A few weeks ago I published what appears to have been quite a popular blog on what I called the ‘Bug Cluster Phenomenon’. Today, I’m going to extend that concept somewhat by way of a mea culpa.
Earlier this week I had to eat some very humble pie. For the last six weeks or so I had received complaints that a temperature measurement wasn’t giving accurate results. The sensor in question is measuring approximately ambient temperature, and was returning values in the 18 – 26 Celsius range, which seemed reasonable to me. I just wrote off the complaints as being due to the fact that humans have a very poor perception of absolute temperature. Well finally, at my urging, someone dragged the device out into the Winter cold, where it promptly read 18 Celsius. Thus I was faced with proof that something was wrong.
I proceeded to investigate the code, and discovered that based on the current inputs to the code, the code was generating an output with an error of about 2 degrees. How was this possible, since it was nothing more than a series of multiplies, adds and shifts – not typically fodder for a 2 degree error?
Well, further investigation showed that at a certain point I was getting numeric overflow when two numbers were being multiplied together. Now typically, when this occurs, one gets answers that have huge ‘errors’. In my case I had the misfortune that the arithmetic worked out such that the error at room temperature was barely noticeable.
Anyway, I duly fixed the code. However, before moving on I took the time to reflect on this particular bug. Was this just one of those stupid coding errors that we all make from time to time, or was there more to it? I came to the conclusion that this was not just “one of those things”. Rather I realized that this was at least the third time this year that I had written code that suffered from a numeric overflow problem. In short, I have a problem or a blind spot if you will, for a particular class of problem.
Well I’m told that recognizing ones problems is the first step in solving them. So I proceeded to do a little bit more investigating and discovered that my numeric overflow bugs always occurred when I combined multiple operators on a line. For example:
y = a * a + c;
Thus the solution seems obvious to me – only one numeric operator per line. Thus in future, I will always code like this:
y = a * a; y += c;
The bottom line. When you encounter a bug, as well as looking for other bugs nearby (as described in the bug cluster phenomenon post), also take the time to reflect on what caused the bug in the first place, and see if you can recognize any systemic problems in your approach to coding. When it comes down to it, this is nothing more than a process of ‘continuous quality improvement’. If it works for Toyota then it might just work in the embedded systems arena.