A few weeks ago I published what appears to have been quite a popular blog on what I called the ‘Bug Cluster Phenomenon’. Today, I’m going to extend that concept somewhat by way of a mea culpa.
Earlier this week I had to eat some very humble pie. For the last six weeks or so I had received complaints that a temperature measurement wasn’t giving accurate results. The sensor in question is measuring approximately ambient temperature, and was returning values in the 18 – 26 Celsius range, which seemed reasonable to me. I just wrote off the complaints as being due to the fact that humans have a very poor perception of absolute temperature. Well finally, at my urging, someone dragged the device out into the Winter cold, where it promptly read 18 Celsius. Thus I was faced with proof that something was wrong.
I proceeded to investigate the code, and discovered that based on the current inputs to the code, the code was generating an output with an error of about 2 degrees. How was this possible, since it was nothing more than a series of multiplies, adds and shifts – not typically fodder for a 2 degree error?
Well, further investigation showed that at a certain point I was getting numeric overflow when two numbers were being multiplied together. Now typically, when this occurs, one gets answers that have huge ‘errors’. In my case I had the misfortune that the arithmetic worked out such that the error at room temperature was barely noticeable.
Anyway, I duly fixed the code. However, before moving on I took the time to reflect on this particular bug. Was this just one of those stupid coding errors that we all make from time to time, or was there more to it? I came to the conclusion that this was not just “one of those things”. Rather I realized that this was at least the third time this year that I had written code that suffered from a numeric overflow problem. In short, I have a problem or a blind spot if you will, for a particular class of problem.
Well I’m told that recognizing ones problems is the first step in solving them. So I proceeded to do a little bit more investigating and discovered that my numeric overflow bugs always occurred when I combined multiple operators on a line. For example:
y = a * a + c;
Thus the solution seems obvious to me – only one numeric operator per line. Thus in future, I will always code like this:
y = a * a; y += c;
The bottom line. When you encounter a bug, as well as looking for other bugs nearby (as described in the bug cluster phenomenon post), also take the time to reflect on what caused the bug in the first place, and see if you can recognize any systemic problems in your approach to coding. When it comes down to it, this is nothing more than a process of ‘continuous quality improvement’. If it works for Toyota then it might just work in the embedded systems arena.
Dear NigelI think you touch a sore point for many C developers with:y = a * a + c;If y, a and c are all of the same data type, there are no surprises, but lets say:long y, c;short a;In this case, I’m not even sure if:y = a * a;y += c; is good enough. C might perform the a*a in 16-bit, and only cast to 32-bit afterwards. I’m not sure. So I agree with you point. Know your own (and C’s) weeknesses and code defensively if you are in a danger area. This is what I would do:long y, c;short a;y = (long)a * (long)a + c;That being said, I have not encountered many developers who have not been bitten by C’s implicit casting rules… Another friendly feature from Brian and Dennis :)Best regards and thanks for interesting blog posts, Lars
Thanks for your comment Lars. I agree that the C type promotion and casting rules can be a nightmare. I’ve seen some people criticize ‘unnecessary’ casting like you have shown. However, I’m with you on this one, and always use a lot of explicit casting to ensure that I get what I want.As a corollary, I am not a great proponent of using unnecessary parentheses to ensure the order of evaluation of an expression. The order of evaluation rules are much easier to understand (if not always intuitive). Having said that, I find I usually avoid the situation simply by using just one operator per line.
Nigel,I think you or someone should do a blog post on the related topic of integer approximations for floating point scaling. I wrote my own little command line utility to get the nearest integer ratio to a given float, up to a given maximum denominator. I find I use this all the time on non-float capable systems with limited horsepower.For example, you might have a formula that says the actual measurement is 2430/1024 (or equivalently, about 2.37305) times the 10-bit ADC reading. If you multiply a 10-bit ADC value by 2430, you will overflow your device's 16-bit integers for many inputs, and of course if you divide by 1024 first, you always underflow.Even if you factor out the greatest common divisor (2) you get 1215/512 with the same problems. But it turns out that 19/8 approximates this to within 0.08%, or less than one bit.The next better approximation is 83/35, but 83*1023 > 16 bits again.
Hi Greg. I’m afraid your comment got lost in the blogging platform move. Anyway I think I may have already done what you are asking for – please check this out.