Archive for the ‘Firmware Bugs’ Category

Common programming errors and presidential inaugurations

Tuesday, January 20th, 2009 Nigel Jones

I don’t normally link politics and embedded systems, but something happened today at the inauguration of Barack Obama that struck me as an obvious error, but which my family and I suspect 99.999% of the rest of the viewers accepted without question. I’m referring to the third paragraph of Rick Warren’s invocation where he stated:

Now, today, we rejoice not only in America’s peaceful transfer of power for the 44th time. We celebrate a …

Well it seems to me that if Barack Obama is the 44th president of the USA, then there can only have ever been 43 transitions of power. I suppose that one could claim that when Washington became president, it was a transition of power. However no one could possibly claim it was peaceful!

What’s my point? Well Rick Warren had just made a classic programming blunder. I’m guessing that his invocation was scrutinized by an army of political hacks, many with advanced degrees from top universities – yet despite this the error was not caught. I guess next time you make this mistake in your code, you can console yourself with this information.

BTW, you will not be surprised to know that my wife and kids just think that this confirms their belief that I’m a complete Nerd who is in desperate need of a life!

Bug cluster phenomenon

Wednesday, October 8th, 2008 Nigel Jones

I was debugging a piece of code recently when I realized that there was a scenario, albeit unlikely, in which a divide by zero could occur. Rather than just fix the bug and move on, I invoked what I call the “bug cluster phenomenon” rule. What you may ask is this rule? Well it has two variants. The first is as follows:

“Where there is one bug, there is usually another”. I’ve observed this phenomenon over many years. What seems to happen is that when I (or anyone else for that matter) is generating a block of code, I get interrupted, or I’m tired or my focus is elsewhere. As a result, when I create one bug, I usually create several others while I am at it. Thus when I find a bug in a function, I always assume that it has company near bye. In short, finding a bug in a function always triggers a top to bottom review of that function and its neighbors. This has dramatically reduced my debugging time over the years – and I strongly recommend you adopt it.

The second variant of the rule is as follows:

“Logical errors normally have company”. I’ve also observed this phenomenon over many years. In this case, it seems that if you have made a particular error in logic in one place in the code, the chances are you have made the same error elsewhere. In the case of the divide by zero issue mentioned in the introduction, this prompted me to wonder if I had any other possible divide by zero errors lurking in my code. As a result, I performed a search through the entire project – and sure enough I found a few other cases where there existed the possibility of a divide by zero error. Thus finding one bug caused me to fix several. That’s efficient debugging!

Incidentally, I was able to quickly find all the divisions in my code because I am absolutely anal about having a space on either side of an operator. Thus, I needed to search for only two strings – ” / ” and ” /= “. I’ve observed that many people are lackadaisical about this, such that you’ll often see expressions such as “y=a/b”. These people have no option other than to search either for just “/” – which of course returns every line with a comment, or they have to construct a more sophisticated regular expression search – which again takes time and is error prone.

Thus I have three pieces of advice to pass on:
1. When you find a bug, look nearby for more.
2. If the bug was of a particular class of bug, then search your code to see if you had made the same mistake elsewhere.
3. Write your code so that it is trivial to search for certain constructs. It will save you time in the long run.

Home

Have you looked at your linker output file recently?

Tuesday, August 12th, 2008 Nigel Jones

Of all the myriad of files involved in a typical embedded firmware project, probably the two most feared (and yes I do mean feared) are the linker control file (which tells the linker how to link your application) and the linker output file. Today it’s the latter which I’ll be talking about.

The linker output file tells you a myriad of information about the way your application has been put together. Unfortunately, much of it is in such a cryptic format that examination of the file is a painful process. Indeed, for this reason, I suspect that most projects are completed with nothing more than a cursory look at this file.

This is a shame, because examination of the linker output file can significantly reduce your debugging time. To show you what I mean, consider my typical action sequence when I first start coding up a project.

1. Write a module.
2. Compile module and correct all errors and warnings.
3. Lint module and correct all complaints from Lint.
4. Repeat steps 1, 2 & 3 until I have sufficient modules to be able to generate a linkable image.
5. Link image and repeat steps 1-4 until the linker has no warnings or errors.
6. Examine the linker output file.

I’d wager that most developers out there would be reaching for the debugger in step 6. The reason I do not, is because I can typically find some bugs simply by looking at the linker output. For example, consider this code sequence:

if (0 == var)
{
 function_a();
} else if (1 == var)
{
 function_b();
}
else if (2 == var)
{
 function_b();
{
else
{
 function_d();
}

I make these sort of copy and paste errors all the time. In this case, when var is 2, I meant to call function_c but inadvertently I ended up calling function_b again. Since function_b exists, the compiler is happy and so there are typically no warnings.

So how does looking at the linker output file help me in this case? Well, if you have a decent linker it will give you a list of all the functions that aren’t called and that consequently have been stripped out of the final image. If in perusing this list I see that function_c() is listed as uncalled, then I immediately know I’ve got a bug somewhere. Typically tracking it down is very easy.

I’ll leave for another day the other ways I use the linker output file to debug code.

Home

Thoughts on the optimal time to test code

Friday, June 6th, 2008 Nigel Jones

Today I’d like to take on one of the sacred cows of the embedded industry, namely the temporal relationship between coding and testing of the aforementioned code. The conventional wisdom seems to be as follows.

“Write a small piece of code. As soon as possible test the code. Repeat until the task is complete”

I know for many of you, me merely having the temerity to suggest this might be sub-optimal will put me firmly into the category of hopeless heretic. Well, before you write me off as a lunatic, let me tell you about an alternative approach, how I stumbled upon it and why I think it has much to commend it.

Being in the consulting business I’m typically working on multiple projects at once. Often a given project will be put on hold for any number of reasons which aren’t germane to this post. As a result, it’s not uncommon for me to write some code, compile it and then not touch it again for several months. I then find myself in the position of having to test / debug code that I wrote months ago. Having now done this many times, I’ve come to the conclusion that rather than this being a problem, it is instead the optimal temporal relationship between coding and testing.

How can this be you ask? Surely after a multi-month hiatus, the code is no longer fresh in your mind and so it must make it that much more difficult to test and debug? Well the answer is of course yes – the code is no longer fresh in my mind, and yes it does make it a little harder to test and debug in the short term. In my emphasis lies the point of my argument.

Why do we write code? Most people would claim we write code in order to make a functional product. I disagree with this assertion. I think we write code so that people coming after us can understand it and modify it. This rather strange claim is based upon those studies that show that companies spend far more money maintaining code than they do writing it. Thus the smart way to write code is to do so in a manner that gives preeminent importance to the long term maintenance of that code. So how does one do this? Well that’s a topic for another post. What I can tell you, is that having to test and debug code that you wrote several months ago is a terrific way for the developer of the code to see the code as someone who’ll be maintaining it will see it. You’ll see the inadequate or plain wrong comments. You’ll see the copy and paste errors. You’ll see where you got tired and took a short cut, and you’ll see those stupid mistakes caused by the telephone ringing at the wrong time.

Indeed because you don’t expect the code to work (after all it’s never been tested) I find you cast a very jaundiced eye over the code – and in the process find a plethora of the mistakes that one typically finds by sitting in front of a debugger. Maybe it’s just me, but I’d rather find bugs via code inspection than by fighting the debug environments common to most embedded systems.

So in a nutshell, I think the optimal way to write and test code is as follows:

1. Write the code. Make sure it compiles and is Lint free.
2. Wait a few months.
3. Reread the code looking for the usual suspects of bad / wrong comments, copy and paste errors, sloppy coding etc.
4. Test it.

The person that maintains your code (quite likely a future version of you) will thank you for doing it this way.

Home

An unfortunate consequence of a 32-bit world

Wednesday, August 29th, 2007 Nigel Jones

Back in the bad old days when I was a lad, one learned about microprocessors by programming 8 bit devices in assembly language. In fact I can still remember my first lab assignment – namely to multiply two 8 bit unsigned quantities together to get a 16 bit result (without the use of a hardware multiplier of course). One of the indelible lessons that comes from doing an exercise such as this, is that it can take many instructions to perform even the most innocuous of high level language statements.

I mention this, because today I was looking at some code written by a young engineer who was recommended to me. In examining some of his code, I noticed the following construct:

void some_function(void)
{
 ...
 ++ivar;
 ...
}

interrupt void isr_handler(void)
{
 ...
 --ivar;
 ...
}

Notwithstanding the fact that ivar should have been declared volatile, the most egregious mistake here was the assumption that the statement ++ivar is an atomic operation. Now if one is used to working on 32 bit machines, the concept of incrementing an integer being anything other than an atomic operation is of course ludicrous. However, in the 8 or 16 bit world where many of us labor in the embedded space, the idea of incrementing an integer being an atomic operation is equally ridiculous. The trouble is with bugs like this is that they are difficult to spot, and will only rear their head after months or even years of operation.

So, is this a case of an incompetent individual? Although nominally yes, I suspect that the real problem is that he was raised on a diet of big CPUs. Perhaps the universities could do these engineers a favor, and throw away the ARM based evaluation boards and replace them with an 8051 based system.

Home