Archive for the ‘Firmware Bugs’ Category

Bug cluster phenomenon

Wednesday, October 8th, 2008 Nigel Jones

I was debugging a piece of code recently when I realized that there was a scenario, albeit unlikely, in which a divide by zero could occur. Rather than just fix the bug and move on, I invoked what I call the “bug cluster phenomenon” rule. What you may ask is this rule? Well it has two variants. The first is as follows:

“Where there is one bug, there is usually another”. I’ve observed this phenomenon over many years. What seems to happen is that when I (or anyone else for that matter) is generating a block of code, I get interrupted, or I’m tired or my focus is elsewhere. As a result, when I create one bug, I usually create several others while I am at it. Thus when I find a bug in a function, I always assume that it has company near bye. In short, finding a bug in a function always triggers a top to bottom review of that function and its neighbors. This has dramatically reduced my debugging time over the years – and I strongly recommend you adopt it.

The second variant of the rule is as follows:

“Logical errors normally have company”. I’ve also observed this phenomenon over many years. In this case, it seems that if you have made a particular error in logic in one place in the code, the chances are you have made the same error elsewhere. In the case of the divide by zero issue mentioned in the introduction, this prompted me to wonder if I had any other possible divide by zero errors lurking in my code. As a result, I performed a search through the entire project – and sure enough I found a few other cases where there existed the possibility of a divide by zero error. Thus finding one bug caused me to fix several. That’s efficient debugging!

Incidentally, I was able to quickly find all the divisions in my code because I am absolutely anal about having a space on either side of an operator. Thus, I needed to search for only two strings – ” / ” and ” /= “. I’ve observed that many people are lackadaisical about this, such that you’ll often see expressions such as “y=a/b”. These people have no option other than to search either for just “/” – which of course returns every line with a comment, or they have to construct a more sophisticated regular expression search – which again takes time and is error prone.

Thus I have three pieces of advice to pass on:
1. When you find a bug, look nearby for more.
2. If the bug was of a particular class of bug, then search your code to see if you had made the same mistake elsewhere.
3. Write your code so that it is trivial to search for certain constructs. It will save you time in the long run.

Home

Have you looked at your linker output file recently?

Tuesday, August 12th, 2008 Nigel Jones

Of all the myriad of files involved in a typical embedded firmware project, probably the two most feared (and yes I do mean feared) are the linker control file (which tells the linker how to link your application) and the linker output file. Today it’s the latter which I’ll be talking about.

The linker output file tells you a myriad of information about the way your application has been put together. Unfortunately, much of it is in such a cryptic format that examination of the file is a painful process. Indeed, for this reason, I suspect that most projects are completed with nothing more than a cursory look at this file.

This is a shame, because examination of the linker output file can significantly reduce your debugging time. To show you what I mean, consider my typical action sequence when I first start coding up a project.

1. Write a module.
2. Compile module and correct all errors and warnings.
3. Lint module and correct all complaints from Lint.
4. Repeat steps 1, 2 & 3 until I have sufficient modules to be able to generate a linkable image.
5. Link image and repeat steps 1-4 until the linker has no warnings or errors.
6. Examine the linker output file.

I’d wager that most developers out there would be reaching for the debugger in step 6. The reason I do not, is because I can typically find some bugs simply by looking at the linker output. For example, consider this code sequence:

if (0 == var)
{
 function_a();
} else if (1 == var)
{
 function_b();
}
else if (2 == var)
{
 function_b();
{
else
{
 function_d();
}

I make these sort of copy and paste errors all the time. In this case, when var is 2, I meant to call function_c but inadvertently I ended up calling function_b again. Since function_b exists, the compiler is happy and so there are typically no warnings.

So how does looking at the linker output file help me in this case? Well, if you have a decent linker it will give you a list of all the functions that aren’t called and that consequently have been stripped out of the final image. If in perusing this list I see that function_c() is listed as uncalled, then I immediately know I’ve got a bug somewhere. Typically tracking it down is very easy.

I’ll leave for another day the other ways I use the linker output file to debug code.

Home

Thoughts on the optimal time to test code

Friday, June 6th, 2008 Nigel Jones

Today I’d like to take on one of the sacred cows of the embedded industry, namely the temporal relationship between coding and testing of the aforementioned code. The conventional wisdom seems to be as follows.

“Write a small piece of code. As soon as possible test the code. Repeat until the task is complete”

I know for many of you, me merely having the temerity to suggest this might be sub-optimal will put me firmly into the category of hopeless heretic. Well, before you write me off as a lunatic, let me tell you about an alternative approach, how I stumbled upon it and why I think it has much to commend it.

Being in the consulting business I’m typically working on multiple projects at once. Often a given project will be put on hold for any number of reasons which aren’t germane to this post. As a result, it’s not uncommon for me to write some code, compile it and then not touch it again for several months. I then find myself in the position of having to test / debug code that I wrote months ago. Having now done this many times, I’ve come to the conclusion that rather than this being a problem, it is instead the optimal temporal relationship between coding and testing.

How can this be you ask? Surely after a multi-month hiatus, the code is no longer fresh in your mind and so it must make it that much more difficult to test and debug? Well the answer is of course yes – the code is no longer fresh in my mind, and yes it does make it a little harder to test and debug in the short term. In my emphasis lies the point of my argument.

Why do we write code? Most people would claim we write code in order to make a functional product. I disagree with this assertion. I think we write code so that people coming after us can understand it and modify it. This rather strange claim is based upon those studies that show that companies spend far more money maintaining code than they do writing it. Thus the smart way to write code is to do so in a manner that gives preeminent importance to the long term maintenance of that code. So how does one do this? Well that’s a topic for another post. What I can tell you, is that having to test and debug code that you wrote several months ago is a terrific way for the developer of the code to see the code as someone who’ll be maintaining it will see it. You’ll see the inadequate or plain wrong comments. You’ll see the copy and paste errors. You’ll see where you got tired and took a short cut, and you’ll see those stupid mistakes caused by the telephone ringing at the wrong time.

Indeed because you don’t expect the code to work (after all it’s never been tested) I find you cast a very jaundiced eye over the code – and in the process find a plethora of the mistakes that one typically finds by sitting in front of a debugger. Maybe it’s just me, but I’d rather find bugs via code inspection than by fighting the debug environments common to most embedded systems.

So in a nutshell, I think the optimal way to write and test code is as follows:

1. Write the code. Make sure it compiles and is Lint free.
2. Wait a few months.
3. Reread the code looking for the usual suspects of bad / wrong comments, copy and paste errors, sloppy coding etc.
4. Test it.

The person that maintains your code (quite likely a future version of you) will thank you for doing it this way.

Home

An unfortunate consequence of a 32-bit world

Wednesday, August 29th, 2007 Nigel Jones

Back in the bad old days when I was a lad, one learned about microprocessors by programming 8 bit devices in assembly language. In fact I can still remember my first lab assignment – namely to multiply two 8 bit unsigned quantities together to get a 16 bit result (without the use of a hardware multiplier of course). One of the indelible lessons that comes from doing an exercise such as this, is that it can take many instructions to perform even the most innocuous of high level language statements.

I mention this, because today I was looking at some code written by a young engineer who was recommended to me. In examining some of his code, I noticed the following construct:

void some_function(void)
{
 ...
 ++ivar;
 ...
}

interrupt void isr_handler(void)
{
 ...
 --ivar;
 ...
}

Notwithstanding the fact that ivar should have been declared volatile, the most egregious mistake here was the assumption that the statement ++ivar is an atomic operation. Now if one is used to working on 32 bit machines, the concept of incrementing an integer being anything other than an atomic operation is of course ludicrous. However, in the 8 or 16 bit world where many of us labor in the embedded space, the idea of incrementing an integer being an atomic operation is equally ridiculous. The trouble is with bugs like this is that they are difficult to spot, and will only rear their head after months or even years of operation.

So, is this a case of an incompetent individual? Although nominally yes, I suspect that the real problem is that he was raised on a diet of big CPUs. Perhaps the universities could do these engineers a favor, and throw away the ARM based evaluation boards and replace them with an 8051 based system.

Home

Understanding Stack Overflow

Monday, June 4th, 2007 Nigel Jones

I suspect that many, if not all bloggers are somewhat narcissistic. In my case it shows through in that I use one of the free services that keeps track of how many visitors I get and what brought them to this blog. Well, it turns out that many of the visitors to this blog get here not because of the brilliance of my writing, but because they did a Google search on “stack overflow” often qualified by PIC, or MSP430 etc. For many of these visitors I suspect they leave empty handed. Thus in an attempt to make these visits less pointless, let me give you my take on what causes a stack overflow in an embedded system.

First of all, go read the Wikipedia description of stack overflow. There’s nothing wrong with the description – it’s just incomplete from an embedded systems perspective.

If you are having problems with 8 bit PICs, then you should read this. For other architectures, read on…

On the assumption that you are getting a stack overflow and that you aren’t performing recursion or attempting to allocate a large amount of storage on the stack, what can be going wrong? Here’s a check list.

  1. What’s your stack size set to? If you don’t understand the question then you need an introductory course to embedded systems programming. If you do understand the question – but don’t know the answer – then this is the most likely source of your problem. How can this be you ask? Well, most embedded systems compilers are designed to work with a particular family of processors. The low end of the family may have a tiny amount of memory (e.g. 128 bytes). As such setting the default stack size to 16 bytes may be a sensible thing to do. Thus, your first step is to ensure that the stack size is set to something reasonable for your system. Click here for advice on how to do this.
  2. Which stack is overflowing? Many processors / compilers support / implement multiple stacks. A typical dichotomy is a call stack (upon which the return addresses of functions are stored) and a data or parameter stack (upon which automatic variables are stored). If you are using an RTOS, then typically there will be a shared call stack while each thread will have its own data stack. Thus is it the shared call stack that is overflowing, or is it the parameter stack associated with a particular task? Once you’ve made the determination which stack is overflowing then finding out exactly what gets placed on that stack will help lead you to the solution to your problem. If you can see no obvious high level language construct that is causing the problem, then the single most likely cause of your misery is an interrupt service routine…
  3. An interrupt service routine can use up an extraordinary amount of space on the stack. For a discussion of how this arises and its impact on performance, see this article. This problem is compounded if your system allows interrupts to be nested (that is, it allows an ISR to itself be interrupted).
  4. Certain library functions (printf() and its brethren are prime offenders) can use an enormous amount of stack space.
  5. If you are writing partially in assembly language, are you failing to pop every register that you pushed? This often occurs if you have more than one exit point from a function or ISR.
  6. If you are writing entirely in assembly language, did you set up the stack pointer correctly and do you know which way the stack grows?
  7. Have you made the mistake of programming a microcontroller that you don’t understand? For example, low end PIC processors have a tiny call stack which is easily overflowed. If you are programming a PIC and don’t know about this limitation, then quite frankly, I’m not surprised you are having problems.
  8. If none of the above solve your problem, then I’m afraid you are most likely in to a stack over-write problem. That is, a pointer is being de-referenced that results in the stack being overwritten. This can often arise when you allocate an array on the stack and then access an element beyond the end of the array. Lint will find a lot of these problems for you. If you don’t know what Lint is, see this article. If you do know what Lint is and aren’t using it then you deserve to be faced with these sorts of problems.

I have also written a related article on setting your stack size that you may find useful.

Home