Archive for the ‘Firmware Bugs’ Category

Understanding Stack Overflow

Monday, June 4th, 2007 Nigel Jones

I suspect that many, if not all bloggers are somewhat narcissistic. In my case it shows through in that I use one of the free services that keeps track of how many visitors I get and what brought them to this blog. Well, it turns out that many of the visitors to this blog get here not because of the brilliance of my writing, but because they did a Google search on “stack overflow” often qualified by PIC, or MSP430 etc. For many of these visitors I suspect they leave empty handed. Thus in an attempt to make these visits less pointless, let me give you my take on what causes a stack overflow in an embedded system.

First of all, go read the Wikipedia description of stack overflow. There’s nothing wrong with the description – it’s just incomplete from an embedded systems perspective.

If you are having problems with 8 bit PICs, then you should read this. For other architectures, read on…

On the assumption that you are getting a stack overflow and that you aren’t performing recursion or attempting to allocate a large amount of storage on the stack, what can be going wrong? Here’s a check list.

  1. What’s your stack size set to? If you don’t understand the question then you need an introductory course to embedded systems programming. If you do understand the question – but don’t know the answer – then this is the most likely source of your problem. How can this be you ask? Well, most embedded systems compilers are designed to work with a particular family of processors. The low end of the family may have a tiny amount of memory (e.g. 128 bytes). As such setting the default stack size to 16 bytes may be a sensible thing to do. Thus, your first step is to ensure that the stack size is set to something reasonable for your system. Click here for advice on how to do this.
  2. Which stack is overflowing? Many processors / compilers support / implement multiple stacks. A typical dichotomy is a call stack (upon which the return addresses of functions are stored) and a data or parameter stack (upon which automatic variables are stored). If you are using an RTOS, then typically there will be a shared call stack while each thread will have its own data stack. Thus is it the shared call stack that is overflowing, or is it the parameter stack associated with a particular task? Once you’ve made the determination which stack is overflowing then finding out exactly what gets placed on that stack will help lead you to the solution to your problem. If you can see no obvious high level language construct that is causing the problem, then the single most likely cause of your misery is an interrupt service routine…
  3. An interrupt service routine can use up an extraordinary amount of space on the stack. For a discussion of how this arises and its impact on performance, see this article. This problem is compounded if your system allows interrupts to be nested (that is, it allows an ISR to itself be interrupted).
  4. Certain library functions (printf() and its brethren are prime offenders) can use an enormous amount of stack space.
  5. If you are writing partially in assembly language, are you failing to pop every register that you pushed? This often occurs if you have more than one exit point from a function or ISR.
  6. If you are writing entirely in assembly language, did you set up the stack pointer correctly and do you know which way the stack grows?
  7. Have you made the mistake of programming a microcontroller that you don’t understand? For example, low end PIC processors have a tiny call stack which is easily overflowed. If you are programming a PIC and don’t know about this limitation, then quite frankly, I’m not surprised you are having problems.
  8. If none of the above solve your problem, then I’m afraid you are most likely in to a stack over-write problem. That is, a pointer is being de-referenced that results in the stack being overwritten. This can often arise when you allocate an array on the stack and then access an element beyond the end of the array. Lint will find a lot of these problems for you. If you don’t know what Lint is, see this article. If you do know what Lint is and aren’t using it then you deserve to be faced with these sorts of problems.

I have also written a related article on setting your stack size that you may find useful.


Reset Reason

Saturday, September 23rd, 2006 Nigel Jones

The title of this post is rather ambiguous and can be read several different ways. This is no accident as it reflects the ambiguity that I see concerning the most fundamental event in an embedded system’s life – reset. Being a consultant, I get to write a lot of my own code. I also get to read a lot of other people’s code and the one area where I almost never see much thought given is to handling the various causes of a system reset. In the bad old days, you were reset and that was all you knew. Today, however, modern processors contain registers that may be interrogated to determine the cause of the last reset. For example, an AVR processor I am working with lists the following possible causes:
Power Up
Brown Out
External Reset

Based on my experience, I’d say that 99% of the embedded systems out there don’t care what caused their last reset. This strikes me as foolhardy. At the very least, an embedded system should keep track of the number of times it has taken a watchdog reset for post deployment quality analysis (you do do this don’t you?). Furthermore, a portable system should take remedial action if it underwent a brown out reset – presumably indicating that the battery is failing. As for a JTAG reset, could this be construed as an attempt by someone to determine the inner workings of your system – and if so what should you do about it?

I have been involved in systems where support for handling the different interrupt sources has been added as an afterthought – and it shows. As a result, I’ve come to the conclusion that the only way to handle this is to think about it from the start, and to know up front what needs to be done for each of the different reset sources. If you go through this exercise, you’ll find that your startup code becomes a lot more sophisticated. You’ll also find that you’ve designed a better system – which after all is the point.