embedded software boot camp

Firmware-Specific Bug #4: Stack Overflow

Thursday, March 11th, 2010 by Michael Barr

Every programmer knows that a stack overflow is a Very Bad Thing™. The effect of each stack overflow varies, though. The nature of the damage and the timing of the misbehavior depend entirely on which data or instructions are clobbered and how they are used. Importantly, the length of time between a stack overflow and its negative effects on the system depends on how long it is before the clobbered bits are used.

Unfortunately, stack overflow afflicts embedded systems far more often than it does desktop computers. This is for several reasons, including:

  1. embedded systems usually have to get by on a smaller amount of RAM;
  2. there is typically no virtual memory to fall back on (because there is no disk);
  3. firmware designs based on RTOS tasks utilize multiple stacks (one per task), each of which must be sized sufficiently to ensure against unique worst-case stack depth;
  4. and interrupt handlers may try to use those same stacks.

Further complicating this issue, there is no amount of testing that can ensure that a particular stack is sufficiently large. You can test your system under all sorts of loading conditions but you can only test it for so long. A stack overflow that only occurs “once in a blue moon” may not be witnessed by tests that run for only “half a blue moon.” Demonstrating that a stack overflow will never occur can, under algorithmic limitations (such as no recursion), be done with a top down analysis of the control flow of the code. But a top down analysis will need to be redone every time the code is changed.

Best Practice: On startup, paint an unlikely memory pattern throughout the stack(s). (I like to use hex 23 3D 3D 23, which looks like a fence ‘#==#’ in an ASCII memory dump.) At runtime, have a supervisor task periodically check that none of the paint above some pre-established high water mark has been changed. If something is found to be amiss with a stack, log the specific error (e.g., which stack and how high the flood) in non-volatile memory and do something safe for users of the product (e.g., controlled shut down or reset) before a true overflow can occur. This is a nice additional safety feature to add to the watchdog task.

Firmware-Specific Bug #3

Firmware-Specific Bug #5 (coming soon)

Tags: , , , , ,

14 Responses to “Firmware-Specific Bug #4: Stack Overflow”

  1. Lundin says:

    Hardened embedded engineers will try to use MCU hardware to detect stack overflow. Some MCUs has for example an illegal addressing interrupt, that occurs when parts of the memory map that aren’t used are written to. It is then wise to map the stack next to such an area, so that a stack overflow will result in an interrupt and/or reset. The ISR for handling such a stack overflow has to be written in inline asm that uses CPU registers only (without stacking them first…).

    Alternatively you can as second best option map the stack next to non-volatile memory. If you get a stack overflow, any pushing onto the stack will result in banging your head against a solid wall of read-only cells.

  2. GroovyD says:

    One method I have used during run-time is in the system timer interrupt to just update stackMax, stackMin, and stackSize pointers such as:

    volatile char* stackMin = 0xffffffff;
    volatile char* stackMax = 0;
    volatile unsigned stackSize = 0;

    void SystemTimerInterrupt(void) {
    #ifdef STACK_TEST
    char stack;
    if (&stack stackMax) stackMax = &stack;
    stackSize = stackMax – stackMin;
    #endif

    }

    Doing this doesn’t require ‘painting’ memory or even knowing where the stack is to begin with and the app can use the stackSize to decide when there is a problem. Then when you are confident you have enough space you can just #def it out.

  3. GroovyD says:

    oops, looks like it lost the code a bit there…


    if (&stack .lt. stackMin) stackMin = &stack;
    if (&stack .gt. stackMax) stackMax = &stack;
    stackSize = (unsigned) (stackMax – stackMin);

  4. John D. Bugger says:

    We have a periodic task which checks that the stack pointer did not exceed a certain limit. If so we log an error to the NVM and then execute a SW reset. For the PowerPC the periodic decrementer interrupt can be used for this.

    • MarkM says:

      I’m hoping you realized shortly after posting this that you’re only checking the stack level of the task you’ve assigned to check the stack level. Otherwise your task that’s dedicated to checking the stack level is faithfully checking its own dedicated stack, and faithfully ignoring the stack level of all other stacks for all other tasks in your system. Your stack level testing task is doing nothing but chewing through CPU bandwidth and eating up a (small) amount of RAM.

      • Michael Barr says:

        Mark,

        Thanks for your feedback and making sure everyone understands. I thought the original post was clear that your supervisor task would check up on painted regions of all task stacks and do “something smart” in the event that “a stack” is nearing overflow. Certainly, that’s what I meant.

        Cheers,
        Mike

  5. Rakesh Kumar says:

    After code review if we can’t find the reason for crash and we think it is due to stack overflow the one method which i had tried was decalring “static” variables instead of normal types.
    If problem gets solved 100% that means systems is really going through stack issues.

  6. Rakesh Jain says:

    What i do is, define a const variable just above stack and give it a default value of known pattern say, 0x11223344. Now keep a check on this variable, if this variable gets corrupted you know that stack is overflowed.

    • Michael Barr says:

      That’s a clever idea, Rakesh. But, unfortunately, I don’t believe that one word remaining unchanged is generally sufficient to prove no stack overflows have occurred.

    • JeffH says:

      Michael, can you explain/elaborate why checking for a pattern immediately above the stack is generally insufficient to detect a stack overflow?

      • Michael Barr says:

        The risk is that checking just the first byte (or just the first 16- or 32-bit word) past the end of the stack could miss a corruption in a subsequent location. For example, if a new stack frame is allocated into that space, it could be either that the one location you’re checking wasn’t actually for an initialized variable or was in a part of a larger object allocation that wasn’t used on this call.

  7. […] have written about what really happens during stack overflow before (Firmware-Specific Bug #4: Stack Overflow) and this explains why a reset may not result and also why it is so hard to trace a stack overflow […]

  8. Also FreeRTOS/OpenRTOS/SafeRTOS offers built-in stack overflow checking as a part of kernel (executed every context switch after context is pushed on to the stack).

Leave a Reply