Posts Tagged ‘security’

Firmware-Specific Bug #5: Heap Fragmentation

Monday, March 15th, 2010 Michael Barr

Dynamic memory allocation is not widely used by embedded software developers—and for good reasons. One of those is the problem of fragmentation of the heap.

All data structures created via C’s malloc() standard library routine or C++’s new keyword live on the heap. The heap is a specific area in RAM of a pre-determined maximum size. Initially, each allocation from the heap reduces the amount of remaining “free” space by the same number of bytes. For example, the heap in a particular system might span 10 KB starting from address 0x20200000. An allocation of a pair of 4-KB data structures would leave 2 KB of free space.

The storage for data structures that are no longer needed can be returned to the heap by a call to free() or use of the delete keyword. In theory this makes that storage space available for reuse during subsequent allocations. But the order of allocations and deletions is generally at least pseudo-random—leading the heap to become a mess of smaller fragments.

To see how fragmentation can be a problem, consider what would happen if the first of the above 4 KB data structures is free. Now the heap consists of one 4-KB free chunk and another 2-KB free chunk; they are not adjacent and cannot be combined. So our heap is already fragmented. Despite 6 KB of total free space, allocations of more than 4 KB will fail.

Fragmentation is similar to entropy: both increase over time. In a long running system (i.e., most every embedded system ever created), fragmentation may eventually cause some allocation requests to fail. And what then? How should your firmware handle the case of a failed heap allocation request?

Best Practice: Avoiding all use of the heap may is a sure way of preventing this bug. But if dynamic memory allocation is either necessary or convenient in your system, there is an alternative way of structuring the heap that will prevent fragmentation. The key observation is that the problem is caused by variable sized requests.

If all of the requests were of the same size, then any free block is as good as any other—even if it happens not to be adjacent to any of the other free blocks. Thus it is possible to use multiple “heaps”—each for allocation requests of a specific size—can using a “memory pool” data structure.

If you like you can write your own fixed-sized memory pool API. You’ll just need three functions:

  • handle = pool_create(block_size, num_blocks) – to create a new pool (of size M chunks by N bytes);
  • p_block = pool_alloc(handle) – to allocate one chunk (from a specified pool); and
  • pool_free(handle, p_block).

But note that many real-time operating systems (RTOSes) feature a fixed-size memory pool API. If you have access to one of those, use it instead of the compiler’s malloc() and free() or your own implementation.

Firmware-Specific Bug #4

Firmware-Specific Bug #6

Firmware-Specific Bug #4: Stack Overflow

Thursday, March 11th, 2010 Michael Barr

Every programmer knows that a stack overflow is a Very Bad Thing™. The effect of each stack overflow varies, though. The nature of the damage and the timing of the misbehavior depend entirely on which data or instructions are clobbered and how they are used. Importantly, the length of time between a stack overflow and its negative effects on the system depends on how long it is before the clobbered bits are used.

Unfortunately, stack overflow afflicts embedded systems far more often than it does desktop computers. This is for several reasons, including:

  1. embedded systems usually have to get by on a smaller amount of RAM;
  2. there is typically no virtual memory to fall back on (because there is no disk);
  3. firmware designs based on RTOS tasks utilize multiple stacks (one per task), each of which must be sized sufficiently to ensure against unique worst-case stack depth;
  4. and interrupt handlers may try to use those same stacks.

Further complicating this issue, there is no amount of testing that can ensure that a particular stack is sufficiently large. You can test your system under all sorts of loading conditions but you can only test it for so long. A stack overflow that only occurs “once in a blue moon” may not be witnessed by tests that run for only “half a blue moon.” Demonstrating that a stack overflow will never occur can, under algorithmic limitations (such as no recursion), be done with a top down analysis of the control flow of the code. But a top down analysis will need to be redone every time the code is changed.

Best Practice: On startup, paint an unlikely memory pattern throughout the stack(s). (I like to use hex 23 3D 3D 23, which looks like a fence ‘#==#’ in an ASCII memory dump.) At runtime, have a supervisor task periodically check that none of the paint above some pre-established high water mark has been changed. If something is found to be amiss with a stack, log the specific error (e.g., which stack and how high the flood) in non-volatile memory and do something safe for users of the product (e.g., controlled shut down or reset) before a true overflow can occur. This is a nice additional safety feature to add to the watchdog task.

Firmware-Specific Bug #3

Firmware-Specific Bug #5 (coming soon)

Firmware-Specific Bug #3: Missing Volatile Keyword

Thursday, February 18th, 2010 Michael Barr

Failure to tag certain types of variables with C’s ‘volatile’ keyword, can cause a number of symptoms in a system that works properly only when the compiler’s optimizer is set to a low level or disabled. The volatile qualifier is used during variable declarations, where its purpose is to prevent optimization of the reads and writes of that variable.

For example, if you write code that says:


    g_alarm = ALARM_ON;    // Patient dying--get nurse!
    // Other code; with no reads of g_alarm state.
    g_alarm = ALARM_OFF;   // Patient stable.

the optimizer will generally try to make your program both faster and smaller by eliminating the first line above–to the detriment of the patient. However, if g_alarm is declared as volatile this optimization will not take place.

Best Practice: The ‘volatile’ keyword should be used to declare any: (a) global variable shared by an ISR and any other code; (b) global variable accessed by two or more RTOS tasks (even when race conditions in those accesses have been prevented); (c) pointer to a memory-mapped peripheral register (or register set); or (d) delay loop counter.

Note that in addition to ensuring all reads and writes take place for a given variable, the use of volatile also constrains the compiler by adding additional “sequence points”. Accesses to multiple volatiles must be executed in the order they are written in the code.

Firmware-Specific Bug #2

Firmware-Specific Bug #4

This Code Stinks! The Worst Embedded Code Ever

Thursday, November 5th, 2009 Michael Barr

At the Embedded Systems Conference Boston in September, I gave a popular ESC Theater talk titled “This Code Stinks! The Worst Embedded Code Ever” that used lousy code from real products as a teaching tool. The example code was gathered by a number of engineers from a broad swath of companies over several years. (Minor details, including variable names and function names, were changed as needed to hide the specifics of applications, companies, or programmers.)

Here’s just one example of the bad code in that presentation:

y = (x + 305) / 146097 * 400 + (x + 305) % 146097 / 36524 * 100 + (x + 305) % 146097 % 36524 / 1461 * 4 + (x + 305) % 146097 % 36524 % 1461 / 365;

I don’t know if the above snippet contains any bugs, as most of the other examples were found to. And that’s a problem. Where are we supposed to begin an analysis of the above? What is this code supposed to do when it works? What range of input values is appropriate to test? What are the correct output values for a given input? Is this code responsible for handling out of range inputs gracefully? In the original listing, there were no comments on or around this line to help.

I eventually learned that this code computes the year, with accounting for extra days in leap years, given the number of days since a known reference date (e.g., January 1, 1970). But I note that we still don’t know if it works in all cases; despite it being present in an FDA-regulated medical device. I note too that the Microsoft Zune Bug was buried in a much better formatted snippet of code that performed a very similar calculation.

Here’s another example, this time in C++, with the bug finding left as an exercise for the reader:

bool Probe::getParam(uint32_t param_id, int32_t idx)
{
int32_t val = 0;
int32_t ret = 0;

ret = m_pParam->readParam(param_id, idx, &val);

if (!ret)
{
logMsg(“attempt to read parameter failed\n”);
exit(1);
}
else …

Hint: This code was embedded in a piece of factory automation equipment.

I’ve placed the full set of slides online at http://bit.ly/badcode.

Coding Standard Rule #7: Don’t Mix Bit-Wise Operators and Signed Data

Wednesday, April 1st, 2009 Michael Barr

Rule: None of the bit-wise operators (i.e., &, |, ~, ^, <<, and >>) shall be used to manipulate signed integer data.

Example (don’t):

int8_t  signed_data = -4;
signed_data >>= 1; // not necessarily -2

Reasoning: The C standard does not specific the underlying format of signed data (e.g., 2’s complement) and leaves the effect of some bit-wise operators to be defined by the compiler author.

Coding Standard Rule #6
Coding Standard Rule #8

These rules are excerpts from the Embedded C Coding Standard book.