Archive for November, 2010

Firmware-Specific Bug #9: Incorrect Priority Assignment

Tuesday, November 30th, 2010 Michael Barr

Get your priorities straight! Or suffer the consequence of missed deadlines. Of course, I’m talking here about the relative priorities of your real-time tasks and interrupt service routines. In my travels around the embedded design community, I’ve learned that most real-time systems are designed with ad hoc priorities.

Unfortunately, mis-prioritized systems often “appear” to work fine without discernibly missing critical deadlines in testing. The worst-case workload may have never yet happened in the field or there is sufficient CPU to accidentally succeed despite the lack of proper planning. This has lead to a generation of embedded software developers being unaware of the proper technique. There is simply too little feedback from non-reproducible deadline misses in the field to the original design team—unless a death and a lawsuit forces an investigation.

Best Practice: There is a science to the process of assigning relative priorities. That science is associated with the “rate monotonic algorithm,” which provides a formulaic way to assign task priorities based on facts. It is also associated with the “rate monotonic analysis,” which helps you prove that your correctly-prioritized tasks and ISRs will find sufficient available CPU bandwidth between them during extreme busy workloads called “transient overload.” It’s too bad most engineers don’t know how to use these tools.

There’s insufficient space in this column for me to explain why and how RMA works. But I’ve written on these topics before and recommend you start with “Introduction to Rate-Monotonic Scheduling” and then read my column “3 Things Every Programmer Should Know About RMA.”

Please know that if you don’t use RMA to prioritize your tasks and ISRs (as a set), there’s only one entity with any guarantees: the one highest-priority task or ISR can take the CPU for itself at any busy time—barring priority inversions!—and thus has up to 100% of the CPU bandwidth available to it. Also note that there is no rule of thumb about what percentage of the CPU bandwidth you may safely use between a set of two or more runnables unless you do follow the RMA scheme.

Firmware-Specific Bug #8

Firmware-Specific Bug #10

Firmware-Specific Bug #8: Priority Inversion

Tuesday, November 23rd, 2010 Michael Barr

A wide range of nasty things can go wrong when two or more tasks coordinate their work through, or otherwise share, a singleton resource such as a global data area, heap object, or peripheral’s register set. In the first part of this column, I described two of the most common problems in task-sharing scenarios: race conditions and non-reentrant functions. But resource sharing combined with the priority-based preemption found in commercial real-time operating systems can also cause priority inversion, which is equally difficult to reproduce and debug.

The problem of priority inversion stems from the use of an operating system with fixed relative task priorities. In such a system, the programmer must assign each task it’s priority. The scheduler inside the RTOS provides a guarantee that the highest-priority task that’s ready to run gets the CPU—at all times. To meet this goal, the scheduler may preempt a lower-priority task in mid-execution. But when tasks share resources, events outside the scheduler’s control can sometimes prevent the highest-priority ready task from running when it should. When this happens, a critical deadline could be missed, causing the system to fail.

At least three tasks are required for a priority inversion to actually occur: the pair of highest and lowest relative priority must share a resource, say by a mutex, and the third must have a priority between the other two. The scenario is always as shown in the figure below. First, the low-priority task acquires the shared resource (time t1). After the high priority task preempts low, it next tries but fails to acquire their shared resource (time t2); control of the CPU returns back to low as high blocks. Finally, the medium priority task—which has no interest at all in the resource shared by low and high—preempts low (time t3). At this point the priorities are inverted: medium is allowed to use the CPU for as long as it wants, while high waits for low. There could even be multiple medium priority tasks.

Priority Inversion

The risk with priority inversion is that it can prevent the high-priority task in the set from meeting a real-time deadline. The need to meet deadlines often goes hand-in-hand with the choice of a preemptive RTOS. Depending on the end product, this missed deadline outcome might even be deadly for its user!

One of the major challenges with priority inversion is that it’s generally not a reproducible problem. First, the three steps need to happen—and in that order. And then the high priority task needs to actually miss a deadline. One or both of these may be rare or hard to reproduce events. Unfortunately, no amount of testing can assure they won’t ever happen in the field.[5]

Best Practice: The good news is that an easy three-step fix will eliminate all priority inversions from your system.
Choose an RTOS that includes a priority-inversion work-around in its mutex API. These work-arounds come by various names, such as priority inheritance protocol and priority ceiling emulation. Ask your sales rep for details.
Only use the mutex API (never the semaphore API, which lacks this work-around) to protect shared resources within real-time software.

Take the additional execution time cost of the work-around into account when performing the analysis to prove that all deadlines will always be met. Note that the method for doing this varies by the specific work-around.
Note that it’s safe to ignore the possibility of priority inversions if you don’t have any tasks with consequences for missing deadlines.

Footnotes

[5] Barr, Michael and Dave Stewart. “Introduction to Rate Monotonic Scheduling,” Beginner’s Corner, Embedded Systems Programming, February 2002. Available online at www.embedded.com/showArticle.jhtml?articleID=9900522.

Firmware-Specific Bug #7

Firmware-Specific Bug #9

Firmware-Specific Bug #7: Deadlock

Wednesday, November 17th, 2010 Michael Barr

A deadlock is a circular dependency between two or more tasks. For example, if Task 1 has already acquired A and is blocked waiting for B while Task 2 has previously acquired B and is blocked waiting for A, neither task will awake. Circular dependencies can occur at several levels in the architecture of a multithreaded system (for example, each task is waiting for an event only the other will send) but here I am concerned with the common problem of resource deadlocks involving mutexes.

Best Practice: Two simple programming practices are each able to entirely prevent resource deadlocks in embedded systems. The first technique, which I recommend over the other, is to never attempt or require the simultaneous acquisition of two or more mutexes. Holding one mutex while blocking for another mutex turns out to be a necessary condition for deadlock. Holding one mutex is never, by itself, a cause of deadlock.[3]

In my view, the practice of acquiring only one mutex at a time is also consistent with an excellent architectural practice of always pushing the acquisition and release of mutexes into the leaf nodes of your code. The leaf nodes are the device drivers and reentrant libraries. This keeps the mutex acquisition and release code out of the task-level algorithmics and helps to minimize the amount of code inside critical sections.[4]

The second technique is to assign an ordering to all of the mutexes in the system (for example, alphabetical order by mutex handle variable name) and to always acquire multiple mutexes in that same order. This technique will definitely remove all resource deadlocks but comes with an execution-time price. I recommend removing deadlocks this way only when you’re dealing with large bodies of legacy code that can’t be easily refactored to eliminate the multiple-mutex dependency.

Footnotes

[3] In theory, the task that wants the mutex could starve while a series of higher priority tasks take turns with the mutex. However, the rate monotonic analysis can be used to ensure this doesn’t happen to tasks with deadlines that must be met.

[4] An additional benefit of this architectural pattern is that it reduces the number of programmers on the team who must remember to use and correctly use each mutex. Other benefits are that each mutex handle can be hidden inside the leaf node that uses it and that doing this allows for easier switches between interrupt disables and mutex acquisition as appropriate to balance performance and task prioritization. One of the most famous priority inversions happened on Mars in 1997. Glitches were observed in Earth-based testing that could not be reproduced and were not attributed to priority inversion until after the problems on Mars forced investigation. For more details, read Glenn Reave’s “What really happened on Mars?” account.

Firmware-Specific Bug #6

Firmware-Specific Bug #8

Firmware-Specific Bug #6: Memory Leak

Tuesday, November 16th, 2010 Michael Barr

Eventually, systems that leak even small amounts of memory will run out of free space and subsequently fail in nasty ways. Often legitimate memory areas get overwritten and the failure isn’t registered until much later. This happens when, for example, a NULL pointer is returned by a failed call to malloc() and the caller blindly proceeds to overwrite the interrupt vector table or some other valuable code or data starting from physical address 0x00000000.

Memory leaks are mostly a problem in systems that use dynamic memory allocation.[1] And memory leaks are memory leaks whether we’re talking about an embedded system or a PC program. However, the long-running nature of embedded systems combined with the deadly or spectacular failures that some safety-critical systems may have make this one bug you definitely don’t want in your firmware.

Memory leaks are a problem of ownership management. Objects allocated from the heap always have a creator, such as a task that calls malloc() and passes the resulting pointer on to another task via message queue or inserts the new buffer into a meta heap object such as a linked list. But does each allocated object have a designated destroyer? Which other task is responsible and how does it know that every other task is finished with the buffer?

Best Practice: There is a simple way to avoid memory leaks and that is to clearly define the ownership pattern or lifetime of each type of heap-allocated object. The figure above shows one common ownership pattern involving buffers that are allocated by a producer task (P), sent through a message queue, and later destroyed by a consumer task (C). To the maximum extent possible this and other safe design patterns should be followed in real-time systems that use the heap.[2]

Footnotes

[1] Unlike fragmentation, memory leaks can happen even with fixed-size allocators.

[2] In addition to avoiding memory leaks, the design pattern shown in in the figure above can be used to ensure against “out-of-memory” errors, in which there are no buffers available in the buffer pool when the producer task attempts an allocation. The technique is to (1) create a dedicated buffer pool for that type of allocation, say a buffer pool of 17-byte buffers; (2) use queuing theory to appropriately size the message queue, which ensures against a full queue; and (3) size the buffer pool so there is initially one free buffer for each consumer, each producer, plus each slot in the message queue.

Firmware-Specific Bug #5

Firmware-Specific Bug #7

Subscribing to this Blog’s Comments

Tuesday, November 16th, 2010 Michael Barr

I recently learned it’s possible to easily subscribe to all the comments posted on this blog. Given the quality of the comments that are posted here, I thought others may be interested too. The URL is feed://embeddedgurus.com/barr-code/comments/feed/.