Posts Tagged ‘safety’

Firmware-Specific Bug #8: Priority Inversion

Tuesday, November 23rd, 2010 Michael Barr

A wide range of nasty things can go wrong when two or more tasks coordinate their work through, or otherwise share, a singleton resource such as a global data area, heap object, or peripheral’s register set. In the first part of this column, I described two of the most common problems in task-sharing scenarios: race conditions and non-reentrant functions. But resource sharing combined with the priority-based preemption found in commercial real-time operating systems can also cause priority inversion, which is equally difficult to reproduce and debug.

The problem of priority inversion stems from the use of an operating system with fixed relative task priorities. In such a system, the programmer must assign each task it’s priority. The scheduler inside the RTOS provides a guarantee that the highest-priority task that’s ready to run gets the CPU—at all times. To meet this goal, the scheduler may preempt a lower-priority task in mid-execution. But when tasks share resources, events outside the scheduler’s control can sometimes prevent the highest-priority ready task from running when it should. When this happens, a critical deadline could be missed, causing the system to fail.

At least three tasks are required for a priority inversion to actually occur: the pair of highest and lowest relative priority must share a resource, say by a mutex, and the third must have a priority between the other two. The scenario is always as shown in the figure below. First, the low-priority task acquires the shared resource (time t1). After the high priority task preempts low, it next tries but fails to acquire their shared resource (time t2); control of the CPU returns back to low as high blocks. Finally, the medium priority task—which has no interest at all in the resource shared by low and high—preempts low (time t3). At this point the priorities are inverted: medium is allowed to use the CPU for as long as it wants, while high waits for low. There could even be multiple medium priority tasks.

Priority Inversion

The risk with priority inversion is that it can prevent the high-priority task in the set from meeting a real-time deadline. The need to meet deadlines often goes hand-in-hand with the choice of a preemptive RTOS. Depending on the end product, this missed deadline outcome might even be deadly for its user!

One of the major challenges with priority inversion is that it’s generally not a reproducible problem. First, the three steps need to happen—and in that order. And then the high priority task needs to actually miss a deadline. One or both of these may be rare or hard to reproduce events. Unfortunately, no amount of testing can assure they won’t ever happen in the field.[5]

Best Practice: The good news is that an easy three-step fix will eliminate all priority inversions from your system.
Choose an RTOS that includes a priority-inversion work-around in its mutex API. These work-arounds come by various names, such as priority inheritance protocol and priority ceiling emulation. Ask your sales rep for details.
Only use the mutex API (never the semaphore API, which lacks this work-around) to protect shared resources within real-time software.

Take the additional execution time cost of the work-around into account when performing the analysis to prove that all deadlines will always be met. Note that the method for doing this varies by the specific work-around.
Note that it’s safe to ignore the possibility of priority inversions if you don’t have any tasks with consequences for missing deadlines.

Footnotes

[5] Barr, Michael and Dave Stewart. “Introduction to Rate Monotonic Scheduling,” Beginner’s Corner, Embedded Systems Programming, February 2002. Available online at www.embedded.com/showArticle.jhtml?articleID=9900522.

Firmware-Specific Bug #7

Firmware-Specific Bug #9

Firmware-Specific Bug #7: Deadlock

Wednesday, November 17th, 2010 Michael Barr

A deadlock is a circular dependency between two or more tasks. For example, if Task 1 has already acquired A and is blocked waiting for B while Task 2 has previously acquired B and is blocked waiting for A, neither task will awake. Circular dependencies can occur at several levels in the architecture of a multithreaded system (for example, each task is waiting for an event only the other will send) but here I am concerned with the common problem of resource deadlocks involving mutexes.

Best Practice: Two simple programming practices are each able to entirely prevent resource deadlocks in embedded systems. The first technique, which I recommend over the other, is to never attempt or require the simultaneous acquisition of two or more mutexes. Holding one mutex while blocking for another mutex turns out to be a necessary condition for deadlock. Holding one mutex is never, by itself, a cause of deadlock.[3]

In my view, the practice of acquiring only one mutex at a time is also consistent with an excellent architectural practice of always pushing the acquisition and release of mutexes into the leaf nodes of your code. The leaf nodes are the device drivers and reentrant libraries. This keeps the mutex acquisition and release code out of the task-level algorithmics and helps to minimize the amount of code inside critical sections.[4]

The second technique is to assign an ordering to all of the mutexes in the system (for example, alphabetical order by mutex handle variable name) and to always acquire multiple mutexes in that same order. This technique will definitely remove all resource deadlocks but comes with an execution-time price. I recommend removing deadlocks this way only when you’re dealing with large bodies of legacy code that can’t be easily refactored to eliminate the multiple-mutex dependency.

Footnotes

[3] In theory, the task that wants the mutex could starve while a series of higher priority tasks take turns with the mutex. However, the rate monotonic analysis can be used to ensure this doesn’t happen to tasks with deadlines that must be met.

[4] An additional benefit of this architectural pattern is that it reduces the number of programmers on the team who must remember to use and correctly use each mutex. Other benefits are that each mutex handle can be hidden inside the leaf node that uses it and that doing this allows for easier switches between interrupt disables and mutex acquisition as appropriate to balance performance and task prioritization. One of the most famous priority inversions happened on Mars in 1997. Glitches were observed in Earth-based testing that could not be reproduced and were not attributed to priority inversion until after the problems on Mars forced investigation. For more details, read Glenn Reave’s “What really happened on Mars?” account.

Firmware-Specific Bug #6

Firmware-Specific Bug #8

Firmware-Specific Bug #6: Memory Leak

Tuesday, November 16th, 2010 Michael Barr

Eventually, systems that leak even small amounts of memory will run out of free space and subsequently fail in nasty ways. Often legitimate memory areas get overwritten and the failure isn’t registered until much later. This happens when, for example, a NULL pointer is returned by a failed call to malloc() and the caller blindly proceeds to overwrite the interrupt vector table or some other valuable code or data starting from physical address 0×00000000.

Memory leaks are mostly a problem in systems that use dynamic memory allocation.[1] And memory leaks are memory leaks whether we’re talking about an embedded system or a PC program. However, the long-running nature of embedded systems combined with the deadly or spectacular failures that some safety-critical systems may have make this one bug you definitely don’t want in your firmware.

Memory leaks are a problem of ownership management. Objects allocated from the heap always have a creator, such as a task that calls malloc() and passes the resulting pointer on to another task via message queue or inserts the new buffer into a meta heap object such as a linked list. But does each allocated object have a designated destroyer? Which other task is responsible and how does it know that every other task is finished with the buffer?

Best Practice: There is a simple way to avoid memory leaks and that is to clearly define the ownership pattern or lifetime of each type of heap-allocated object. The figure above shows one common ownership pattern involving buffers that are allocated by a producer task (P), sent through a message queue, and later destroyed by a consumer task (C). To the maximum extent possible this and other safe design patterns should be followed in real-time systems that use the heap.[2]

Footnotes

[1] Unlike fragmentation, memory leaks can happen even with fixed-size allocators.

[2] In addition to avoiding memory leaks, the design pattern shown in in the figure above can be used to ensure against “out-of-memory” errors, in which there are no buffers available in the buffer pool when the producer task attempts an allocation. The technique is to (1) create a dedicated buffer pool for that type of allocation, say a buffer pool of 17-byte buffers; (2) use queuing theory to appropriately size the message queue, which ensures against a full queue; and (3) size the buffer pool so there is initially one free buffer for each consumer, each producer, plus each slot in the message queue.

Firmware-Specific Bug #5

Firmware-Specific Bug #7

Get Your (RTOS Task and ISR) Priorities Straight!

Thursday, October 14th, 2010 Michael Barr

Get your priorities straight!  Or suffer the consequence of missed deadlines.  Of course, I’m talking here about the relative priorities of your real-time tasks and interrupt service routines.  In my travels around the embedded design community, I have learned that most real-time systems are designed with ad hoc priorities.

Unfortunately, mis-prioritized systems often “appear” to work fine without discernibly missing critical deadlines in testing.  The worst-case workload may have never yet happened in the field or there is sufficient CPU to accidentally succeed despite the lack of proper planning.  This has lead to a generation of embedded software developers being unaware of the proper technique.  There is simply too little feedback from non-reproducible deadline misses in the field to the team that designed it—unless a death and a lawsuit forces an investigation.

Truth be told: There is a science to the process of assigning relative priorities.  That science is associated with the “rate monotonic algorithm,” which provides a formulaic way to assign task priorities based on facts.  It is also associated with the “rate monotonic analysis,” which helps you prove that your correctly-prioritized tasks and ISRs will find sufficient available CPU bandwidth between them during extreme busy workloads called “transient overload.”   It’s too bad most engineers don’t know how to use these tools.

I’ve written on the why and how of RMA before and recommend you start with my article Introduction to Rate-Monotonic Scheduling and then read my blog post 3 Things Every Programmer Should Know About RMA.

Toyota’s Embedded Software Image Problem

Friday, March 19th, 2010 Michael Barr

It remains unclear whether Toyota’s higher-than-industry-average number of complaints regarding sudden unintended acceleration (SUA) is caused (in whole or in part) by an embedded software problem. But whether it is or it isn’t actually firmware, the company has clearly denied it and yet still developed an embedded software “image problem”. They’ve brought some of this on themselves.

Side Note: I think it is a net positive that journalists, the mass media, and a broader swath of the general public are increasingly aware that there is software embedded inside cars, airplanes, medical devices, and just about everything else with a power supply or batteries. Firmware has been inside these products for many years, of course. But as I wrote in a recent article in Electronic Design, my experience working with companies across many industries lead me to believe there is a looming firmware quality crisis. Greater public awareness is sure to bring litigation. This will force engineering management to care more about firmware quality than they currently do.

Toyota’s Firmware Image Problem

Long before the “floor-mat recall” NHTSA had logged a higher number of unintended acceleration complaints (4.51 complaints per 100,000 cars sold for the 2005 to 2010 model years) for Toyota than any other company. (A recent Washington Post graphic has more data.) Apparently, NHTSA and Toyota were investigating the reports–but hadn’t yet taken any action.

It seems that what set that first Toyota recall in motion was a high-profile fatal August 2009 crash involving an off-duty California Highway Patrol office, his family, a runaway Lexus, and a disturbing 911 call,  Given the context of that specific crash, I’m not convinced the floor mat recall made much sense. In particular, I find it hard to believe that a police officer with adrenaline pumping through his veins and his family’s life on the line, wouldn’t just rip a stuck floor mat out of the way like the Incredible Hulk. (Or that he would choose running off the road at 125 mph vs. shutting the vehicle off entirely.)  But I don’t have all the facts about either that specific accident or the reasoning behind the floor mat recall.

The broader recalls that have happened since have focused on also adding mechanical strength to the accelerator pedals in a number of different makes and models. To this day, Toyota categorically denies any sort of electrical problem.  Yet some cars that have been modified in this way have since been reported to experience unintended acceleration!  Besides which, mechanical parts generally fail visibly or entirely once they first fail–rather than intermittently.  Intermittent failures are far more common with electronics (think EMI) and firmware.

Toyota’s firmware image problem stems from two things:  First, they have separately recalled the Prius for a braking-related firmware upgrade.  Other possible Prius software issues have been identified by Steve Wozniak and Jim Sikes, but these have not yet been confirmed.  Additionally, the continued reliance (by Toyota and NHTSA) on theories such as “we can’t reproduce the problem and we haven’t been able to see it during testing” as proof that there’s not a software bug is simply unbelievable.  

Anyone who works with software knows from experience that lots of bugs can’t be easily reproduced.  The fact that these incidents can’t be reproduced is not a proof of anything.

Software in Cars: The Future

Don’t get me wrong.  I want more software in my car not less.  I very much look forward to the day that an in-car computer takes over the driving for me.  After all, some cars already have more sensor data to make decisions on than the driver does.  Imagine what a car with an integrated GPS navigation system, auto-follow cruise control, and collision avoidance systems could do.  While I guess that I should move left one lane to avoid a crash, the computer is capable of seeing in all directions at once, calculating all of the trajectories of near-by cars, including instantaneous changes in their acceleration or deceleration.

Additionally, I suspect that even with bugs in a car’s drive-by-wire software the car may be much safer overall for its electronic traction control and anti-lock braking systems.

I just wish that Toyota would own up to the fact that the inability to reproduce a problem doesn’t rule out a software (or EMI) flaw.