Posts Tagged ‘firmware’

Firmware-Specific Bug #3: Missing Volatile Keyword

Thursday, February 18th, 2010 Michael Barr

Failure to tag certain types of variables with C’s ‘volatile’ keyword, can cause a number of symptoms in a system that works properly only when the compiler’s optimizer is set to a low level or disabled. The volatile qualifier is used during variable declarations, where its purpose is to prevent optimization of the reads and writes of that variable.

For example, if you write code that says:


    g_alarm = ALARM_ON;    // Patient dying--get nurse!
    // Other code; with no reads of g_alarm state.
    g_alarm = ALARM_OFF;   // Patient stable.

the optimizer will generally try to make your program both faster and smaller by eliminating the first line above–to the detriment of the patient. However, if g_alarm is declared as volatile this optimization will not take place.

Best Practice: The ‘volatile’ keyword should be used to declare any: (a) global variable shared by an ISR and any other code; (b) global variable accessed by two or more RTOS tasks (even when race conditions in those accesses have been prevented); (c) pointer to a memory-mapped peripheral register (or register set); or (d) delay loop counter.

Note that in addition to ensuring all reads and writes take place for a given variable, the use of volatile also constrains the compiler by adding additional “sequence points”. Accesses to multiple volatiles must be executed in the order they are written in the code.

Firmware-Specific Bug #2

Firmware-Specific Bug #4

Firmware-Specific Bug #2: Non-Reentrant Function

Monday, February 15th, 2010 Michael Barr

Technically, the problem of a non-reentrant functions is a special case of the problem of a race condition.  For that reason the run-time errors caused by a non-reentrant function are similar and also don’t occur in a reproducible way—making them just as hard to debug.  Unfortunately, a non-reentrant function is also more difficult to spot in a code review than other types of race conditions.

The figure below shows a typical scenario.  Here the software entities subject to preemption are RTOS tasks.  But rather than manipulating a shared object directly, they do so by way of function call indirection.  For example, suppose that Task A calls a sockets-layer protocol function, which calls a TCP-layer protocol function, which calls an IP-layer protocol function, which calls an Ethernet driver.  In order for the system to behave reliably, all of these functions must be reentrant.

But the functions of the driver module manipulate the same global object in the form of the registers of the Ethernet Controller chip.  If preemption is permitted during these register manipulations, Task B may preempt Task A after the Packet A data has been queued but before the transmit is begun.  Then Task B calls the sockets-layer function, which calls the TCP-layer function, which calls the IP-layer function, which calls the Ethernet driver, which queues and transmits Packet B.  When control of the CPU returns to Task A, it finally requests its transmission.  Depending on the design of the Ethernet controller chip, this may either retransmit Packet B or generate an error.  Either way, Packet A’s data is lost and does not go out onto the network.

In order for the functions of this Ethernet driver to be callable from multiple RTOS tasks (near-)simultaneously, those functions must be made reentrant.  If each function uses only stack variables, there is nothing to do; each RTOS task has its own private stack.  But drivers and some other functions will be non-reentrant unless carefully designed.

The key to making functions reentrant is to suspend preemption around all accesses of peripheral registers, global variables (including static local variables), persistent heap objects, and shared memory areas.  This can be done either by disabling one or more interrupts or by acquiring and releasing a mutex; the specifics of the type of shared data usually dictate the best solution.

Best Practice: Create and hide a mutex within each library or driver module that is not intrinsically reentrant.  Make acquisition of this mutex a pre-condition for the manipulation of any persistent data or shared registers used within the module as a whole.  For example, the same mutex may be used to prevent race conditions involving both the Ethernet controller registers and a global (or static local) packet counter.  All functions in the module that access this data, must follow the protocol to acquire the mutex before manipulating these objects.

Beware that non-reentrant functions may come into your code base as part of third party middleware, legacy code, or device drivers.  Disturbingly, non-reentrant functions may even be part of the standard C or C++ library provided with your compiler.  For example, if you are using the GNU compiler to build RTOS-based applications, take note that you should be using the reentrant “newlib” standard C library rather than the default.

Firmware-Specific Bug #1

Firmware-Specific Bug #3

Firmware-Specific Bug #1: Race Condition

Thursday, February 11th, 2010 Michael Barr

A race condition is any situation in which the combined outcome of two or more threads of execution (which can be either RTOS tasks or main() plus an ISR) varies depending on the precise order in which the instructions of each are interleaved.

For example, suppose you have two threads of execution in which one regularly increments a global variable (g_counter += 1;) and the other occasionally resets it (g_counter = 0;). There is a race condition here if the increment cannot always be executed atomically (i.e., in a single instruction cycle). A collision between the two updates of the counter variable may never or only very rarely occur. But when it does, the counter will not actually be reset in memory; its value is henceforth corrupt. The effect of this may have serious consequences for the system, though perhaps not until a long time after the actual collision.

Best Practice: Race conditions can be prevented by surrounding the “critical sections” of code that must be executed atomically with an appropriate preemption-limiting pair of behaviors. To prevent a race condition involving an ISR, at least one interrupt signal must be disabled for the duration of the other code’s critical section. In the case of a race between RTOS tasks, the best practice is the creation of a mutex specific to that shared object, which each task must acquire before entering the critical section. Note that it is not a good idea to rely on the capabilities of a specific CPU to ensure atomicity, as that only prevents the race condition until a change of compiler or CPU.

Shared data and the random timing of preemption are culprits that cause the race condition. But the error might not always occur, making tracking down such bugs from symptoms to root causes incredibly difficult. It is, therefore, important to be ever-vigilant about protecting all shared objects.

Best Practice: Name all potentially shared objects—including global variables, heap objects, or peripheral registers and pointers to the same—in a way that the risk is immediately obvious to every future reader of the code. Netrino’s Embedded C Coding Standard advocates the use of a ‘g_‘ prefix for this purpose.

Locating all potentially shared objects is the first step in a code audit for race conditions.

Firmware-Specific Bug #2: Non-Reentrant Function

Rate Monotonic Analysis and Round Robin Scheduling

Friday, January 22nd, 2010 Michael Barr

Rate Monotonic Analysis (RMA) is a way of proving a priori via mathematics (rather than post-implementation via testing) that a set of tasks and interrupt service routines (ISRs) will always meet their deadlines–even under worst-case timing.  In this blog, I address the issue of what to do if two or more tasks or ISRs have equal priority and whether round robin scheduling is necessary in an RTOS to deal with that special case.

First a little background.  In order for the schedulability analysis portion of the RMA mathematics to provide meaningful results, the following assumptions must hold:

Under RMA, the relative priorities are assigned according to a simple rule: “The more often a task or ISR runs (in the worst-case), the higher its priority.” Put another way, the task or ISR with the longest period between iterations (interarrival time, if you prefer) is least important. This is because an infrequent but high-priority task could prevent a more frequent task from missing an entire iteration.

So what happens if you are using RMA to assign priorities and you wind up with two (or more) tasks or ISRs assigned equal priority? (Translation: they have the same worst-case interarrival times). Must they be assigned equal priority in the real system? What if the RTOS (in the case of tasks) or hardware (in the case of interrupts) doesn’t support round-robin scheduling–or even equal priorities with run-to-completion?

Interestingly, it turns out not to matter a bit whether you:

  1. Merge the two tasks into one (i.e., executed code for Task A then Task B).
  2. Give them equal priority, either with round robin or run-to-completion behavior.
  3. Give them adjacent unequal priorities (in either relative order).

If you run through the timing diagrams for each of the above scenarios, you’ll see that all three are equivalent. Except that the equal priority with round robin potentially suffers a performance impact from unnecessary additional context switches.

Firmware Wall of Shame: Kenmore Elite Electric Range

Monday, January 11th, 2010 Michael Barr

A couple of years back, my wife and I remodeled our kitchen. In the process, we replaced our oven and range with a Kenmore Elite slide-in unit, similar to the one in the picture below. My wife was pretty nervous about buying an oven with a display and a keyboard–because she understood that meant embedded software with its all-too-frequent crashes and upgrades. At the time, I assured her that oven controller firmware was the sort of thing anyone who could spell ‘C’ could write.

But now my day of reckoning has come. Our 3-year old oven isn’t working properly. It even failed my wife on Christmas Eve, as she prepared a meal for about 20 family and friends. (Praise be for a full tank of gas and a 3-burner outdoor grill!) But still I felt vindicated. Our oven problem was with the electronics not the firmware, I assured her–as if that were some great thing in itself! The problem only occurred when the oven was hot. And a power-cycle didn’t cure it. We have learned that the buttons and display will work again, eventually, after the heat has dissipated.

Today the repairman is here. (I didn’t dare void the warranty by peeking at the electronics inside before he came.) “What error code does it give when it fails?,” he wants to know. “F-1-?,” I reported quickly. “We can’t read the last digit, because that’s a part of the display that doesn’t work when the oven fails in this way.” “Hmm.”, he muttered, turning to his repair manual, “the fix for F10 is as different from the fix for F19 as for every error code in between.” “Can’t you hook up your laptop to the oven’s diagnostic serial port?,” I wanted to know. “Nope,” he replied, “The display is the diagnostic port.”

Crap. My wife was right. Writing the embedded software for an oven controller is harder than I thought. The designers of the Kenmore Elite slide-in electric range’s firmware forgot to account for the fact that they only had one diagnostic port and that it itself might break. Or they knew it and cheated their customers (including us), to reduce the BOM cost, out of a serial port we wouldn’t know we didn’t have until it was too late. Either way, shame on them.