embedded software boot camp

Firmware-Specific Bug #1: Race Condition

Thursday, February 11th, 2010 by Michael Barr

A race condition is any situation in which the combined outcome of two or more threads of execution (which can be either RTOS tasks or main() plus an ISR) varies depending on the precise order in which the instructions of each are interleaved.

For example, suppose you have two threads of execution in which one regularly increments a global variable (g_counter += 1;) and the other occasionally resets it (g_counter = 0;). There is a race condition here if the increment cannot always be executed atomically (i.e., in a single instruction cycle). A collision between the two updates of the counter variable may never or only very rarely occur. But when it does, the counter will not actually be reset in memory; its value is henceforth corrupt. The effect of this may have serious consequences for the system, though perhaps not until a long time after the actual collision.

Best Practice: Race conditions can be prevented by surrounding the “critical sections” of code that must be executed atomically with an appropriate preemption-limiting pair of behaviors. To prevent a race condition involving an ISR, at least one interrupt signal must be disabled for the duration of the other code’s critical section. In the case of a race between RTOS tasks, the best practice is the creation of a mutex specific to that shared object, which each task must acquire before entering the critical section. Note that it is not a good idea to rely on the capabilities of a specific CPU to ensure atomicity, as that only prevents the race condition until a change of compiler or CPU.

Shared data and the random timing of preemption are culprits that cause the race condition. But the error might not always occur, making tracking down such bugs from symptoms to root causes incredibly difficult. It is, therefore, important to be ever-vigilant about protecting all shared objects.

Best Practice: Name all potentially shared objects—including global variables, heap objects, or peripheral registers and pointers to the same—in a way that the risk is immediately obvious to every future reader of the code. Netrino’s Embedded C Coding Standard advocates the use of a ‘g_‘ prefix for this purpose.

Locating all potentially shared objects is the first step in a code audit for race conditions.

Firmware-Specific Bug #2: Non-Reentrant Function

Tags: , ,

10 Responses to “Firmware-Specific Bug #1: Race Condition”

  1. J. Scott says:

    Your definition of a race condition is too narrow in scope and doesn't address race conditions caused by an unexpected sequence of inputs.

  2. Anonymous says:

    Best practice: don't share data, encapsulate it in one thread only and provide a message-based interface to it.

  3. Anonymous says:

    Might mention deadlock as a possible result of all those mutexes. Can be alkmost as ahrd to fix as the problem they are introduced to solve.

  4. Anonymous says:

    A common race condition I have seen involves communication between an ISR and main() or a task function by use of a global flag variable. The ISR sets the flag and the other code tests and then clears it.Too many developers either forget to disable the interrupt in main() or think the clear will always be atomic. Embedded programmers should never rely on the specific timings of instructions of their current processor.

  5. Anonymous says:

    As both a hardware designer and a software designer I once had to debug a race condition in a corner case in the hardware. The target system was running in a mode where the control processor was running at a very high clock speed and the peripheral at a low clock speed (clock ratio were of course programmable).

    First, the processor wrote to the peripheral instructing it to start. This command was buffered in the synchronziation logic between the clock domains. Then, the processor read the status flag. But the read signal was not synchronized. So the processor thought the peripheral was already finished (actually, it hadn't even started) causing the loss of the buffer of data the peripheral should be working on.

  6. Ram C. says:

    Great article on one of the trickiest class of problems for embedded engineers. Michael how about a follow up article on avoiding (by using good system design practices) or detecting and resolving deadlocks in code?

  7. Ram C. says:

    Now, if one were to use non-blocking event driven framework instead of a conventional OS/RTOS, would it not make life easier?

    • Miro Samek says:

      That’s an interesting observation. An event-driven framework can reduce (even eliminate) the need for sharing resources and replace it with sharing events that are managed by the framework in a thread-safe manner. Without sharing resources, race conditions won’t happen. However, event-driven paradigm is perhaps more vulnerable to deadlocks than traditional sequential programming style. If an event is lost or simply forgotten to be posted, the application might freeze. I plan to write about race conditions and such deadlocks in my “state space” blog. Stay tuned…

  8. IT student says:

    Race condition could be the cause of Toyota’s unintended acceleration problems. The problem may be related to software controlled mutual exclusion as opposed to more expensive hardware controlled mutual exclusion. NASA had this problem in the late 70’s.

  9. […] software control flow.” (Appendix A, p. 11) NASA also spent time simulating possible race conditions due to worrisome “recursively nested interrupt masking” (pp, 44-46); note, though, that […]

Leave a Reply to IT student