Archive for the ‘Firmware Bugs’ Category

My Embedded Toolbox: Source Code Whitespace Cleanup

Monday, August 7th, 2017 Miro Samek

In this installment of my “Embedded Toolbox” series, I would like to share with you the free source code cleanup utility called QClean for cleaning whitespace in your source files, header files, makefiles, linker scripts, etc.

You probably wonder why you might need such a utility? In fact, the common thinking is that compilers (C, C++, etc.) ignore whitespace anyway, so why bother? But, as a professional software developer you should not ignore whitespace, because it can cause all sorts of problems, some of them illustrated in the figure below:


  1. Trailing whitespace after the last printable character in line can cause bugs. For example, trailing whitespace after the C/C++ macro-continuation character ‘\’ can confuse the C pre-processor and can result in a program error, as indicated by the bug icons.
  2. Similarly, inconsistent use of End-Of-Line (EOL) convention can cause bugs. For example, mixing the DOS EOL Convention (0x0D,0x0A) with Unix EOL Convention (0x0A) can confuse the C pre-processor and can result in a program error, as indicated by the bug icons.
  3. Varying amount of trailing whitespace at the end of the lines plus inconsistent use of tabs and spaces can cause unnecessary churn in the version control system (VCS) in source files that otherwise should be identical. Sure, many VCSs allow you to “ignore whitespace”, but are files differing in size by as much as 20% really identical?
  4. Inconsistent use of tabs and spaces can lead to different rendering of the source code by different editors and printers.

Note: The problems caused by whitespace in the source code are particularly insidious, because you don’t see the culprit. By using an automated whitespace cleanup utility you can save yourself hours of frustration and significantly improve your code quality.


 QClean Source Code Cleanup Utility

QClean is a simple and blazingly fast command-line utility to automatically clean whitespace in your source code. QClean is deployed as natively compiled executable and is located in the QTools Collection (in the sub-directory  qtools/bin ). QClean is also available in portable source code and can be adapted and re-compiled on all desktop platforms (Windows, POSIX –Linux, MacOS).

Using QClean

Typically, you invoke QClean from a command-line prompt without any parameters. In that case, QClean will cleanup white space in the current directory and recursively in all its sub-directories.

Note: If you have added the qtools/bin/ directory to your PATH environment variable (see Installing QTools), you can run qclean directly from your terminal window.


As you can see in the screen shot above, QClean processes the files and prints out the names of the cleaned up files. Also, you get information as to what has been cleaned, for example, “Trail-WS” means that trailing whitespace has been cleaned up. Other possibilities are: “CR” (cleaned up DOS/Windows (CR) end-of-lines), “LF” (cleaned up Unix (LF) end-of-lines), and “Tabs” (replaced Tabs with spaces).

QClean Command-Line Parameters

QClean takes the following command-line parameters:

[root-dir] . root directory to clean (relative or absolute)
-h help (show help message and exit)
-q query only (no cleanup when -q present)
-r check also read-only files
-l[limit] 80 line length limit (not checked when -l absent)

QClean Features

QClean fixes the following whitespace problems:

  • removing of all trailing whitespace (see figure above 1)
  • applying consistent End-Of-Line convention (either Unix (LF) or DOS (CRLF) see figure above 2)
  • replacing Tabs with spaces (untabify, see figure above 2)
  • optionally, scan the source code for long lines exceeding the specified limit (-l option, default 80 characters per line).

Long Lines

QClean can optionally check the code for long lines of code that exceed a specified limit (80 characters by default) to reduce the need to either wrap the long lines (which destroys indentation), or the need to scroll the text horizontally. (All GUI usability guidelines universally agree that horizontal scrolling of text is always a bad idea.) In practice, the source code is very often copied-and-pasted and then modified, rather than created from scratch. For this style of editing, it’s very advantageous to see simultaneously and side-by-side both the original and the modified copy. Also, differencing the code is a routinely performed action of any VCS (Version Control System) whenever you check-in or merge the code. Limiting the line length allows to use the horizontal screen real estate much more efficiently for side-by-side-oriented text windows instead of much less convenient and error-prone top-to-bottom differencing.

QClean File Types

QClean applies the following rules for cleaning the whitespace depending on the file types:

.c Unix (LF) remove remove check
.h Unix (LF) remove remove check
.cpp Unix (LF) remove remove check
.hpp Unix (LF) remove remove check
.s Unix (LF) remove remove check
.asm Unix (LF) remove remove check
.lnt Unix (LF) remove remove check
.txt DOS (CR,LF) remove remove don’t check
.md DOS (CR,LF) remove remove don’t check
.bat DOS (CR,LF) remove remove don’t check
.ld Unix (LF) remove remove check
.tcl Unix (LF) remove remove check
.py Unix (LF) remove remove check
.java Unix (LF) remove remove check
Makefile Unix (LF) remove leave check
.mak Unix (LF) remove leave check
.html Unix (LF) remove remove don’t check
.htm Unix (LF) remove remove don’t check
.php Unix (LF) remove remove don’t check
.dox Unix (LF) remove remove don’t check
.m Unix (LF) remove remove check

The cleanup rules specified in the table above can be easily customized by editing the array l_fileTypes in the qclean/source/main.c file. Also, you can change the Tab sizeby modifying the TAB_SIZE constant (currently set to 4) as well as the default line-limit by modifying the LINE_LIMIT constant (currently set to 80) at the top of the the qclean/source/main.c file. Of course, after any such modification, you need to re-build the QClean executable and copy it into the qtools/bin directory.

Note: For best code portability, QClean enforces the consistent use of the specified End-Of-Line convention (typically Unix (LF)), regardless of the native EOL of the platform. The DOS/Windows EOL convention (CR,LF) is typically not applied because it causes compilation problems on Unix-like systems (Specifically, the C preprocessor doesn’t correctly parse the multi-line macros.) On the other hand, most DOS/Windows compilers seem to tolerate the Unix EOL convention without problems.


QClean is very simple to use (no parameters are needed in most cases) and is fast (it can easily cleanup hundreds of files per second). All this is designed so that you can use QClean frequently. In fact, the use of QClean after editing your code should become part of your basic hygiene–like washing hands after going to the bathroom.

Are We Shooting Ourselves in the Foot with Stack Overflow?

Monday, February 17th, 2014 Miro Samek

In the latest Lesson #10 of my Embedded C Programming with ARM Cortex-M Video Course I explain what stack overflow is and I show what can transpire deep inside an embedded microcontroller when the stack pointer register (SP) goes out of bounds. You can watch the YouTube video to see the details, but basically when the stack overflows, memory beyond the stack bound gets corrupted and your code will eventually fail. If you are lucky, your program will crash quickly. If you are less lucky, however, your program will limp along in some crippled state, not quite dead but not fully functional either. Code like this can kill people.


Stack Overflow Implicated in the Toyota Unintended Acceleration Lawsuit

Unless you’ve been living under a rock for a past couple of years, you must have heard of the Toyota unintended acceleration (UA) cases, where Camry and other Toyota vehicles accelerated unexpectedly and some of them managed to kill people and all of them scared the hell out of their drivers.

The recent trial testimony delivered at the Oklahoma trial by an embedded guru Michael Barr for the fist time in history of these trials offers a glimpse into the Toyota throttle control software. In his deposition, Michael explains how a stack overflow could corrupt the critical variables of the operating system (OSEK in this case), because they were located in memory adjacent to the top of the stack. The following two slides from Michael’s testimony explain the memory layout around the stack and why stack overflow was likely in the Toyota code (see the complete set of Michael’s slides).

Toyota stack overflow explained

Barr's Slides explaining Toyota stack overflow (NOTE: the stack grows "up" in this picture, because "Top" represents a lower address in RAM than "Bottom". Traditionally, however, such a stack is called "descending".)


Why Were People Killed?

The crucial aspect in the failure scenario described by Michael is that the stack overflow did not cause an immediate system failure. In fact, an immediate system failure followed by a reset would have saved lives, because Michael explains that even at 60 Mph, a complete CPU reset would have occurred within just 11 feet of vehicle’s travel.

Instead, the problem was exactly that the system kept running after the stack overflow. But due to the memory corruption some tasks got “killed” (or “forgotten”) by the OSEK real-time operating system while other tasks were still running. This, in turn, caused the engine to run, but with the throttle “stuck” in the wide-open position, because the “kitchen-sink” TaskX, as Michael calls it, which controlled the throttle among many other things, was dead.

A Shot in the Foot

The data corruption caused by the stack overflow is completely self inflicted. I mean, we know exactly which way the stack grows on any given CPU architecture. For example, on the V850 CPU used in the Toyota engine control module (ECM) the stack grows towards the lower memory addresses, which is traditionally called a “descending stack” or a stack growing “down”. In this sense the stack is like a loaded gun that points either up or down in the RAM address space. Placing your foot (or your critical data for that matter) exactly at the muzzle of this gun doesn’t sound very smart, does it? In fact, doing so goes squarely against the very first NRA Gun Safety Rule: “ ALWAYS keep the gun pointed in a safe direction”.

A standard memory map, in which the stack grows towards your program data.

Yet, as illustrated in the Figure above, most traditional, beaten path memory layouts allocate the stack space above the data sections in RAM, even though the stack grows “down” (towards the lower memory addresses) in most embedded processors (see Table below ). This arrangement puts your program data in the path of destruction of a stack overflow. In other words, you violate the first  NRA Gun Safety Rule and you end up shooting yourself in the foot, as did Toyota.

Processor Architecture Stack growth direction
ARM Cortex-M down
AVR down
AVR32 down
ColdFire down
HC12 down
MSP430 down
PIC18 up
PIC24/dsPIC up
PIC32 (MIPS) down
PowerPC down
RL78 down
RX100/600 down
SH down
V850 down
x86 down

A Smarter Way

At this point, I hope it makes sense to suggest that you consider pointing the stack in a safe direction. For a CPU with the stack growing “down” this means that you should place the stack at the start of RAM, below all the data sections. As illustrated in the Figure below, that way you will make sure that a stack overflow can’t corrupt anything.

A safer memory map, where a stack overflow can't corrupt the data.

Of course, a simple reordering of sections in RAM does nothing to actually prevent a stack overflow, in the same way as pointing a gun to the ground does not prevent the gun from firing. Stack overflow prevention is an entirely different issue that requires a careful software design and a thorough stack usage analysis to size the stack adequately.

But the reordering of sections in RAM helps in two ways. First, you play safe by protecting the data from corruption by the stack. Second, on many systems you also get an instantaneous and free stack overflow detection in form of a hardware exception triggered in the CPU. For example, on ARM Cortex-M an attempt to read to or write from an address below the beginning of RAM causes the Hard Fault exception. Later in the article I will show how to design the exception handler to avoid shooting yourself in the foot again. But before I do this, let me first explain how to change the order of sections in RAM.

How to Change the Default Order of Sections in RAM

To change the order of sections in RAM (or ROM), you typically need to edit the linker script file. For example, in a linker script for the GNU toolchain (typically a file with the .ld extension), you just move the .stack section before the .data section. The following listing shows the order of sections in the GNU .ld file (I provide the complete GNU linker script files for the Tiva-C ARM Cortex-M4F microcontroller in the code downloads accompanying my earlier article):

    .isr_vector : {~~~} >ROM
    .text : {~~~} >ROM
    .preinit_array : {~~~} >ROM
    .init_array : {~~~} >ROM
    .fini_array : {~~~} >ROM
    _etext = .; /* end of code in ROM */

     /* start of RAM */
    .stack : {
        __stack_start__ = .; /* start of the stack section */
        . = . + STACK_SIZE;
        . = ALIGN(4);
        __stack_end__ = .;   /* end of the stack section */
    } >RAM   /* stack at the start of RAM */

    .data :  AT (_etext) {~~~} >RAM
    .bss : {~~~} > RAM
    .heap : {~~~} > RAM

On the other hand, a linker script for the IAR toolchain (typically a file with the .icf extension) requires a different strategy. For some reason simple reordering of sections does not do the trick and you need to replace the last line of the standard linker script:

place in RAM_region { readwrite, block CSTACK, block HEAP };

with the following two lines:

place at start of RAM_region {block CSTACK }; /* stack at the start of RAM */
place in RAM_region { readwrite, block HEAP  };

Please note that thus modified linker script remains compatible with the IAR linker configuration file editor built into the IAR EWARM IDE. (Again I provide the complete IAR linker script files for the Tiva-C ARM Cortex-M4F microcontroller in the code downloads accompanying this post.)


Designing an Exception Handler for Stack Overflow

As I mentioned earlier, an overflow of a descending stack placed at the start of RAM causes the Hard Fault exception on an ARM Cortex-M microcontroller. This is exactly what you want, because the exception handler provides you the last line of defense to perform damage control. However, you must be very careful how you write the exception handler, because your stack pointer (SP) is out of bounds at this point and any attempt to use the stack will fail and cause another Hard Fault exception. I hope you can see how this would lead to an endless cycle that would lock up the machine even before you had a chance to do any damage control. In other words, you must be careful here not to shoot yourself in the foot again.

So, you clearly can’t write the Hard Fault exception handler in standard C, because a standard C function most likely will access the stack. But, it is still possible to use non-standard extensions to C to get the job done. For example, the GNU compiler provides the __attribute__((naked)) extension, which indicates to the compiler that the specified function does not need prologue/epilogue sequences. Specifically, the GNU compiler will not save or restore any registers on the stack for a “naked” function. The following listing shows the definition of the HardFault_Handler() exception handler, whereas the name conforms to the Cortex Microcontroller Software Interface Standard (CMSIS):

extern unsigned __stack_start__;          /* defined in the GNU linker script */
extern unsigned __stack_end__;            /* defined in the GNU linker script */
__attribute__((naked)) void HardFault_Handler(void);
void HardFault_Handler(void) {
    __asm volatile (
        "    mov r0,sp\n\t"
        "    ldr r1,=__stack_start__\n\t"
        "    cmp r0,r1\n\t"
        "    bcs stack_ok\n\t"
        "    ldr r0,=__stack_end__\n\t"
        "    mov sp,r0\n\t"
        "    ldr r0,=str_overflow\n\t"
        "    mov r1,#1\n\t"
        "    b assert_failed\n\t"
        "    ldr r0,=str_hardfault\n\t"
        "    mov r1,#2\n\t"
        "    b assert_failed\n\t"
        "str_overflow:  .asciz \"StackOverflow\"\n\t"
        "str_hardfault: .asciz \"HardFault\"\n\t"

Please note how the __attribute__((naked)) extension is applied to the declaration of the HardFault_Handler() function. The function definition is written entirely in assembly. It starts with moving the SP register into R0 and tests whether it is in bound. A one-sided check against __stack_start__ is sufficient, because you know that the stack grows “down” in this case. If a stack overflow is detected, the SP is restored back to the original end of the stack section __stack_end__. At this point the stack pointer is repaired and you can call a standard C function. Here, I call the function assert_failed(), commonly used to handle failing assertions. assert_failed() can be a standard C function, but it should not return. Its job is to perform application-specific fail-safe shutdown and logging of the error followed typically by a system reset. The code downloads accompanying this article[6] provide an example of assert_failed() implementation in the board support package (BSP).

On a side note, I’d like to warn you against coding any exception handler as an endless loop, which is another beaten path approach taken in most startup code examples provided by microcontroller vendors. Such code locks up the machine, which might be useful during debugging, but is almost never what you want in the production code. Unfortunately, all too often I see developers shooting themselves in the foot yet again by leaving this dangerous code in the final product.

For completeness, I want to mention how to implement HardFault_Handler() exception handler in the IAR toolset. The non-standard extended keyword you can use here is __stackless, which means exactly that the IAR compiler should not use the stack in the designated function. The IAR version can also use the IAR intrinsic functions __get_SP() and __set_SP() to get and set the stack pointer, respectively, instead of inline assembly:

extern int CSTACK$$Base;            /* symbol created by the IAR linker */
extern int CSTACK$$Limit;           /* symbol created by the IAR linker */

__stackless void HardFault_Handler(void) {
    unsigned old_sp = __get_SP();

    if (old_sp < (unsigned)&CSTACK$$Base) {          /* stack overflow? */
        __set_SP((unsigned)&CSTACK$$Limit);    /* initial stack pointer */
        assert_failed("StackOverflow", old_sp); /* should never return! */
    else {
        assert_failed("HardFault", __LINE__);   /* should never return! */

What About an RTOS?

The technique of placing the stack at the start of RAM is not going to work if you use an RTOS kernel that requires a separate stack for every task. In this case, you simply cannot align all these multiple stacks at the single address in RAM. But even for multiple stacks, I would recommend taking a minute to think about the safest placement of the stacks in RAM as opposed to allocating the stacks statically inside the code and leaving it completely up to the linker to place the stacks somewhere in the .bss section.

Finally, I would like to point out that preemptive multitasking is also possible with a single-stack kernel, for which the simple technique of aligning the stack at the start of RAM works very well. Contrary to many misconceptions, single-stack preemptive kernels are quite popular. For example, the so called basic tasks of the OSEK-VDX standard all nest on a single stack, and therefore Toyota had to deal with only one stack (see Barr’s slides at the beginning). For more information about single-stack preemptive kernels, please refer to my article “Build a Super-Simple Tasker”.

Test it!

The most important strategy to deal with rare, but catastrophic faults, such as stack overflow is that you need to carefully design and actually test your system’s response to such faults. However, typically you cannot just wait for a rare fault to happen by itself. Instead, you need to use a technique called scientifically fault injection, which simply means that you need to intentionally cause the fault (you need to fire the gun!). In case of stack overflow you have several options: you might intentionally reduce the size of the stack section so that it is too small. You can also use the debugger to change the SP register manually. From there, I recommend that you single -step through the code in the debugger and verify that the system behaves as you intended. Chances are that you might be shooting yourself in the foot, just as it happened to Toyota.

I would be very interested to hear what you find out. Is your stack placed above the data section? Are your exception handlers coded as endless loops? Please leave a comment!

Cutting Through the Confusion with ARM Cortex-M Interrupt Priorities

Saturday, February 1st, 2014 Miro Samek

The insanely popular ARM Cortex-M processor offers very versatile interrupt priority management, but unfortunately, the multiple priority numbering conventions used in managing the interrupt priorities are often counter-intuitive, inconsistent, and confusing, which can lead to bugs. In this post I attempt to explain the subject and cut through the confusion.

The Inverse Relationship Between Priority Numbers and Urgency of the Interrupts

The most important fact to know is that ARM Cortex-M uses the “reversed” priority numbering scheme for interrupts, where priority zero corresponds to the highest urgency interrupt and higher numerical values of priority correspond to lower urgency. This numbering scheme poses a constant threat of confusion, because any use of the terms “higher priority” or “lower priority” immediately requires clarification, whether they represent the numerical value of priority, or perhaps, the urgency of an interrupt.

NOTE: To avoid this confusion, in the rest of this post, the term “priority” means the numerical value of interrupt priority in the ARM Cortex-M convention. The term “urgency” means the capability of an interrupt to preempt other interrupts. A higher-urgency interrupt (lower priority number) can preempt a lower-urgency interrupt (higher priority number).

 Interrupt Priority Configuration Registers in the NVIC

The number of priority levels in the ARM Cortex-M core is configurable, meaning that various silicon vendors can implement different number of priority bits in their chips. However, there is a minimum number of interrupt priority bits that need to be implemented, which is 2 bits in ARM Cortex-M0/M0+ and 3 bits in ARM Cortex-M3/M4.

But here again, the most confusing fact is that the priority bits are implemented in the most-significant bits of the priority configuration registers in the NVIC (Nested Vectored Interrupt Controller). The following figure illustrates the bit assignment in a priority configuration register for 3-bit implementation (part A), such as TI Tiva MCUs, and 4-bit implementation (part B), such as the NXP LPC17xx ARM Cortex-M3 MCUs.

 Interrupt priory registers with 3 bits of priority (A), and 4 bits of priority (B)

Interrupt priory registers with 3 bits of priority (A), and 4 bits of priority (B)


The relevance of the bit representation in the NVIC priority register is that this creates another priority numbering scheme, in which the numerical value of the priority is shifted to the left by the number of unimplemented priority bits. If you ever write directly to the priority registers in the NVIC, you must remember to use this convention.

NOTE: The interrupt priorities don’t need to be uniquely assigned, so it is perfectly legal to assign the same interrupt priority to many interrupts in the system. That means that your application can service many more interrupts than the number of interrupt priority levels.

NOTE: Out of reset, all interrupts and exceptions with configurable priority have the same default priority of zero. This priority number represents the highest-possible interrupt urgency.

Interrupt Priority Numbering in the CMSIS

The Cortex Microcontroller Software Interface Standard (CMSIS) provided by ARM Ltd. is the recommended way of programming Cortex-M microcontrollers in a portable way. The CMSIS standard provides the function NVIC_SetPriority(IRQn, priority) for setting the interrupts priorities.

However, it is very important to note that the ‘priority‘ argument of this function must not be shifted by the number of unimplemented bits, because the function performs the shifting by (8 – __NVIC_PRIO_BITS) internally, before writing the value to the appropriate priority configuration register in the NVIC. The number of implemented priority bits __NVIC_PRIO_BITS is defined in CMSIS for each ARM Cortex-M device.

For example, calling NVIC_SetPriority(7, 6) will set the priority configuration register corresponding to IRQ#7 to 1100,0000 binary on ARM Cortex-M with 3-bits of interrupt priority and it will set the same register to 0110,0000 binary on ARM Cortex-M with 4-bits of priority.

NOTE: The confusion about the priority numbering scheme used in the NVIC_SetPriority() is further promulgated by various code examples on the Internet and even in reputable books. For example the book “The Definitive Guide to ARM Cortex-M3, Second Edition”, ISBN 979-0-12-382091-4, Section 8.3 on page 138 includes a call NVIC_SetPriority(7, 0xC0) with the intent to set priority of IR#7 to 6. This call is incorrect and at least in CMSIS version 3.x will set the priority of IR#7 to zero.

Preempt Priority and Subpriority

The interrupt priority registers for each interrupt is further divided into two parts. The upper part (most-significant bits) is the preempt priority, and the lower part (least-significant bits) is the subpriority. The number of bits in each part of the priority registers is configurable via the Application Interrupt and Reset Control Register (AIRC, at address 0xE000ED0C).

The preempt priority level defines whether an interrupt can be serviced when the processor is already running another interrupt handler. In other words, preempt priority determines if one interrupt can preempt another.

The subpriority level value is used only when two exceptions with the same preempt priority level are pending (because interrupts are disabled, for example). When the interrupts are re-enabled, the exception with the lower subpriority (higher urgency) will be handled first.

In most applications, I would highly recommended to assign all the interrupt priority bits to the preempt priority group, leaving no priority bits as subpriority bits, which is the default setting out of reset. Any other configuration complicates the otherwise direct relationship between the interrupt priority number and interrupt urgency.

NOTE: Some third-party code libraries (e.g., the STM32 driver library) change the priority grouping configuration to non-standard. Therefore, it is highly recommended to explicitly re-set the priority grouping to the default by calling the CMSIS function NVIC_SetPriorityGrouping(0U) after initializing such external libraries.

Disabling Interrupts with PRIMASK and BASEPRI Registers

Often in real-time embedded programming it is necessary to perform certain operations atomically to prevent data corruption.  The simplest way to achieve the atomicity is to briefly disable and re-enabe interrupts.

The ARM Cortex-M offers two methods of disabling and re-enabling interrupts. The simplest method is to set and clear the interrupt bit in the PRIMASK register. Specifically, disabling interrupts can be achieved with the “CPSID i” instruction and enabling interrupts with the “CPSIE i” instruction. This method is simple and fast, but it disables all interrupt levels indiscriminately. This is the only method available in the ARMv6-M architecture (Cortex-M0/M0+).

However, the more advanced ARMv7-M (Cortex-M3/M4/M4F) provides additionally the BASEPRI special register, which allows you to disable interrupts more selectively. Specifically, you can disable interrupts only with urgency lower than a certain level and leave the higher-urgency interrupts not disabled at all. (This feature is sometimes called “zero interrupt latency”.)

The CMSIS provides the function __set_BASEPRI(priority) for changing the value of the BASEPRI register. The function uses the hardware convention for the ‘priority’ argument, which means that the priority must be shifted left by the number of unimplemented bits (8 – __NVIC_PRIO_BITS).

NOTE: The priority numbering convention used in __set_BASEPRI(priority) is thus different than in the NVIC_SetPriority(priority) function, which expects the “priority” argument not shifted.

For example, if you want to selectively block interrupts with priority number higher or equal to 6, you could use the following code:

// code before critical section
__set_BASEPRI(6 << (8 - __NVIC_PRIO_BITS));
// critical section
__set_BASEPRI(0U); // remove the BASEPRI masking
// code after critical section


What’s the state of your Cortex?

Monday, September 26th, 2011 Miro Samek

Recently, I’ve been involved in a fascinating bug hunt related to a very peculiar behavior of the ARM Cortex-M3 core. Given the incredible popularity of this core, I thought that digging a little deeper into the mysteries of ARM Cortex could be interesting and informative.

First, I need to provide some background. So, the bug was related to the very unique ARM Cortex-M exception type called PendSV. This is an exception triggered by software, but unlike any regular software interrupt, PendSV is an asynchronous exception. This means that PendSV typically does not run immediately after it is triggered, but only after the Nested Vectored Interrupt Controller (NVIC) determines that the priority of the currently executing code drops below the priority associated with PendSV.

At this point, you might wonder, why and where would such “Pended Software Interrupt” be useful? Well, it turns out that PendSV is the only reliable way on ARM Cortex-M to find out when all (possibly nested) interrupt service routines (ISRs) have completed. And this determination is essential to run the scheduler in any preemptive real time kernel.

Virtually all preemptive RTOSes for ARM Cortex-M processors work as follows. Upon initialization the priority associated with PendSV is set to be the lowest of all exceptions (0xFF). All ISRs in the system, prioritized above PendSV, trigger the PendSV exception by writing 1 to the PENDSVSET bit in the NVIC ICSR register, like this:

*((uint32_t volatile *)0xE000ED04) = 0x10000000;

Now, the heavy lifting is left entirely to the NVIC hardware. NVIC will activate PendSV only after the last of all nested interrupts completes and is about to return to the preempted task context. This is exactly the right time for a context switch. In other words, the PendSV exception is designed to call the scheduler and perform the task preemption. ARM Cortex is so smart that it eliminates the overhead of exiting one exception (the last nested interrupt) and activating another (the PendSV) in the trick called “tail-chaining”.

Everything looks easy so far, but ARM Cortex has one more trick up it’s sleeve and this optimization, called “late-arrival”, has interesting side effects related to PendSV. This subtle interaction between PendSV and late-arrival leads essentially to a hardware race condition I’ve recently had a pleasure to chase down.

To illustrate the events that lead up to the bug, I’ve prepared a distilled hardware trace available for viewing at ARM-Cortex-M3_bug.txt. Please go ahead and click on this link to follow along.

The trace starts with an interrupt entry (labelled as Exception 83). This system runs under the preemptive kernel called QK, so the ISR calls QK_ISR_ENTRY() and later QK_ISR_EXIT() macros to inform the kernel about the interrupt. At trace index 069545 the QK_ISR_EXIT() macro triggers the PendSV exception by writing 0x10000000 into the ICSR register.

After this, the Exception 83 runs to completion and eventually tail-chains to Exception 14 (PendSV). This is all as expected.

However, the real problem starts at trace index 069618, at which the execution of the first instruction of PendSV (CPSID i) is cancelled due to arrival of a higher-priority Exception 36 (another interrupt).

This cancellation of low-priority Exception 14 in favor of the higher-priority Exception 36 is the ARM Cortex-special called late arrival. The ARM core optimizes the interrupt entry (which is identical for all exception), and instead of entering the low-priority exception and than immediately high-priority exception, it simply enters the high-priority exception.

The problem is that just before the late arrival, the PENDSVSET bit in the NVIC-ICSR register is already cleared.

However, the late-arriving Exception 36 sets this bit again in QK_ISR_EXIT(), which is normal for any interrupt (trace index 070126).

The Exception 36 eventually exits to the original PendSV (trace index 070130), but this is not the usual tail-chaining (the trace indicates tail-chaining by the pair Exception Exit/Exception Entry). This time around the trace shows only Exception Exit, but no entry.

This difference has very important implication, which is that the PENDSVSET bit in the NVIC-ICSR register is not cleared (remember that it is set, however).

What unfolds next is the consequence of the PENDSVSET bit being set. PendSV executes, fakes its own return to the QK scheduler, and eventually it unlocks interrupts. But before SVCall (Exception 11) can execute, the PendSV Exception 14 is taken again (because it is triggered by the PENDSVSET bit). This makes no sense and should never happen, because PendSV should never be in the triggered state at this point.

So, what are the consequences of this behavior and what is the fix?

Well, as you can see, due to late-arrival PendSV can be occasionally entered with the PENDSVSET bit being set, so it will be triggered again immediately after it completes. This might or might not have adverse consequences. In case of the QK kernel, this was unacceptable and led to a Hardware Fault. In other RTOSes it might simply cause another scheduler call, waste of some CPU, and delay of the task-level response, but perhaps not a catastrophic failure.

The actual fix of the problem is very simple. Since you cannot rely on the automatic clearing of the PENDSVSET bit in the NVIC-ICSR register, you need to clear it manually (by writing 1 to the PENDSVCLR bit in the NVIC-ICSR register.) Of course this is wasteful, because only one time in a million this bit is actually not cleared automatically.

Interestingly, I have not seen such writing to the PENDSVCLR bit in open source RTOSes for ARM Cortex-M (such as Recently, I’ve come across some posts to the ARM Community Forums that this problem exists for the Frescale MQX RTOS (see PendSV pending inside PendSV handler? (Cortex-M4)).

If you use a preemptive kernel on ARM Cortex-M0/M3, perhaps you could check how your kernel handles PendSV. If you don’t see an explicit write to the PENDSVCLR bit, I would recommend that you think through the consequences of re-entering PendSV. I’d be very interested to collect a survey of how the existing kernels for ARM Cortex-M handle this situation.

I hate RTOSes

Monday, April 12th, 2010 Miro Samek

I have to confess that I’ve been experiencing a severe writer’s block lately. It’s not that I’m short of subjects to talk about, but I’m getting tired of circling around the most important issues that matter to me most and should matter the most to any embedded software developer. I mean the basic software structure.

Unfortunately, I find it impossible to talk about truly important issues without stepping on somebody’s toes, which means picking a fight. So, in this installment I decided to come out of the closet and say it openly: I hate RTOSes, because they are a ticking bomb.

The main reason I say so is because a conventional RTOS implies a certain programming paradigm, which leads to particularly brittle designs. I’m talking about blocking. Blocking occurs any time you wait explicitly in-line for something to happen. All RTOSes provide an assortment of blocking mechanisms, such as various semaphores, event-flags, mailboxes, message queues, and so on. Every RTOS task, structured as an endless loop, must use at least one such blocking mechanism, or else it will take all the CPU cycles. Typically, however, tasks block in many places scattered throughout various functions called from the task routine (the endless loop). For example, a task can block and wait for a semaphore that indicates end of an ADC conversion. In other part of the code, the same task might wait for a timeout event flag, and so on.

Blocking is insidious, because it appears to work initially, but quickly degenerates into a unmanageable mess. The problem is that while a task is blocked, the task is not doing any other work and is not responsive to other events. Such task cannot be easily extended to handle other events, not just because the system is unresponsive, but also due to the fact the the whole structure of the code past the blocking call is designed to handle only the event that it was explicitly waiting for.

You might think that difficulty of adding new features (events and behaviors) to such designs is only important later, when the original software is maintained or reused for the next similar project. I disagree. Flexibility is vital from day one. Any application of nontrivial complexity is developed over time by gradually adding new events and behaviors. The inflexibility prevents an application to grow that way, so the design degenerates in the process known as architectural decay. This in turn makes it often impossible to even finish the original application, let alone maintain it.

The mechanisms of architectural decay of RTOS-based applications are manifold, but perhaps the worst is unnecessary proliferation of tasks. Designers, unable to add new events to unresponsive tasks are forced to create new tasks, regardless of coupling and cohesion. Often the new feature uses the same data as other feature in another tasks (we call such features cohesive). But placing the new feature in a different task requires very careful sharing of the common data. So mutexes and other such mechanisms must be applied. The designer ends up spending most of the time not on the feature at hand, but on managing subtle, hairy, unintended side-effects.

For decades embedded engineers were taught to believe that the only two alternatives for structuring embedded software are a “superloop” (main+ISRs) or an RTOS. But this is of course not true. Other alternatives exist, specifically event-driven programming with modern state machines is a much better way. It is not a silver bullet, of course, but after having used this method extensively for over a decade I will never go back to a raw RTOS. I plan to write more about this better way, why it is better and where it is still weak. Stay tuned.