Archive for the ‘Efficient C/C++’ Category

Fast, Deterministic, and Portable Counting Leading Zeros

Monday, September 8th, 2014 Miro Samek

Counting leading zeros in an integer number is a critical operation in many DSP algorithms, such as normalization of samples in sound or video processing, as well as in real-time schedulers to quickly find the highest-priority task ready-to-run.

In most such algorithms, it is important that the count-leading zeros operation be fast and deterministic. For this reason, many modern processors provide the CLZ (count-leading zeros) instruction, sometimes also called LZCNT, BSF (bit scan forward), FF1L (find first one-bit from left) or FBCL (find bit change from left).

Of course, if your processor supports CLZ or equivalent in hardware, you definitely should take advantage of it. In C you can often use a built-in function provided by the embedded compiler. A couple of examples below illustrate the calls for various CPUs and compilers:


y = __CLZ(x);          // ARM Cortex-M3/M4, IAR compiler (CMSIS standard)
y = __clz(x);          // ARM Cortex-M3/M4, ARM-KEIL compiler
y = __builtin_clz(x);  // ARM Cortex-M3/M4, GNU-ARM compiler
y = _clz(x);           // PIC32 (MIPS), XC32 compiler
y = __builtin_fbcl(x); // PIC24/dsPIC, XC16 compiler

However, what if your CPU does not provide the CLZ instruction? For example, ARM Cortex-M0 and M0+ cores do not support it. In this case, you need to implement CLZ() in software (typically an inline function or as a macro).

The Internet offers a lot of various algorithms for counting leading zeros and the closely related binary logarithm (log-base-2(x) = 32 – 1 – clz(x)). Here is a sample list of the most popular search results:

http://en.wikipedia.org/wiki/Find_first_set
http://aggregate.org/MAGIC/#Leading Zero Count
http://www.hackersdelight.org/hdcodetxt/nlz.c.txt (from Hacker’s Delight (2nd Edition), Section 5-3 “Counting Leading 0’s”)
http://www.codingforspeed.com/counting-the-number-of-leading-zeros-for-a-32-bit-integer-signed-or-unsigned/
http://stackoverflow.com/questions/9041837/find-the-index-of-the-highest-bit-set-of-a-32-bit-number-without-loops-obviously
http://stackoverflow.com/questions/671815/what-is-the-fastest-most-efficient-way-to-find-the-highest-set-bit-msb-in-an-i

But, unfortunately, most of the published algorithms are either incomplete, sub-optimal, or both. So, I thought it could be useful to post here a complete and, as I believe, optimal CLZ(x) function, which is both deterministic and outperforms most of the published implementations, including all of the “Hacker’s Delight” algorithms.

Here is the first version:

static inline uint32_t CLZ1(uint32_t x) {
    static uint8_t const clz_lkup[] = {
        32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
        28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U,
        27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
        27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U
    };
    uint32_t n;
    if (x >= (1U << 16)) {
        if (x >= (1U << 24)) {
            n = 24U;
        }
        else {
            n = 16U;
        }
    }
    else {
        if (x >= (1U << 8)) {
            n = 8U;
        }
        else {
            n = 0U;
        }
    }
    return (uint32_t)clz_lkup[x >> n] - n;
}

This algorithm uses a hybrid approach of bi-section to find out which 8-bit chunk of the 32-bit number contains the first 1-bit, which is followed by a lookup table clz_lkup[] to find the first 1-bit within the byte.

The CLZ1() function is deterministic in that it completes always in 13 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core with the highest level of optimization.

The CLZ1() implementation takes about 40 bytes for code plus 256 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 300 bytes of ROM.

If the ROM footprint is too high for your application, at the cost of running the bi-section for one more step, you can reduce the size of the lookup table to only 16 bytes. Here is the CLZ2() function that illustrates this tradeoff:

static inline uint32_t CLZ2(uint32_t x) {
    static uint8_t const clz_lkup[] = {
        32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
        28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U
    };
    uint32_t n;

    if (x >= (1U << 16)) {
        if (x >= (1U << 24)) {
            if (x >= (1 << 28)) {
                n = 28U;
            }
            else {
                n = 24U;
            }
        }
        else {
            if (x >= (1U << 20)) {
                n = 20U;
            }
            else {
                n = 16U;
            }
        }
    }
    else {
        if (x >= (1U << 8)) {
            if (x >= (1U << 12)) {
                n = 12U;
            }
            else {
                n = 8U;
            }
        }
        else {
            if (x >= (1U << 4)) {
                n = 4U;
            }
            else {
                n = 0U;
            }
        }
    }
    return (uint32_t)clz_lkup[x >> n] - n;
}

The CLZ2() function completes always in 17 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core.

The CLZ2() implementation takes about 80 bytes for code plus 16 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 100 bytes of ROM.

I wonder if you can beat the CLZ1() and CLZ2() implementations. If so, please post it in the comment. I would be really interested to find an even better way.

NOTE: In case you wish to use the published code in your projects, the code is released under the “Do What The F*ck You Want To Public License” (WTFPL).

 

Are We Shooting Ourselves in the Foot with Stack Overflow?

Monday, February 17th, 2014 Miro Samek

In the latest Lesson #10 of my Embedded C Programming with ARM Cortex-M Video Course I explain what stack overflow is and I show what can transpire deep inside an embedded microcontroller when the stack pointer register (SP) goes out of bounds. You can watch the YouTube video to see the details, but basically when the stack overflows, memory beyond the stack bound gets corrupted and your code will eventually fail. If you are lucky, your program will crash quickly. If you are less lucky, however, your program will limp along in some crippled state, not quite dead but not fully functional either. Code like this can kill people.

 

Stack Overflow Implicated in the Toyota Unintended Acceleration Lawsuit

Unless you’ve been living under a rock for a past couple of years, you must have heard of the Toyota unintended acceleration (UA) cases, where Camry and other Toyota vehicles accelerated unexpectedly and some of them managed to kill people and all of them scared the hell out of their drivers.

The recent trial testimony delivered at the Oklahoma trial by an embedded guru Michael Barr for the fist time in history of these trials offers a glimpse into the Toyota throttle control software. In his deposition, Michael explains how a stack overflow could corrupt the critical variables of the operating system (OSEK in this case), because they were located in memory adjacent to the top of the stack. The following two slides from Michael’s testimony explain the memory layout around the stack and why stack overflow was likely in the Toyota code (see the complete set of Michael’s slides).

Toyota stack overflow explained

Barr's Slides explaining Toyota stack overflow (NOTE: the stack grows "up" in this picture, because "Top" represents a lower address in RAM than "Bottom". Traditionally, however, such a stack is called "descending".)

 

Why Were People Killed?

The crucial aspect in the failure scenario described by Michael is that the stack overflow did not cause an immediate system failure. In fact, an immediate system failure followed by a reset would have saved lives, because Michael explains that even at 60 Mph, a complete CPU reset would have occurred within just 11 feet of vehicle’s travel.

Instead, the problem was exactly that the system kept running after the stack overflow. But due to the memory corruption some tasks got “killed” (or “forgotten”) by the OSEK real-time operating system while other tasks were still running. This, in turn, caused the engine to run, but with the throttle “stuck” in the wide-open position, because the “kitchen-sink” TaskX, as Michael calls it, which controlled the throttle among many other things, was dead.

A Shot in the Foot

The data corruption caused by the stack overflow is completely self inflicted. I mean, we know exactly which way the stack grows on any given CPU architecture. For example, on the V850 CPU used in the Toyota engine control module (ECM) the stack grows towards the lower memory addresses, which is traditionally called a “descending stack” or a stack growing “down”. In this sense the stack is like a loaded gun that points either up or down in the RAM address space. Placing your foot (or your critical data for that matter) exactly at the muzzle of this gun doesn’t sound very smart, does it? In fact, doing so goes squarely against the very first NRA Gun Safety Rule: “ ALWAYS keep the gun pointed in a safe direction”.

A standard memory map, in which the stack grows towards your program data.

Yet, as illustrated in the Figure above, most traditional, beaten path memory layouts allocate the stack space above the data sections in RAM, even though the stack grows “down” (towards the lower memory addresses) in most embedded processors (see Table below ). This arrangement puts your program data in the path of destruction of a stack overflow. In other words, you violate the first  NRA Gun Safety Rule and you end up shooting yourself in the foot, as did Toyota.

Processor Architecture Stack growth direction
ARM Cortex-M down
AVR down
AVR32 down
ColdFire down
HC12 down
MSP430 down
PIC18 up
PIC24/dsPIC up
PIC32 (MIPS) down
PowerPC down
RL78 down
RX100/600 down
SH down
V850 down
x86 down

A Smarter Way

At this point, I hope it makes sense to suggest that you consider pointing the stack in a safe direction. For a CPU with the stack growing “down” this means that you should place the stack at the start of RAM, below all the data sections. As illustrated in the Figure below, that way you will make sure that a stack overflow can’t corrupt anything.

A safer memory map, where a stack overflow can't corrupt the data.

Of course, a simple reordering of sections in RAM does nothing to actually prevent a stack overflow, in the same way as pointing a gun to the ground does not prevent the gun from firing. Stack overflow prevention is an entirely different issue that requires a careful software design and a thorough stack usage analysis to size the stack adequately.

But the reordering of sections in RAM helps in two ways. First, you play safe by protecting the data from corruption by the stack. Second, on many systems you also get an instantaneous and free stack overflow detection in form of a hardware exception triggered in the CPU. For example, on ARM Cortex-M an attempt to read to or write from an address below the beginning of RAM causes the Hard Fault exception. Later in the article I will show how to design the exception handler to avoid shooting yourself in the foot again. But before I do this, let me first explain how to change the order of sections in RAM.

How to Change the Default Order of Sections in RAM

To change the order of sections in RAM (or ROM), you typically need to edit the linker script file. For example, in a linker script for the GNU toolchain (typically a file with the .ld extension), you just move the .stack section before the .data section. The following listing shows the order of sections in the GNU .ld file (I provide the complete GNU linker script files for the Tiva-C ARM Cortex-M4F microcontroller in the code downloads accompanying my earlier article):

~~~
SECTIONS {
    .isr_vector : {~~~} >ROM
    .text : {~~~} >ROM
    .preinit_array : {~~~} >ROM
    .init_array : {~~~} >ROM
    .fini_array : {~~~} >ROM
    _etext = .; /* end of code in ROM */

     /* start of RAM */
    .stack : {
        __stack_start__ = .; /* start of the stack section */
        . = . + STACK_SIZE;
        . = ALIGN(4);
        __stack_end__ = .;   /* end of the stack section */
    } >RAM   /* stack at the start of RAM */

    .data :  AT (_etext) {~~~} >RAM
    .bss : {~~~} > RAM
    .heap : {~~~} > RAM
     ~~~
}

On the other hand, a linker script for the IAR toolchain (typically a file with the .icf extension) requires a different strategy. For some reason simple reordering of sections does not do the trick and you need to replace the last line of the standard linker script:

place in RAM_region { readwrite, block CSTACK, block HEAP };

with the following two lines:

place at start of RAM_region {block CSTACK }; /* stack at the start of RAM */
place in RAM_region { readwrite, block HEAP  };

Please note that thus modified linker script remains compatible with the IAR linker configuration file editor built into the IAR EWARM IDE. (Again I provide the complete IAR linker script files for the Tiva-C ARM Cortex-M4F microcontroller in the code downloads accompanying this post.)

 

Designing an Exception Handler for Stack Overflow

As I mentioned earlier, an overflow of a descending stack placed at the start of RAM causes the Hard Fault exception on an ARM Cortex-M microcontroller. This is exactly what you want, because the exception handler provides you the last line of defense to perform damage control. However, you must be very careful how you write the exception handler, because your stack pointer (SP) is out of bounds at this point and any attempt to use the stack will fail and cause another Hard Fault exception. I hope you can see how this would lead to an endless cycle that would lock up the machine even before you had a chance to do any damage control. In other words, you must be careful here not to shoot yourself in the foot again.

So, you clearly can’t write the Hard Fault exception handler in standard C, because a standard C function most likely will access the stack. But, it is still possible to use non-standard extensions to C to get the job done. For example, the GNU compiler provides the __attribute__((naked)) extension, which indicates to the compiler that the specified function does not need prologue/epilogue sequences. Specifically, the GNU compiler will not save or restore any registers on the stack for a “naked” function. The following listing shows the definition of the HardFault_Handler() exception handler, whereas the name conforms to the Cortex Microcontroller Software Interface Standard (CMSIS):

extern unsigned __stack_start__;          /* defined in the GNU linker script */
extern unsigned __stack_end__;            /* defined in the GNU linker script */
~~~
__attribute__((naked)) void HardFault_Handler(void);
void HardFault_Handler(void) {
    __asm volatile (
        "    mov r0,sp\n\t"
        "    ldr r1,=__stack_start__\n\t"
        "    cmp r0,r1\n\t"
        "    bcs stack_ok\n\t"
        "    ldr r0,=__stack_end__\n\t"
        "    mov sp,r0\n\t"
        "    ldr r0,=str_overflow\n\t"
        "    mov r1,#1\n\t"
        "    b assert_failed\n\t"
        "stack_ok:\n\t"
        "    ldr r0,=str_hardfault\n\t"
        "    mov r1,#2\n\t"
        "    b assert_failed\n\t"
        "str_overflow:  .asciz \"StackOverflow\"\n\t"
        "str_hardfault: .asciz \"HardFault\"\n\t"
    );
}

Please note how the __attribute__((naked)) extension is applied to the declaration of the HardFault_Handler() function. The function definition is written entirely in assembly. It starts with moving the SP register into R0 and tests whether it is in bound. A one-sided check against __stack_start__ is sufficient, because you know that the stack grows “down” in this case. If a stack overflow is detected, the SP is restored back to the original end of the stack section __stack_end__. At this point the stack pointer is repaired and you can call a standard C function. Here, I call the function assert_failed(), commonly used to handle failing assertions. assert_failed() can be a standard C function, but it should not return. Its job is to perform application-specific fail-safe shutdown and logging of the error followed typically by a system reset. The code downloads accompanying this article[6] provide an example of assert_failed() implementation in the board support package (BSP).

On a side note, I’d like to warn you against coding any exception handler as an endless loop, which is another beaten path approach taken in most startup code examples provided by microcontroller vendors. Such code locks up the machine, which might be useful during debugging, but is almost never what you want in the production code. Unfortunately, all too often I see developers shooting themselves in the foot yet again by leaving this dangerous code in the final product.

For completeness, I want to mention how to implement HardFault_Handler() exception handler in the IAR toolset. The non-standard extended keyword you can use here is __stackless, which means exactly that the IAR compiler should not use the stack in the designated function. The IAR version can also use the IAR intrinsic functions __get_SP() and __set_SP() to get and set the stack pointer, respectively, instead of inline assembly:

extern int CSTACK$$Base;            /* symbol created by the IAR linker */
extern int CSTACK$$Limit;           /* symbol created by the IAR linker */

__stackless void HardFault_Handler(void) {
    unsigned old_sp = __get_SP();

    if (old_sp < (unsigned)&CSTACK$$Base) {          /* stack overflow? */
        __set_SP((unsigned)&CSTACK$$Limit);    /* initial stack pointer */
        assert_failed("StackOverflow", old_sp); /* should never return! */
    }
    else {
        assert_failed("HardFault", __LINE__);   /* should never return! */
    }
}

What About an RTOS?

The technique of placing the stack at the start of RAM is not going to work if you use an RTOS kernel that requires a separate stack for every task. In this case, you simply cannot align all these multiple stacks at the single address in RAM. But even for multiple stacks, I would recommend taking a minute to think about the safest placement of the stacks in RAM as opposed to allocating the stacks statically inside the code and leaving it completely up to the linker to place the stacks somewhere in the .bss section.

Finally, I would like to point out that preemptive multitasking is also possible with a single-stack kernel, for which the simple technique of aligning the stack at the start of RAM works very well. Contrary to many misconceptions, single-stack preemptive kernels are quite popular. For example, the so called basic tasks of the OSEK-VDX standard all nest on a single stack, and therefore Toyota had to deal with only one stack (see Barr’s slides at the beginning). For more information about single-stack preemptive kernels, please refer to my article “Build a Super-Simple Tasker”.

Test it!

The most important strategy to deal with rare, but catastrophic faults, such as stack overflow is that you need to carefully design and actually test your system’s response to such faults. However, typically you cannot just wait for a rare fault to happen by itself. Instead, you need to use a technique called scientifically fault injection, which simply means that you need to intentionally cause the fault (you need to fire the gun!). In case of stack overflow you have several options: you might intentionally reduce the size of the stack section so that it is too small. You can also use the debugger to change the SP register manually. From there, I recommend that you single -step through the code in the debugger and verify that the system behaves as you intended. Chances are that you might be shooting yourself in the foot, just as it happened to Toyota.

I would be very interested to hear what you find out. Is your stack placed above the data section? Are your exception handlers coded as endless loops? Please leave a comment!

Dual Targeting and Agile Prototyping of Embedded Software on Windows

Friday, April 12th, 2013 Miro Samek

When developing embedded code for devices with non-trivial user interfaces, it often pays off to build a prototype (virtual prototype) of the embedded system of a PC. The strategy is called “dual targeting”, because you develop software on one machine (e.g., Windows PC) and run it on a deeply embedded target, as well as on the PC. Dual targeting is the main strategy for avoiding the “target system bottleneck” in the agile embedded software development, popularized in the book “Test-Driven Development for Embedded C” by James Grenning.

Avoiding Target Hardware Bottleneck with Dual Targeting

Please note that dual targeting does not mean that the embedded device has anything to do with the PC. Neither it means that the simulation must be cycle-exact with the embedded target CPU.

Dual targeting simply means that from day one, your embedded code (typically in C) is designed to run on at least two platforms: the final target hardware and your PC. All you really need for this is two C compilers: one for the PC and another for the embedded device.

However, the dual targeting strategy does require a specific way of designing the embedded software such that any target hardware dependencies are handled through a well-defined interface often called the Board Support Package (BSP). This interface has at least two implementations: one for the actual target and one for the PC, for example running Windows. With such interface in place, the bulk of the embedded code can remain completely unaware which BSP implementation it is linked to and so it can be developed quickly on the PC, but can also run on the target hardware without any changes.

While some embedded programmers can view dual targeting as a self-inflicted burden, the more experienced developers generally agree that paying attention to the boundaries between software and hardware is actually beneficial, because it results in more modular, more portable, and more maintainable software with much longer useful lifetime. The investment in dual targeting has also an immediate payback in the vastly accelerated compile-run-debug cycle, which is much faster and more productive on the powerful PC compared to much slower, recourse-constrained deeply embedded target with limited visibility into the running code.

Agile Rapid Prototyping of Embedded Software with Dual Targeting

Dual targeting can have many different objectives. For example, in the test-driven development (TDD) of embedded software, the objective is to build relatively concise unit tests and execute them on the desktop as console-type applications. The main challenge is management of the inter-module dependencies and flexibility of tests, but the overall architecture of the final product is of lesser concerns, as the unit tests are executed in isolation using special test harnesses.

However, dual targeting can also be used for (rapid) prototyping and simulating the whole embedded devices on the PC, not just executing unit tests. In this case, the objective is to build a possibly complete prototype of the embedded device as a GUI-type application. This approach is particularly interesting for embedded systems with non-trivial user interfaces, such as: home appliances, office equipment, thermostats, medical devices, industrial controllers, etc. As it turns out, significant percentage of the code embedded in all those devices is devoted to the user interface and can be, or even should be, developed on the desktop.

QWIN GUI Toolkit

When developing embedded code for devices with non-trivial user interfaces, one often runs into the problem of representing the embedded front panels as GUI elements on the PC. The problem is so common, that I’m really surprised that my internet search couldn’t uncover any simple C-only interface to the basic elements, such as LCDs, buttons, and LEDs. I’ve posted questions on StackOverflow, and other such forums, but again, I got recommendations for .NET, C#, VisualBasic, and many expensive proprietary tools, none of which provided an easy, direct binding to C. My objective is not really that complicated, yet it seems that every embedded developer has to re-invent this wheel over and over again.

qwin_ani

 

 

So, to help embedded developers interested in prototyping embedded devices on Windows, I have created a QWIN GUI Toolkit” and have posted on SourceForge (as part of the Qtools collection) under the permissive MIT open source license. This toolkit relies only on the raw Win32 API in C and currently provides the following elements:

  • Graphic display for an efficient, pixel-addressable displays such as graphical LCDs, OLEDs, etc. with full 24-bit color.
  • Segment display for segmented display such as segment LCDs, and segment LEDs with generic, custom bitmaps for the segments.
  • Owner-drawn buttons with custom “depressed” and “released” bitmaps and capable of generating separate events when depressed and when released.

The toolkit comes with an example and an App Note, showing how to handle input from the owner-drawn buttons, regular buttons, keyboard, and the mouse. You can also view a 1-minute YouTube video “Flyn ‘n’ Shoot game on windows” that shows a virtual embedded board running a game.

Regarding the size and complexity of the “QWIN GUI Toolkit“, the implementation of the aforementioned GUI elements takes only about 250 lines of C. The example with all sources of input and a lot of comments amounts to some 300 lines of C. The toolkit has been tested with the free MinGW compiler, the free Visual C++ Express 2013, and the free ResEdit resource editor.

Enjoy!

Embedded C Programming with ARM Cortex-M Video Course

Monday, January 21st, 2013 Miro Samek

As part of my New Year’s resolution for 2013, I just started to teach an Embedded C Programming Course with ARM Cortex-M on YouTube. The playlist for this course is available at: http://www.youtube.com/playlist?list=PLPW8O6W-1chwyTzI3BHwBLbGQoPFxPAPM .

The course is intended for beginners and is structured as a series of short, focused, hands-on lessons that teach you how to program ARM Cortex-M microcontrollers in C.

I’ve designed this course not just to be watched, but to follow it along on your own computer. In the “Getting Started” Lesson 0, I show you how to download and install the free evaluation version of IAR EWARM and how to order the Stellaris Launchpad ARM Cortex-M4 board (for just $12.99). The board is optional, as I show how to use the instruction set simulator.

My goal is not just to teach C–other courses do it already quite well. But there are virtually no courses that would step down to the machine level and show you exactly what happens inside the ARM processor.

Starting from Lesson 1 you actually see how the ARM Cortex-M processor executes your code, how it manipulates registers, and how it counts. You learn how binary numbers map to the hexadecimal system used in the debugger (and in C) and you learn about the two’s complement number representation of signed numbers.

In lesson 2, you learn about the flow of control and the ARM branch instructions. Actually, you witness a disection of the ARM B-instruction (branch). You also learn about the pipeline and pipeline stalls due to branching.

In lesson 3, you learn about variables and pointers. You learn how ARM accesses variables in memory through the load and store instructions (load-store architecture). You also learn how the fundamental concept of memory addresses maps to pointers in C, how to obtain an address of a variable and how to dereference a pointer.

I hope that this course will help you gain understanding of the ARM Cortex-M core, which will look really good on your resume.

This deeper understanding will allow you to use both the ARM processor and the C language more efficiently and with greater confidence. You will gain understanding not just what for your program does, but also how the C statements translate to machine instructions and how fast the processor can execute them.

I’d love to hear your comments about the course. Is there anything that you would like to see in the upcoming lessons? Do you see anything that you would teach differently? Or perhaps you have ideas for teaching specific subjects? Please share…

Protothreads versus State Machines

Thursday, June 9th, 2011 Miro Samek

For a number of years I’ve been getting questions regarding Protothreads and comparisons to state machines. Here is what I think.

Protothreads are an attempt to write event-driven code in a sequential way. To do so, protothreads introduce a concept of “blocking abstraction” to event-driven programming–something that event-driven programming is trying to get rid of in the first place.

Obviously, to be compatible with event-driven programming, protothreads cannot *really* block, at least not in the same sense as traditional threads of a conventional blocking RTOS can. Instead, a protothread is still called for every event and still *returns* to the caller without really blocking. However, when a given event does not match the “blocking abstraction”, the protothread returns without doing anything and without progressing. Only when the current event matches the “blocking abstraction” the protothread advances to the next “blocking abstraction” and also returns. Please note that protothreads allow standard flow control statements, such as IF-THEN-ELSE and WHILE loops between any two “blocking abstractions”. Therefore the program feels more “natural” for designers used to the traditional sequential programming style. You simply see the expected sequence of events.

Protothreads are indeed a simplification, but only for *sequential problems*, in which only specific sequences of events are valid and all other sequences are invalid. Examples for using protothreads include the sequence of events for initializing a radio modem.

However, protothreads lose their advantage entirely if there are many valid sequences of events. This is actually the most common situation in event-driven systems. State machines are capable to handle multiple sequences of events easily. In fact, state machines are getting simpler when the sequence of events matters less. In contrast, protothreads are getting more and more complex when they need to accept more sequences of events.

So, in the end, protothreads are an attempt to replace state machines, which are considered complex and messy by the inventors of the protothread mechanism. But, the “messiness” of state machines is obviously a very subjective statement. A good state machine implementation technique can remove many (most) accidental difficulties of coding state machines. I mean, if a state machine as depicted in a state diagram is simple, and if the code does not reflect this simplicity, the problem is with the implementation technique, not with the inherent complexity of a state machine.

The bottom line: good state machine implementation techniques eliminate most reasons for using protothreads. State machines are far more generic and flexible, because they can easily handle multiple sequences of events. In comparison, protothreads are intentionally crippled state machines that transition implicitly from one “blocking abstraction” to another executing code in between as the “actions” on such implicit transition.