embedded software boot camp

Embedded Programming Video Course Teaches RTOS

January 20th, 2019 by Miro Samek

If you’d like to understand how a Real-Time Operating System (RTOS) really works, here is a free video course for you:

RTOS part-1: In this first lesson on RTOS you will see how to extend the foreground/background architecture from the previous lesson, so that you can have multiple background loops running seemingly simultaneously:

RTOS part-2: In this second lesson on RTOS you will see how to automate the context switch process. Specifically, in this lesson, you will start building your own minimal RTOS that will implement the manual context switch procedure that you worked out in the previous lesson:

RTOS part-3: This third lesson on Real-Time Operating System (RTOS) shows how to automate the scheduling process. Specifically, in this lesson you will implement the simple round robin scheduler that runs threads in a circular order. Along the way, you will add several improvements to the MiROS RTOS and you will see how fast it runs:

RTOS part-4: This forth lesson on Real-Time Operating System (RTOS) shows how to replace the horribly inefficient polling for events with efficient BLOCKING of threads. Specifically, in this lesson you will add a blocking delay function to the MiROS RTOS and you’ll learn about some far-reaching implications of thread blocking on the RTOS design:

RTOS part-5: This fifth lesson on RTOS I’ll finally address the real-time aspect in the “Real-Time Operating System” name. Specifically, in this lesson you will augment the MiROS RTOS with a preemptive, priority-based scheduler, which can be mathematically proven to meet real-time deadlines under certain conditions:

RTOS part-6: This sixth lesson on RTOS talks about the RTOS mechanisms for synchronization and communication among concurrent threads. Such mechanisms are the most complex elements of any RTOS, and are generally really tricky to develop by yourself. For that reason, this lesson replaces the toy MiROS RTOS with the professional-grade QXK RTOS included in the QP/C framework, parts of which have been used since lesson 21. The lesson demonstrates the process of porting an existing application to a different RTOS, and once this is done, explains semaphores and shows how they work in practice:

RTOS part-7: This seventh lesson on RTOS talks about sharing resources among concurrent threads, and about the RTOS mechanisms for protecting such shared resources. First you see what can happen if you share resources without any protection, and then you get introduced to the “mutual exclusion mechanisms” for protecting the shared resources. Specifically, you learn about: critical sections, resource semaphores, selective scheduler locking, and mutexes. You also learn about the second-order problems caused by these mechanisms, such as unbounded prioroty inversion. Finally, you learn how to prevent these second-order effects by priority-ceiling protocol and priority-inheritance protocol.

Embedded Toolbox: Source Code Whitespace Cleanup

August 7th, 2017 by Miro Samek

In this installment of my “Embedded Toolbox” series, I would like to share with you the free source code cleanup utility called QClean for cleaning whitespace in your source files, header files, makefiles, linker scripts, etc.

You probably wonder why you might need such a utility? In fact, the common thinking is that compilers (C, C++, etc.) ignore whitespace anyway, so why bother? But, as a professional software developer you should not ignore whitespace, because it can cause all sorts of problems, some of them illustrated in the figure below:

qclean_dirty

  1. Trailing whitespace after the last printable character in line can cause bugs. For example, trailing whitespace after the C/C++ macro-continuation character ‘\’ can confuse the C pre-processor and can result in a program error, as indicated by the bug icons.
  2. Similarly, inconsistent use of End-Of-Line (EOL) convention can cause bugs. For example, mixing the DOS EOL Convention (0x0D,0x0A) with Unix EOL Convention (0x0A) can confuse the C pre-processor and can result in a program error, as indicated by the bug icons.
  3. Varying amount of trailing whitespace at the end of the lines plus inconsistent use of tabs and spaces can cause unnecessary churn in the version control system (VCS) in source files that otherwise should be identical. Sure, many VCSs allow you to “ignore whitespace”, but are files differing in size by as much as 20% really identical?
  4. Inconsistent use of tabs and spaces can lead to different rendering of the source code by different editors and printers.

Note: The problems caused by whitespace in the source code are particularly insidious, because you don’t see the culprit. By using an automated whitespace cleanup utility you can save yourself hours of frustration and significantly improve your code quality.

 


 QClean Source Code Cleanup Utility

QClean is a simple and blazingly fast command-line utility to automatically clean whitespace in your source code. QClean is deployed as natively compiled executable and is located in the QTools Collection (in the sub-directory  qtools/bin ). QClean is also available in portable source code and can be adapted and re-compiled on all desktop platforms (Windows, POSIX –Linux, MacOS).

Using QClean

Typically, you invoke QClean from a command-line prompt without any parameters. In that case, QClean will cleanup white space in the current directory and recursively in all its sub-directories.

Note: If you have added the qtools/bin/ directory to your PATH environment variable (see Installing QTools), you can run qclean directly from your terminal window.

qclean_run

As you can see in the screen shot above, QClean processes the files and prints out the names of the cleaned up files. Also, you get information as to what has been cleaned, for example, “Trail-WS” means that trailing whitespace has been cleaned up. Other possibilities are: “CR” (cleaned up DOS/Windows (CR) end-of-lines), “LF” (cleaned up Unix (LF) end-of-lines), and “Tabs” (replaced Tabs with spaces).

QClean Command-Line Parameters

QClean takes the following command-line parameters:

 PARAMETER  DEFAULT  COMMENT
[root-dir] . root directory to clean (relative or absolute)
OPTIONS
-h help (show help message and exit)
-q query only (no cleanup when -q present)
-r check also read-only files
-l[limit] 80 line length limit (not checked when -l absent)

QClean Features

QClean fixes the following whitespace problems:

  • removing of all trailing whitespace (see figure above 1)
  • applying consistent End-Of-Line convention (either Unix (LF) or DOS (CRLF) see figure above 2)
  • replacing Tabs with spaces (untabify, see figure above 2)
  • optionally, scan the source code for long lines exceeding the specified limit (-l option, default 80 characters per line).

Long Lines

QClean can optionally check the code for long lines of code that exceed a specified limit (80 characters by default) to reduce the need to either wrap the long lines (which destroys indentation), or the need to scroll the text horizontally. (All GUI usability guidelines universally agree that horizontal scrolling of text is always a bad idea.) In practice, the source code is very often copied-and-pasted and then modified, rather than created from scratch. For this style of editing, it’s very advantageous to see simultaneously and side-by-side both the original and the modified copy. Also, differencing the code is a routinely performed action of any VCS (Version Control System) whenever you check-in or merge the code. Limiting the line length allows to use the horizontal screen real estate much more efficiently for side-by-side-oriented text windows instead of much less convenient and error-prone top-to-bottom differencing.

QClean File Types

QClean applies the following rules for cleaning the whitespace depending on the file types:

 FILE TYPE  END-OF-LINE  TRAILING WS  TABS  LONG-LINES
.c Unix (LF) remove remove check
.h Unix (LF) remove remove check
.cpp Unix (LF) remove remove check
.hpp Unix (LF) remove remove check
.s Unix (LF) remove remove check
.asm Unix (LF) remove remove check
.lnt Unix (LF) remove remove check
.txt DOS (CR,LF) remove remove don’t check
.md DOS (CR,LF) remove remove don’t check
.bat DOS (CR,LF) remove remove don’t check
.ld Unix (LF) remove remove check
.tcl Unix (LF) remove remove check
.py Unix (LF) remove remove check
.java Unix (LF) remove remove check
Makefile Unix (LF) remove leave check
.mak Unix (LF) remove leave check
.html Unix (LF) remove remove don’t check
.htm Unix (LF) remove remove don’t check
.php Unix (LF) remove remove don’t check
.dox Unix (LF) remove remove don’t check
.m Unix (LF) remove remove check

The cleanup rules specified in the table above can be easily customized by editing the array l_fileTypes in the qclean/source/main.c file. Also, you can change the Tab sizeby modifying the TAB_SIZE constant (currently set to 4) as well as the default line-limit by modifying the LINE_LIMIT constant (currently set to 80) at the top of the the qclean/source/main.c file. Of course, after any such modification, you need to re-build the QClean executable and copy it into the qtools/bin directory.

Note: For best code portability, QClean enforces the consistent use of the specified End-Of-Line convention (typically Unix (LF)), regardless of the native EOL of the platform. The DOS/Windows EOL convention (CR,LF) is typically not applied because it causes compilation problems on Unix-like systems (Specifically, the C preprocessor doesn’t correctly parse the multi-line macros.) On the other hand, most DOS/Windows compilers seem to tolerate the Unix EOL convention without problems.

Summary

QClean is very simple to use (no parameters are needed in most cases) and is fast (it can easily cleanup hundreds of files per second). All this is designed so that you can use QClean frequently. In fact, the use of QClean after editing your code should become part of your basic hygiene–like washing hands after going to the bathroom.

Embedded Toolbox: Programmer’s Calculator

June 27th, 2017 by Miro Samek

Like any craftsman, I have accumulated quite a few tools during my embedded software development career. Some of them proved to me more useful than others. And these generally useful tools ended up in my Embedded Toolbox. In this blog, I’d like to share some of my tools with you. Today, I’d like to start with my cross-platform Programmer’s Calculator called QCalc.

qcalc

I’m sure that you already have your favorite calculator online or on your smartphone. But can your calculator accept complete expressions in the C-syntax, which you can cut-and-paste directly to and from your embedded code? How many buttons do you need to push to see your result in decimal, hex and binary? Well, QCalc can do this with less hassle than anything else I’ve seen out there. I begin with describing QCalc features and then I tell you how to download and launch it.

QCalc Features

Expressions in C-Syntax

The most important feature of QCalc is that it accepts expressions in the C-syntax – with the same operands and precedence rules as in the C or C++ source code. Among others, the expressions can contain all bit-wise operators (<<, >>, |, &, ^, ~) as well as mixed decimal, hexadecimal and even binary constants. QCalc is also a powerful floating-point scientific calculator and supports all mathematical functions (sin(), cos(), tan(), exp(), ln(), …). Some examples of acceptable expressions are:

((0xBEEF << 16) | 1280) & ~0xFF – binary operators, mixed hex and decimal numbers
($1011 << 24) | (1280 >> 8) ^ 0xFFF0 – mixed binary, dec and hex numbers
(1234 % 55) + 4321/33 – remainder, integer division
pow(sin($pi),2) + pow(cos($pi),2) – scientific floating-point calculations, pi-constant
($0111 & $GPIO_EXTIPINSELL_EXTIPINSEL0_MASK) << ($GPIO_EXTIPINSELL_EXTIPINSEL1_SHIFT * 12) – (see user-defined variables)

NOTE: QCalc internally uses the Tcl command expr to evaluate the expressions. Please refer to the documentation of the Tcl expr command for more details of supported syntax and features.

Automatic conversion to hexadecimal and binary

If the result of expression evaluation is integer (as opposed to floating point), QCalc automatically displays the result in hexadecimal and binary formats (see QCalc GUI). For better readability the hex display shows a comma between the two 16-bit half-words (e.g., 0xDEAD,BEEF). Similarly, the binary output shows a comma between the four 8-bit bytes (e.g., 0b11011110,10101101,10111110,11101111).

Binary constants

As the extension to the C-syntax, QCalc supports binary numbers in the range from 0-15 (0b0000-0b1111). These binary constants are represented as $0000$0001$0010,…, $1110, and $1111 and can be mixed into expressions. Here are a few examples of such expressions:

($0110 << 14) & 0xDEADBEEF
($0010 | $1000) * 123

History of inputs

QCalc remembers the history of up to 8 most recently entered expressions. You can recall and navigate the history of previously entered expressions by pressing the Up / Down keys.

The $ans variable

QCalc stores the result of the last computation in the $ans variable (note the dollar sign $ in front of the variable name). Here are some examples of expressions with the $ans variable:

1/$ans – find the inverse of the last computation
log($ans)/log(2) – find log-base-2 of the last computation

User variables

QCalc allows you also to define any number of your own user variables. To set a variable, you simply type the expression =alpha in the user input field. This will define the variable alpha and assign it the value of the last computation ($ans). Subsequently, you can use your alpha variable in expressions by typing $alpha (note the dollar sign $ in front of the variable name). Here is example of defining and using variable $GPIO_BASE:

0xE000E000 – set some value into $ans
=GPIO_BASE – define user variable GPIO_BASE and set it to $ans
$GPIO_BASE + 0x400 – use the variable $GPIO_BASE in an expression

Note: The names of user variables are case-sensitive.

Error handling

Expressions that you enter into QCalc might have all kinds of errors: syntax errors, computation errors (e.g., division by zero), undefined variable errors, etc. In all these cases, QCalc responds with the Error message and the explanation of the error:

qcalc_err

Downloading and Launching QCalc

QCalc is included in the open source QTools Collection, which you cad freely download from SourceForge or GitHub. Once you install QTools, QCalc is located in the sub-directory qtools/bin/ and consists of a single file qcalc.tcl. To launch QCalc, you need to open this file with the wish Tk interpreter.

NOTE: The wish Tk interpreter is included in the QTools Collection for Windows and is also pre-installed in most Linux distributions.

You use QCalc by typing (or pasting) an expression in the user input field and pressing the Enter key to evaluate the expression. You can conveniently edit any expression already inside the user input field, and you can recall the previous expressions by means of the Up/Down keys. You can also resize the QCalc window to see more or less of the input field.

QCalc on Windows

The wish Tk interpreter is conveniently provided in the same qtools/bin/ directory as the qcalc.tcl script. The directory contains also a shortcut qcalc, which you can copy to your desktop.

qcalc_lnk

QCalc on Linux

Most Linux distributions contain the Tk interpreter, which you can use to launch QCalc. You can do this either from a terminal, by typin wish $QTOOLS/qcalc.tcl & or by creating a shortcut to wish with the command-line argument $QTOOLS/qcalc.tcl.

qcalc_linux

Modern Embedded Programming: Beyond the RTOS

April 27th, 2016 by Miro Samek

An RTOS (Real-Time Operating System) is the most universally accepted way of designing and implementing embedded software. It is the most sought after component of any system that outgrows the venerable “superloop”. But it is also the design strategy that implies a certain programming paradigm, which leads to particularly brittle designs that often work only by chance. I’m talking about sequential programming based on blocking.

Blocking occurs any time you wait explicitly in-line for something to happen. All RTOSes provide an assortment of blocking mechanisms, such as time-delays, semaphores, event-flags, mailboxes, message queues, and so on. Every RTOS task, structured as an endless loop, must use at least one such blocking mechanism, or else it will take all the CPU cycles. Typically, however, tasks block not in just one place in the endless loop, but in many places scattered throughout various functions called from the task routine. For example, in one part of the loop a task can block and wait for a semaphore that indicates the end of an ADC conversion. In other part of the loop, the same task might wait for an event flag indicating a button press, and so on.

This excessive blocking is insidious, because it appears to work initially, but almost always degenerates into a unmanageable mess. The problem is that while a task is blocked, the task is not doing any other work and is not responsive to other events. Such a task cannot be easily extended to handle new events, not just because the system is unresponsive, but mostly due to the fact that the whole structure of the code past the blocking call is designed to handle only the event that it was explicitly waiting for.

You might think that difficulty of adding new features (events and behaviors) to such designs is only important later, when the original software is maintained or reused for the next similar project. I disagree. Flexibility is vital from day one. Any application of nontrivial complexity is developed over time by gradually adding new events and behaviors. The inflexibility makes it exponentially harder to grow and elaborate an application, so the design quickly degenerates in the process known as architectural decay.

perils_of_blocking

The mechanisms of architectural decay of RTOS-based applications are manifold, but perhaps the worst is the unnecessary proliferation of tasks. Designers, unable to add new events to unresponsive tasks are forced to create new tasks, regardless of coupling and cohesion. Often the new feature uses the same data and resources as an already existing feature (such features are called cohesive). But unresponsiveness forces you to add the new feature in a new task, which requires caution with sharing the common data. So mutexes and other such blocking mechanisms must be applied and the vicious cycle tightens. The designer ends up spending most of the time not on the feature at hand, but on managing subtle, intermittent, unintended side-effects.

For these reasons experienced software developers avoid blocking as much as possible. Instead, they use the Active Object design pattern. They structure their tasks in a particular way, as “message pumps”, with just one blocking call at the top of the task loop, which waits generically for all events that can flow to this particular task. Then, after this blocking call the code checks which event actually arrived, and based on the type of the event the appropriate event handler is called. The pivotal point is that these event handlers are not allowed to block, but must quickly return to the “message pump”. This is, of course, the event-driven paradigm applied on top of a traditional RTOS.

While you can implement Active Objects manually on top of a conventional RTOS, an even better way is to implement this pattern as a software framework, because a framework is the best known method to capture and reuse a software architecture. In fact, you can already see how such a framework already starts to emerge, because the “message pump” structure is identical for all tasks, so it can become part of the framework rather than being repeated in every application.

paradigm-shift

This also illustrates the most important characteristics of a framework called inversion of control. When you use an RTOS, you write the main body of each task and you call the code from the RTOS, such as delay(). In contrast, when you use a framework, you reuse the architecture, such as the “message pump” here, and write the code that it calls. The inversion of control is very characteristic to all event-driven systems. It is the main reason for the architectural-reuse and enforcement of the best practices, as opposed to re-inventing them for each project at hand.

But there is more, much more to the Active Object framework. For example, a framework like this can also provide support for state machines (or better yet, hierarchical state machines), with which to implement the internal behavior of active objects. In fact, this is exactly how you are supposed to model the behavior in the UML (Unified Modeling Language).

As it turns out, active objects provide the sufficiently high-level of abstraction and the right level of abstraction to effectively apply modeling. This is in contrast to a traditional RTOS, which does not provide the right abstractions. You will not find threads, semaphores, or time delays in the standard UML. But you will find active objects, events, and hierarchical state machines.

An AO framework and a modeling tool beautifully complement each other. The framework benefits from a modeling tool to take full advantage of the very expressive graphical notation of state machines, which are the most constructive part of the UML. The modeling tool benefits from the well-defined “framework extension points” designed for customizing the framework into applications, which in turn provide well-defined rules for generating code.

In summary, RTOS and superloop aren’t the only game in town. Actor frameworks, such as Akka, are becoming all the rage in enterprise computing, but active object frameworks are an even better fit for deeply embedded programming. After working with such frameworks for over 15 years , I believe that they represent a similar quantum leap of improvement over the RTOS, as the RTOS represents with respect to the “superloop”.

If you’d like to learn more about active objects, I recently posted a presentation on SlideShare: Beyond the RTOS: A Better Way to Design Real-Time Embedded Software

Also, I recently ran into another good presentation about the same ideas. This time a NASA JPL veteran describes the best practices of “Managing Concurrency in Complex Embedded Systems”. I would say, this is exactly active object model. So, it seems that it really is true that experts independently arrive at the same conclusions…

Fast, Deterministic, and Portable Counting Leading Zeros

September 8th, 2014 by Miro Samek

Counting leading zeros in an integer number is a critical operation in many DSP algorithms, such as normalization of samples in sound or video processing, as well as in real-time schedulers to quickly find the highest-priority task ready-to-run.

In most such algorithms, it is important that the count-leading zeros operation be fast and deterministic. For this reason, many modern processors provide the CLZ (count-leading zeros) instruction, sometimes also called LZCNT, BSF (bit scan forward), FF1L (find first one-bit from left) or FBCL (find bit change from left).

Of course, if your processor supports CLZ or equivalent in hardware, you definitely should take advantage of it. In C you can often use a built-in function provided by the embedded compiler. A couple of examples below illustrate the calls for various CPUs and compilers:


y = __CLZ(x);          // ARM Cortex-M3/M4, IAR compiler (CMSIS standard)
y = __clz(x);          // ARM Cortex-M3/M4, ARM-KEIL compiler
y = __builtin_clz(x);  // ARM Cortex-M3/M4, GNU-ARM compiler
y = _clz(x);           // PIC32 (MIPS), XC32 compiler
y = __builtin_fbcl(x); // PIC24/dsPIC, XC16 compiler

However, what if your CPU does not provide the CLZ instruction? For example, ARM Cortex-M0 and M0+ cores do not support it. In this case, you need to implement CLZ() in software (typically an inline function or as a macro).

The Internet offers a lot of various algorithms for counting leading zeros and the closely related binary logarithm (log-base-2(x) = 32 – 1 – clz(x)). Here is a sample list of the most popular search results:

http://en.wikipedia.org/wiki/Find_first_set
http://aggregate.org/MAGIC/#Leading Zero Count
http://www.hackersdelight.org/hdcodetxt/nlz.c.txt (from Hacker’s Delight (2nd Edition), Section 5-3 “Counting Leading 0’s”)
http://www.codingforspeed.com/counting-the-number-of-leading-zeros-for-a-32-bit-integer-signed-or-unsigned/
http://stackoverflow.com/questions/9041837/find-the-index-of-the-highest-bit-set-of-a-32-bit-number-without-loops-obviously
http://stackoverflow.com/questions/671815/what-is-the-fastest-most-efficient-way-to-find-the-highest-set-bit-msb-in-an-i

But, unfortunately, most of the published algorithms are either incomplete, sub-optimal, or both. So, I thought it could be useful to post here a complete and, as I believe, optimal CLZ(x) function, which is both deterministic and outperforms most of the published implementations, including all of the “Hacker’s Delight” algorithms.

Here is the first version:

static inline uint32_t CLZ1(uint32_t x) {
    static uint8_t const clz_lkup[] = {
        32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
        28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U,
        27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
        27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U
    };
    uint32_t n;
    if (x >= (1U << 16)) {
        if (x >= (1U << 24)) {
            n = 24U;
        }
        else {
            n = 16U;
        }
    }
    else {
        if (x >= (1U << 8)) {
            n = 8U;
        }
        else {
            n = 0U;
        }
    }
    return (uint32_t)clz_lkup[x >> n] - n;
}

This algorithm uses a hybrid approach of bi-section to find out which 8-bit chunk of the 32-bit number contains the first 1-bit, which is followed by a lookup table clz_lkup[] to find the first 1-bit within the byte.

The CLZ1() function is deterministic in that it completes always in 13 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core with the highest level of optimization.

The CLZ1() implementation takes about 40 bytes for code plus 256 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 300 bytes of ROM.

If the ROM footprint is too high for your application, at the cost of running the bi-section for one more step, you can reduce the size of the lookup table to only 16 bytes. Here is the CLZ2() function that illustrates this tradeoff:

static inline uint32_t CLZ2(uint32_t x) {
    static uint8_t const clz_lkup[] = {
        32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
        28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U
    };
    uint32_t n;

    if (x >= (1U << 16)) {
        if (x >= (1U << 24)) {
            if (x >= (1 << 28)) {
                n = 28U;
            }
            else {
                n = 24U;
            }
        }
        else {
            if (x >= (1U << 20)) {
                n = 20U;
            }
            else {
                n = 16U;
            }
        }
    }
    else {
        if (x >= (1U << 8)) {
            if (x >= (1U << 12)) {
                n = 12U;
            }
            else {
                n = 8U;
            }
        }
        else {
            if (x >= (1U << 4)) {
                n = 4U;
            }
            else {
                n = 0U;
            }
        }
    }
    return (uint32_t)clz_lkup[x >> n] - n;
}

The CLZ2() function completes always in 17 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core.

The CLZ2() implementation takes about 80 bytes for code plus 16 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 100 bytes of ROM.

I wonder if you can beat the CLZ1() and CLZ2() implementations. If so, please post it in the comment. I would be really interested to find an even better way.

NOTE: In case you wish to use the published code in your projects, the code is released under the “Do What The F*ck You Want To Public License” (WTFPL).