embedded software boot camp

MISRA-C Guidelines for Safety Critical Software

December 17th, 2009 by Nigel Jones

Also available in PDF version.

In 1998, the UK’s Motor Industry Software Reliability Association established a set of 127 guidelines for the use of C in safety-critical systems. Here’s a look at the rules, what they mean, and how they can work for you.

The reasons for the popularity of C in the embedded realm are clear: easy access to hardware, low memory requirements, and efficient run-time performance being foremost among them. Equally well known are the problems with C: highly limited run-time checking, a syntax that is prone to stupid mistakes that are technically legal, and a myriad of areas where the ISO C standard explicitly states that the behavior is either implementation defined or undefined.

In the hands of a truly experienced programmer, most of the limitations of C can be avoided. Of course, that leaves the problem of inexperienced programmers. The problem is similar to the dilemma faced by parents when their offspring learn to drive. The solution that most parents adopt is to give their child a large, slow lumbering vehicle with excellent brakes; few hand over the keys to a high performance sports car.

ISO C is like a high performance vehicle–with very questionable brakes. Despite this, organizations have little choice other than to let their beginners loose with the language. The results are predictable and well documented.

So, what can be done? The UK-based Motor Industry Software Reliability Association (MISRA) realized that in many areas of an automobile design, safety is of paramount importance. They also recognized that C was the de facto language for coding such systems and that these systems are, therefore, vulnerable to C’s limitations. Rather than mandating the use of a safer language such as Ada, the organization looked at ways to make C safer.

The result was a set of “Guidelines for the Use of the C Language in Vehicle-Based Software,” or “MISRA C,” as they are more informally known. The guidelines, which were first published in 1998, comprise a 70-page document that describes a workable subset of C that avoids many of its well-known problems. (Unfortunately, you have to buy the document from MISRA. It is available at www.misra.org.uk for about $50.)

The MISRA C document contains eight chapters and two appendices and was obviously written by experienced embedded programmers. Chapters 1 through 6 contain important information about the rationale for MISRA C and how to interpret the rules. These chapters should be read prior to diving into the actual rules, which are found in Chapter 7.

Safety guidelines

In all, MISRA C has 127 rules. Of these, 93 are required and the remaining 34 are advisory. The distinction between these two types of rules is important. C code that claims conformance to MISRA C must comply with all 93 required rules. Conforming code should adhere to the advisory rules as much as is practical. In other words, either you buy into the spirit of MISRA C or you don’t.

With this in mind, it’s time to discuss the rules themselves. The example rules are quoted verbatim, in part to illustrate the tone and style of the guidelines. Let’s start at the beginning:

Rule 1 (required): All code shall conform to ISO 9899 standard C, with no extensions permitted.

When I first read this, I laughed out loud. Because the C standard was not originally written with embedded systems in mind, it is impossible to write a non-trivial embedded program without resorting to a variety of compiler extensions. For example, ISO C provides no way to specify that a function is an interrupt service routine. (MISRA frowns upon assembly language, so you can’t use that as a way to skirt the issues.)

Fortunately, in reading the notes associated with Rule 1, one finds the following comment:

It is recognized that it may be necessary to raise deviations (as described in section 5.3.2) to permit certain language extensions, for example to support hardware specific features.

The deviations mentioned in the comment are, arguably, the strongest feature of MISRA C. In putting together the rules, the authors recognized that they could not possibly foresee all the circumstances associated with all embedded systems. Thus, they put in place a mechanism by which an organization can formally deviate from a given rule. Note that this does not mean that the programmer is free to violate the rules on a whim. Rather it means that all violations must be carefully considered, justified, and documented.

MISRA also recommends that deviations be localized as much as possible–such as, all I/O operations be performed in one file. Notwithstanding MISRA C, this is good programming practice anyway.

Most of the other rules are common sense and are succinctly stated. For instance:

Rule 20 (required): All object and function identifiers shall be declared before use.

Any programmer who has a problem with this rule probably needs to find a new line of work.

Some of the rules are more stylistic in their intent and, as such, are more likely to offend individual sensibilities. This is particularly true of the advisory rules, such as:

Rule 49 (advisory): Tests of a value against zero should be made explicit, unless the operand is effectively Boolean.

This rule goes against my own preferred style of coding. However, I see where the authors are going and wouldn’t exactly be compromising deeply held beliefs by conforming to their advice.

While Rule 49 isn’t going to change much in the way most people write code, Rule 104 could have a bigger impact.

Rule 104 (required): Non-constant pointers to functions shall not be used.

The use of pointers to functions is a favored technique of many experienced embedded programmers.

The bottom line: regardless of your experience level, if you choose to go down the MISRA C path, expect to have to change the way you code in both big and small ways.

Compliance checking

In writing these rules, the authors appear to have tried very hard to make compliance checking via static analysis possible for as many rules as possible. Static analysis is a fancy term meaning that a computer program could be written to check for violations of the rules.

The most obvious program to do the compliance checking is a compiler. In fact, several embedded compiler vendors, including Green Hills and Tasking (now Altium), do offer compilers with MISRA compliance switches. Does this mean that you have to use a subset of the compilers out there in order to get compliance checking? Fortunately, no! PC-Lint (from Gimpel) now offers MISRA compliance checking as well. (Read more about lint.)

Unfortunately, though, not all of the rules can be enforced automatically. For the other 23 rules, the only available enforcement mechanism is a code review. So, if nothing else, adherence to MISRA C guarantees that your organization will have to conduct code reviews.

MISRA recommends the use of a compliance matrix to track how your organization intends to check its conformance to each rule.

Final thoughts

Many organizations that develop software, particularly in the U.S., have not even heard of MISRA C–let alone require its use. However, this doesn’t mean that as an individual developer you shouldn’t consider adopting these guidelines on your own. After all, MISRA is a tool to help you write safer, more portable code.

For those of you who can’t handle the effort of formally adopting MISRA C, you should, at the very least, examine the rules. If you see rules that contradict your normal coding style, then I suggest you look long and hard at your practices. Chances are that the MISRA folks are older and wiser than you.

MISRA C does nothing for checking the validity of your algorithms. It doesn’t enforce a particular style, and it cannot stop you from writing completely idiotic code. What it will do-if you let it-is protect you from most of the darker corners of the C language.


This article was published in the July 2002 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Introduction to MISRA C” Embedded Systems Programming, July 2002.

Use Lint for Static Code Analysis

December 17th, 2009 by Nigel Jones

Also available in PDF version.

Language specifications, including those for C and C++, are often loosely written. A static analysis tool called lint can help you find dangerous and non-portable constructs in your code before your compiler turns them into run-time bugs.

Anyone who has written a program has had to debug code. In many cases, after staring at the code for hours, we finally realize that we have done something stupid like read an uninitialized variable, or index beyond the end of an array. With experience, the incidence of these “doh!” bugs decreases; but they can still be extremely frustrating and costly.

What we need at such times is an infinitely patient, anal-retentive expert to inspect every inch of our code. This expert must be more thorough than any compiler, and should report back in a completely dispassionate way everything that is potentially wrong with our code. A development tool called lint is the closest you can get to this ideal.

Lint is a tool similar to a compiler in that it parses C or C++ source files. It checks these files for syntactical correctness. To lint a file called foo.c, one typically just types:

lint foo.c

at the command line. Naturally, though, there are also many optional command line switches.

Standard features

Whereas a compiler concerns itself primarily with code generation, lint is completely devoted to checking your code for a myriad of possible defects. The key word here is possible. Just because lint flags a section of your code for review, it doesn’t necessarily mean a problem will occur when you compile that code with your particular compiler.

Lint is designed to be compiler-agnostic and is, in fact, frequently in the business of focusing your attention on parts of the code that might result in different behavior depending on the specific compiler used.

The specific list of problems that lint checks varies by implementation and version of the tool. However, most flavors of lint will at least check for the following:

  • Possible indexing beyond array bounds
  • De-referencing of null pointers
  • Suspicious assignments (such as if (a = b))
  • Mismatches in variable types (such as foo declared as a double in one file and used as a long in another)
  • Potentially dangerous data type combinations
  • Unused variables
  • Unreachable code
  • Header files included multiple times and/or unnecessarily
  • Non-portable constructs

Lint checks so many things, in fact, that it’s common for the tool to initially produce as many errors and warnings as there are lines of code in the source file that’s input. This is a big turn-off for many potential users, since their attitude tends to be “this tool is so picky it’s ridiculous.” However, working through each warning and correcting it can be a rewarding exercise. As an example, consider this seemingly innocuous code:

main(void)
{
    int i;

    for (i = 0; i < 10l; i++)
    {
        printf("%d\n",i);
    }
}

What will be the last number printed when you run this program? If you answered “100,” you’re wrong. This is a perfectly valid program, and will compile without warning on most any compiler. However, lint will complain about something. If you can’t see the problem via visual inspection, then I suggest you cut and paste the code above and run it through your favorite debugger. Note, and a big hint here: do not simply type this program into your editor. Observe how long it takes you to find this problem–and then ask yourself whether wading through lint warnings isn’t so bad.

Getting the lint out

What sort of real world benefits can you expect from addressing all of the warnings produced by lint? My experiences on a typical project involving a small microcontroller (total code size below 32KB) included the following:

  • Lint found two or three outright bugs–before I had even started testing the code
  • I learned something about the C language each time I ran lint
  • My final code was cleaner because lint informed me of unused variables, macros, and header file includes that could be safely removed
  • I was better informed of potential portability issues

Given the above, it will probably not surprise you to learn that organizations that are really serious about code quality often insist not only that all code compile without warnings (which is relatively trivial to achieve) but also that it be “lint free”-that is, generate no warnings with lint. This is a much more difficult criteria to achieve.

Figure 1: How lint fits into the development process

It’s worth looking at where lint fits into the development process. My general design flow is shown in Figure 1. Once I have code that compiles, I lint it. If the code gets through lint okay, it’s highly unlikely that I’ll be embarrassed in the code review. During the debug phase, it’s normal for changes to be made to the code. However, once the code is debugged, and before it is passed to the release test, I normally lint the code again. Why? I’m always amazed at the number of sloppy coding constructs that occur when code is being debugged. Lint is a great tool for identifying that ratty piece of code that was put in there to help debug something and then promptly forgotten.

Sources of lint

Lint is a standard tool on most Linux or Unix development systems. In the PC realm, however, you often have to go out and buy lint, or find a free or shareware version. If you do buy lint, rest assured it will likely be the best money you have ever spent in your embedded career. Most lint variants are relatively inexpensive (less than $1,000) and worth every penny.

Incidentally, you may be wondering how well lint handles all those nasty little compiler extensions that are so common in embedded development. This is an area where the commercial programs outshine the free and shareware offerings. In particular, some versions of lint allow considerable customization of lint’s rules, such that all those extensions are correctly handled. In some cases, the compiler definitions are even supplied by the lint vendor. In others, you may be able to get them from the compiler vendor.

If you don’t have access to lint, but are using the GNU tools (on Unix or a PC), simply use gcc’s -Wall flag to achieve about 80% of the same functionality.

So, for all the neophytes out there, get yourself a copy of lint, and use it. If nothing else, your boss will be impressed with the maturity of your code. For all the experienced hacks who aren’t using lint–watch out! The new guys who are using it might show you up.

Further reading

Wikipedia – Lint

Darwin, Ian F. Checking C Programs with Lint. Sebastopol, CA: O’Reilly, 1988.


This article was published in the May 2002 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Introduction to Lint,” Embedded Systems Programming, May 2002

Optimal C Constructs for 8051 Microcontrollers

December 17th, 2009 by Nigel Jones

Also available in PDF version.

The limitations of an 8-bit microcontroller (MCU) can sometimes make conventional C constructs produce suboptimal code. In this article we look at common problems on the 8051 family and discuss workarounds in C.

The 8-bit MCU marketplace is massive. And, despite the inroads made by newer architectures, the ancient 8051 continues to hold significant market share. But old 8-bit architectures, such as the 8051 and HC11, were never designed to be programmed in a high-level language such as C.

Compiler vendors work hard to make good compilers for these 8-bit microcontrollers. But even with all the ingenuity in the world, compilers must eventually bump up against a target MCU’s fundamental limitations, producing some surprising results.

Those of you familiar with the 8051 know that it’s not exactly a great architecture from a programmer’s perspective. Accordingly, a lot of 8051-based projects end up hungry for every CPU cycle and memory byte that can be saved. Since I enjoy working on 8-bit designs, I’ve spent many years wringing the last drop of performance out of such systems–and learning a lot in the process.

If you have a poor software design, inefficient algorithms, or haven’t taken the basic steps advocated by the compiler vendor, nothing I can tell you here will make a difference. However, if you take all the steps advocated by the compiler vendor and have optimal algorithms and other necessary conditions, and you still need more performance (but don’t want to resort to assembly language) finding the optimal C coding constructs can help you get the job done.

On 32-bit systems with a plethora of registers and addressing modes, functionally equivalent coding constructs will probably produce identical performance. On 8-bit systems, however, a subtle change in coding style can affect performance significantly. Consider the following blocks of code, which are functionally equivalent.

// 8 bytes and 42 cycles
for (i = 0; i < 10; i++)
{
    _nop_();
}

// 15 bytes and 120 cycles
i = 0;
while (i++ < 10)
{
    _nop_();
}

As the comments indicate, the difference between these two fragments is significant. Given the large difference, you’d think the compiler vendors would publish a nice list of optimal techniques to which one could refer. Well, I have poured over dozens of compiler manuals and have never come across such a list. I can only assume that either the vendors aren’t aware of these differences (which seems hard to believe), or they consider it a sign of weakness in their product, and don’t want to publicize the information.

Preliminaries

Working on the adage that “What they won’t tell me is probably really worth knowing,” I have created just such a list of optimal C constructs. This list was hard won and I intend to share it with you. However, a few caveats are in order.

First, this list is only definitive for an 8051 and version 6.02 of the Keil compiler using optimization level 8, favoring execution speed over size. I also used the no integer promote (NOIP) switch. Does this mean that if you are using something else that this article is a waste of your time? I hope not! Many of the results come about because of the 8051 architecture. Thus, if you are using another 8051 compiler, I suspect that the general results will hold true. If you are using another 8-bit machine, such as the HC11 or the PIC then at the very least, the article should give you some food for thought.

Second, I can’t hope to cover every possible permutation of each of the scenarios. If you think I’ve missed an important variant, please add it to the test code and let me know your results.

“Optimal” can mean many things. I think there is a tremendous case for claiming that optimal code is code that is easy to maintain. In this article, however, I use optimal to refer to code that is optimally fast, small, and consistent in its execution time; I err on the side of speed. Generally, however, I’ve found that the fastest code is also the smallest.

On many occasions I have seen compilers do weird things with subtle changes to code. As a result, the information I’ll present here should be taken as a “rule of thumb.” If you apply the techniques, most of the time you’ll get more optimal code. However, if you are trying to wring the last ounce of performance out of your system, you should really check the results for your specific case.

Finally, some of the tests reveal surprising weaknesses in the Keil compiler. It is not my intention to criticize their product. Indeed, I have used their compiler for over ten years, which says a lot about what I think of it. I also suspect that similar compilers would also show their idiosyncrasies when subjected to similar detailed inspection. I’d like to go on the record as saying I think Keil’s products are excellent–and put a lot of the bigger names in the industry to shame.

Coding conventions

A central problem in trying to demonstrate the various constructs is to keep the code very simple–but not so simple that the optimizer realizes the code is doing nothing useful, and hence optimizes it away to nothing. To this end, I make extensive use of the intrinsic “function call” _nop_().

In comparing various techniques, and to ensure that every technique gets a fair hearing, I make extensive use of conditional compilation. To try the various techniques, it is of course necessary to set the value of the various manifest constants in order to compile the code one way or another.

In most cases, I report the number of execution cycles, the code size, and the data size of the various fragments. The data size is instructive. However, given that this memory gets heavily overlaid on an 8051, don’t get too worked up about differences in this parameter. When relevant, I also report the size of the library code.

In some of the projects, I run a lot of variants in an attempt to deduce patterns and optimal constructs. I usually only report on my conclusions. I encourage you to study the results of the various test cases and see if you agree with my conclusions.

As a final note, the execution times are normally given in cycles assuming a standard 8051 (12 clocks per instruction cycle). When I am looking at the benefits of dual data pointers, I use a processor with a different number of clocks per cycle. In either case, what is really important is the ratio (or difference) of the execution times.

Code constructs

We’ll be looking at these areas:

  • Looping
  • Function calling
  • Decision making
  • Calling one of N functions
  • Pointer manipulation
  • Copying
  • Bit-banging

The next to last section, on copying, builds on the earlier topics, and the bit-banging section builds on everything. That is, we’ll apply some of the techniques to address the problem of addressing a serial device by twiddling port lines.

Looping

All nontrivial programs contain loops. Of these, the most common is the loop that has to execute a fixed number of times. C offers numerous ways to code this. To see if there is any significant difference between the various constructs, I put together the looping project. The looping project contains eight different constructs. After testing the eight variants, I achieved a variation in performance that is quite spectacular (2.5:1 in size and 3.5:1 in speed). Furthermore, the best performing construct is not exactly intuitive.

First, let’s look at the worst performing loop construct:

// Code: 15. Cycles: 124.
void fne(void)
{
    unsigned char i; 

    i = 0;
    while (i++ < 10)
    {
       _nop_();
    }
}

This is a fairly innocuous construct, but for various reasons it causes the compiler to generate a lousy solution. In contrast, three different constructs produced the optimal solution. The following is the most “standard” looking:

// Code: 6. Cycles: 35.
void fnc(void)
{
    unsigned char i; 

    for (i = 10; i; i--)
    {
        _nop_();
    }
}

Note that this solution (along with the other two optimal solutions) involves counting down. In contrast, let’s look at the results for a standard count-up for-loop: // Code: 8. Cycles: 46. void fna(void) { unsigned char i; for (i = 0; i < 10; i++) { _nop_(); } }

The above requires two more code bytes and an extra cycle per loop iteration. This is an interesting result, one in which the standard construct doesn’t produce the optimal code. This result comes about because the 8051 has a DJNZ (Decrement and Jump if Not Zero) instruction that is designed for efficient looping. However, it requires the loop variable to count down–hence the for-loop that counts down gives the best performance.

Note, however, that the for-loop that counts down is usually problematic if the loop variable is used within the body of the loop. For example, if we wanted to set the elements of an array to some value, we’d have a problem because the values of the looping variable would be 10 to 1 instead of 9 to 0. If you need to use the looping variable within the body of the loop (which is the more common case), then the standard count-up for-loop construct gives the best results.

It’s also worth noting that if you were to use a 16-bit variable as the loop index, the optimal construct would probably change. Since I almost never have to use a 16-bit variable as a loop index, I’ve decided to leave this case as an exercise for the reader. If anyone does try it, I’d be interested in the results.

Function calling

Calling a function incurs overhead. On modern processors, this typically consists of pushing the required parameters onto a stack and then executing a call. The called function pops the required parameters off of the stack, performs the required work, and then returns. For many of the modern processors, the penalty is not so much the pushing and popping of parameters, as it is the fact that a function call often results in a cache miss and the attendant system slow down.

In the 8051 world, however, parameters are usually passed via registers. (It’s horribly inefficient to use the compiler’s reentrant model with parameters pushed onto an external stack. Consequently, it should never be used except in a case of true need.) However, the 8051 only has a few registers to use to pass parameters. Keil’s rules for parameter passing appear in Table 1 (at ftp://ftp.embedded.com/pub/11jones/tables).

There are several things to note:

1. At most, only three parameters will be passed in registers

2. The order in which arguments are declared affects how many bytes are passed in registers

3. At most, only one long or float and only one generic pointer can be passed in registers

4. There is no provision for passing other data types such as a structure or a union in registers

Let’s look at the implications of each of these limitations.

Three parameter limit

Passing lots of parameters to a function on an 8-bit machine is not a great idea. Here are some things you can try to get around this limitation:

  • Group data into structures and pass a pointer to the structure
  • If you need to pass four chars, you can group them as a union of four chars and one long and then pass the long (not pretty, but it works)

Parameter ordering

Here’s a limitation that often surprises people. Consider a function foo() that requires a char, an int, and a generic pointer to be passed to it.

void foo(char a, int b, int *c);

This will result in all six bytes (one for a, two for b, and three for c) being passed in registers. However, rearranging the parameter order to:

void bar(int *c, int b, char a);

will result in only five bytes being passed in registers, with parameter a being passed in a fixed memory location. This results in a significant performance hit.

The code to demonstrate this is in Listing 1 (at ftp://ftp.embedded.com/pub/11jones/listings), which displays Test Case 1 of the function calling project. The first method requires three fewer cycles, three fewer code bytes, and six fewer data bytes than the second method. If this function is called often, the overall savings can be significant.

The bottom line is this: if you have a function that is called repeatedly, it is worthwhile to check the parameter order to make sure as many parameters as possible are passed in registers.

Passing pointers

Table 1 shows that only one generic pointer can be passed in registers, whereas up to three memory-space-specific pointers may be passed in registers. This is an unsubtle hint not to use generic pointers. To give you an indication of how severe the penalty is, consider Listing 2, which shows the difference between passing two generic pointers and passing two typed pointers.

The casts in foo() are intended to eliminate the time difference associated with the expression evaluation (the Keil compiler is smart enough to eliminate the overhead associated with the casts). Calling foo() required considerably more resources than calling bar(). This can all be attributed to the fact that bar() required more bytes (six versus four) to be passed, and that three of the bytes were passed in fixed memory locations. The obvious conclusion is to look very hard at any function that requires the passing of more than one generic pointer.

Passing other data types

The register argument rules are quite rigid. If you have the temerity to try and pass a bit or a structure as an argument (even if the structure consumes just two bytes), that parameter and all parameters to the right of it on the argument list are not passed in registers. Thus:

void foo(char a, int b, int *c, bit d);

will result in the first three parameters being passed in registers. However,

void bar(bit d, char a, int b, int *c);

will result in none of the parameters being passed in registers. The code to demonstrate this is Test Case 3 of the function calling project. The performance difference is significant and is shown in the Table 2.

Similarly, the following code:

typedef struct
{
    char e;
    char f;

} STYPE;

void foo(STYPE a, int b);

also results in none of the parameters being passed in registers, since STYPE defines an unsupported type. The code to demonstrate this is in Listing 3.

In this case, foo() uses a massive amount of resources compared to bar(). Now admittedly, this has a lot to do with the compiler calling a function to copy the structure–rather than being smart enough to realize the structure is so small that it could easily fit into the available registers. However, I think it demonstrates that passing structures to a function is a really bad idea.

On a related note, I have also discovered that in certain circumstances, even when all the parameters may be passed in registers, changing the ordering of parameters sometimes makes a difference. Unfortunately, I am unable to come up with a rule of thumb to guide you on this one. All I can do is suggest you experiment on critical functions.

Return types

No discussion of function calling would be complete without mention of the impact of the data type that is returned. The Keil documentation shows that all basic types (char, int, float, *) will be returned in registers. A bit will be returned in the Carry flag. However, they make no mention of what happens if you try and return a structure. We’ll examine the case where you wish to get a copy of the data structure. Listing 4 examines this topic.

Thus, the simpler construct produced marginally better code. This is the first case of many where simplicity produces the better results.

As an aside, this example again demonstrates the inefficiency of the compiler’s data structure copy routine. For a trivial case like this, it would of course be better to return a pointer and perform an explicit element-by-element copy. I just find it surprising that the compiler doesn’t contain some form of heuristic algorithm for determining the optimal solution.

Function calling summary

Take a hard look at all of your function prototypes. The following changes will usually result in worthwhile performance increases:

1. Convert generic pointer arguments to typed pointer arguments

2. Make sure that any bit or structure arguments are passed last

3. Notwithstanding #2, don’t pass structures!

4. Minimize the number of arguments-but not at the expense of violating rule 3

5. Look at the ordering of your parameters to see if a reordering will result in all (or more) of the parameters being passed in registers

6. Experiment! If you have a function that is called regularly, experiment with the order of its parameters. You may be surprised at the gains you realize.

7. If you need to assign a structure from a function call, return the structure instead of a pointer to the structure.

Decision making

All programs are littered with if-else and switch-case constructs. On the face of it, there isn’t much one can do to change the system performance. However, certain questions that should be answered include:

  • Is the ternary operator more efficient than an if-else?
  • What’s the most efficient way to test something for being zero?
  • How does an index into an array of pointers to functions compare in performance to a switch statement?

The ternary operator

The ternary operator is terse, slightly cryptic and a favorite of mine. Examination of K&R shows that the authors use the ternary operator very liberally indeed. But does the ternary operator actually accomplish anything? Well, the answer with the Keil compiler is an emphatic no, since it produces larger code and longer execution time in every test case I tried. Listing 5 shows Test Case 1 of the decision making project.

The code in main() is designed to test the functions “both ways.” Other tests, involving bits, floats, and so on, all produced similar results. This is the second example where a simpler coding style produces better code.

Evaluating a variable for TRUE / FALSE

This issue seems to occur frequently. In a nutshell, we are interested in determining if a variable is zero (FALSE) or nonzero (TRUE). This seems so trivial that you wouldn’t imagine that performance would vary widely depending upon the constructs used. Think again!

Test Code 2 from the decision-making project is shown in Listing 6. In this case, I have three functions whose sole task is to test a byte variable and return 0 if it is zero, and 1 if it is not. The result is represented as a bit. The first function solves the problem by using the most straightforward representation. For simplicity and clarity, it is hard to beat. It also has the possible benefit of having equal execution time regardless of the value of the parameter. The second example is more obfuscated while giving no real benefit (a byte saved versus increased and variable execution times). The third function is either incredibly simple or ridiculously obtuse, depending on your perspective. It does, however, give dramatically better performance and comes about from an exceedingly neat trick by the compiler.

This is an interesting example, since I think it demonstrates two points:

As we have seen in other examples, the compiler seems to generate better code when simpler constructs are used. Hence example 1 is better than example 2.

Example 3 gives the best result because it makes use of a type cast. Type casts occur so frequently that I suspect the compiler writers have put a lot of effort into generating optimal coding constructs for them.

Summary

From the preceding examples, we can come up with a few rules of thumb for decision-making:

  • Avoid the use of the ternary operator
  • When testing for TRUE / FALSE, omit all explicit tests and rely on the compiler to cast the variable to a Boolean

Calling one of N functions

Though this is a form of decision-making, the problem of calling one of N functions is interesting and complex enough to warrant its own section. Essentially, we wish to solve the problem of calling one of N functions based on some index value. There are three general approaches to this problem:

  • if-else chain
  • switch statement
  • Array of pointers to functions

If you are unfamiliar with arrays of pointers to functions, you can read my article “Arrays of Pointers to Functions” to find out more.

The test project (Pointer to Function) contains four test cases. In all cases, we are calling one of N functions, where the functions fna() … fnk() simply return a … k respectively.

In this case, we wish to call one of five functions based on an index whose range of legal values is 0…4. This is a small enough number of choices that any of the three techniques is reasonable. The results of the three techniques are summarized in Table 3.

Note that index 6 is an illegal value, and hence tests how the technique handles this problem. There are a number of interesting results here:

  • The else-if and switch techniques have identical code size. However, examination of the assembly language shows that completely different techniques are used.
  • The else-if construct execution time increases with the value of the index. This seems reasonable, since we first test for 0, then for 1 and so on.
  • The code is considerably more compact than the other two techniques.
  • The switch construct execution time decreases with the value of the index. This comes about from a rather elegant code construct that the compiler generates. However, I’d consider this counter-intuitive.
  • The pointer to function code execution time is constant for all valid indices.
  • The pointer to function code execution time is considerably longer than the other techniques. However, a large portion of this is due to the limit checking on the index. If this can be eliminated (because the index is guaranteed to be within a valid range), then the penalty is less significant.

If the number of elements in the switch statement exceeds six, the Keil compiler uses a different technique for evaluating the switch statement. Test Case 2 examines this point.

Test Case 2 is identical to Test Case 1, except that the list of possible functions is extended to 11. I suspect that most people would eschew an else-if chain at this point, and so I have dropped it from the test.

The results of the two remaining techniques are summarized in Table 4.

Note that index 20 is an illegal value, and hence tests how the technique handles this problem.

Again, there are a number of interesting results.

  • The switch statement now produces consistent execution time (except for the default case).
  • The code size of the switch statement is significantly larger than the pointer to function approach.
  • The switch statement still has a significant edge in execution time. Again, if the limit checking code can be removed then this difference becomes much smaller.
  • Test Case 3 reproduces Test Case 2, except the indices are offset. The results are basically the same as Test Case 2, showing that the compiler is smart enough to handle nonzero-based contiguous indices. However, what happens when the indices are non-uniformly spaced? Test Case 4 addresses this question

In Test Case 4, we still have 11 possible function calls. However, the indices for the functions are non-uniformly spaced. For the switch-based code, we simply fill in the requisite values as shown below:

char switch_fn(unsigned char i)
{
    char res; 

    switch(i)
    {
      case 0x30:
        res = fna();
        break; 

      case 0x35:
        res = fnb();
        break; 

      case 0x36:
        res = fnc();
        break; 

      case 0x39:
        res = fnd();
        break; 

      case 0x3C:
        res = fne();
        break; 

      case 0x41:
        res = fnf();
        break; 

      case 0x42:
        res = fng();
        break; 

      case 0x48:
        res = fnh();
        break; 

      case 0x4A:
         res = fni();
         break; 

      case 0x4D:
        res = fnj();
        break; 

      case 0x4E:
        res = fnk();
        break; 

      default:
        res = 0;
        break;
    } 

    return res;
}

For the pointer to function approach, we fill in the gaps with calls to a default function; in other words, a function that returns the default value of zero. Thus, the executable code looks the same. It’s just the lookup table that has increased in size. The code to do this follows:

char ptof_fn(unsigned char i)
{
    static char (code * code pf[])(void) = {
        fna, fnDefault, fnDefault, fnDefault,
        fnDefault, fnb, fnc, fnDefault,
        fnDefault, fnd, fnDefault, fnDefault,
        fne, fnDefault, fnDefault, fnDefault,
        fnDefault, fnf, fng, fnDefault,
        fnDefault, fnDefault, fnDefault, fnDefault,
        fnh, fnDefault, fni, fnDefault, fnDefault, fnj, fnk
    };

    char res;

    if ((i >= 0x30) && (i <= 0x4E))
    {
        res = pf[i-0x30]();
    }
    else
    {
        res = 0;
    }

    return res;
}

The results of this test are shown in Table 5.

Note that index 49 is an illegal value, and hence tests how the technique handles this problem.

Again, there are a number of really interesting results here:

  • The switch statement code now produces variable execution times
  • The execution time of the switch increases significantly with increasing index
  • The default value execution time of the switch is dramatically increased, which would be problematic if the default case was the most common
  • The switch footprint is considerably larger than the function pointer approach. However, much of this is attributable to library code, which presumably would be reused in other switch evaluations.

Summary

From the preceding examples, we can come up with a few rules of thumb for calling one of N functions:

  • If consistent execution time is important, use an array of function pointers. But if the number of indices is greater than six and the indices are contiguous, use a switch statement.
  • If code size is important, always use an array of function pointers.
  • If average execution time is of paramount concern, then you have to do some analysis:
    • For six or fewer choices, use an if-else chain.
    • For more than six choices, where the choices are contiguous, use a switch statement.
    • For all other cases, use an array of function pointers.

If you have a situation where the number of choices may be altered over time, and especially if the choices may become noncontiguous, then use an array of function pointers. This will prevent big changes in performance from small changes in code.

Pointer manipulation

Pointers are an integral part of most C programs. In addition to dereferencing them, it is common to increment or decrement them within the body of a loop. Anything that is regularly performed within a loop is worth looking at carefully to see if any optimizations are possible. To this end, I created the pointer manipulation project. This project attempts to answer the following questions:

  • How does combining the pointer manipulation with the dereference affect performance?
  • Is there any difference between incrementing and decrementing a pointer?
  • What effect does the memory space have on these issues?

The answers to these questions are not intuitively obvious because the 8051 has some weird characteristics:

1. When accessing data space via a pointer register (for example, R0), the register can be incremented or decremented with equal ease.

2. In contrast, when accessing external data space via the DPTR register, the instruction set only defines an increment operation. Thus, we might expect pointer decrements to be more problematic in this case.

The pointer manipulation project contains a dozen functions that all perform the equivalent of memset(). In each case we are trying to set the contents of a 20-byte array to a specified value.

Function calls att1(), att2(), and att3() examine the optimal way to combine a data space pointer dereference with a pointer increment. The results are summarized in Table 6, and are very interesting. There is a massive penalty to be paid for combining the pointer increment with the dereference as done in att1(). Surely, I thought to myself, this must be some manifestation of dealing with internal data space pointers? So I performed the same test using external data space pointers. Function calls att7(), att8(), and att9() were the test code. The results are summarized in Table 7. As you can see, we get the same basic result. This was quite a shock to me, since my coding style preference is to use *ptr++.

I know that there is a school of thought in which my preferred way of doing things is bad, as I’m doing too much in one line of code. Well, for those of you who subscribe to this school, here’s evidence that (at least for this compiler) your way of doing things is also more efficient.

It’s worth commenting on the fact that the pre-increment (att9()) offers a one cycle saving over the post-increment (att8()). This was also a surprise to me. For those of you who don’t like pre-increments (because it looks strange), go ahead and use the post-increment, since the penalty is miniscule. For those of you who absolutely have to have the best performance, use the pre-increment.

Here’s att9()’s implementation:

void
att9(unsigned char xdata *ptr,
    unsigned char val, unsigned char len)
{
    unsigned char i;

    for (i = len; i; i--)
    {
        *ptr = val; ++ptr;
    }
}

Note the use of the count-down for-loop to achieve optimal looping time.

Having obtained such a surprising result, I was obviously intrigued about other pointer manipulations. How does pointer incrementing compare to pointer decrementing?

Functions att4(), att5(), and att6() are identical to att1() … att3() except that we start at the top of the array and decrement the pointer. The results were identical. This was expected since the index register for data space access can be as easily decremented as incremented.

The same is not true for external data space access, however, where there is no provision for decrementing the DPTR register. Functions att10() … att12() are the equivalents of att7() … att9(). Once again the results were surprising; they’re summarized in Table 8.

In the case where we combine the pointer increment/decrement with the dereference, (att7() and att10()), we are better off using a pointer decrement. In contrast, where we split the instructions apart there is a significant penalty to be paid for decrementing the pointer. Finally, note that the pre-decrement also holds a one cycle advantage over the post-decrement, just as for the increment case.

A reasonable question to ask is “Do these results hold when we are pointing to anything else other than a character?” I don’t know. Most of my 8051 pointer manipulation seems to occur on communication buffers, so I haven’t been motivated to find out the answer to this question.

As a final note on this topic, I also compared calling memset() to the optimal solutions. memset() was considerably faster, but required more code space. The copying memory project that we look at next examines this issue in more detail.

Copying memory

Copying one area of memory to another is a common requirement. Indeed, it is so common that ANSI defined memcpy() to do this for us. In just about every environment, one would expect that memcpy() would be the fastest way to copy memory. However, this isn’t always the case with the 8051. The reason is a combination of the ANSI standard, the 8051 architecture, and implementation details of the compiler.

Let’s start with our friends at ANSI. First, the memcpy() library routine is required to return a pointer. Although I understand the rationale for this, I almost never use the result of a call to memcpy(). Thus the compiler writers are forced to generate and execute code that is of no interest to me. Secondly, memcpy() is required to take a length argument that is a signed int (why anyone would want to pass a negative length to memcpy() is beyond my comprehension). Signed integers carry a heavy penalty on the 8051. This would be OK, except that on most of my 8051 projects, I normally need to copy much less than 256 bytes, so that the length parameter can be an unsigned char. Thus, ANSI forces me to pass two bytes to memcpy() where one byte would do fine. In summary, ANSI handicaps the memcpy() writer in three ways:

  • Requires a pointer to be returned
  • Requires the length parameter to be an integer, resulting in more bytes being passed than is normally necessary
  • Requires the length parameter to be an integer, resulting in the looping variable being unnecessarily large

Of course, if a particular invocation does require copying more than 256 bytes, then the last two complaints are moot.

The length issue is nothing compared to the problems caused by the 8051. The 8051 has four different address spaces that make sense from a memcpy() perspective (using the Keil nomenclature: IDATA, PDATA, XDATA, and CODE). Three of these are legal destination spaces, while all four are legal as the source of the data. Thus, there are 12 possible memory space combinations that have to be supported. To accommodate this, Keil defines memcpy() to take two generic pointers. For those of you that are really paying attention, you’ll remember that the Keil parameter passing rules don’t allow two generic pointers to be passed in registers. However, examination of the memcpy() assembly language shows that Keil doesn’t follow its own rules. In fact it passes both pointers and the length integer in registers. (The cynic in me wants to know why they are allowed to do this and I’m not.)

Anyway, at this point, the memcpy() routine has two generic pointers it has to deal with. Keil’s solution is to determine at run time the memory spaces that the pointers point to and to call the appropriate function. I’m a little surprised that Keil cannot determine this information at compile time and simply call the requisite routine. However, I don’t write compilers for a living, so I’m sure there is a good reason.

To summarize, the memcpy() routine has the following handicaps:

  • Needs eight bytes to be passed in registers
  • Uses a 16-bit looping index-even when eight bits will do
  • Determines at run time the memory spaces to use
  • Has to return a value

Against these, the compiler has two advantages:

  • Hand optimized assembly language
  • If the 8051 derivative has two DPTR registers (a common extension to many 8051 families), the function can make use of this to dramatically improve performance for XDATA-XDATA or CODE-XDATA moves

Based on this information, it would seem reasonable that memcpy() may not be the optimal solution in all cases. The memcpy project was put together to examine this very issue. The project contains 12 tests. Some of these use memcpy(), while others use variations on for loops and array/pointer manipulations. These 12 tests are not exhaustive, but do serve to demonstrate a reasonable pattern. Tests from code space to data space are not shown, since this doesn’t seem to occur very frequently.

The detailed results are given in Table 9. A few comments are needed on some of the columns.

># DPTR registers: This specifies how many DPTR registers I told the compiler were available. The only time this is relevant is when doing an XDATA – XDATA (or CODE – XDATA) copy using memcpy(). In all other cases, the compiler simply ignores the availability of a second DPTR register.

Implementation: This gives a very brief summary of the basic technique used to achieve the result. For-loop/arrays means that the code looked something like this:

for (i = 0; i < SIZE; i++)
{
    to[i] = ex[i];
}

For-loop/pointers means that the code is based on constructs similar to:

for (i = 0, toptr = to, exptr = ex; i++)
{
    *toptr++ = *exptr++;
}

You’ll have to look at the source code to get the definitive statement on each implementation.

Times: The 1-byte and 11-byte times allow me to compute the setup time (the time taken to set up all the looping constructs) together with the per byte time (the delta time over the setup time for each additional byte to be copied). This information allows me to compute the break-even point for comparing various techniques.

Several conclusions may be drawn from these results:

1. The memcpy() function requires about 56 to 64 cycles of setup time. This is a lot, and is attributable to the number of registers to be loaded and the fact that the memory spaces are determined at run time.

2. The most efficient method of coding one’s own copying data routine seems to be the most straightforward, namely a for-loop with arrays (test 8). This is another example of where the simplest coding construct works best. It’s also notable that the array-based code is faster than pointer-based code. So much for the adage of “pointers are faster than array subscripting.”

3. If we compare the for-loop to memcpy(), the break-even point occurs around seven bytes in most cases. So if you have to copy more than seven bytes, use memcpy(). Otherwise use a for-loop.

4. The big exception is when doing XDATA-XDATA moves on an 8051 with a single DPTR register. In this case, we are comparing the results of test 4a with test 8. Here memcpy() is always slower than the for-loop. Of course, if you need to copy more than 256 bytes, this result doesn’t hold. Notwithstanding this, I still find this result pretty amazing.

If speed is what you need, then use these results. However, if you are copying blocks of data around as part of system initialization, I urge you to consider code size, readability and maintainability before making your decision. Indeed, if code size is your prime consideration, you may want to avoid memcpy(), even if you use it in multiple places. The reason for this is quite subtle. It takes about 19 bytes of code to call memcpy() (the compiler has to load eight registers). In addition, the actual library code is 246 bytes. By comparison, the code size for Test 5 is only 16 bytes, while the code size for Test 8 is 30 bytes. You’d have to duplicate this code numerous times to equal the code required by memcpy.

Bit banging

By way of a conclusion, I thought we’d take a look at a reasonably realistic example that allows me to demonstrate a few of the techniques that have been discussed. The example consists of an external peripheral–perhaps a 16-bit digital to analog converter, which requires 24 bits of command and data to be clocked into it.

The bottom line: fancy coding tricks are no substitute for good design.

Source code

The code is available at ftp://ftp.embedded.com/pub/2002/11jones, where you’ll also find all of the other test code referenced in this article. In the final project, there are eight different C programs and one assembly language program. There is a 16:1 variation in speed and a significant variation in code size. However, if you bother to examine the code, you’ll find that the biggest improvement in speed came from a change in algorithm. The techniques described here are the icing on the cake.


This article was published in the October 2002 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “‘Optimal’ C Constructs for 8051 Microprocessors,” Embedded Systems Programming, October 2002

How to Use C’s Volatile Keyword

December 17th, 2009 by Nigel Jones

Also available in PDF version.

The proper use of C’s volatile keyword is poorly understood by many programmers. This is not surprising, as most C texts dismiss it in a sentence or two. This article will teach you the proper way to do it.

Have you experienced any of the following in your C or C++ embedded code?

  • Code that works fine—until you enable compiler optimizations
  • Code that works fine—until interrupts are enabled
  • Flaky hardware drivers
  • RTOS tasks that work fine in isolation–until some other task is spawned

If you answered yes to any of the above, it’s likely that you didn’t use the C keyword volatile. You aren’t alone. The use of volatile is poorly understood by many programmers. Unfortunately, most books about the C programming language dismiss volatile in a sentence or two.

C’s volatile keyword is a qualifier that is applied to a variable when it is declared. It tells the compiler that the value of the variable may change at any time–without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. However, before we examine them, let’s take a look at the syntax.

volatile keyword syntax

To declare a variable volatile, include the keyword volatile before or after the data type in the variable definition. For instance both of these declarations will declare foo to be a volatile integer:

volatile int foo;
int volatile foo;

Now, it turns out that pointers to volatile variables are very common, especially with memory-mapped I/O registers. Both of these declarations declare pReg to be a pointer to a volatile unsigned 8-bit integer:

volatile uint8_t * pReg;
uint8_t volatile * pReg;

Volatile pointers to non-volatile data are very rare (I think I’ve used them once), but I’d better go ahead and give you the syntax:

int * volatile p;

And just for completeness, if you really must have a volatile pointer to a volatile variable, you’d write:

int volatile * volatile p;

Incidentally, for a great explanation of why you have a choice of where to place volatile and why you should place it after the data type (for example, int volatile * foo), read Dan Sak’s column “Top-Level cv-Qualifiers in Function Parameters” (Embedded Systems Programming, February 2000, p. 63).

Finally, if you apply volatile to a struct or union, the entire contents of the struct/union are volatile. If you don’t want this behavior, you can apply the volatile qualifier to the individual members of the struct/union.

Proper use of volatile

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

1. Memory-mapped peripheral registers

2. Global variables modified by an interrupt service routine

3. Global variables accessed by multiple tasks within a multi-threaded application

We’ll talk about each of these cases in the sections that follow.

Peripheral registers

Embedded systems contain real hardware, usually with sophisticated peripherals. These peripherals contain registers whose values may change asynchronously to the program flow. As a very simple example, consider an 8-bit status register that is memory mapped at address 0x1234. It is required that you poll the status register until it becomes non-zero. The naive and incorrect implementation is as follows:

uint8_t * pReg = (uint8_t *) 0x1234;

// Wait for register to become non-zero
while (*pReg == 0) { } // Do something else

This will almost certainly fail as soon as you turn compiler optimization on, since the compiler will generate assembly language that looks something like this:

  mov ptr, #0x1234 mov a, @ptr

loop:
  bz loop

The rationale of the optimizer is quite simple: having already read the variable’s value into the accumulator (on the second line of assembly), there is no need to reread it, since the value will always be the same. Thus, in the third line, we end up with an infinite loop. To force the compiler to do what we want, we modify the declaration to:

uint8_t volatile * pReg = (uint8_t volatile *) 0x1234;

The assembly language now looks like this:

  mov ptr, #0x1234

loop:
  mov a, @ptr
  bz loop

The desired behavior is achieved.

Subtler problems tend to arise with registers that have special properties. For instance, a lot of peripherals contain registers that are cleared simply by reading them. Extra (or fewer) reads than you are intending can cause quite unexpected results in these cases.

Interrupt service routines

Interrupt service routines often set variables that are tested in mainline code. For example, a serial port interrupt may test each received character to see if it is an ETX character (presumably signifying the end of a message). If the character is an ETX, the ISR might set a global flag. An incorrect implementation of this might be:

int etx_rcvd = FALSE;

void main()
{
    ...
    while (!ext_rcvd)
    {
        // Wait
    }
    ...
}

interrupt void rx_isr(void)
{
    ...
    if (ETX == rx_char)
    {
        etx_rcvd = TRUE;
    }
    ...
}

With compiler optimization turned off, this code might work. However, any half decent optimizer will “break” the code. The problem is that the compiler has no idea that etx_rcvd can be changed within an ISR. As far as the compiler is concerned, the expression !ext_rcvd is always true, and, therefore, you can never exit the while loop. Consequently, all the code after the while loop may simply be removed by the optimizer. If you are lucky, your compiler will warn you about this. If you are unlucky (or you haven’t yet learned to take compiler warnings seriously), your code will fail miserably. Naturally, the blame will be placed on a “lousy optimizer.”

The solution is to declare the variable etx_rcvd to be volatile. Then all of your problems (well, some of them anyway) will disappear.

Multi-threaded applications

Despite the presence of queues, pipes, and other scheduler-aware communications mechanisms in real-time operating systems, it is still fairly common for two tasks to exchange information via a shared memory location (that is, a global). Even as you add a preemptive scheduler to your code, your compiler has no idea what a context switch is or when one might occur. Thus, another task modifying a shared global is conceptually identical to the problem of interrupt service routines discussed previously. So all shared global variables should be declared volatile. For example, this is asking for trouble:

int cntr;

void task1(void)
{
    cntr = 0; 

    while (cntr == 0)
    {
        sleep(1);
    }
    ...
}

void task2(void)
{
    ...
    cntr++;
    sleep(10);
    ...
}

This code will likely fail once the compiler’s optimizer is enabled. Declaring cntr to be volatile is the proper way to solve the problem.

Final thoughts

Some compilers allow you to implicitly declare all variables as volatile. Resist this temptation, since it is essentially a substitute for thought. It also leads to potentially less efficient code.

Also, resist the temptation to blame the optimizer or turn it off. Modern optimizers are so good that I cannot remember the last time I came across an optimization bug. In contrast, I come across failures by programmers to use volatile with depressing frequency.

If you are given a piece of flaky code to “fix,” perform a grep for volatile. If grep comes up empty, the examples given here are probably good places to start looking for problems.


This article was published in the July 2001 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Introduction to the Volatile Keyword” Embedded Systems Programming, July 2001

Minimize Interrupt Service Routine Overhead

December 17th, 2009 by Nigel Jones

Also available in a PDF version.

With all the automation available today, it’s easy for programmers to overlook costly overhead introduced into machine code by the compiler. Interrupt handlers are one key area worthy of a closer inspection.

In the early days of embedded C compilers, interrupt service routines (ISRs) had to be written in assembly language. Today, most compilers let the developer identify a function as an ISR, with the compiler taking care of all of the hassles associated with the ISR. This could include placing the correct entry into the interrupt vector table, stacking and unstacking registers, and terminating the function with a return from interrupt instruction. There are even compilers available that know when an ISR needs to clear a particular flag before the ISR terminates and will insert the proper code.

Although these advances are most welcome, they have come at a price–namely that it’s so easy to write an interrupt handler in a high-level language that one can easily lose track of the overhead that the compiler is introducing. Doing so can unwittingly result in a high price for the convenience of using a high-level language.

ISR overhead can be split into two parts, fixed and variable. Fixed overhead is associated with the CPU detecting that an interrupt has occurred, vectoring to the ISR, clearing any required flags to prevent the ISR from being re-entered, and finally executing the return from interrupt instruction. Thus, an ISR that does nothing, such as the one shown in Listing 1, still incurs some CPU overhead every time it occurs. Naturally, there isn’t much a compiler can do about the fixed overhead.

__interrupt void timer_isr(void)
{
    /* Do nothing */
}

Listing 1. Do nothing ISR

ISR variable overhead is the time spent stacking and unstacking the registers used in the ISR. This is a key topic, as it’s not uncommon for the time spent stacking and unstacking registers to dwarf the time spent doing useful work. One of the ironies of using modern, register-rich CPUs is that the presence of all those registers encourages the compiler writers to use them! Normally, this leads to very fast, tight code. However in an ISR, all those registers can work to our detriment.

Consider an ISR that uses 12 registers (and thus incurs 12 push and 12 pop operations). Even though the ISR’s body may be faster than code that uses just six registers, the overall execution time of the ISR may increase simply because it must stack and unstack so many registers. For infrequently occurring interrupts, this is irrelevant. However for interrupts that occur frequently, this overhead can rapidly consume a very large portion of the CPU’s bandwidth.

For example, consider a full-duplex serial link running at 38,400 baud on a CPU with a 1-MHz instruction cycle. If the CPU is interrupted upon receipt and transmission of every character, then assuming 10 bits per byte, it’ll be interrupted every 106 / (38,400 / 10) / 2 = 130 ?s. If each interrupt requires 12 registers to be pushed and popped, then the processor spends 24/130 = 18% of its time doing nothing more than stack operations. A coding change that requires only six registers to be stacked would free up 9% of the CPU time and save stack space.

To determine how much overhead you are incurring in an ISR, have the compiler generate an assembly language listing of your C code, preferably with the C instructions interleaved. An example is shown in Listing 2.

    160 static __interrupt void timer0_CompareMatchAIsr(void)
   \   timer0_CompareMatchAIsr:
    161 {
   \   00000000   938A               ST      -Y, R24
   \   00000002   93FA               ST      -Y, R31
   \   00000004   93EA               ST      -Y, R30
   \   00000006   923A               ST      -Y, R3
   \   00000008   922A               ST      -Y, R2
   \   0000000A   921A               ST      -Y, R1
   \   0000000C   920A               ST      -Y, R0
   \   0000000E   937A               ST      -Y, R23
   \   00000010   936A               ST      -Y, R22
   \   00000012   935A               ST      -Y, R21
   \   00000014   934A               ST      -Y, R20
   \   00000016   933A               ST      -Y, R19
   \   00000018   932A               ST      -Y, R18
   \   0000001A   931A               ST      -Y, R17
   \   0000001C   930A               ST      -Y, R16
   \   0000001E   B78F               IN      R24, 0x3F
    162 TCCR0B = 0;    /* Stop the timer */
   \   00000020   E000               LDI     R16, 0
   \   00000022   BF03               OUT     0x33, R16
    163 fifo_AddEvent(Event);    /* Post the event */
   \   00000024   9100....           LDS     R16, Event
   \   00000028   ....               RCALL   fifo_AddEvent
    164 }
   \   0000002A   BF8F               OUT     0x3F, R24
   \   0000002C   9109               LD      R16, Y+
   \   0000002E   9119               LD      R17, Y+
   \   00000030   9129               LD      R18, Y+
   \   00000032   9139               LD      R19, Y+
   \   00000034   9149               LD      R20, Y+
   \   00000036   9159               LD      R21, Y+
   \   00000038   9169               LD      R22, Y+
   \   0000003A   9179               LD      R23, Y+
   \   0000003C   9009               LD      R0, Y+
   \   0000003E   9019               LD      R1, Y+
   \   00000040   9029               LD      R2, Y+
   \   00000042   9039               LD      R3, Y+
   \   00000044   91E9               LD      R30, Y+
   \   00000046   91F9               LD      R31, Y+
   \   00000048   9189               LD      R24, Y+
   \   0000004A   9518               RETI

Listing 2. Mixed C and generated assembly language listing of two-line ISR

Even if you aren’t familiar with your CPU’s instruction set, the push and pop operations are usually easy to spot. Note that this two-line ISR requires 15 registers to be stacked and unstacked.

I’ll present various techniques for minimizing ISR overhead. The order they’re presented in follows a project’s progression. The first suggestions are only applicable at the start of a project, whereas the final suggestions apply to the poor soul at the end of a project who needs to improve performance without changing anything.

CPU selection

If you’re a regular reader of Embedded Systems Design, you know that the latest CPUs offer amazing capabilities; yet survey after survey shows that huge numbers of us continue to work with ancient architectures such as the 8051. I previously wrote this off as a simple case of inertia until I went through an interesting exercise about a year ago, when I had to select a small CPU core to be embedded in an ASIC. The ASIC was to be used in a portable product where ultra low-power consumption was critical. As is the case for most low-power products, most of the code was to be executed under interrupt, and hence ISR overhead was a major concern.

To determine the best CPU for the application, example pieces of code were compiled using the best compilers I could find for each CPU candidate and the energy consumed by each CPU calculated. The results were quite clear: the older CPUs such as the 8051 and the HC08 from Freescale had a distinct advantage over the newer, register-rich architectures. This advantage was due in part to the low ISR overhead of these devices. In the case of the 8051, this advantage also had a lot to do with the fact that the 8051 allows one to allocate a register bank to a particular block of code–such as an ISR–and thus drastically reduce the number of registers that need to be stacked.

The bottom line is if you know your code will be dominated by interrupt handlers, don’t be in too much of a hurry to embrace the latest and greatest register-rich CPU.

Compiler selection

As will be shown shortly, a major factor on ISR performance is your compiler’s register allocation strategy. If you aren’t familiar with this term, it refers to the algorithm the compiler uses in allocating registers across function calls. In broad terms, a compiler has three options:

  • Assume that a called function will trash every register
  • Require that the called function preserve every register that it uses
  • Use a hybrid approach, whereby certain registers are deemed scratch registers–and hence may be trashed by a called function, and other registers are preserved registers–and thus must be preserved if used by a called function

Now, if your compiler has a sophisticated global register-coloring algorithm, whereby it can track register allocation across inter-module function calls–and so eliminate all unnecessary register preservation instructions, then these three algorithms effectively all collapse down to the same thing, and thus the strategy is of little concern. However, if your compiler doesn’t do this, then you need to be aware of its impact on making function calls from within interrupt handlers.

Firmware design

Once you’ve chosen your CPU and compiler, you’ll probably turn your attention to the detailed firmware architecture. A key decision that you’ll have to make is what interrupts are needed and what gets done within them. In making these decisions, be aware of the consequences of making function calls from within an ISR.

Consider the example shown in Listing 3, which is the source code for the assembly language listing shown in Listing 2. On the face of it, it meets the criteria of keeping interrupt handlers short and simple. After all it’s just two lines of code.

void fifo_AddEvent(uint8_t event);

__interrupt void timer_isr(void)
{
    TCCROB = 0;			/* Stop the timer */
    fifo_AddEvent(Event);	/* Post the event */
}

Listing 3. A “simple” ISR

However, as Listing 2 shows, a huge number of registers have been stacked. The reason is quite simple: the compiler that generated this code uses a hybrid register allocation strategy. Thus, without any other information, the compiler has to assume that the called function will use all the scratch registers it’s allowed to use–and so stacks all of them. Now, if the called function is a complex function, then stacking all the registers is unfortunate but necessary. However, if the called function is simple, then most of the register stacking is probably unnecessary. In order to avoid this problem, don’t make function calls from within an ISR.

This advice falls into the same category as soldiers being advised to avoid dangerous situations for health reasons. It’s true, but useless most of the time. Presumably, if you’re making a function call, you’re doing it for a good reason. In which case, what can you do about it? In a nutshell, you must do something that lets the compiler know what registers the called function is using. There are three ways you can do this, but they all amount to the same thing.

If you are writing in C++, or C99, then make the function that’s called from the ISR an inline function. This will result in copies of the function being used throughout the project, but should result in only the required registers being stacked. It’s your call as to whether the tradeoff is acceptable.

If you are using C89, you can do something similar by using macros and global variables–with all the problems that this entails

The third option is to ensure that the ISR and the called function reside in the same module. In this case, a good compiler can see exactly what you’re calling and thus stack only the required registers. Furthermore, if it’s possible to declare the called function as static, there’s a good chance that the compiler will inline the function and also eliminate the overhead of the actual function call.

Occasional function calls

Although the overhead of a function call can be annoying, it can be frustrating when you have code that looks like Listing 4. In this case, most of the time, the ISR simply decrements a counter. However, when the counter reaches zero, a function call is made. Unfortunately, many compilers will look at this ISR, see that a function call is made, and stack all the registers for the function call every time the ISR is invoked. This wastes CPU cycles. The long-term solution to this problem is to pressure the compiler vendors to improve their product such that the registers only get stacked when the function call is made. In the short term, use a software interrupt. The code for such an approach is shown in Listing 5.

__interrupt void timer_isr(void)
{
    if (--Cntr == 0)
    {
        post_event(TIMER_TICK);
    }
}

Listing 4. An occasional function call

__interrupt void timer_isr(void)
{
    if (--Cntr == 0)
    {
        __software_interrupt;
    }
}

__interrupt void software_isr(void)
{
    post_event(TIMER_TICK);
}

Listing 5. Use of software interrupt

In place of the function call, we generate a software interrupt. This results in another interrupt being generated. The requisite function call is made from within this interrupt. The advantage of this approach is that the timer ISR no longer needs to stack the registers associated with the function call. Of course the software interrupt handler does need to stack the registers, but now they’re only stacked when they’re really needed. The main cost of this approach is slightly increased latency from the time the interrupt occurs to the time the function call is made.

Depending on your perspective, this technique is either very elegant or a kludge. However, with adequate documentation, it should be possible to explain clearly what you’re doing and why. For me, the choice between doing this and resorting to assembly language is clear cut, but the choice is yours.

Incidentally, you may be unaware of the concept of a software interrupt. Some CPUs include in their instruction set a software interrupt instruction (SWI). Execution of the instruction causes an interrupt flag to be set (just as if a timer had rolled over, or a character had been received in a UART). If all the usual criteria are met, at some point the CPU will vector to the software interrupt handler and execute it just like any normal hardware based interrupt.

Don’t be despondent if your CPU doesn’t include a SWI because it’s possible to fashion a software interrupt in several ways using unused hardware resources. The easiest way is typically to configure an unused port pin as an output, while also configuring it to interrupt on change. Thus, simply toggling the port pin will generate an interrupt.

A second technique is to employ a spare counter. In this case, one can simply load the counter with its maximum value, configure it to interrupt on roll over, then enable the counter to count up. With a bit of ingenuity, you’ll find that most of the hardware resources in a typical microcontroller can be configured to generate an interrupt upon software command.

A final warning for those of you whose CPU supports an SWI instruction, is that debuggers often make use of the SWI instruction to set breakpoints. So before using this technique, make sure you aren’t going to break your debugger.

Coding constructs

What if your CPU has been selected and your software architecture designed, such that the functionality of your interrupt handlers is set? In this case, it’s time to look at some of your coding constructs. Small changes in this area can make surprisingly large changes to the number of registers that need to be stacked. Most of the normal suggestions for making code tighter apply, of course, and won’t be reiterated here. However, there are a few areas to watch out for.

Switch statements are a bad idea for two reasons. First, the overhead associated with a switch statement is typically quite large, resulting in a lot of registers being stacked. Second, most compilers will implement switch statements in one of several ways depending on the values that are switched. Thus, a small change to a switch statement can result in a completely different set of instructions being generated (and hence registers stacked). The result is that an innocuous change to the code can make a dramatic difference to the CPU bandwidth consumed. This is not something that makes for a maintainable system.

Watch out for unexpected type promotions. These can have a devastating impact on an ISR in two ways. First, a type promotion can result in more registers being used to hold a variable than is necessary (which of course means that more registers need to be stacked). Second, a type promotion may result in the compiler making a function call to a library function to perform an operation rather than doing it in line. If you’re lucky, your compiler knows what registers are used by library functions. If you’re unlucky, you’ll be hit with the full overhead of a function call.

Interrupts often test and/or set flags. Many processors contain special flag locations that can be operated on without going through a CPU register. These special flag locations can thus have a double benefit within interrupts as one avoids the overhead of stacking the intermediate registers.

If you have done all that you can to minimize overhead, and you are still facing an unacceptable burden, before reaching for the assembler, look at your compiler optimization settings, as the usual rules don’t always apply to interrupt handlers.

Most compilers will let you optimize for size or speed. Furthermore, most compiler manuals carry a notice that sometimes optimizing for speed will actually result in a smaller footprint than optimizing for size. This can occur because the act of inlining code, unraveling loops, and so forth, results in the optimizer seeing a different view of the code and making optimizations that weren’t previously possible. Interestingly, I’ve never seen a compiler manual admonish the user that optimizing for size may result in faster executing code. However, this is exactly what can happen in the case of an ISR.

I’m not privy to the way compiler writers optimize interrupt handlers. However, what seems to happen is that the optimizer concentrates on optimizing the C language constructs and then simply adjusts the registers that need to be stacked based on the resulting code, rather than looking at the entire routine. By asking the compiler to optimize for size, you create the possibility of the compiler generating more compact code that uses less registers. Because it uses less registers, the stacking overhead is reduced and the overall execution time of the ISR is potentially reduced. Thus the bottom line is, when asking a compiler to optimize an ISR, try both optimizing for speed and size, as you may find that size-optimized code is actually faster.


This article was published in the January 2007 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Minimize your ISR overhead” Embedded Systems Programming, January 2007.