Archive for the ‘Publications’ Category

Uses for C’s offsetof() Macro

Thursday, December 17th, 2009 Nigel Jones

Also available in PDF version.

C’s seldom-used offsetof() macro can actually be a helpful addition to your bag of tricks. Here are a couple of places in embedded systems where the macro is indispensable, including packing data structures and describing how EEPROM data are stored.

If you browse through an ANSI C compiler’s header files, you’ll come across a very strange looking macro in stddef.h. The macro, offsetof(), has a horrid declaration. Furthermore, if you consult the compiler manuals, you’ll find an unhelpful explanation that reads something like this:

The offsetof() macro returns the offset of the element name within the struct or union composite. This provides a portable method to determine the offset.

At this point, your eyes start to glaze over, and you move on to something that’s more understandable and useful. Indeed, this was my position until about a year ago–whence the macro’s usefulness finally dawned on me. I now kick myself for not realizing the benefits earlier—the macro could have saved me a lot of grief over the years. However, I console myself by realizing that I wasn’t alone, since I’d never seen this macro used in any embedded code. Offline and online searches confirmed that offsetof() is essentially not used. I even found compilers that had not bothered to define it.

How offsetof() works

Before delving into the three areas where I’ve found the macro useful, it’s necessary to discuss what the macro does, and how it does it.

The offsetof() macro is an ANSI-required macro that should be found in stddef.h. Simply put, the offsetof() macro returns the number of bytes of offset before a particular element of a struct or union.

The declaration of the macro varies from vendor to vendor and depends upon the processor architecture. Browsing through the compilers on my computer, I found the example declarations shown in Listing 1. As you can see, the definition of the macro can get complicated.

// Keil 8051 compiler
#define offsetof(s,m) (size_t)&(((s *)0)->m)

// Microsoft x86 compiler (version 7)
#define offsetof(s,m) (size_t)(unsigned long)&(((s *)0)->m)

// Diab Coldfire compiler
#define offsetof(s,memb) \
    ((size_t)((char *)&((s *)0)->memb-(char *)0))

Listing 1. A representative set of offsetof() macro definitions

Regardless of the implementation, the offsetof() macro takes two parameters. The first parameter is the structure name; the second, the name of the structure element. (I apologize for using a term as vague as “structure name.” I’ll refine this shortly.) A straightforward use of the macro is shown in Listing 2.

typedef struct
    int   i;
    float f;
    char  c;


void main(void)
    printf("Offset of 'f' is %u", offsetof(SFOO, f));

Listing 2. A straightforward use of offsetof()

To better understand the magic of the offsetof() macro, consider the details of Keil’s definition. The various operators within the macro are evaluated in an order such that the following steps are performed:

  • ((s *)0): takes the integer zero and casts it as a pointer to s.
  • ((s *)0)->m: dereferences that pointer to point to structure member m.
  • &(((s *)0)->m): computes the address of m.
  • (size_t)&(((s *)0)->m): casts the result to an appropriate data type.

By definition, the structure itself resides at address 0. It follows that the address of the field pointed to (Step 3 above) must be the offset, in bytes, from the start of the structure. At this point, we can make several observations:

  • We can be a bit more specific about the term “structure name.” In a nutshell, if the structure name you use, call it s, results in a valid C expression when written as (s *)0->m, you can use s in the offsetof() macro. The examples shown in Listings 3 and 4 will help clarify that point.
  • The member expression, m, can be of arbitrary complexity; indeed, if you have nested structures, then the member field can be an expression that resolves to a parameter deeply nested within a structure
  • It’s easy enough to see why this macro also works with unions
  • The macro won’t work with bitfields, as you can’t take the address of a bitfield member of a structure or union

Listings 3 and 4 contain simple variations on the usage of this macro. These should help you get you comfortable with the offsetof() basics.

struct sfoo
    int   i;
    float f;
    char  c;


void main(void)
    printf("Offset of 'f' is %u", offsetof(struct sfoo, f));

Listing 3. A struct without a typedef

typedef struct
    long  l;
    short s;


typedef struct
    int   i;
    float f;
    SBAR  b;


void main(void)
    printf("Offset of 'l' is %u", offsetof(SFOO, b.l));

Listing 4. Nested structs

Now that you understand the semantics of the macro, it’s time to take a look at a few use examples.

struct padding bytes

Most 16-bit and larger processors require that data structures in memory be aligned on a multibyte (for example, 16-bit or 32-bit) boundary. Sometimes the requirement is absolute, and sometimes it’s merely recommended for optimal bus throughput. In the latter case, the flexibility is offered because the designers recognized that you may wish to trade off memory access time with other competing issues such as memory size and the ability to transfer (perhaps via a communications link or direct memory access) the memory contents directly to another processor that has a different alignment requirement.

For cases such as these, it’s often necessary to resort to compiler directives to achieve the required level of packing. As the C structure declarations can be quite complex, working out how to achieve this can be daunting. Furthermore, after poring over the compiler manuals, I’m always left with a slight sense of unease about whether I’ve really achieved what I set out to do.

The most straightforward solution to this problem is to write a small piece of test code. For instance, consider the moderately complex declaration given in Listing 5.

typedef union
    int   i;
    float f;
    char  c;

        float  g;
        double h;
    } b;


void main(void)
    printf("Offset of 'h' is %u", offsetof(UFOO, b.h));

Listing 5. A union containing a struct

If you need to know where field b.h resides in the structure, then the simplest way to find out is to write some test code such as that shown in Listing 5.

This is all well and good, but what about portability? Writing code that relies on offsets into structures can be risky—particularly if the code gets ported to a new target at a later date. Adding a comment is of course a good idea—but what one really needs is a means of forcing the compiler to tell you if the critical members of a structure are in the wrong place. Fortunately, one can do this using the offsetof() macro and the technique in Listing 6.

typedef union
    int   i;
    float f;
    char  c;

        float  g;
        double h;
    } b; 	   


static union
    char wrong_offset_i[offsetof(UFOO, i) == 0];
    char wrong_offset_f[offsetof(UFOO, f) == 0];
    char wrong_offset_h[offsetof(UFOO, b.h) == 2]; // Error

Listing 6. An anonymous union to check struct offsets

The technique works by attempting to declare a union of one-char arrays. If any test evaluates to false, its array will be of zero size, and a compiler error will result. One compiler I tested generated the specific error “Invalid dimension size [0]” on the line defining array wrong_offset_h[].

Thus the offsetof() macro can be used both to determine and to validate the packing of elements within C structs.

Nonvolatile memory layouts

Many embedded systems contain some form of nonvolatile memory, which holds configuration parameters and other device-specific information. One of the most common types of nonvolatile memory is serial EEPROM. Normally, such memories are byte addressable. The result is often a serial EEPROM driver that provides an API that includes a read function that looks like this:

ee_rd(uint16_t offset, uint16_t nBytes, uint8_t * dest)

In other words, read nBytes from offset offset in the EEPROM and store them at dest. The problem is knowing what offset in EEPROM to read from and how many bytes to read (in other words, the underlying size of the variable being read). The most common solution to this problem is to declare a data structure that represents the EEPROM and then declare a pointer to that struct and assign it to address 0x0000000. This technique is shown in Listing 7.

typedef struct
    int   i;
    float f;
    char  c;


EEPROM * const pEE = 0x0000000;

ee_rd(&(pEE->f), sizeof(pEE->f), dest);

Listing 7. Accessing data in serial EEPROM via a pointer

This technique has been in use for years. However, I dislike it precisely because it does create an actual pointer to a variable that supposedly resides at address 0. In my experience, this can create a number of problems including:

  • Someone maintaining the code can get confused into thinking that the EEPROM data structure really does exist
  • You can write perfectly legal code (for example, pEE->f = 3.2) and get no compiler warnings that what you’re doing is disastrous
  • The code doesn’t describe the underlying hardware well

A far better approach is to use the offsetof() macro. An example is shown in Listing 8

typedef struct
    int   i;
    float f;
    char  c;


ee_rd(offsetof(EEPROM,f), 4, dest);

Listing 8. Use offsetof() to access data stored in serial EEPROM

However, there’s still a bit of a problem. The size of the parameter has been entered manually (in this case “4”). It would be a lot better if we could have the compiler work out the size of the parameter as well. No problem, you say, just use the sizeof() operator. However, the sizeof() operator doesn’t work the way we would like it to. That is, we cannot write sizeof(EEPROM.f) because EEPROM is a data type and not a variable.

Normally, one always has at least one instance of a data type so that this is not a problem. In our case, the data type EEPROM is nothing more than a template for describing how data are stored in the serial EEPROM. So, how can we use the sizeof() operator in this case? Well, we can simply use the same technique used to define the offsetof() macro. Consider the definition:

#define SIZEOF(s,m) ((size_t) sizeof(((s *)0)->m))

This looks a lot like the offsetof() definitions in Listing 1. We take the value 0 and cast it to “pointer to s.” This gives us a variable to point to. We then point to the member we want and apply the regular sizeof() operator. The net result is that we can get the size of any member of a typedef without having to actually declare a variable of that data type.

Thus, we can now refine our read from the serial EEPROM as follows:

ee_rd(offsetof(EEPROM, f), SIZEOF(EEPROM, f), &dest);

At this point, we’re using two macros in the function call, with both macros taking the same two parameters. This leads to an obvious refinement that cuts down on typing and errors:

#define EE_RD(M,D) \
    ee_rd(offsetof(EEPROM,M), SIZEOF(EEPROM,M), D)

Now our call to the EEPROM driver becomes much more intuitive:

EE_RD(f, &dest);

That is, read f from the EEPROM and store its contents at location dest. The location and size of the parameter is handled automatically by the compiler, resulting in a clean, robust interface.

Protect nonvolatile memory

Many embedded systems contain directly addressable nonvolatile memory, such as battery-backed SRAM. It’s usually important to detect if the contents of this memory have been corrupted. I usually group the data into a structure, compute a CRC (cyclic redundancy check) over that structure, and append it to the data structure. Thus, I often end up with something like this:

struct nv
    short    param_1;
    float    param_2;
    char     param_3;
    uint16_t crc;

} nvram;

The intent of the crc field is that it contain a CRC of all the parameters in the data structure with the exception of itself. This seems reasonable enough. Thus, the question is, how does one compute the CRC? If we assume we have a function, crc16( ), that computes the CRC-16 over an array of bytes, then we might be tempted to use the following:

nvram.crc =
    crc16((char *) &nvram, sizeof(nvram)-sizeof(nvram.crc));

This code will only definitely work with compilers that pack all data on byte boundaries. For compilers that don’t do this, the code will almost certainly fail. To see that this is the case, let’s look at this example structure for a compiler that aligns everything on a 32-bit boundary. The effective structure could look like that in Listing 9.

struct nv
    short    param_1; 	// offset = 0
    char     pad1[2]; 	// 2 byte pad
    float    param_2; 	// offset = 4
    char     param_3; 	// offset = 8
    char     pad2[3]; 	// 3 byte pad
    uint16_t crc; 	 // offset = 12
    char     pad3[2]; 	// 2 byte pad

} nvram;

Listing 9. An example struct for a compiler that aligns everything on a 32-bit boundary

The first two pads are expected. However, why is the compiler adding two bytes onto the end of the structure? It does this because it has to handle the case when you declare an array of such structures. Arrays are required to be contiguous in memory, too. So to meet this requirement and to maintain alignment, the compiler pads the structure out as shown.

On this basis, we can see that the sizeof(nvram) is 16 bytes. Now our naive code in Listing 9 computes the CRC over sizeof(nvram) – sizeof(nvram.crc) bytes = 16 – 2 = 14 bytes. Thus the stored crc fieldnow includes its previous value in its computation! We certainly haven’t achieved what we set out to do.

struct nv
    struct data
        short param_1; 	// offset = 0
        float param_2; 	// offset = 4
        char  param_3; 	// offset = 8

    } data;

    uint 16_t crc;	 // offset = 12

} nvram;

Listing 10. Nested data structures

The most common solution to this problem is to nest data structures as shown in Listing 10. Now we can compute the CRC using:

nvram.crc =
    crc16((uint8_t *) &, sizeof(;

This works well and is indeed the technique I’ve used over the years. However, it introduces an extra level of nesting within the structure—purely to overcome an artifact of the compiler. Another alternative is to place the CRC at the top of the structure. This overcomes most of the problems but feels unnatural to many people. On the basis that structures should always reflect the underlying system, a technique that doesn’t rely on artifice is preferable—and that technique is to use the offsetof() macro.

Using the offsetof() macro, one can simply use the following (assuming the original structure definition):

nvram.crc =
    crc16((uint8_t *) &nvram, offsetof(struct nv, crc));

Keep looking

I’ve provided a few examples where the offsetof() macro can improve your code. I’m sure that I’ll find further uses over the next few years. If you’ve found additional uses for the macro I would be interested to hear about them.

This article was published in the March 2004 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Learn a new trick with the offsetof() macro” Embedded Systems Programming, March 2004.

MISRA-C Guidelines for Safety Critical Software

Thursday, December 17th, 2009 Nigel Jones

Also available in PDF version.

In 1998, the UK’s Motor Industry Software Reliability Association established a set of 127 guidelines for the use of C in safety-critical systems. Here’s a look at the rules, what they mean, and how they can work for you.

The reasons for the popularity of C in the embedded realm are clear: easy access to hardware, low memory requirements, and efficient run-time performance being foremost among them. Equally well known are the problems with C: highly limited run-time checking, a syntax that is prone to stupid mistakes that are technically legal, and a myriad of areas where the ISO C standard explicitly states that the behavior is either implementation defined or undefined.

In the hands of a truly experienced programmer, most of the limitations of C can be avoided. Of course, that leaves the problem of inexperienced programmers. The problem is similar to the dilemma faced by parents when their offspring learn to drive. The solution that most parents adopt is to give their child a large, slow lumbering vehicle with excellent brakes; few hand over the keys to a high performance sports car.

ISO C is like a high performance vehicle–with very questionable brakes. Despite this, organizations have little choice other than to let their beginners loose with the language. The results are predictable and well documented.

So, what can be done? The UK-based Motor Industry Software Reliability Association (MISRA) realized that in many areas of an automobile design, safety is of paramount importance. They also recognized that C was the de facto language for coding such systems and that these systems are, therefore, vulnerable to C’s limitations. Rather than mandating the use of a safer language such as Ada, the organization looked at ways to make C safer.

The result was a set of “Guidelines for the Use of the C Language in Vehicle-Based Software,” or “MISRA C,” as they are more informally known. The guidelines, which were first published in 1998, comprise a 70-page document that describes a workable subset of C that avoids many of its well-known problems. (Unfortunately, you have to buy the document from MISRA. It is available at for about $50.)

The MISRA C document contains eight chapters and two appendices and was obviously written by experienced embedded programmers. Chapters 1 through 6 contain important information about the rationale for MISRA C and how to interpret the rules. These chapters should be read prior to diving into the actual rules, which are found in Chapter 7.

Safety guidelines

In all, MISRA C has 127 rules. Of these, 93 are required and the remaining 34 are advisory. The distinction between these two types of rules is important. C code that claims conformance to MISRA C must comply with all 93 required rules. Conforming code should adhere to the advisory rules as much as is practical. In other words, either you buy into the spirit of MISRA C or you don’t.

With this in mind, it’s time to discuss the rules themselves. The example rules are quoted verbatim, in part to illustrate the tone and style of the guidelines. Let’s start at the beginning:

Rule 1 (required): All code shall conform to ISO 9899 standard C, with no extensions permitted.

When I first read this, I laughed out loud. Because the C standard was not originally written with embedded systems in mind, it is impossible to write a non-trivial embedded program without resorting to a variety of compiler extensions. For example, ISO C provides no way to specify that a function is an interrupt service routine. (MISRA frowns upon assembly language, so you can’t use that as a way to skirt the issues.)

Fortunately, in reading the notes associated with Rule 1, one finds the following comment:

It is recognized that it may be necessary to raise deviations (as described in section 5.3.2) to permit certain language extensions, for example to support hardware specific features.

The deviations mentioned in the comment are, arguably, the strongest feature of MISRA C. In putting together the rules, the authors recognized that they could not possibly foresee all the circumstances associated with all embedded systems. Thus, they put in place a mechanism by which an organization can formally deviate from a given rule. Note that this does not mean that the programmer is free to violate the rules on a whim. Rather it means that all violations must be carefully considered, justified, and documented.

MISRA also recommends that deviations be localized as much as possible–such as, all I/O operations be performed in one file. Notwithstanding MISRA C, this is good programming practice anyway.

Most of the other rules are common sense and are succinctly stated. For instance:

Rule 20 (required): All object and function identifiers shall be declared before use.

Any programmer who has a problem with this rule probably needs to find a new line of work.

Some of the rules are more stylistic in their intent and, as such, are more likely to offend individual sensibilities. This is particularly true of the advisory rules, such as:

Rule 49 (advisory): Tests of a value against zero should be made explicit, unless the operand is effectively Boolean.

This rule goes against my own preferred style of coding. However, I see where the authors are going and wouldn’t exactly be compromising deeply held beliefs by conforming to their advice.

While Rule 49 isn’t going to change much in the way most people write code, Rule 104 could have a bigger impact.

Rule 104 (required): Non-constant pointers to functions shall not be used.

The use of pointers to functions is a favored technique of many experienced embedded programmers.

The bottom line: regardless of your experience level, if you choose to go down the MISRA C path, expect to have to change the way you code in both big and small ways.

Compliance checking

In writing these rules, the authors appear to have tried very hard to make compliance checking via static analysis possible for as many rules as possible. Static analysis is a fancy term meaning that a computer program could be written to check for violations of the rules.

The most obvious program to do the compliance checking is a compiler. In fact, several embedded compiler vendors, including Green Hills and Tasking (now Altium), do offer compilers with MISRA compliance switches. Does this mean that you have to use a subset of the compilers out there in order to get compliance checking? Fortunately, no! PC-Lint (from Gimpel) now offers MISRA compliance checking as well. (Read more about lint.)

Unfortunately, though, not all of the rules can be enforced automatically. For the other 23 rules, the only available enforcement mechanism is a code review. So, if nothing else, adherence to MISRA C guarantees that your organization will have to conduct code reviews.

MISRA recommends the use of a compliance matrix to track how your organization intends to check its conformance to each rule.

Final thoughts

Many organizations that develop software, particularly in the U.S., have not even heard of MISRA C–let alone require its use. However, this doesn’t mean that as an individual developer you shouldn’t consider adopting these guidelines on your own. After all, MISRA is a tool to help you write safer, more portable code.

For those of you who can’t handle the effort of formally adopting MISRA C, you should, at the very least, examine the rules. If you see rules that contradict your normal coding style, then I suggest you look long and hard at your practices. Chances are that the MISRA folks are older and wiser than you.

MISRA C does nothing for checking the validity of your algorithms. It doesn’t enforce a particular style, and it cannot stop you from writing completely idiotic code. What it will do-if you let it-is protect you from most of the darker corners of the C language.

This article was published in the July 2002 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Introduction to MISRA C” Embedded Systems Programming, July 2002.

Use Lint for Static Code Analysis

Thursday, December 17th, 2009 Nigel Jones

Also available in PDF version.

Language specifications, including those for C and C++, are often loosely written. A static analysis tool called lint can help you find dangerous and non-portable constructs in your code before your compiler turns them into run-time bugs.

Anyone who has written a program has had to debug code. In many cases, after staring at the code for hours, we finally realize that we have done something stupid like read an uninitialized variable, or index beyond the end of an array. With experience, the incidence of these “doh!” bugs decreases; but they can still be extremely frustrating and costly.

What we need at such times is an infinitely patient, anal-retentive expert to inspect every inch of our code. This expert must be more thorough than any compiler, and should report back in a completely dispassionate way everything that is potentially wrong with our code. A development tool called lint is the closest you can get to this ideal.

Lint is a tool similar to a compiler in that it parses C or C++ source files. It checks these files for syntactical correctness. To lint a file called foo.c, one typically just types:

lint foo.c

at the command line. Naturally, though, there are also many optional command line switches.

Standard features

Whereas a compiler concerns itself primarily with code generation, lint is completely devoted to checking your code for a myriad of possible defects. The key word here is possible. Just because lint flags a section of your code for review, it doesn’t necessarily mean a problem will occur when you compile that code with your particular compiler.

Lint is designed to be compiler-agnostic and is, in fact, frequently in the business of focusing your attention on parts of the code that might result in different behavior depending on the specific compiler used.

The specific list of problems that lint checks varies by implementation and version of the tool. However, most flavors of lint will at least check for the following:

  • Possible indexing beyond array bounds
  • De-referencing of null pointers
  • Suspicious assignments (such as if (a = b))
  • Mismatches in variable types (such as foo declared as a double in one file and used as a long in another)
  • Potentially dangerous data type combinations
  • Unused variables
  • Unreachable code
  • Header files included multiple times and/or unnecessarily
  • Non-portable constructs

Lint checks so many things, in fact, that it’s common for the tool to initially produce as many errors and warnings as there are lines of code in the source file that’s input. This is a big turn-off for many potential users, since their attitude tends to be “this tool is so picky it’s ridiculous.” However, working through each warning and correcting it can be a rewarding exercise. As an example, consider this seemingly innocuous code:

    int i;

    for (i = 0; i < 10l; i++)

What will be the last number printed when you run this program? If you answered “100,” you’re wrong. This is a perfectly valid program, and will compile without warning on most any compiler. However, lint will complain about something. If you can’t see the problem via visual inspection, then I suggest you cut and paste the code above and run it through your favorite debugger. Note, and a big hint here: do not simply type this program into your editor. Observe how long it takes you to find this problem–and then ask yourself whether wading through lint warnings isn’t so bad.

Getting the lint out

What sort of real world benefits can you expect from addressing all of the warnings produced by lint? My experiences on a typical project involving a small microcontroller (total code size below 32KB) included the following:

  • Lint found two or three outright bugs–before I had even started testing the code
  • I learned something about the C language each time I ran lint
  • My final code was cleaner because lint informed me of unused variables, macros, and header file includes that could be safely removed
  • I was better informed of potential portability issues

Given the above, it will probably not surprise you to learn that organizations that are really serious about code quality often insist not only that all code compile without warnings (which is relatively trivial to achieve) but also that it be “lint free”-that is, generate no warnings with lint. This is a much more difficult criteria to achieve.

Figure 1: How lint fits into the development process

It’s worth looking at where lint fits into the development process. My general design flow is shown in Figure 1. Once I have code that compiles, I lint it. If the code gets through lint okay, it’s highly unlikely that I’ll be embarrassed in the code review. During the debug phase, it’s normal for changes to be made to the code. However, once the code is debugged, and before it is passed to the release test, I normally lint the code again. Why? I’m always amazed at the number of sloppy coding constructs that occur when code is being debugged. Lint is a great tool for identifying that ratty piece of code that was put in there to help debug something and then promptly forgotten.

Sources of lint

Lint is a standard tool on most Linux or Unix development systems. In the PC realm, however, you often have to go out and buy lint, or find a free or shareware version. If you do buy lint, rest assured it will likely be the best money you have ever spent in your embedded career. Most lint variants are relatively inexpensive (less than $1,000) and worth every penny.

Incidentally, you may be wondering how well lint handles all those nasty little compiler extensions that are so common in embedded development. This is an area where the commercial programs outshine the free and shareware offerings. In particular, some versions of lint allow considerable customization of lint’s rules, such that all those extensions are correctly handled. In some cases, the compiler definitions are even supplied by the lint vendor. In others, you may be able to get them from the compiler vendor.

If you don’t have access to lint, but are using the GNU tools (on Unix or a PC), simply use gcc’s -Wall flag to achieve about 80% of the same functionality.

So, for all the neophytes out there, get yourself a copy of lint, and use it. If nothing else, your boss will be impressed with the maturity of your code. For all the experienced hacks who aren’t using lint–watch out! The new guys who are using it might show you up.

Further reading

Wikipedia – Lint

Darwin, Ian F. Checking C Programs with Lint. Sebastopol, CA: O’Reilly, 1988.

This article was published in the May 2002 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Introduction to Lint,” Embedded Systems Programming, May 2002

Optimal C Constructs for 8051 Microcontrollers

Thursday, December 17th, 2009 Nigel Jones

Also available in PDF version.

The limitations of an 8-bit microcontroller (MCU) can sometimes make conventional C constructs produce suboptimal code. In this article we look at common problems on the 8051 family and discuss workarounds in C.

The 8-bit MCU marketplace is massive. And, despite the inroads made by newer architectures, the ancient 8051 continues to hold significant market share. But old 8-bit architectures, such as the 8051 and HC11, were never designed to be programmed in a high-level language such as C.

Compiler vendors work hard to make good compilers for these 8-bit microcontrollers. But even with all the ingenuity in the world, compilers must eventually bump up against a target MCU’s fundamental limitations, producing some surprising results.

Those of you familiar with the 8051 know that it’s not exactly a great architecture from a programmer’s perspective. Accordingly, a lot of 8051-based projects end up hungry for every CPU cycle and memory byte that can be saved. Since I enjoy working on 8-bit designs, I’ve spent many years wringing the last drop of performance out of such systems–and learning a lot in the process.

If you have a poor software design, inefficient algorithms, or haven’t taken the basic steps advocated by the compiler vendor, nothing I can tell you here will make a difference. However, if you take all the steps advocated by the compiler vendor and have optimal algorithms and other necessary conditions, and you still need more performance (but don’t want to resort to assembly language) finding the optimal C coding constructs can help you get the job done.

On 32-bit systems with a plethora of registers and addressing modes, functionally equivalent coding constructs will probably produce identical performance. On 8-bit systems, however, a subtle change in coding style can affect performance significantly. Consider the following blocks of code, which are functionally equivalent.

// 8 bytes and 42 cycles
for (i = 0; i < 10; i++)

// 15 bytes and 120 cycles
i = 0;
while (i++ < 10)

As the comments indicate, the difference between these two fragments is significant. Given the large difference, you’d think the compiler vendors would publish a nice list of optimal techniques to which one could refer. Well, I have poured over dozens of compiler manuals and have never come across such a list. I can only assume that either the vendors aren’t aware of these differences (which seems hard to believe), or they consider it a sign of weakness in their product, and don’t want to publicize the information.


Working on the adage that “What they won’t tell me is probably really worth knowing,” I have created just such a list of optimal C constructs. This list was hard won and I intend to share it with you. However, a few caveats are in order.

First, this list is only definitive for an 8051 and version 6.02 of the Keil compiler using optimization level 8, favoring execution speed over size. I also used the no integer promote (NOIP) switch. Does this mean that if you are using something else that this article is a waste of your time? I hope not! Many of the results come about because of the 8051 architecture. Thus, if you are using another 8051 compiler, I suspect that the general results will hold true. If you are using another 8-bit machine, such as the HC11 or the PIC then at the very least, the article should give you some food for thought.

Second, I can’t hope to cover every possible permutation of each of the scenarios. If you think I’ve missed an important variant, please add it to the test code and let me know your results.

“Optimal” can mean many things. I think there is a tremendous case for claiming that optimal code is code that is easy to maintain. In this article, however, I use optimal to refer to code that is optimally fast, small, and consistent in its execution time; I err on the side of speed. Generally, however, I’ve found that the fastest code is also the smallest.

On many occasions I have seen compilers do weird things with subtle changes to code. As a result, the information I’ll present here should be taken as a “rule of thumb.” If you apply the techniques, most of the time you’ll get more optimal code. However, if you are trying to wring the last ounce of performance out of your system, you should really check the results for your specific case.

Finally, some of the tests reveal surprising weaknesses in the Keil compiler. It is not my intention to criticize their product. Indeed, I have used their compiler for over ten years, which says a lot about what I think of it. I also suspect that similar compilers would also show their idiosyncrasies when subjected to similar detailed inspection. I’d like to go on the record as saying I think Keil’s products are excellent–and put a lot of the bigger names in the industry to shame.

Coding conventions

A central problem in trying to demonstrate the various constructs is to keep the code very simple–but not so simple that the optimizer realizes the code is doing nothing useful, and hence optimizes it away to nothing. To this end, I make extensive use of the intrinsic “function call” _nop_().

In comparing various techniques, and to ensure that every technique gets a fair hearing, I make extensive use of conditional compilation. To try the various techniques, it is of course necessary to set the value of the various manifest constants in order to compile the code one way or another.

In most cases, I report the number of execution cycles, the code size, and the data size of the various fragments. The data size is instructive. However, given that this memory gets heavily overlaid on an 8051, don’t get too worked up about differences in this parameter. When relevant, I also report the size of the library code.

In some of the projects, I run a lot of variants in an attempt to deduce patterns and optimal constructs. I usually only report on my conclusions. I encourage you to study the results of the various test cases and see if you agree with my conclusions.

As a final note, the execution times are normally given in cycles assuming a standard 8051 (12 clocks per instruction cycle). When I am looking at the benefits of dual data pointers, I use a processor with a different number of clocks per cycle. In either case, what is really important is the ratio (or difference) of the execution times.

Code constructs

We’ll be looking at these areas:

  • Looping
  • Function calling
  • Decision making
  • Calling one of N functions
  • Pointer manipulation
  • Copying
  • Bit-banging

The next to last section, on copying, builds on the earlier topics, and the bit-banging section builds on everything. That is, we’ll apply some of the techniques to address the problem of addressing a serial device by twiddling port lines.


All nontrivial programs contain loops. Of these, the most common is the loop that has to execute a fixed number of times. C offers numerous ways to code this. To see if there is any significant difference between the various constructs, I put together the looping project. The looping project contains eight different constructs. After testing the eight variants, I achieved a variation in performance that is quite spectacular (2.5:1 in size and 3.5:1 in speed). Furthermore, the best performing construct is not exactly intuitive.

First, let’s look at the worst performing loop construct:

// Code: 15. Cycles: 124.
void fne(void)
    unsigned char i; 

    i = 0;
    while (i++ < 10)

This is a fairly innocuous construct, but for various reasons it causes the compiler to generate a lousy solution. In contrast, three different constructs produced the optimal solution. The following is the most “standard” looking:

// Code: 6. Cycles: 35.
void fnc(void)
    unsigned char i; 

    for (i = 10; i; i--)

Note that this solution (along with the other two optimal solutions) involves counting down. In contrast, let’s look at the results for a standard count-up for-loop: // Code: 8. Cycles: 46. void fna(void) { unsigned char i; for (i = 0; i < 10; i++) { _nop_(); } }

The above requires two more code bytes and an extra cycle per loop iteration. This is an interesting result, one in which the standard construct doesn’t produce the optimal code. This result comes about because the 8051 has a DJNZ (Decrement and Jump if Not Zero) instruction that is designed for efficient looping. However, it requires the loop variable to count down–hence the for-loop that counts down gives the best performance.

Note, however, that the for-loop that counts down is usually problematic if the loop variable is used within the body of the loop. For example, if we wanted to set the elements of an array to some value, we’d have a problem because the values of the looping variable would be 10 to 1 instead of 9 to 0. If you need to use the looping variable within the body of the loop (which is the more common case), then the standard count-up for-loop construct gives the best results.

It’s also worth noting that if you were to use a 16-bit variable as the loop index, the optimal construct would probably change. Since I almost never have to use a 16-bit variable as a loop index, I’ve decided to leave this case as an exercise for the reader. If anyone does try it, I’d be interested in the results.

Function calling

Calling a function incurs overhead. On modern processors, this typically consists of pushing the required parameters onto a stack and then executing a call. The called function pops the required parameters off of the stack, performs the required work, and then returns. For many of the modern processors, the penalty is not so much the pushing and popping of parameters, as it is the fact that a function call often results in a cache miss and the attendant system slow down.

In the 8051 world, however, parameters are usually passed via registers. (It’s horribly inefficient to use the compiler’s reentrant model with parameters pushed onto an external stack. Consequently, it should never be used except in a case of true need.) However, the 8051 only has a few registers to use to pass parameters. Keil’s rules for parameter passing appear in Table 1 (at

There are several things to note:

1. At most, only three parameters will be passed in registers

2. The order in which arguments are declared affects how many bytes are passed in registers

3. At most, only one long or float and only one generic pointer can be passed in registers

4. There is no provision for passing other data types such as a structure or a union in registers

Let’s look at the implications of each of these limitations.

Three parameter limit

Passing lots of parameters to a function on an 8-bit machine is not a great idea. Here are some things you can try to get around this limitation:

  • Group data into structures and pass a pointer to the structure
  • If you need to pass four chars, you can group them as a union of four chars and one long and then pass the long (not pretty, but it works)

Parameter ordering

Here’s a limitation that often surprises people. Consider a function foo() that requires a char, an int, and a generic pointer to be passed to it.

void foo(char a, int b, int *c);

This will result in all six bytes (one for a, two for b, and three for c) being passed in registers. However, rearranging the parameter order to:

void bar(int *c, int b, char a);

will result in only five bytes being passed in registers, with parameter a being passed in a fixed memory location. This results in a significant performance hit.

The code to demonstrate this is in Listing 1 (at, which displays Test Case 1 of the function calling project. The first method requires three fewer cycles, three fewer code bytes, and six fewer data bytes than the second method. If this function is called often, the overall savings can be significant.

The bottom line is this: if you have a function that is called repeatedly, it is worthwhile to check the parameter order to make sure as many parameters as possible are passed in registers.

Passing pointers

Table 1 shows that only one generic pointer can be passed in registers, whereas up to three memory-space-specific pointers may be passed in registers. This is an unsubtle hint not to use generic pointers. To give you an indication of how severe the penalty is, consider Listing 2, which shows the difference between passing two generic pointers and passing two typed pointers.

The casts in foo() are intended to eliminate the time difference associated with the expression evaluation (the Keil compiler is smart enough to eliminate the overhead associated with the casts). Calling foo() required considerably more resources than calling bar(). This can all be attributed to the fact that bar() required more bytes (six versus four) to be passed, and that three of the bytes were passed in fixed memory locations. The obvious conclusion is to look very hard at any function that requires the passing of more than one generic pointer.

Passing other data types

The register argument rules are quite rigid. If you have the temerity to try and pass a bit or a structure as an argument (even if the structure consumes just two bytes), that parameter and all parameters to the right of it on the argument list are not passed in registers. Thus:

void foo(char a, int b, int *c, bit d);

will result in the first three parameters being passed in registers. However,

void bar(bit d, char a, int b, int *c);

will result in none of the parameters being passed in registers. The code to demonstrate this is Test Case 3 of the function calling project. The performance difference is significant and is shown in the Table 2.

Similarly, the following code:

typedef struct
    char e;
    char f;


void foo(STYPE a, int b);

also results in none of the parameters being passed in registers, since STYPE defines an unsupported type. The code to demonstrate this is in Listing 3.

In this case, foo() uses a massive amount of resources compared to bar(). Now admittedly, this has a lot to do with the compiler calling a function to copy the structure–rather than being smart enough to realize the structure is so small that it could easily fit into the available registers. However, I think it demonstrates that passing structures to a function is a really bad idea.

On a related note, I have also discovered that in certain circumstances, even when all the parameters may be passed in registers, changing the ordering of parameters sometimes makes a difference. Unfortunately, I am unable to come up with a rule of thumb to guide you on this one. All I can do is suggest you experiment on critical functions.

Return types

No discussion of function calling would be complete without mention of the impact of the data type that is returned. The Keil documentation shows that all basic types (char, int, float, *) will be returned in registers. A bit will be returned in the Carry flag. However, they make no mention of what happens if you try and return a structure. We’ll examine the case where you wish to get a copy of the data structure. Listing 4 examines this topic.

Thus, the simpler construct produced marginally better code. This is the first case of many where simplicity produces the better results.

As an aside, this example again demonstrates the inefficiency of the compiler’s data structure copy routine. For a trivial case like this, it would of course be better to return a pointer and perform an explicit element-by-element copy. I just find it surprising that the compiler doesn’t contain some form of heuristic algorithm for determining the optimal solution.

Function calling summary

Take a hard look at all of your function prototypes. The following changes will usually result in worthwhile performance increases:

1. Convert generic pointer arguments to typed pointer arguments

2. Make sure that any bit or structure arguments are passed last

3. Notwithstanding #2, don’t pass structures!

4. Minimize the number of arguments-but not at the expense of violating rule 3

5. Look at the ordering of your parameters to see if a reordering will result in all (or more) of the parameters being passed in registers

6. Experiment! If you have a function that is called regularly, experiment with the order of its parameters. You may be surprised at the gains you realize.

7. If you need to assign a structure from a function call, return the structure instead of a pointer to the structure.

Decision making

All programs are littered with if-else and switch-case constructs. On the face of it, there isn’t much one can do to change the system performance. However, certain questions that should be answered include:

  • Is the ternary operator more efficient than an if-else?
  • What’s the most efficient way to test something for being zero?
  • How does an index into an array of pointers to functions compare in performance to a switch statement?

The ternary operator

The ternary operator is terse, slightly cryptic and a favorite of mine. Examination of K&R shows that the authors use the ternary operator very liberally indeed. But does the ternary operator actually accomplish anything? Well, the answer with the Keil compiler is an emphatic no, since it produces larger code and longer execution time in every test case I tried. Listing 5 shows Test Case 1 of the decision making project.

The code in main() is designed to test the functions “both ways.” Other tests, involving bits, floats, and so on, all produced similar results. This is the second example where a simpler coding style produces better code.

Evaluating a variable for TRUE / FALSE

This issue seems to occur frequently. In a nutshell, we are interested in determining if a variable is zero (FALSE) or nonzero (TRUE). This seems so trivial that you wouldn’t imagine that performance would vary widely depending upon the constructs used. Think again!

Test Code 2 from the decision-making project is shown in Listing 6. In this case, I have three functions whose sole task is to test a byte variable and return 0 if it is zero, and 1 if it is not. The result is represented as a bit. The first function solves the problem by using the most straightforward representation. For simplicity and clarity, it is hard to beat. It also has the possible benefit of having equal execution time regardless of the value of the parameter. The second example is more obfuscated while giving no real benefit (a byte saved versus increased and variable execution times). The third function is either incredibly simple or ridiculously obtuse, depending on your perspective. It does, however, give dramatically better performance and comes about from an exceedingly neat trick by the compiler.

This is an interesting example, since I think it demonstrates two points:

As we have seen in other examples, the compiler seems to generate better code when simpler constructs are used. Hence example 1 is better than example 2.

Example 3 gives the best result because it makes use of a type cast. Type casts occur so frequently that I suspect the compiler writers have put a lot of effort into generating optimal coding constructs for them.


From the preceding examples, we can come up with a few rules of thumb for decision-making:

  • Avoid the use of the ternary operator
  • When testing for TRUE / FALSE, omit all explicit tests and rely on the compiler to cast the variable to a Boolean

Calling one of N functions

Though this is a form of decision-making, the problem of calling one of N functions is interesting and complex enough to warrant its own section. Essentially, we wish to solve the problem of calling one of N functions based on some index value. There are three general approaches to this problem:

  • if-else chain
  • switch statement
  • Array of pointers to functions

If you are unfamiliar with arrays of pointers to functions, you can read my article “Arrays of Pointers to Functions” to find out more.

The test project (Pointer to Function) contains four test cases. In all cases, we are calling one of N functions, where the functions fna() … fnk() simply return a … k respectively.

In this case, we wish to call one of five functions based on an index whose range of legal values is 0…4. This is a small enough number of choices that any of the three techniques is reasonable. The results of the three techniques are summarized in Table 3.

Note that index 6 is an illegal value, and hence tests how the technique handles this problem. There are a number of interesting results here:

  • The else-if and switch techniques have identical code size. However, examination of the assembly language shows that completely different techniques are used.
  • The else-if construct execution time increases with the value of the index. This seems reasonable, since we first test for 0, then for 1 and so on.
  • The code is considerably more compact than the other two techniques.
  • The switch construct execution time decreases with the value of the index. This comes about from a rather elegant code construct that the compiler generates. However, I’d consider this counter-intuitive.
  • The pointer to function code execution time is constant for all valid indices.
  • The pointer to function code execution time is considerably longer than the other techniques. However, a large portion of this is due to the limit checking on the index. If this can be eliminated (because the index is guaranteed to be within a valid range), then the penalty is less significant.

If the number of elements in the switch statement exceeds six, the Keil compiler uses a different technique for evaluating the switch statement. Test Case 2 examines this point.

Test Case 2 is identical to Test Case 1, except that the list of possible functions is extended to 11. I suspect that most people would eschew an else-if chain at this point, and so I have dropped it from the test.

The results of the two remaining techniques are summarized in Table 4.

Note that index 20 is an illegal value, and hence tests how the technique handles this problem.

Again, there are a number of interesting results.

  • The switch statement now produces consistent execution time (except for the default case).
  • The code size of the switch statement is significantly larger than the pointer to function approach.
  • The switch statement still has a significant edge in execution time. Again, if the limit checking code can be removed then this difference becomes much smaller.
  • Test Case 3 reproduces Test Case 2, except the indices are offset. The results are basically the same as Test Case 2, showing that the compiler is smart enough to handle nonzero-based contiguous indices. However, what happens when the indices are non-uniformly spaced? Test Case 4 addresses this question

In Test Case 4, we still have 11 possible function calls. However, the indices for the functions are non-uniformly spaced. For the switch-based code, we simply fill in the requisite values as shown below:

char switch_fn(unsigned char i)
    char res; 

      case 0x30:
        res = fna();

      case 0x35:
        res = fnb();

      case 0x36:
        res = fnc();

      case 0x39:
        res = fnd();

      case 0x3C:
        res = fne();

      case 0x41:
        res = fnf();

      case 0x42:
        res = fng();

      case 0x48:
        res = fnh();

      case 0x4A:
         res = fni();

      case 0x4D:
        res = fnj();

      case 0x4E:
        res = fnk();

        res = 0;

    return res;

For the pointer to function approach, we fill in the gaps with calls to a default function; in other words, a function that returns the default value of zero. Thus, the executable code looks the same. It’s just the lookup table that has increased in size. The code to do this follows:

char ptof_fn(unsigned char i)
    static char (code * code pf[])(void) = {
        fna, fnDefault, fnDefault, fnDefault,
        fnDefault, fnb, fnc, fnDefault,
        fnDefault, fnd, fnDefault, fnDefault,
        fne, fnDefault, fnDefault, fnDefault,
        fnDefault, fnf, fng, fnDefault,
        fnDefault, fnDefault, fnDefault, fnDefault,
        fnh, fnDefault, fni, fnDefault, fnDefault, fnj, fnk

    char res;

    if ((i >= 0x30) && (i <= 0x4E))
        res = pf[i-0x30]();
        res = 0;

    return res;

The results of this test are shown in Table 5.

Note that index 49 is an illegal value, and hence tests how the technique handles this problem.

Again, there are a number of really interesting results here:

  • The switch statement code now produces variable execution times
  • The execution time of the switch increases significantly with increasing index
  • The default value execution time of the switch is dramatically increased, which would be problematic if the default case was the most common
  • The switch footprint is considerably larger than the function pointer approach. However, much of this is attributable to library code, which presumably would be reused in other switch evaluations.


From the preceding examples, we can come up with a few rules of thumb for calling one of N functions:

  • If consistent execution time is important, use an array of function pointers. But if the number of indices is greater than six and the indices are contiguous, use a switch statement.
  • If code size is important, always use an array of function pointers.
  • If average execution time is of paramount concern, then you have to do some analysis:
    • For six or fewer choices, use an if-else chain.
    • For more than six choices, where the choices are contiguous, use a switch statement.
    • For all other cases, use an array of function pointers.

If you have a situation where the number of choices may be altered over time, and especially if the choices may become noncontiguous, then use an array of function pointers. This will prevent big changes in performance from small changes in code.

Pointer manipulation

Pointers are an integral part of most C programs. In addition to dereferencing them, it is common to increment or decrement them within the body of a loop. Anything that is regularly performed within a loop is worth looking at carefully to see if any optimizations are possible. To this end, I created the pointer manipulation project. This project attempts to answer the following questions:

  • How does combining the pointer manipulation with the dereference affect performance?
  • Is there any difference between incrementing and decrementing a pointer?
  • What effect does the memory space have on these issues?

The answers to these questions are not intuitively obvious because the 8051 has some weird characteristics:

1. When accessing data space via a pointer register (for example, R0), the register can be incremented or decremented with equal ease.

2. In contrast, when accessing external data space via the DPTR register, the instruction set only defines an increment operation. Thus, we might expect pointer decrements to be more problematic in this case.

The pointer manipulation project contains a dozen functions that all perform the equivalent of memset(). In each case we are trying to set the contents of a 20-byte array to a specified value.

Function calls att1(), att2(), and att3() examine the optimal way to combine a data space pointer dereference with a pointer increment. The results are summarized in Table 6, and are very interesting. There is a massive penalty to be paid for combining the pointer increment with the dereference as done in att1(). Surely, I thought to myself, this must be some manifestation of dealing with internal data space pointers? So I performed the same test using external data space pointers. Function calls att7(), att8(), and att9() were the test code. The results are summarized in Table 7. As you can see, we get the same basic result. This was quite a shock to me, since my coding style preference is to use *ptr++.

I know that there is a school of thought in which my preferred way of doing things is bad, as I’m doing too much in one line of code. Well, for those of you who subscribe to this school, here’s evidence that (at least for this compiler) your way of doing things is also more efficient.

It’s worth commenting on the fact that the pre-increment (att9()) offers a one cycle saving over the post-increment (att8()). This was also a surprise to me. For those of you who don’t like pre-increments (because it looks strange), go ahead and use the post-increment, since the penalty is miniscule. For those of you who absolutely have to have the best performance, use the pre-increment.

Here’s att9()’s implementation:

att9(unsigned char xdata *ptr,
    unsigned char val, unsigned char len)
    unsigned char i;

    for (i = len; i; i--)
        *ptr = val; ++ptr;

Note the use of the count-down for-loop to achieve optimal looping time.

Having obtained such a surprising result, I was obviously intrigued about other pointer manipulations. How does pointer incrementing compare to pointer decrementing?

Functions att4(), att5(), and att6() are identical to att1() … att3() except that we start at the top of the array and decrement the pointer. The results were identical. This was expected since the index register for data space access can be as easily decremented as incremented.

The same is not true for external data space access, however, where there is no provision for decrementing the DPTR register. Functions att10() … att12() are the equivalents of att7() … att9(). Once again the results were surprising; they’re summarized in Table 8.

In the case where we combine the pointer increment/decrement with the dereference, (att7() and att10()), we are better off using a pointer decrement. In contrast, where we split the instructions apart there is a significant penalty to be paid for decrementing the pointer. Finally, note that the pre-decrement also holds a one cycle advantage over the post-decrement, just as for the increment case.

A reasonable question to ask is “Do these results hold when we are pointing to anything else other than a character?” I don’t know. Most of my 8051 pointer manipulation seems to occur on communication buffers, so I haven’t been motivated to find out the answer to this question.

As a final note on this topic, I also compared calling memset() to the optimal solutions. memset() was considerably faster, but required more code space. The copying memory project that we look at next examines this issue in more detail.

Copying memory

Copying one area of memory to another is a common requirement. Indeed, it is so common that ANSI defined memcpy() to do this for us. In just about every environment, one would expect that memcpy() would be the fastest way to copy memory. However, this isn’t always the case with the 8051. The reason is a combination of the ANSI standard, the 8051 architecture, and implementation details of the compiler.

Let’s start with our friends at ANSI. First, the memcpy() library routine is required to return a pointer. Although I understand the rationale for this, I almost never use the result of a call to memcpy(). Thus the compiler writers are forced to generate and execute code that is of no interest to me. Secondly, memcpy() is required to take a length argument that is a signed int (why anyone would want to pass a negative length to memcpy() is beyond my comprehension). Signed integers carry a heavy penalty on the 8051. This would be OK, except that on most of my 8051 projects, I normally need to copy much less than 256 bytes, so that the length parameter can be an unsigned char. Thus, ANSI forces me to pass two bytes to memcpy() where one byte would do fine. In summary, ANSI handicaps the memcpy() writer in three ways:

  • Requires a pointer to be returned
  • Requires the length parameter to be an integer, resulting in more bytes being passed than is normally necessary
  • Requires the length parameter to be an integer, resulting in the looping variable being unnecessarily large

Of course, if a particular invocation does require copying more than 256 bytes, then the last two complaints are moot.

The length issue is nothing compared to the problems caused by the 8051. The 8051 has four different address spaces that make sense from a memcpy() perspective (using the Keil nomenclature: IDATA, PDATA, XDATA, and CODE). Three of these are legal destination spaces, while all four are legal as the source of the data. Thus, there are 12 possible memory space combinations that have to be supported. To accommodate this, Keil defines memcpy() to take two generic pointers. For those of you that are really paying attention, you’ll remember that the Keil parameter passing rules don’t allow two generic pointers to be passed in registers. However, examination of the memcpy() assembly language shows that Keil doesn’t follow its own rules. In fact it passes both pointers and the length integer in registers. (The cynic in me wants to know why they are allowed to do this and I’m not.)

Anyway, at this point, the memcpy() routine has two generic pointers it has to deal with. Keil’s solution is to determine at run time the memory spaces that the pointers point to and to call the appropriate function. I’m a little surprised that Keil cannot determine this information at compile time and simply call the requisite routine. However, I don’t write compilers for a living, so I’m sure there is a good reason.

To summarize, the memcpy() routine has the following handicaps:

  • Needs eight bytes to be passed in registers
  • Uses a 16-bit looping index-even when eight bits will do
  • Determines at run time the memory spaces to use
  • Has to return a value

Against these, the compiler has two advantages:

  • Hand optimized assembly language
  • If the 8051 derivative has two DPTR registers (a common extension to many 8051 families), the function can make use of this to dramatically improve performance for XDATA-XDATA or CODE-XDATA moves

Based on this information, it would seem reasonable that memcpy() may not be the optimal solution in all cases. The memcpy project was put together to examine this very issue. The project contains 12 tests. Some of these use memcpy(), while others use variations on for loops and array/pointer manipulations. These 12 tests are not exhaustive, but do serve to demonstrate a reasonable pattern. Tests from code space to data space are not shown, since this doesn’t seem to occur very frequently.

The detailed results are given in Table 9. A few comments are needed on some of the columns.

># DPTR registers: This specifies how many DPTR registers I told the compiler were available. The only time this is relevant is when doing an XDATA – XDATA (or CODE – XDATA) copy using memcpy(). In all other cases, the compiler simply ignores the availability of a second DPTR register.

Implementation: This gives a very brief summary of the basic technique used to achieve the result. For-loop/arrays means that the code looked something like this:

for (i = 0; i < SIZE; i++)
    to[i] = ex[i];

For-loop/pointers means that the code is based on constructs similar to:

for (i = 0, toptr = to, exptr = ex; i++)
    *toptr++ = *exptr++;

You’ll have to look at the source code to get the definitive statement on each implementation.

Times: The 1-byte and 11-byte times allow me to compute the setup time (the time taken to set up all the looping constructs) together with the per byte time (the delta time over the setup time for each additional byte to be copied). This information allows me to compute the break-even point for comparing various techniques.

Several conclusions may be drawn from these results:

1. The memcpy() function requires about 56 to 64 cycles of setup time. This is a lot, and is attributable to the number of registers to be loaded and the fact that the memory spaces are determined at run time.

2. The most efficient method of coding one’s own copying data routine seems to be the most straightforward, namely a for-loop with arrays (test 8). This is another example of where the simplest coding construct works best. It’s also notable that the array-based code is faster than pointer-based code. So much for the adage of “pointers are faster than array subscripting.”

3. If we compare the for-loop to memcpy(), the break-even point occurs around seven bytes in most cases. So if you have to copy more than seven bytes, use memcpy(). Otherwise use a for-loop.

4. The big exception is when doing XDATA-XDATA moves on an 8051 with a single DPTR register. In this case, we are comparing the results of test 4a with test 8. Here memcpy() is always slower than the for-loop. Of course, if you need to copy more than 256 bytes, this result doesn’t hold. Notwithstanding this, I still find this result pretty amazing.

If speed is what you need, then use these results. However, if you are copying blocks of data around as part of system initialization, I urge you to consider code size, readability and maintainability before making your decision. Indeed, if code size is your prime consideration, you may want to avoid memcpy(), even if you use it in multiple places. The reason for this is quite subtle. It takes about 19 bytes of code to call memcpy() (the compiler has to load eight registers). In addition, the actual library code is 246 bytes. By comparison, the code size for Test 5 is only 16 bytes, while the code size for Test 8 is 30 bytes. You’d have to duplicate this code numerous times to equal the code required by memcpy.

Bit banging

By way of a conclusion, I thought we’d take a look at a reasonably realistic example that allows me to demonstrate a few of the techniques that have been discussed. The example consists of an external peripheral–perhaps a 16-bit digital to analog converter, which requires 24 bits of command and data to be clocked into it.

The bottom line: fancy coding tricks are no substitute for good design.

Source code

The code is available at, where you’ll also find all of the other test code referenced in this article. In the final project, there are eight different C programs and one assembly language program. There is a 16:1 variation in speed and a significant variation in code size. However, if you bother to examine the code, you’ll find that the biggest improvement in speed came from a change in algorithm. The techniques described here are the icing on the cake.

This article was published in the October 2002 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “‘Optimal’ C Constructs for 8051 Microprocessors,” Embedded Systems Programming, October 2002

How to Use C’s Volatile Keyword

Thursday, December 17th, 2009 Nigel Jones

Also available in PDF version.

The proper use of C’s volatile keyword is poorly understood by many programmers. This is not surprising, as most C texts dismiss it in a sentence or two. This article will teach you the proper way to do it.

Have you experienced any of the following in your C or C++ embedded code?

  • Code that works fine—until you enable compiler optimizations
  • Code that works fine—until interrupts are enabled
  • Flaky hardware drivers
  • RTOS tasks that work fine in isolation–until some other task is spawned

If you answered yes to any of the above, it’s likely that you didn’t use the C keyword volatile. You aren’t alone. The use of volatile is poorly understood by many programmers. Unfortunately, most books about the C programming language dismiss volatile in a sentence or two.

C’s volatile keyword is a qualifier that is applied to a variable when it is declared. It tells the compiler that the value of the variable may change at any time–without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. However, before we examine them, let’s take a look at the syntax.

volatile keyword syntax

To declare a variable volatile, include the keyword volatile before or after the data type in the variable definition. For instance both of these declarations will declare foo to be a volatile integer:

volatile int foo;
int volatile foo;

Now, it turns out that pointers to volatile variables are very common, especially with memory-mapped I/O registers. Both of these declarations declare pReg to be a pointer to a volatile unsigned 8-bit integer:

volatile uint8_t * pReg;
uint8_t volatile * pReg;

Volatile pointers to non-volatile data are very rare (I think I’ve used them once), but I’d better go ahead and give you the syntax:

int * volatile p;

And just for completeness, if you really must have a volatile pointer to a volatile variable, you’d write:

int volatile * volatile p;

Incidentally, for a great explanation of why you have a choice of where to place volatile and why you should place it after the data type (for example, int volatile * foo), read Dan Sak’s column “Top-Level cv-Qualifiers in Function Parameters” (Embedded Systems Programming, February 2000, p. 63).

Finally, if you apply volatile to a struct or union, the entire contents of the struct/union are volatile. If you don’t want this behavior, you can apply the volatile qualifier to the individual members of the struct/union.

Proper use of volatile

A variable should be declared volatile whenever its value could change unexpectedly. In practice, only three types of variables could change:

1. Memory-mapped peripheral registers

2. Global variables modified by an interrupt service routine

3. Global variables accessed by multiple tasks within a multi-threaded application

We’ll talk about each of these cases in the sections that follow.

Peripheral registers

Embedded systems contain real hardware, usually with sophisticated peripherals. These peripherals contain registers whose values may change asynchronously to the program flow. As a very simple example, consider an 8-bit status register that is memory mapped at address 0x1234. It is required that you poll the status register until it becomes non-zero. The naive and incorrect implementation is as follows:

uint8_t * pReg = (uint8_t *) 0x1234;

// Wait for register to become non-zero
while (*pReg == 0) { } // Do something else

This will almost certainly fail as soon as you turn compiler optimization on, since the compiler will generate assembly language that looks something like this:

  mov ptr, #0x1234 mov a, @ptr

  bz loop

The rationale of the optimizer is quite simple: having already read the variable’s value into the accumulator (on the second line of assembly), there is no need to reread it, since the value will always be the same. Thus, in the third line, we end up with an infinite loop. To force the compiler to do what we want, we modify the declaration to:

uint8_t volatile * pReg = (uint8_t volatile *) 0x1234;

The assembly language now looks like this:

  mov ptr, #0x1234

  mov a, @ptr
  bz loop

The desired behavior is achieved.

Subtler problems tend to arise with registers that have special properties. For instance, a lot of peripherals contain registers that are cleared simply by reading them. Extra (or fewer) reads than you are intending can cause quite unexpected results in these cases.

Interrupt service routines

Interrupt service routines often set variables that are tested in mainline code. For example, a serial port interrupt may test each received character to see if it is an ETX character (presumably signifying the end of a message). If the character is an ETX, the ISR might set a global flag. An incorrect implementation of this might be:

int etx_rcvd = FALSE;

void main()
    while (!ext_rcvd)
        // Wait

interrupt void rx_isr(void)
    if (ETX == rx_char)
        etx_rcvd = TRUE;

With compiler optimization turned off, this code might work. However, any half decent optimizer will “break” the code. The problem is that the compiler has no idea that etx_rcvd can be changed within an ISR. As far as the compiler is concerned, the expression !ext_rcvd is always true, and, therefore, you can never exit the while loop. Consequently, all the code after the while loop may simply be removed by the optimizer. If you are lucky, your compiler will warn you about this. If you are unlucky (or you haven’t yet learned to take compiler warnings seriously), your code will fail miserably. Naturally, the blame will be placed on a “lousy optimizer.”

The solution is to declare the variable etx_rcvd to be volatile. Then all of your problems (well, some of them anyway) will disappear.

Multi-threaded applications

Despite the presence of queues, pipes, and other scheduler-aware communications mechanisms in real-time operating systems, it is still fairly common for two tasks to exchange information via a shared memory location (that is, a global). Even as you add a preemptive scheduler to your code, your compiler has no idea what a context switch is or when one might occur. Thus, another task modifying a shared global is conceptually identical to the problem of interrupt service routines discussed previously. So all shared global variables should be declared volatile. For example, this is asking for trouble:

int cntr;

void task1(void)
    cntr = 0; 

    while (cntr == 0)

void task2(void)

This code will likely fail once the compiler’s optimizer is enabled. Declaring cntr to be volatile is the proper way to solve the problem.

Final thoughts

Some compilers allow you to implicitly declare all variables as volatile. Resist this temptation, since it is essentially a substitute for thought. It also leads to potentially less efficient code.

Also, resist the temptation to blame the optimizer or turn it off. Modern optimizers are so good that I cannot remember the last time I came across an optimization bug. In contrast, I come across failures by programmers to use volatile with depressing frequency.

If you are given a piece of flaky code to “fix,” perform a grep for volatile. If grep comes up empty, the examples given here are probably good places to start looking for problems.

This article was published in the July 2001 issue of Embedded Systems Programming. If you wish to cite the article in your own work, you may find the following MLA-style information helpful:

Jones, Nigel. “Introduction to the Volatile Keyword” Embedded Systems Programming, July 2001