Archive for the ‘Efficient C/C++’ Category

Efficient C Tips #4 – Use Speed Optimization

Monday, September 8th, 2008 Nigel Jones

Back in July 2008 I promised that the next blog post would be on why you should use speed optimization instead of size optimization. Well four other posts somehow got in the way – for which I apologize. Anyway, onto the post!

In “Efficient C Tips #2” I made the case for always using full optimization on your released code. Back when I was a lad, the conventional wisdom when it came to optimization was to use the following algorithm:

1. Use size optimization by default
2. For those few pieces of code that get executed the most, use speed optimization.

This algorithm was based on the common observation that most code is executed infrequently and so in the grand scheme of things its execution time is irrelevant. Furthermore since memory is constrained and expensive, this code that is rarely executed should consume as little resource (i.e. memory) as possible. On first blush, this approach seems reasonable. However IMHO it was flawed back then and is definitely flawed now. Here is why:

1. In an embedded system, you typically are not sharing memory with other applications (unlike on a general purpose computer). Thus there are no prizes for using less than the available memory. Of course, if by using size optimization you can fit the application into a smaller memory device then use size optimization and use the smaller and cheaper part. However in my experience this rarely happens. Instead typically you have a system that comes with say 32K, 64K or 128K of Flash. If your application consumes 50K with speed optimization and 40K with size optimization, then you’ll still be using the 64K part and so size optimization has bought you nothing. Conversely, speed optimization will also cost you nothing – but your code will presumably run faster, and consume less power.

2. In an interesting quirk of optimization technology, it turns out that in some cases speed optimization can result in a smaller image than size optimization! It is almost never the case that the converse is true. See however this article that I wrote which discusses one possible exception. Thus even if you are memory constrained, try speed optimization.

3. Size optimization comes with a potentially very big downside. After a compiler has done all the usual optimizations (constant folding, strength reduction etc), a compiler that is set up to do size optimization will usually perform “common sub-expression elimination”. What this consists of is looking at the object code and identifying small blocks of assembly language that are used repeatedly throughout the application. These “common sub-expressions” are converted into sub routines. This process can be repeated ad nauseum such that one subroutine calls another which calls another and so on. As a result an innocuous looking piece of C code can be translated into a call tree that nests many levels deep – and there is the rub. Although this technique can dramatically reduce code size it comes at the price of increasing the call stack depth. Thus code that runs fine in debug mode may well suffer from a call stack overflow when you turn on size optimization. Speed optimization will not do this to you!

4. As I mentioned in “Efficient C Tips #2” one downside of optimization is that it can rearrange instruction sequences such that the special access requirements often needed by watchdogs, EEPROM etc are violated. In my experience, this only happens when one uses size optimization – and never with speed optimization. Note that I don’t advocate relying on this; it is however a bonus if you have forgotten to follow the advice I give in “Efficient C Tips #2” for these cases.

The bottom line – speed optimization is superior to size optimization. Now I just have to get the compiler vendors to select speed optimization by default!

Next Tip
Previous Tip

Efficient C Tips #3 – Avoiding post increment / decrement

Friday, August 1st, 2008 Nigel Jones

It always seems counter intuitive to me, but post increment / decrement operations in C / C++ often result in inefficient code, particularly when de-referencing pointers. For example

for (i = 0, ptr = buffer; i < 8; i++)
 *ptr++ = i;

This code snippet contains two post increment operations. With most compilers, you’ll get better code quality by re-writing it like this:

for (i = 0, ptr = buffer; i < 8; ++i)
 *ptr = i;

Why is this you ask? Well, the best explanation I’ve come across to date is this one on the IAR website:

Certainly taking the time to understand what’s going on is worthwhile. However, if it makes your head hurt then just remember to avoid post increment / decrement operations.

Incidentally, you may find that on your particular target it makes no difference. However, this is purely a result of the fact that your target processor directly supports the required addressing modes to make post increments efficient. If you are interested in writing code that is universally efficient, then avoid the use of post increment / decrement.

You may also wonder just how much this saves you. I’ve run some tests on various compilers / targets and have found that this coding style cuts the object code size down from zero to several percent. I’ve never seen it increase the code size. More to the point, in loops, using a pre-increment can save you a load / store operation per increment per loop iteration. These can add up to some serious time savings.

Next Tip
Previous Tip

Efficient C Tips #2 – Using the optimizer

Saturday, July 5th, 2008 Nigel Jones

In my first post on “Efficient C” I talked about how to use the optimal integer data type to achieve the best possible performance. In this post, I’ll talk about using the code optimization settings in your compiler to achieve further performance gains.

I assume that if you are reading this, then you are aware that compilers have optimization settings or switches. Invoking these settings usually has a dramatic effect on the size and speed of the compiled image. Typical results that I have observed over the years is a 40% reduction in code size and a halving of execution time for fully optimized versus non-optimized code. Despite these amazing numbers, I’d say about half of the code that I see (and I see a lot) is released to the field without full optimization turned on. When I ask developers about this, I typically get one of the following explanations:

1. I forgot to turn the optimizer on.
2. The code works fine as is, so why bother optimizing it?
3. When I turned the optimizer on, the code stopped working.

The first answer is symptomatic of a developer that is just careless. I can guarantee that the released code will have a lot of problems!

The second answer on the face of it has some merit. It’s the classic “if it aint broke don’t fix it” argument. However, notwithstanding that it means that your code will take longer to execute and thus almost certainly consume more energy (see my previous post on “Embedded Systems and the Environment”), it also means that there are potential problems lurking in your code. I address this issue below.

The third answer is of course the most interesting. You have a “perfectly good” piece of code that is functioning just fine, yet when you turn the optimizer on, the code stops working. Whenever this happens, the developer blames the “stupid compiler” and moves on. Well, after having this happen to me a fair number of times over my career, I’d say that the chances that the compiler is to blame are less than 1 in 10. The real culprit is normally the developer’s poor understanding of the rules of the programming language and how compilers work.

Typically when a compiler is set up to do no optimization, it generates object code for each line of source code in the order in which the code is encountered and then simply stitches the result together (for the compiler aficionados out there I know it’s more involved than this – but it serves my point). As a result, code is executed in the order in which you write it, constants are tested to see if they have changed, variables are stored to memory and then immediately loaded back into registers, invariant code is repeatedly executed within loops, all the registers in the CPU are stacked in an ISR and so on.

Now, when the optimizer is turned on, the optimizer rearranges code execution order, looks for constant expressions, redundant stores, common sub-expressions, unused registers and so on and eliminates everything that it perceives to be unnecessary. And therein dear reader lies the source of most of the problems. What the compiler perceives as unnecessary, the coder thinks is essential – and indeed is relying upon the “unnecessary” code to be executed.

So what’s to be done about this? Firstly, you have to understand what the key word volatile means and does. Even if you think you understand volatile, go and read this article I wrote a number of years back for Embedded Systems Programming magazine. I’d say that well over half of the optimization problems out there relate to failure to use volatile correctly.

The second problematic area concerns specialized protective hardware such as watchdogs. In an effort to make inadvertent modification of certain registers less likely, the CPU manufacturers insist upon a certain set of instructions being executed in order within a certain time. An optimizer can often break these specialized sequences. In which case, the best bet is to put the specialized sequences into their own function and then use the appropriate #pragma directive to disable optimization of that function.

Now what to do if you are absolutely sure that you are using volatile appropriately and correctly and that specialized coding sequences have been protected as suggested, yet your code still does not work when the optimizer is turned on? The next thing to look for are software timing sequences, either explicit or implicit. The explicit timing sequences are things such as software delay loops, and are easy to spot. The implicit ones are a bit tougher and typically arise when you are doing something like bit-banging a peripheral, where the instruction cycle time implicitly acts as a setup or hold time for the hardware being addressed.

OK, what if you’ve checked for software timing and things still don’t work? In my experience you are now in to what I’ll call the “Suspect Code / Suspect Compiler (SCSC)” environment. With an SCSC problem, the chances are you’ve written some very complex, convoluted code. With this type of code, two things can happen:

1. You are working in a grey area of the language (i.e. an area where the behavior is not well specified by the standard). Your best defense against this is to use Lint from Gimpel. Lint will find all your questionable coding constructs. Once you have fixed them, you’ll probably find your optimization problems have gone away.
2. The optimizer is genuinely getting confused. Although this is regrettable, the real blame may lie with you for writing knarly code. The bottom line in my experience is that optimizers work best on simple code. Of course, if you have written simple code and the optimizer is getting it wrong, then do everyone a favor and report it to the compiler vendor.

In my next post I’ll take on the size / speed dichotomy and make the case for using speed rather than size as the “usual” optimization method.

Next Tip

Previous Tip


Efficient C Tips #1 – Choosing the correct integer size

Sunday, June 15th, 2008 Nigel Jones

From time to time I write articles for Embedded Systems Design magazine. A number of these articles have concentrated on how to write efficient C for an embedded target. Whenever I write these articles I always get emails from people asking me two questions:

1. How did you learn this stuff?
2. Is there somewhere I can go to learn more?

The answer to the first question is a bit long winded and consists of:
1. I read compiler manuals (yes, I do need a life).
2. I experiment.
3. Whenever I see a strange coding construct, I ask the author why they are doing it that way. From time to time I pick up some gems.
4. I think hard about what the compiler has to do in order to satisfy a particular coding construct. It’s really helpful if you know assembly language for this stage.

The answer to the second question is short: No!

To help rectify this, in my copious free time I’ll consider putting together a one day course on how to write efficient C for embedded systems. If this is of interest to you then please contact me .

In the interim, I’d like to offer up my first tip on how to choose the correct integer size.

In my experience in writing programs for both embedded systems and computers, I’d say that greater than 95% of all the integers used by those programs could fit into an 8 bit variable. The question is, what sort of integer should one use in order to make the code the most efficient? Most computer programmers who use C will be puzzled by this question. After all the data type ‘int’ is supposed to be an integer type that is at least 16 bits that represents the natural word length of the target system. Thus, one should simply use the ‘int’ data type.

In the embedded world, however, such a trite answer will quickly get you into trouble – for at least three reasons.
1. For 8 bit microcontrollers, the natural word length is 8 bits. However you can’t represent an ‘int’ data type in 8 bits and remain C99 compliant. Some compiler manufacturer’s eschew C99 compliance and make the ‘int’ type 8 bits (at least one PIC compiler does this), while others simply say we are compliant and if you are stupid enough to use an ‘int’ when another data type makes more sense then that’s your problem.
2. For some processors there is a difference between the natural word length of the CPU and the natural word length of the (external) memory bus. Thus the optimal integer type can actually depend upon where it is stored.
3. The ‘int’ data type is signed. Much, indeed most, of the embedded world is unsigned, and those of us that have worked in it for a long time have found that working with unsigned integers is a lot faster and a lot safer than working with signed integers, or even worse a mix of signed and unsigned integers. (I’ll make this the subject of another blog post).

Thus the bottom line is that using the ‘int’ data type can get you into a world of trouble. Most embedded programmers are aware of this, which is why when you look at embedded code, you’ll see a veritable maelstrom of user defined data types such as UINT8, INT32, WORD, DWORD etc. Although these should ensure that there is no ambiguity about the data type being used for a particular construct, it still doesn’t solve the problem about whether the data type is optimal or not. For example, consider the following simple code fragment for doing something 100 times:


for (i = 0; i < 100; i++)
 // Do something 100 times

Please ignore all other issues other than what data type should the loop variable ‘i’ be?Well evidently, it needs to be at least 8 bits wide and so we would appear to have a choice of 8,16,32 or even 64 bits as our underlying data type. Now if you are writing code for a particular CPU then you should know whether it is an 8, 16, 32 or 64 bit CPU and thus you could make your choice based on this factor alone. However, is a 16 bit integer always the best choice for a particular 16 bit CPU? And what about if you are trying to write portable code that is supposed to be used on a plethora of targets? Finally, what exactly do we mean by ‘optimal’ or ‘efficient’ code?I wrestled with these problems for many years before finally realizing that the C99 standards committee has solved this problem for us. Quite a few people now know that the C99 standard standardized the naming conventions for specific integer types (int8_t, uint8_t, int16_t etc). What isn’t so well known is that they also defined data types which are “minimum width” and also “fastest width”. To see if your compiler is C99 compliant, open up stdint.h. If it is compliant, as well as the uint8_t etc data types, you’ll also see at least two other sections – minimum width types and fastest minimum width types.

An example will help clarify the situation:

Fixed width unsigned 8 bit integer: uint8_t

Minimum width unsigned 8 bit integer: uint_least8_t

Fastest minimum width unsigned 8 bit integer: uint_fast8_t

Thus a uint8_t is guaranteed to be exactly 8 bits wide. A uint_least8_t is the smallest integer guaranteed to be at least 8 bits wide. An uint_fast8_t is the fastest integer guaranteed to be at least 8 bits wide. So we can now finally answer our question. If we are trying to consume the minimum amount of data memory, then our TBD_DATATYPE should be uint_least8_t. If we are trying to make our code run as fast as possible then we should use uint_fast8_t. Thus the bottom line is this. If you want to start writing efficient, portable embedded code, the first step you should take is start using the C99 data types ‘least’ and ‘fast’. If your compiler isn’t C99 compliant then complain until it is – or change vendors. If you make this change I think you’ll be pleasantly surprised at the improvements in code size and speed that you’ll achieve.

Next Tip


Integer Log functions

Sunday, May 11th, 2008 Nigel Jones

A few months ago I wrote about a very nifty square root function in Jack Crenshaw’s book “Math Toolkit for Real-time Programming”. As elegant as the square root function is, it pails in comparison to what Crenshaw calls his ‘bitlog’ function. This is some code that computes the log (to base 2 of course) of an integer – and does it in amazingly few cycles and with amazing accuracy. The code in the book is for a 32 bit integer; the code I present here is for a 16 bit integer. Although you are of course free to use this code as is, I strongly suggest you buy Crenshaw’s book and read about this function. You’ll see it truly is a work of art. BTW, one of the things I really like about Crenshaw is that he takes great pains to note that he didn’t invent this algorithm. Rather he credits Tom Lehman. Kudos to Lehman.

 FUNCTION: bitlog

 Computes 8 * (log(base 2)(x) -1).

 -    The uint16_t value whose log we desire

 -    An approximation to log(x)


uint16_t bitlog(uint16_t x)
    uint8_t    b;
    uint16_t res;

    if (x <=  8 ) /* Shorten computation for small numbers */
        res = 2 * x;
        b = 15; /* Find the highest non zero bit in the input argument */
        while ((b > 2) && ((int16_t)x > 0))
            x <<= 1;
        x &= 0x7000;
        x >>= 12;

        res = x + 8 * (b - 1);

    return res;