Archive for the ‘Efficient C/C++’ Category

Efficient C Tip #13 – use the modulus (%) operator with caution

Tuesday, February 8th, 2011 Nigel Jones

This is the thirteenth in a series of tips on writing efficient C for embedded systems.  As the title suggests, if you are interested in writing efficient C, you need to be cautious about using the modulus operator.  Why is this? Well a little thought shows that C = A % B is equivalent to C = A – B * (A / B). In other words the modulus operator is functionally equivalent to three operations. As a result it’s hardly surprising that code that uses the modulus operator can take a long time to execute. Now in some cases you absolutely have to use the modulus operator. However in many cases it’s possible to restructure the code such that the modulus operator is not needed. To demonstrate what I mean, some background information is in order as to how this blog posting came about.

Converting seconds to days, hours, minutes and seconds

In Embedded Systems Design there is an increasing need for some form of real time clock. When this is done, the designer typically implements the time as a 32 bit variable containing the number of seconds since a particular date. When this is done, it’s not usually long before one has to convert the ‘time’ into days, hours, minutes and seconds. Well I found myself in just such a situation recently. As a result, I thought a quick internet search was in order to find the ‘best’ way of converting ‘time’ to days, hours, minutes and seconds. The code I found wasn’t great and as usual was highly PC centric. I thus sat down to write my own code.

Attempt #1 – Using the modulus operator

My first attempt used the ‘obvious’ algorithm and employed the modulus operator. The relevant code fragment appears below.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 seconds = time % 60UL;
 time /= 60UL;
 minutes = time % 60UL;
 time /= 60UL;
 hours = time % 24UL;
 time /= 24UL;
 days = time;  
}

This approach has a nice looking symmetry to it.  However, it contained three divisions and three modulus operations. I thus was rather concerned about its performance and so I measured its speed for three different architectures – AVR (8 bit), MSP430 (16 bit), and ARM Cortex (32 bit). In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and include the test harness overhead:

AVR:  29,825 cycles

MSP430: 27,019 cycles

ARM Cortex: 390 cycles

No that isn’t a misprint. The ARM was nearly two orders of magnitude more cycle efficient than the MSP430 and AVR. Thus my claim that the modulus operator can be very inefficient is true for some architectures – but not all.  Thus if you are using the modulus operator on an ARM processor then it’s probably not worth worrying about. However if you are working on smaller processors then clearly something needs to be done  – and so I investigated some alternatives.

Attempt #2 – Replace the modulus operator

As mentioned in the introduction,  C = A % B is equivalent to C = A – B * (A / B). If we compare this to the code in attempt 1, then it should be apparent that the intermediate value (A/B) computed as part of the modulus operation is in fact needed in the next line of code. Thus this suggests a simple optimization to the algorithm.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 days = time / (24UL * 3600UL);    
 time -= days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = time / 3600UL;
 time -= (hours * 3600UL);
 /* time now contains the number of seconds in the last hour */
 minutes = time / 60U;
 seconds = time - minutes * 60U;
 }

In this case I have replaced three mods with three subtractions and three multiplications. Thus although I have replaced a single operator (%) with two operations (- *) I still expect an increase in speed because the modulus operator is actually three operators in one (- * /).  Thus effectively I have eliminated three divisions and so I expected a significant improvement in speed. The results however were a little surprising:

AVR:  18,720 cycles

MSP430: 14,805 cycles

ARM Cortex: 384 cycles

Thus while this technique yielded a roughly order of two improvements for the AVR and MSP430 processors, it had essentially no impact on the ARM code.  Presumably this is because the ARM has native support for the modulus operation. Notwithstanding the ARM results, it’s clear that at least in this example, it’s possible to significantly speed up an algorithm by eliminating the modulus operator.

I could of course just stop at this point. However examination of attempt 2 shows that further optimizations are possible by observing that if seconds is a 32 bit variable, then days can be at most a 16 bit variable. Furthermore, hours, minutes and seconds are inherently limited to an 8 bit range. I thus recoded attempt 2 to use smaller data types.

Attempt #3 – Data type size reduction

My naive implementation of the code looked like this:

void compute_time(uint32_t time)
{
 uint16_t    days;
 uint8_t     hours, minutes, seconds;
 uint16_t    stime;

 days = (uint16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types and to add casts where appropriate. The results were interesting:

AVR:  14,400 cycles

MSP430: 11,457 cycles

ARM Cortex: 434 cycles

Thus while this resulted in a significant improvement for the AVR & MSP430, it resulted in a significant worsening for the ARM. Clearly the ARM doesn’t like working with non 32 bit variables. Thus this suggested an improvement that would make the code a lot more portable – and that is to use the C99 fast types. Doing this gives the following code:

Attempt #4 – Using the C99 fast data types

void display_time(uint32_t time)
{
 uint_fast16_t    days;
 uint_fast8_t    hours, minutes, seconds;
 uint_fast16_t    stime;

 days = (uint_fast16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint_fast8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types to the C99 fast types. The results were encouraging:

AVR:  14,400 cycles

MSP430: 11,595 cycles

ARM Cortex: 384 cycles

Although the MSP430 time increased very slightly, the AVR and ARM stayed at their fastest speeds. Thus attempt #4 is both fast and portable.

Conclusion

Not only did replacing the modulus operator with alternative operations result in faster code, it also opened up the possibility for further optimizations. As a result with the AVR & MSP430 I was able to more than halve the execution time.

Converting Integers for Display

A similar problem (with a similar solution) occurs when one wants to display integers on a display. For example if you are using a custom LCD panel with say a 3 digit numeric field, then the problem arises as to how to determine the value of each digit. The obvious way, using the modulus operator is as follows:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999)
 {
 value = 999;
 }

 lsd = value % 10;
 value /= 10;
 nsd = value % 10;
 value /= 10;
 msd = value;

 /* Now display the digits */
}

However, using the technique espoused above, we can rewrite this much more efficiently as:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999U)
 {
  value = 999U;
 }

 msd = value / 100U;
 value -= msd * 100U;

 nsd = value / 10U;
 value -= nsd * 10U;

 lsd = value;

 /* Now display the digits */
}

If you benchmark this you should find it considerably faster than the modulus based approach.

Previous Tip

Efficient C Tip #12 – Be wary of switch statements

Saturday, April 10th, 2010 Nigel Jones

This is the twelfth in a series of tips on writing efficient C for embedded systems. Like the previous topic, I suspect that this will be a bit controversial. As the title suggests, if you are interested in writing efficient C, you need to be wary of switch statements. Before I explain why, a little background will be useful. I did all of my early embedded systems programming in assembly language. This wasn’t out of some sense of machismo, it was simply a reflection of the fact that there were no high level languages available (with the possible exception of PL/M). Naturally as well as programming embedded systems I also did computer programming, initially in Pascal and BASIC, and later in C. One of the major differences I found in using the HLL was the wonderful switch / case statement. I found it to be a beautiful tool – with a few lines of source code I could do all sorts of powerful things that were simply very difficult or tedious to do in assembly language. Fast forward a number of years and C compilers began to become available for small embedded systems and so I naturally started using them, together with of course the attendant switch statement. All was well in paradise until the day I used a switch statement in an interrupt service routine and found to my horror that the ISR was taking about ten times longer to execute than I thought was reasonable.

This precipitated an investigation into how exactly switch statements are implemented by the compiler. When I did this, I discovered a number of things that should give one pause.

Heuristic Algorithms

The first thing I discovered is that compilers typically have a number of ways of implementing a switch statement. They seem to be loosely divided into the following trichotomy:

  1. An if-else-if-else-if chain. In this implementation, the switch statement is treated as syntactic sugar for an if-else-if chain.
  2. Some form of jump or control tables, or as they are sometimes called a computed goto. This is a favorite technique of assembly language programmers and the compiler writers can use it to great effect.
  3. A hybrid of 1 & 2.

Where it gets interesting is how the compiler decides which approach to use. If the case values are contiguous (e.g. zero through ten), then it’s likely the compiler will use some form of jump table. Conversely if the case values are completely disjointed (e.g. zero, six, twenty, four hundred and a thousand) then an if-else implementation is likely. However what does the compiler do when, for example, you have a bifurcated set of ranges such as zero-ten and ninety – one hundred? Well the answer is, that each compiler seems to have some form of heuristic algorithm for determining what is the ‘best’ way of implementing a given set of cases. Although some compilers allow you to force a particular implementation, for the most part you are at the mercy of the compiler.

Comparative Execution Speeds

If you think about it, it should become apparent that a jump table approach is likely to give a highly consistent time of execution through the decision tree, whereas the if-else -if chain has a highly variable time of execution depending upon the particular value of the switched variable.  Notwithstanding this, the jump table approach has a certain amount of execution overhead associated with it. This means that although its  mean execution time (which is normally the same as its worst and best execution time) may be dramatically better than the mean execution time of the if-else-if chain, the if-else-if chain’s best execution time may be considerably better. So what you say! Well in some cases, a particular value is far more likely to occur than the other values, thus it would be very nice if this value was tested first. However, as you will now see, this isn’t guaranteed…

Order of Execution

For many years I wrote switch statements under the assumption that the case values would be evaluated from top to bottom. That is, if the compiler chose to implement the switch statement as an if-else-if chain, then it would first test the first case, then the second case and so on down to the default case at the bottom of my source code. Well it turns out that my assumption was completely wrong. The compiler is under no such obligation, and indeed will often evaluate the values bottom to top. Furthermore, the compiler will often evaluate the default value first. For example consider a defaulted switch statement with contiguous case values in the range zero to ten. If the index variable is an unsigned int, then there are at least 65525 possible values handled by the default case, and so it makes sense to eliminate them first. Now if you know that the index variable can only possibly take on the values zero to ten, then you can of course eliminate the default statement – and then get excoriated by the coding standards / MISRA folks.

Maintenance

This is the area where I really get worried. Consider the case where you have a switch statement in an ISR. The code is working with no problems until one day it is necessary to make a change to the switch statement – by for example adding an additional case value. This simple change can cause the compiler to completely change the implementation of the switch statement. As a result, you may find that:

  1. The worst case execution time has jumped dramatically.
  2. The mean execution time has jumped dramatically.
  3. The stack space required by the ISR has jumped dramatically.

Any of these three possibilities can cause your program to fail catastrophically. Now of course one could argue ‘that’s why you test all changes’. However, in my opinion it’s far better to be proactive and to avoid putting yourself in this situation in the first place.

I’d also be remiss in not noting the dreaded missing break statement maintenance problem. However as a religious user of Lint, I’m not normally too concerned about this.

Switch statement alternatives

If performance and stability is your goal then I strongly recommend that you implement your code, the way you want it executed. This means either explicitly use an if-else-if chain or use function pointers. If function pointers scare you, then you might want to read this article I wrote on the subject.

Recommendations

Based on my experience, I have a number of things that I do when it comes to switch statements. If you find my analysis compelling, you may want to adopt them:

  1. Switch statements should be the last language construct you reach for – and not the first.
  2. Learn how to use function pointers. Once you do you’ll find a lot of the reasons for using switch statements go away.
  3. Try to keep all the case values contiguous.
  4. If you can’t keep the case values contiguous, go to the other extreme and make them disparate – that way you are less likely to have the compiler change the algorithm on you.
  5. If your compiler supports it, consider using pragmas to lock in a particular implementation.
  6. Be very wary of using switch statements in interrupt service routines or any other performance critical code.
  7. Use Lint to guard against missing break statements.

Comments welcome!

Next Tip

Previous Tip

Home

Efficient C Tip #11 – Avoid passing parameters by using more small functions

Saturday, February 6th, 2010 Nigel Jones

This is the eleventh in a series of tips on writing efficient C for embedded systems. Today’s topic will, I suspect, be slightly controversial. This post is based upon two basic observations:

  1. Passing parameters to functions is costly.
  2. Conditional branch instructions can be very costly on CPUs that have instruction caches (even with branch prediction).

I don’t think that too many people will disagree with me on the above. Despite this I too often see a style of coding that incurs these costs unnecessarily. I think it’s best illustrated by a (real world) example. The issue is one that will be familiar to most of you.  An embedded system contains a number of discrete LEDs (say 3), and the requirement is to write some code to allow higher level code to either turn on, turn off, or toggle a particular LED. The way I often see this coded is as follows:

typedef enum
{
 LED1, LED2, LED3
} LED_NO;
typedef enum
{
 LED_OFF, LED_ON, LED_TOGGLE
} LED_ACTION;
void led(LED_NO led_no, LED_ACTION led_action)
{
 switch (led_no)
 {
  case LED1:
   switch (led_action)
   {
    case LED_OFF:
     PORTB_PORTB0 = 0;
     break;
    case LED_ON:
     PORTB_PORTB0 = 1;
     break;
    case LED_TOGGLE:
     PORTB_PORTB0 ^= 1;
     break;
    default:
     break;
   }
  break;
 case LED2:
  ...
}

So what’s wrong with this you ask? Well in a nutshell the parameters passed to the function are used strictly to control the order of execution. There is no code common to any pair or group of parameters. When faced with a situation such as this, I instead implement the code as a large number of very small functions. For example:

void led1_Off(void)
{
 PORTB_PORTB1 = 0;
}
void led1_On(void)
{
 PORTB_PORTB1 = 1;
}
void led1_Toggle(void)
{
 PORTB_PORTB1 ^= 1;
}
...

Let’s compare the two approaches.

Efficiency

This blog posting is supposedly about efficiency, so let’s start with the results. I coded these two approaches up together with a main() function that exercised all 9 possible combination’s. I then turned full speed optimization on and looked at the results for an AVR processor.

Single function approach: 78 bytes for main(), 94 bytes for the LED code. Execution time 208 cycles.

Multiple function approach: 42 bytes for main(), 54 bytes for the LED code. Execution time 96 cycles.

Clearly my approach is significantly more efficient.

Usability

By usability I’m referring to the case where someone else needs to use your code. They know they need to say toggle LED2 so they hunt around and find the file led.h. The question is, once they have opened up led.h, how quickly can they determine what they have to do in order to toggle LED2? In the single function case they are presented with just one function (which is a plus), but then they have to locate the enumerations and work out the parameters that need to be passed to the function (which is a minus). In the multiple function case, they have to search through a list of functions looking for the correct one. However once they have found it, it’s very clear what the function does.

For me, I think it is a toss up between the two approaches as to which is more usable.

Maintainability

In this case the multiple function approach is the big winner. To see how this is, consider what happens to the single function case when one adds an LED or adds an action. The single function case just explodes in size, whereas with the multi-function approach one simply adds more very simple functions.

Conclusions

If you buy my analysis then clearly the multi-function approach is superior in both efficiency and maintainability – two areas that are dear to my heart. Now granted this is a fairly extreme example. However in my experience if you look through a reasonable amount of code you will soon discover a function that essentially does one thing or another based upon a function parameter. When you locate such a function you might want to try breaking it into two functions in the manner described here – I think you’ll be pleased with the results.

Next Tip

Previous Tip

****
As the readership of this blog has grown I must say I have been really impressed with the many insightful comments that have been posted. I know I learn a lot from them, and so I suspect, do a lot of the other readers. Thus for those of you that have commented in the past – thank you. For those of you yet to post a comment, I encourage you to take the plunge!

Home

Minimizing memory use in embedded systems Tip #3 – Don’t use printf()

Thursday, September 24th, 2009 Nigel Jones

This is the third in a series of tips on minimizing memory consumption in embedded systems.

If you are like me, the first C program you saw was K&R’s famous ‘hello, world’ code, reproduced below:

main()
{
 printf(“hello, world\n”);
}

In my opinion, this program has done incalculable harm to the realm of embedded systems programming! I appreciate that this is a rather extreme statement – but as is usual I have my reasons …

The interesting thing about this code is that it introduces printf() – and as such gives the impression that printf() is an important (and useful) part of the C language. Well I suppose it is / was for those programming computers. However for those programming embedded systems, printf() and its brethren (sprintf, vsprintf, scanf etc) are in general a disaster waiting to happen for the unwary. Here is why:

Code Size

The printf() functions are immensely sophisticated functions, and as such consume an incredible amount of code space. I have clear memories of an early 8051 compiler’s printf() function consuming 8K of code space (and this was at a time when an 8K program was a decent size). Since then, compiler vendors have put a lot of effort into addressing this issue. For example IAR allows you to specify the functionality (and hence size) of printf() as a library option. Notwithstanding this, if your available code space is less than 32K the chances are you really shouldn’t be using printf(). But what if you need some of the features of printf()? Well in that case I recommend you write your own formatting function. For example I often find that I have a small microcontroller project that needs to talk over a serial link using an ASCII protocol. In cases like these, the easy thing to do is to generate the requisite string using a complex format string with sprintf(). However, with a little bit of ingenuity you should be able to create the string using a series of calls to simple formatting routines. I can guarantee that you’ll end up with more compact code.

Stack Size

Barely a day goes by that someone doesn’t end up on this blog because they have a stack overflow caused by printf(), sprintf() or vsprintf(). Why is this? Well if you are ever feeling bored one day, try and write the printf() function. If you do, you’ll soon find that it is not only difficult, but also that it requires a large amount of space for the function arguments, a lot of temporary buffer space for doing the formatting as well as a large number of intermediate variables. In short, it needs a tremendous amount of stack space. Indeed I have had embedded systems that need a mere 32 bytes of stack space prior to using printf() – and 200+ bytes after I’ve added in printf(). The bottom line is that for small embedded systems, formatted output needs a ridiculous amount of stack space – and that as a result stack overflow is a real possibility.

Variable length arguments

I’m sure most people use sprintf() etc without fully appreciating that these functions use a variable length argument list. I’ll leave for another day the full implications of this. However for now you should just consider that MISRA bans the use of variable length arguments – and that you should take this as a strong hint to avoid these functions in embedded systems.

Execution time

The execution time of printf() can be spectacularly long. For example the ‘hello world’ program given in the introduction requires 1000 cycles on an AVR CPU. Changing it to the almost as trivial function shown below increases the execution to 6371 cycles:

int main( void )
{
 int i = 89;

 printf("hello, world %d\n", i);
}

Lest you think this is an indictment of the AVR processor, the same code for a generic ARM processor still takes a whopping 1738 cycles. In short, printf() and its brethren can take a really long time to execute.

Now do the above mean you should always eschew formatted output functions? No! Indeed I recommend the use of vsprintf() here for certain classes of problem. What I do recommend is that you think long and hard before using these functions to ensure that you really understand what you are doing (and getting) when you use them.

Previous Tip
Home

Minimizing memory use in embedded systems tip#2 – Be completely consistent in your coding style

Friday, September 4th, 2009 Nigel Jones

This is the second in a series of postings on how to minimize the memory consumption of an embedded system.

As the title suggests, you’ll often get a nice reduction in code size if you are completely consistent in your HLL coding style. To show how this works, its necessary to take a trip into assembly language.

When you write in assembly language you soon find that you perform the same series of instructions over and over again. For example, to add two numbers together, you might have pseudo assembly language code that looks something like this:

LD X, operand1 ; X points to operand 1
LD Y, operand2 ; Y points to operand 2
LD R0,X        ; Get operand 1
LD R1,Y        ; Get operand 2
ADD            ;
ST R0          ; Store the result in R0

After you have done this a few times, it becomes clear that the only thing that changes from use to use is the address of the operands. As a result, assembly language programmers would typically define a macro. The exact syntax varies from assembler to assembler, but it might look something like this:

MACRO ADD_BYTES(P1, P2)
LD X, P1  ; X points to parameter 1
LD Y, P2  ; Y points to parameter 2
LD R0,X   ; Get operand 1
LD R1,Y   ; Get operand 2
ADD       ;
ST R0     ; Store the result in R0
ENDM

Thereafter, whenever it is necessary to add two bytes together, one would simply enter the macro together with the name of the operands of interest. However, after you have invoked the macro a few dozen times, it probably dawns on you that you are chewing up memory un-necessarily and that you can save a lot by changing the macro to this:

MACRO ADD_BYTES(P1, P2)
LD X, P1  ; X points to parameter 1
LD Y, P2  ; Y points to parameter 2
CALL LDR0R1XY
ENDM

It is of course necessary to now define a subroutine ‘LDR0R1XY’ that looks like this:

LDR0R1XY:
LD R0,X  ; Get operand 1
LD R1,Y  ; Get operand 2
ADD      ;
ST R0    ; Store the result in R0
RET

Clearly this approach starts to save a few bytes per invocation, such that once one has used ADD_BYTES several times one achieves a net saving in memory usage. If one uses ADD_BYTES dozens of times then the savings can be substantial.

So how does this help if you are programming in a HLL? Well, decent compilers will do exactly the same optimization when told to perform full size optimization. However, in this case, the optimizer looks at all the code sequences generated by the compiler and identifies those code sequences that can be placed in a subroutine. A really good compiler will do this recursively in the sense that it will replace a code sequence with a subroutine call, and that subroutine call will in turn call another subroutine and so on. The results can be a dramatic reduction in code size – albeit at a potentially big increase in demand on the call stack.

Now clearly in order to take maximal advantage of this compiler optimization, it’s essential that the compiler see the same code sequences over and over again. You can maximize the likelihood of this occurring by being completely consistent in your coding style. Some examples:

  • When making function calls, keep the parameter orders consistent. For example if you call a lot of functions with two parameters such as a uint8_t and a uint16_t, then ensure that all your functions declare the parameters in the same order.
  • If most of your variables are 16 bit, with just a handful being 8 bit, then you may find you get a code size reduction if you convert all your variables to 16 bits.
  • Don’t flip randomly between case statements and if-else-if chains.

Note that notwithstanding the fact that being completely consistent can save you a lot of code space, I also think that code that is extremely consistent in its style has other merits as well, not the least of which is readability.As a final note, does anyone know the formal name for this type of optimization?

Next Tip

Previous Tip

Home