Archive for the ‘Efficient C/C++’ Category

Peak detection of a time series

Friday, September 18th, 2015 Nigel Jones

I’ve been doing embedded work for so long now that it’s rare that I come across a need that I haven’t run into before. Well, it happened the other day, so I thought I’d share it with you.

Here’s the situation. I have a transducer whose role is to determine a binary condition (essentially X is present / absent).  The transducer is AC coupled and pass band filtered such that what I’m looking for is the presence of a noisy AC signal. My first inclination was to take the pass-band filtered signal, rectify it and average it over time. While this worked reasonably well, the delta between X being present / absent was not as large as I’d have liked. Accordingly I investigated other metrics. Somewhat to my surprise I found that the peak signal values (i.e peak X present : peak X absent) gave a far better ratio. Thus I found myself needing to find the peak (i.e. maximum value) of a moving time series of data values.

To prove the concept, I used a brute force approach. That is, every time I received a new reading I’d store it in a buffer, the size of which was determined by the amount of time I wished to look back over. I’d then search the entire buffer looking for the maximum value. This approach of course worked fine – but it was hopelessly inefficient. Given that I am receiving a new value to process every few milliseconds, this inefficiency is unacceptable in the real product.  Accordingly, I needed a better algorithm.

My first port of call was of course the internet. If you do a search for peak detector algorithms then you’ll find a plethora of algorithms. However they are all way too sophisticated for what I need, as they are aimed at finding *all* the local maxima within an N-Dimensional array (apparently an important problem in physics, image processing etc).  I just want to know what’s the maximum value in a buffer of values – and I want to know it as efficiently as possible.

After pondering the problem for awhile, my thoughts first turned to an efficient median filtering algorithm. I discussed median filtering here. My rationale was quite simple. The efficient median filtering algorithm inherently keeps the time series data sorted and so I should be able to easily adapt it to return a maximum value rather than a median value. Well it turned out to be quite trivial. Indeed it’s so trivial that I wonder why I haven’t modified it before. Here’s the code (note, as explained in my original posting on median filtering, this algorithm and code is mostly courtesy of Phil Ekstrom):

#define STOPPER 0                                /* Smaller than any datum */
#define MEDIAN_FILTER_SIZE    (31)

void median_filter(uint16_t datum, uint16_t *calc_median, uint16_t *calc_peak)
{
    struct pair
    {
        struct pair   *point;        /* Pointers forming list linked in sorted order */
        uint16_t  value;             /* Values to sort */
    };
    static struct pair buffer[MEDIAN_FILTER_SIZE] = {0}; /* Buffer of nwidth pairs */
    static struct pair *datpoint = buffer;   /* Pointer into circular buffer of data */
    static struct pair small = {NULL, STOPPER};          /* Chain stopper */
    static struct pair big = {&small, 0}; /* Pointer to head (largest) of linked list.*/
    
    struct pair *successor;         /* Pointer to successor of replaced data item */
    struct pair *scan;              /* Pointer used to scan down the sorted list */
    struct pair *scanold;           /* Previous value of scan */
    struct pair *median;            /* Pointer to median */
    
    uint16_t i;
    
    if (datum == STOPPER)
    {
        datum = STOPPER + 1;                             /* No stoppers allowed. */
    }
    
    if ( (++datpoint - buffer) >= MEDIAN_FILTER_SIZE)
    {
        datpoint = buffer;                  /* Increment and wrap data in pointer.*/
    }
    
    datpoint->value = datum;                /* Copy in new datum */
    successor = datpoint->point;            /* Save pointer to old value's successor */
    median = &big;                          /* Median initially to first in chain */
    scanold = NULL;                         /* Scanold initially null. */
    scan = &big;              /* Points to pointer to first (largest) datum in chain */
    
    /* Handle chain-out of first item in chain as special case */
    if (scan->point == datpoint)
    {
        scan->point = successor;
    }
    scanold = scan;                                     /* Save this pointer and   */
    scan = scan->point ;                                /* step down chain */
    
    /* Loop through the chain, normal loop exit via break. */
    for (i = 0 ; i < MEDIAN_FILTER_SIZE; ++i)
    {
        /* Handle odd-numbered item in chain  */
        if (scan->point == datpoint)
        {
            scan->point = successor;                      /* Chain out the old datum.*/
        }
        
        if (scan->value < datum)         /* If datum is larger than scanned value,*/
        {
            datpoint->point = scanold->point;             /* Chain it in here.  */
            scanold->point = datpoint;                    /* Mark it chained in. */
            datum = STOPPER;
        };
        
        /* Step median pointer down chain after doing odd-numbered element */
        median = median->point;                       /* Step median pointer.  */
        if (scan == &small)
        {
            break;                                      /* Break at end of chain  */
        }
        scanold = scan;                               /* Save this pointer and   */
        scan = scan->point;                           /* step down chain */
        
        /* Handle even-numbered item in chain.  */
        if (scan->point == datpoint)
        {
            scan->point = successor;
        }
        
        if (scan->value < datum)
        {
            datpoint->point = scanold->point;
            scanold->point = datpoint;
            datum = STOPPER;
        }
        
        if (scan == &small)
        {
            break;
        }
        
        scanold = scan;
        scan = scan->point;
    }
    *calc_median = median->value;
    *calc_peak = big.point->value;
}

To use this code, simply set the MEDIAN_FILTER_SIZE to your desired value, and then call median_filter every time you have a new value to process. The advantage of this code is that it gives you two parameters – the median and the maximum.
However, when I bench marked it against the brute force approach I discovered that the brute force algorithm was actually faster – by a reasonable amount. After reflecting upon this for awhile I realized that the median filtering approach was of course keeping all of the data sorted – which was way more than I needed for a simple peak detector. Thus it was time for another approach.

After thinking about it for awhile, it was clear that the only datum I cared about in my buffer was the current maximum value. Thus upon receipt of a new value, I essentially had to decide:

  1. Is this new value >= current maximum value? If it is then it’s the new maximum value. Note that I really do mean >= and not >. The reason is explained below.
  2. Otherwise, is this new value about to overwrite the existing maximum value? If so, overwrite it and use a brute force search to find the new maximum value.

In order to make the algorithm as efficient as possible we need to minimize the number of times a brute force search is required. Now if your buffer contains only one incidence of the maximum value, then you’ll simply have to do the brute force search when you replace the maximum value. However, if your buffer contains multiple copies of the maximum value, then you’ll minimize the number of brute force searches by declaring the operative maximum value as the one furthest away from where we will write to in the buffer next. It’s for this reason that we use a >= comparison in step 1.

In a similar vein, if we have to do a brute force search then if there are multiple copies of the maximum value then once again we want to choose the maximum value furthest away from where we will write to in the buffer next.

Anyway, here’s the code

 
#define WINDOW_SIZE 31

uint16_t efficient_peak_detector(uint16_t value)
{
    static uint16_t wbuffer[WINDOW_SIZE] = {0U};
    static uint16_t wr_idx = 0U;
    static uint16_t peak_value = 0U;
    static uint16_t peak_value_idx = 0U;
    
    if (value >= peak_value)
    {
        peak_value = value;            /* New peak value, so record it */
        peak_value_idx = wr_idx;
        wbuffer[wr_idx] = value;
    }
    else
    {
        wbuffer[wr_idx] = value;        /* Not a new peak value, so just store it */
        if (wr_idx == peak_value_idx)    /* Have we over written the peak value ? */
        {
            /*  Yes, so we need to do a brute force search to find the new
                maximum. Note that for efficiency reasons, if there are multiple
                values of the new peak value, then we want to chose the one
                whose index value is as far away as possible from the current index */
            uint16_t idx;
            uint16_t cnt;
            
            for (cnt = 0U, idx = wr_idx, peak_value = 0U; cnt < WINDOW_SIZE; ++cnt)
            {
                if (wbuffer[idx] >= peak_value)
                {
                    peak_value = wbuffer[idx];    /* Record new peak */
                    peak_value_idx = idx;
                }
                if (++idx >= WINDOW_SIZE)
                {
                    idx = 0;
                }
            }
        }
    }
    if (++wr_idx >= WINDOW_SIZE)
    {
        wr_idx = 0;
    }
        
    return peak_value;
}

Obviously, if you make your buffer size a power of two then you can optimize the buffer wrapping code. There are a couple of other minor optimizations that can be made. For example on eight bit processors, making the various indices and counters  (wr_idx, peak_value_idx, idx, cnt) 8 bits will speed things up a lot. Similarly on a 32 bit processor you will likely gain a bit by using 32 bit values.

Notwithstanding the aforementioned optimizations, this code is on average considerably faster than the brute force approach and not much larger. Its main limitation is that its run time is highly variable depending upon whether we have to determine a new maximum value using the brute force approach or not.

Clearly it’s trivial to modify this code to perform a ‘trough detector’ (aka minimum detector), or indeed have it do both functions.

<I’ve noticed that comments are shown as closed. I’m not sure why this is, but am working with the system administrator to get it fixed. Apologies>

Optimizing for the CPU / compiler

Sunday, June 3rd, 2012 Nigel Jones

It is well known that standard C language features map horribly on to the architecture of many processors. While the mapping is obvious and appalling for some processors (low end PICs, 8051 spring to mind), it’s still not necessarily great at the 32 bit end of the spectrum where processors without floating point units can be hit hard with C’s floating point promotion rules. While this is all obvious stuff, it’s essentially about what those CPUs are lacking. Where it gets really interesting in the embedded space is when you have a processor that has all sorts of specialized features that are great for embedded systems – but which simply do not map on to the C language view of the world. Some examples will illustrate my point.

Arithmetic vs. Logical shifting

The C language does of course have support for performing shift operations. However, these are strictly arithmetic shifts. That is when bits get shifted off the end of an integer type, they are simply lost. Logical shifting, sometimes known as rotation, is different in that bits simply get rotated back around (often through the carry bit but not always). Now while arithmetic shifting is great for, well arithmetic operations, there are plenty of occasions in which I find myself wanting to perform a rotation. Now can I write a rotation function in C – sure – but it’s a real pain in the tuches.

Saturated addition

If you have ever had to design and implement an integer digital filter, I am sure you found yourself yearning for an addition operator that will saturate rather than overflow. [In this form of arithmetic, if the integral type would overflow as the result of an operation, then the processor simply returns the minimum or maximum value as appropriate].  Processors that the designers think might be required to perform digital filtering will have this feature built directly into their instruction sets.  By contrast the C language has zero direct support for such operations, which must be coded using nasty checks and masks.

Nibble swapping

Swapping the upper and lower nibbles of a byte is a common operation in cryptography and related fields. As a result many processors include this ever so useful instruction in their instruction sets. While you can of course write C code to do it, it’s horrible looking and grossly inefficient when compared to the built in instruction.

Implications

If you look over the examples quoted I’m sure you noticed a theme:

  1. Yes I can write C code to achieve the desired functionality.
  2. The resultant C code is usually ugly and horribly inefficient when compared to the intrinsic function of the processor.

Now in many cases, C compilers simply don’t give you access to these intrinsic functions, other than resorting to the inline assembler. Unfortunately, using the inline assembler causes a lot of problems. For example:

  1. It will often force the compiler to not optimize the enclosing function.
  2. It’s really easy to screw it up.
  3. It’s banned by most coding standards.

As a result, the intrinsic features can’t be used anyway. However, there are embedded compilers out there that support intrinsic functions. For example here’s how to swap nibbles using IAR’s AVR compiler:

foo = __swap_nibbles(bar);

There are several things to note about this:

  1. Because it’s a compiler intrinsic function, there are no issues with optimization.
  2. Similarly because one works with standard variable names, there is no particular likelihood of getting this wrong.
  3. Because it looks like a function call, there isn’t normally a problem with coding standards.

This then leads to one of the essential quandaries of embedded systems. Is it better to write completely standard (and hence presumably portable) C code, or should one take every advantage of neat features that are offered by your CPU (and if it is any good), your compiler?

I made my peace with this decision many years ago and fall firmly into the camp of take advantage of every neat feature offered by the CPU / compiler – even if it is non-standard. My rationale for doing so is as follows:

  1. Porting code from one CPU to another happens rarely. Thus to burden the bulk of systems with this mythical possibility seems weird to me.
  2. End users do not care. When was the last time you heard someone extoll the use of standard code in the latest widget? Instead end users care about speed, power and battery life. All things that can come about by having the most efficient code possible.
  3. It seems downright rude not to use those features that the CPU designer built in to the CPU just because some purist says I should not.

Having said this, I do of course understand completely if you are in the business of selling software components (e.g. an AES library), where using intrinsic / specialized instructions could be a veritable pain. However for the rest of the industry I say use those intrinsic functions! As always, let the debate begin.

 

Efficient C Tip #13 – use the modulus (%) operator with caution

Tuesday, February 8th, 2011 Nigel Jones

This is the thirteenth in a series of tips on writing efficient C for embedded systems.  As the title suggests, if you are interested in writing efficient C, you need to be cautious about using the modulus operator.  Why is this? Well a little thought shows that C = A % B is equivalent to C = A – B * (A / B). In other words the modulus operator is functionally equivalent to three operations. As a result it’s hardly surprising that code that uses the modulus operator can take a long time to execute. Now in some cases you absolutely have to use the modulus operator. However in many cases it’s possible to restructure the code such that the modulus operator is not needed. To demonstrate what I mean, some background information is in order as to how this blog posting came about.

Converting seconds to days, hours, minutes and seconds

In Embedded Systems Design there is an increasing need for some form of real time clock. When this is done, the designer typically implements the time as a 32 bit variable containing the number of seconds since a particular date. When this is done, it’s not usually long before one has to convert the ‘time’ into days, hours, minutes and seconds. Well I found myself in just such a situation recently. As a result, I thought a quick internet search was in order to find the ‘best’ way of converting ‘time’ to days, hours, minutes and seconds. The code I found wasn’t great and as usual was highly PC centric. I thus sat down to write my own code.

Attempt #1 – Using the modulus operator

My first attempt used the ‘obvious’ algorithm and employed the modulus operator. The relevant code fragment appears below.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 seconds = time % 60UL;
 time /= 60UL;
 minutes = time % 60UL;
 time /= 60UL;
 hours = time % 24UL;
 time /= 24UL;
 days = time;  
}

This approach has a nice looking symmetry to it.  However, it contained three divisions and three modulus operations. I thus was rather concerned about its performance and so I measured its speed for three different architectures – AVR (8 bit), MSP430 (16 bit), and ARM Cortex (32 bit). In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and include the test harness overhead:

AVR:  29,825 cycles

MSP430: 27,019 cycles

ARM Cortex: 390 cycles

No that isn’t a misprint. The ARM was nearly two orders of magnitude more cycle efficient than the MSP430 and AVR. Thus my claim that the modulus operator can be very inefficient is true for some architectures – but not all.  Thus if you are using the modulus operator on an ARM processor then it’s probably not worth worrying about. However if you are working on smaller processors then clearly something needs to be done  – and so I investigated some alternatives.

Attempt #2 – Replace the modulus operator

As mentioned in the introduction,  C = A % B is equivalent to C = A – B * (A / B). If we compare this to the code in attempt 1, then it should be apparent that the intermediate value (A/B) computed as part of the modulus operation is in fact needed in the next line of code. Thus this suggests a simple optimization to the algorithm.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 days = time / (24UL * 3600UL);    
 time -= days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = time / 3600UL;
 time -= (hours * 3600UL);
 /* time now contains the number of seconds in the last hour */
 minutes = time / 60U;
 seconds = time - minutes * 60U;
 }

In this case I have replaced three mods with three subtractions and three multiplications. Thus although I have replaced a single operator (%) with two operations (- *) I still expect an increase in speed because the modulus operator is actually three operators in one (- * /).  Thus effectively I have eliminated three divisions and so I expected a significant improvement in speed. The results however were a little surprising:

AVR:  18,720 cycles

MSP430: 14,805 cycles

ARM Cortex: 384 cycles

Thus while this technique yielded a roughly order of two improvements for the AVR and MSP430 processors, it had essentially no impact on the ARM code.  Presumably this is because the ARM has native support for the modulus operation. Notwithstanding the ARM results, it’s clear that at least in this example, it’s possible to significantly speed up an algorithm by eliminating the modulus operator.

I could of course just stop at this point. However examination of attempt 2 shows that further optimizations are possible by observing that if seconds is a 32 bit variable, then days can be at most a 16 bit variable. Furthermore, hours, minutes and seconds are inherently limited to an 8 bit range. I thus recoded attempt 2 to use smaller data types.

Attempt #3 – Data type size reduction

My naive implementation of the code looked like this:

void compute_time(uint32_t time)
{
 uint16_t    days;
 uint8_t     hours, minutes, seconds;
 uint16_t    stime;

 days = (uint16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types and to add casts where appropriate. The results were interesting:

AVR:  14,400 cycles

MSP430: 11,457 cycles

ARM Cortex: 434 cycles

Thus while this resulted in a significant improvement for the AVR & MSP430, it resulted in a significant worsening for the ARM. Clearly the ARM doesn’t like working with non 32 bit variables. Thus this suggested an improvement that would make the code a lot more portable – and that is to use the C99 fast types. Doing this gives the following code:

Attempt #4 – Using the C99 fast data types

void display_time(uint32_t time)
{
 uint_fast16_t    days;
 uint_fast8_t    hours, minutes, seconds;
 uint_fast16_t    stime;

 days = (uint_fast16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint_fast8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types to the C99 fast types. The results were encouraging:

AVR:  14,400 cycles

MSP430: 11,595 cycles

ARM Cortex: 384 cycles

Although the MSP430 time increased very slightly, the AVR and ARM stayed at their fastest speeds. Thus attempt #4 is both fast and portable.

Conclusion

Not only did replacing the modulus operator with alternative operations result in faster code, it also opened up the possibility for further optimizations. As a result with the AVR & MSP430 I was able to more than halve the execution time.

Converting Integers for Display

A similar problem (with a similar solution) occurs when one wants to display integers on a display. For example if you are using a custom LCD panel with say a 3 digit numeric field, then the problem arises as to how to determine the value of each digit. The obvious way, using the modulus operator is as follows:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999)
 {
 value = 999;
 }

 lsd = value % 10;
 value /= 10;
 nsd = value % 10;
 value /= 10;
 msd = value;

 /* Now display the digits */
}

However, using the technique espoused above, we can rewrite this much more efficiently as:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999U)
 {
  value = 999U;
 }

 msd = value / 100U;
 value -= msd * 100U;

 nsd = value / 10U;
 value -= nsd * 10U;

 lsd = value;

 /* Now display the digits */
}

If you benchmark this you should find it considerably faster than the modulus based approach.

Previous Tip

Efficient C Tip #12 – Be wary of switch statements

Saturday, April 10th, 2010 Nigel Jones

This is the twelfth in a series of tips on writing efficient C for embedded systems. Like the previous topic, I suspect that this will be a bit controversial. As the title suggests, if you are interested in writing efficient C, you need to be wary of switch statements. Before I explain why, a little background will be useful. I did all of my early embedded systems programming in assembly language. This wasn’t out of some sense of machismo, it was simply a reflection of the fact that there were no high level languages available (with the possible exception of PL/M). Naturally as well as programming embedded systems I also did computer programming, initially in Pascal and BASIC, and later in C. One of the major differences I found in using the HLL was the wonderful switch / case statement. I found it to be a beautiful tool – with a few lines of source code I could do all sorts of powerful things that were simply very difficult or tedious to do in assembly language. Fast forward a number of years and C compilers began to become available for small embedded systems and so I naturally started using them, together with of course the attendant switch statement. All was well in paradise until the day I used a switch statement in an interrupt service routine and found to my horror that the ISR was taking about ten times longer to execute than I thought was reasonable.

This precipitated an investigation into how exactly switch statements are implemented by the compiler. When I did this, I discovered a number of things that should give one pause.

Heuristic Algorithms

The first thing I discovered is that compilers typically have a number of ways of implementing a switch statement. They seem to be loosely divided into the following trichotomy:

  1. An if-else-if-else-if chain. In this implementation, the switch statement is treated as syntactic sugar for an if-else-if chain.
  2. Some form of jump or control tables, or as they are sometimes called a computed goto. This is a favorite technique of assembly language programmers and the compiler writers can use it to great effect.
  3. A hybrid of 1 & 2.

Where it gets interesting is how the compiler decides which approach to use. If the case values are contiguous (e.g. zero through ten), then it’s likely the compiler will use some form of jump table. Conversely if the case values are completely disjointed (e.g. zero, six, twenty, four hundred and a thousand) then an if-else implementation is likely. However what does the compiler do when, for example, you have a bifurcated set of ranges such as zero-ten and ninety – one hundred? Well the answer is, that each compiler seems to have some form of heuristic algorithm for determining what is the ‘best’ way of implementing a given set of cases. Although some compilers allow you to force a particular implementation, for the most part you are at the mercy of the compiler.

Comparative Execution Speeds

If you think about it, it should become apparent that a jump table approach is likely to give a highly consistent time of execution through the decision tree, whereas the if-else -if chain has a highly variable time of execution depending upon the particular value of the switched variable.  Notwithstanding this, the jump table approach has a certain amount of execution overhead associated with it. This means that although its  mean execution time (which is normally the same as its worst and best execution time) may be dramatically better than the mean execution time of the if-else-if chain, the if-else-if chain’s best execution time may be considerably better. So what you say! Well in some cases, a particular value is far more likely to occur than the other values, thus it would be very nice if this value was tested first. However, as you will now see, this isn’t guaranteed…

Order of Execution

For many years I wrote switch statements under the assumption that the case values would be evaluated from top to bottom. That is, if the compiler chose to implement the switch statement as an if-else-if chain, then it would first test the first case, then the second case and so on down to the default case at the bottom of my source code. Well it turns out that my assumption was completely wrong. The compiler is under no such obligation, and indeed will often evaluate the values bottom to top. Furthermore, the compiler will often evaluate the default value first. For example consider a defaulted switch statement with contiguous case values in the range zero to ten. If the index variable is an unsigned int, then there are at least 65525 possible values handled by the default case, and so it makes sense to eliminate them first. Now if you know that the index variable can only possibly take on the values zero to ten, then you can of course eliminate the default statement – and then get excoriated by the coding standards / MISRA folks.

Maintenance

This is the area where I really get worried. Consider the case where you have a switch statement in an ISR. The code is working with no problems until one day it is necessary to make a change to the switch statement – by for example adding an additional case value. This simple change can cause the compiler to completely change the implementation of the switch statement. As a result, you may find that:

  1. The worst case execution time has jumped dramatically.
  2. The mean execution time has jumped dramatically.
  3. The stack space required by the ISR has jumped dramatically.

Any of these three possibilities can cause your program to fail catastrophically. Now of course one could argue ‘that’s why you test all changes’. However, in my opinion it’s far better to be proactive and to avoid putting yourself in this situation in the first place.

I’d also be remiss in not noting the dreaded missing break statement maintenance problem. However as a religious user of Lint, I’m not normally too concerned about this.

Switch statement alternatives

If performance and stability is your goal then I strongly recommend that you implement your code, the way you want it executed. This means either explicitly use an if-else-if chain or use function pointers. If function pointers scare you, then you might want to read this article I wrote on the subject.

Recommendations

Based on my experience, I have a number of things that I do when it comes to switch statements. If you find my analysis compelling, you may want to adopt them:

  1. Switch statements should be the last language construct you reach for – and not the first.
  2. Learn how to use function pointers. Once you do you’ll find a lot of the reasons for using switch statements go away.
  3. Try to keep all the case values contiguous.
  4. If you can’t keep the case values contiguous, go to the other extreme and make them disparate – that way you are less likely to have the compiler change the algorithm on you.
  5. If your compiler supports it, consider using pragmas to lock in a particular implementation.
  6. Be very wary of using switch statements in interrupt service routines or any other performance critical code.
  7. Use Lint to guard against missing break statements.

Comments welcome!

Next Tip

Previous Tip

Home

Efficient C Tip #11 – Avoid passing parameters by using more small functions

Saturday, February 6th, 2010 Nigel Jones

This is the eleventh in a series of tips on writing efficient C for embedded systems. Today’s topic will, I suspect, be slightly controversial. This post is based upon two basic observations:

  1. Passing parameters to functions is costly.
  2. Conditional branch instructions can be very costly on CPUs that have instruction caches (even with branch prediction).

I don’t think that too many people will disagree with me on the above. Despite this I too often see a style of coding that incurs these costs unnecessarily. I think it’s best illustrated by a (real world) example. The issue is one that will be familiar to most of you.  An embedded system contains a number of discrete LEDs (say 3), and the requirement is to write some code to allow higher level code to either turn on, turn off, or toggle a particular LED. The way I often see this coded is as follows:

typedef enum
{
 LED1, LED2, LED3
} LED_NO;
typedef enum
{
 LED_OFF, LED_ON, LED_TOGGLE
} LED_ACTION;
void led(LED_NO led_no, LED_ACTION led_action)
{
 switch (led_no)
 {
  case LED1:
   switch (led_action)
   {
    case LED_OFF:
     PORTB_PORTB0 = 0;
     break;
    case LED_ON:
     PORTB_PORTB0 = 1;
     break;
    case LED_TOGGLE:
     PORTB_PORTB0 ^= 1;
     break;
    default:
     break;
   }
  break;
 case LED2:
  ...
}

So what’s wrong with this you ask? Well in a nutshell the parameters passed to the function are used strictly to control the order of execution. There is no code common to any pair or group of parameters. When faced with a situation such as this, I instead implement the code as a large number of very small functions. For example:

void led1_Off(void)
{
 PORTB_PORTB1 = 0;
}
void led1_On(void)
{
 PORTB_PORTB1 = 1;
}
void led1_Toggle(void)
{
 PORTB_PORTB1 ^= 1;
}
...

Let’s compare the two approaches.

Efficiency

This blog posting is supposedly about efficiency, so let’s start with the results. I coded these two approaches up together with a main() function that exercised all 9 possible combination’s. I then turned full speed optimization on and looked at the results for an AVR processor.

Single function approach: 78 bytes for main(), 94 bytes for the LED code. Execution time 208 cycles.

Multiple function approach: 42 bytes for main(), 54 bytes for the LED code. Execution time 96 cycles.

Clearly my approach is significantly more efficient.

Usability

By usability I’m referring to the case where someone else needs to use your code. They know they need to say toggle LED2 so they hunt around and find the file led.h. The question is, once they have opened up led.h, how quickly can they determine what they have to do in order to toggle LED2? In the single function case they are presented with just one function (which is a plus), but then they have to locate the enumerations and work out the parameters that need to be passed to the function (which is a minus). In the multiple function case, they have to search through a list of functions looking for the correct one. However once they have found it, it’s very clear what the function does.

For me, I think it is a toss up between the two approaches as to which is more usable.

Maintainability

In this case the multiple function approach is the big winner. To see how this is, consider what happens to the single function case when one adds an LED or adds an action. The single function case just explodes in size, whereas with the multi-function approach one simply adds more very simple functions.

Conclusions

If you buy my analysis then clearly the multi-function approach is superior in both efficiency and maintainability – two areas that are dear to my heart. Now granted this is a fairly extreme example. However in my experience if you look through a reasonable amount of code you will soon discover a function that essentially does one thing or another based upon a function parameter. When you locate such a function you might want to try breaking it into two functions in the manner described here – I think you’ll be pleased with the results.

Next Tip

Previous Tip

****
As the readership of this blog has grown I must say I have been really impressed with the many insightful comments that have been posted. I know I learn a lot from them, and so I suspect, do a lot of the other readers. Thus for those of you that have commented in the past – thank you. For those of you yet to post a comment, I encourage you to take the plunge!

Home