Archive for March, 2009

Efficient C Tips #8 – Use const

Monday, March 30th, 2009 Nigel Jones

One of the easiest ways to make your code more efficient is to use const wherever feasible. Just like declaring local functions as static, this is one of those changes that makes your code more robust, more maintainable and faster – a true win-win situation. So how does this work? Well you get the most benefit when passing pointers as parameters to functions. Here’s an example of a function whose job it is to compute the sum of an array of integers. The naive implementation would look something like this:

uint32_t sum(uint16_t *ptr, uint16_t n_elements)
 uint16_t lpc;
 uint32_t sum = 0;

 for (lpc = 0; lpc < n_elements; lpc++)
  sum += *ptr++;
 return sum;

I’ll ignore the issues of post increment and counting up (for now). Instead, consider the declaration of ptr. As it stands, the caller of this function has no idea whether sum() will modify the data or not, and hence must assume that it does. This has obvious implications for the compiler when it comes to optimization. To overcome this, it is necessary to declare ptr as pointing to const. The function prototype for sum() now becomes:

uint32_t sum(uint16_t const *ptr, uint16_t n_elements);

You’ll notice that I prefer to use what I call Saks notation for where I place the const modifier. The more conventional, albeit less sensible way of writing the declaration is:

uint32_t sum(const uint16_t *ptr, uint16_t n_elements);

Regardless of the style, by doing this you are indicating to the compiler that you will not be modifying the data that ptr points to. As a result, the optimizer can make assumptions that will typically lead to tighter code.

Before I give you the final code, I’d like to make a few other observations.

  • As well as potentially making your code more efficient, use of const also makes your code more readable and maintainable. That is, someone examining your code will know something extra about the function simply by looking at the prototype. Personally I find this very useful.
  • If you examine the C standard library, you’ll find very liberal use of the const modifier. You should take this as a strong hint that it’s a good idea.
  • PC-Lint will very helpfully tell you if a pointer can be declared as pointing to const. Yet another reason for using Lint!

So what does my sum() function look like? Well, incorporating my previous hints on post increment and counting down, it looks something like this:

uint32_t sum(uint16_t const *ptr, uint16_t n_elements)
 uint32_t sum = 0;

 for (; n_elements != 0; --n_elements)
  sum += *ptr;
 return sum;

Next Tip
Previous Tip

Editorial Note

I’ve been following my own advice and have been on a short vacation. As a result I’ve been tardy in responding to some of your comments. I’ll try and rectify this over the next few days.

Demand more time off!

Sunday, March 22nd, 2009 Nigel Jones

I’ve been posting on a lot of technical issues lately and so I thought I’d turn to a less cerebral topic – but one which I feel quite passionate about. First off – some background. I’m British by birth and was raised in Europe (UK & Germany) before moving to the USA in my early twenties. Upon arrival in the USA I was struck by many things; however professionally what amazed me was the number of hours the typical engineer works in the USA compared to their European counterparts. When I left the UK, the standard work week was 37.5 hours and the typical amount of paid time off was 4 weeks for new hires, quickly increasing to 6 weeks or more with length of service. To this was added 8 bank holidays. Perhaps more importantly, employers seemed to think that this was a good thing. For example, my employer at the time had the following policies in effect:

  • Employees were encouraged to work their 37.5 hours in such a way, that the work week ended at lunch time on Friday, effectively ensuring that employees had 2.5 day weekends.
  • Employees were strongly encouraged to take at least 2 weeks off as a block, thus ensuring that they got at least one long break from work every year.

By contrast, when I arrived in the USA, I discovered that the norm was quite different. Indeed the policies I encountered were as follows (and this from the American branch of the same firm as I had worked for in the UK):

  • Work week of 40 hours.
  • Engineers were routinely expected to put in unpaid overtime, with 10 hours being the norm.
  • Annual vacation of two weeks, which only started accruing after 6 months service.
  • Very long serving employees might get 3 weeks vacation a year.
  • Taking more than one week off at a time was actively discouraged.

So what to make of this? If you do the mathematics, a typical engineer in the USA would be working about 50 * 50 = 2500 hours a year (ignoring bank holidays – which are about the same), whereas a typical engineer in the UK would be working 37.5 * 48 = 1800 hours – a 39% difference. Now the question is, did I perceive the engineers in the USA to generate more output? I’d say yes, but only by a few percent, and certainly no where near the 39% more hours that they worked.
I’m sure other people’s experience will differ. However it’s clear to me why there isn’t a big difference in productivity. I solve most of my toughest technical problems when I’m not at work. Indeed, there is nothing like taking a stroll, going for a bike ride, or even sitting down for a beer with friends for clearing the mind and allowing you to literally look at issues from a new perspective. I know this experience isn’t unique to me, so why don’t employers see the light and realize that everyone benefits from requiring engineers (and other professions – but that’s outside my bailiwick) to take more time off?

Maybe it’s just me, but a start in changing this situation could be for more engineers to start demanding more time off. Some companies are starting to see the light. For example Netrino offers its employees 5 weeks vacation. Let’s make them the norm – not the exception!

As a final note, I know I have regular readers from other parts of the world – South America, Australasia, and the former eastern block. I’d be interested to hear what your working conditions are like.


Sorting (in) embedded systems

Sunday, March 15th, 2009 Nigel Jones

Although countless PhD’s have been awarded on sorting algorithms, it’s not a topic that seems to come up much in embedded systems (or at least the kind of embedded systems that I work on). Thus it was with some surprise recently that I found myself needing to sort an array of integers. The array wasn’t very large (about twenty entries) and I was eager to move on to the real problem at hand and so I just dropped in a call to the standard C routine qsort(). I didn’t give it a great deal of thought because I ‘knew’ that a ‘Quick Sort’ algorithm is in general fast and well behaved and that with sorting so few entries I wasn’t too concerned about it being ‘optimal’. Anyway, with the main task at hand solved, on a whim I decided to take another look at qsort(), just to make sure that I wasn’t being too cavalier in my approach. Boy did I get a shock! My call to qsort() was increasing my code size by 1500 bytes and it wasn’t giving very good sort times either. For those of you programming big systems, this may seem acceptable. In my case, the target processor had 16K of memory and so 1500 bytes was a huge hit.

Surely there had to be a better solution? Well there’s always a better solution, but in my case in particular, and for embedded systems in general, what is the optimal sorting algorithm?

Well, after thinking about it for a while, I think the optimal sorting algorithm for embedded systems has these characteristics:

  1. It must sort in place.
  2. The algorithm must not be recursive.
  3. Its best, average and worst case running times should be of similar magnitude.
  4. Its code size should be commensurate with the problem.
  5. Its running time should increase linearly or logarithmically with the number of elements to be sorted.
  6. Its implementation must be ‘clean’ – i.e. free of breaks and returns in the middle of a loop.
Sort In Place

This is an important criterion not just because it saves memory, but most importantly because it obviates the need for dynamic memory allocation. In general dynamic memory allocation should be avoided in embedded systems because of problems with heap fragmentation and allocation performance. If you aren’t aware of this issue, then read this series of articles by Dan Saks on the issue.


Recursion is beautiful and solves certain problems amazingly elegantly. However, it’s not fast and it can easily lead to problems of stack overflow. As a result, it should never be used in embedded systems.

Running Time Variability

Even the softest of real time systems have some time constraints that need to be met. As a result a function whose execution time varies enormously with the input data can often be problematic. Thus I prefer code whose execution time is nicely bounded.

Code Size

This is often a concern. Suffice to say that the code size should be reasonable for the target system.

Data Size Dependence

Sorting algorithms are usually classified using ‘Big O notation’ to denote how sensitive they are to the amount of data to be sorted. If N is the number of elements to be sorted, then an algorithm whose running time is N Log N is usually preferred to one whose running time is N2. However, as you shall see, for small N the advantage of the more sophisticated algorithms can be lost by the the overhead of the sophistication.

Clean Implementation

I’m a great proponent of ‘clean’ code. Thus code where one exits from the middle of a loop isn’t as acceptable as code where everything proceeds in an orderly fashion. Although this is a personal preference of mine, it is also codified in for example the MISRA C requirements, to which many embedded systems are built. Anyway to determine the optimal sorting algorithm, I went to the Wikipedia page on sorting algorithms and initially selected the following for comparison to the built in qsort: Comb, Gnome, Selection, Insertion, Shell & Heap sorts. All of these are sort in place algorithms. I originally eschewed the Bubble & Cocktail sorts as they really have nothing to commend them. However, several people posted comments asking that I include them – so I did. As predicted they have nothing to commend them. In all cases, I used the Wikipedia code pretty much as is, optimized for maximum speed. (I recognize that the implementations in Wikipedia may not be optimal – but they are the best I have). For each algorithm, I sorted arrays of 8, 32 & 128 signed integers. In every case I sorted the same random array, together with a sorted array and an inverse sorted array. First the code sizes in bytes:

qsort()      1538
Gnome()        76
Selection()   130
Insertion()   104
Shell()       242
Comb()        190
Heap()        200
Bubble()      104
Cocktail()    140

Clearly, anything is a lot better than the built in qsort(). However, we are not comparing apples and oranges, because qsort() is a general purpose routine, whereas the others are designed explicitly to sort integers. Leaving aside qsort(), the Gnome sort Insertion sort and Bubble sorts are clearly the code size leaders. Having said that, in most embedded systems, a 100 bytes here or there is irrelevant and so we are free to choose based upon other criteria.

Execution times for the 8 element array

Name        Random  Sorted  Inverse Sorted
qsort()     3004     832    2765
Gnome()     1191     220    2047
Selection() 1120    1120    1120
Insertion() 544      287    756
Shell()     1233    1029    1425
Comb()      2460    1975    2480
Heap()      1265    1324    1153
Bubble()     875     208    1032
Cocktail()  1682     927    2056

In this case, the Insertion sort is the clear winner. Not only is it dramatically faster in almost all cases, it also has reasonable variability and it has almost the smallest code size. Notice that the bubble sort for all its vaunted simplicity consumes as much code and runs considerably slower. Notice that the Selection sort’s running time is completely consistent – and not too bad when compared to other methods.

Execution times for the 32 element array

Name        Random  Sorted  Inverse Sorted
qsort()     23004    3088   19853
Gnome()     17389     892   35395
Selection() 14392   14392   14392 
Insertion()  5588    1179   10324
Shell()      6589    4675    6115
Comb()      10217    8638   10047
Heap()       8449    8607    7413
Bubble()    13664     784   16368
Cocktail()  17657    3807   27634

In this case, the winner isn’t so clear cut. Although the insertion sort still performed well, it’s showing a very large variation in running time now. By contrast the shell sort has got decent times with small variability. The Gnome, Bubble and Cocktail sorts are showing huge variability in execution times (with a very bad worst case), while the Selection sort shows consistent execution time. On balance, I’d go with the shell sort in most cases.

Execution times for the 128 element array

Name         Random  Sorted  Inverse Sorted
qsort()      120772   28411   77896
Gnome()      316550    3580  577747
Selection()  217420  217420  217420
Insertion()   88475    4731  158020
Shell()       41661   25611   34707
Comb()        50858   43523   48568
Heap()        46959   49215   43314
Bubble()     231294    3088  262032
Cocktail()   271821   15327  422266

In this case the winner is either the shell sort or the heap sort depending on whether you want raw performance more or less when compared to performance variability. The Gnome, Bubble and Cocktail sorts are hopelessly outclassed. So what to make of all this? Well in any comparison like this there are a myriad of variables that one should take into account, and so I don’t believe these data should be treated as gospel. What is clear to me is that:

  1. Being a general purpose routine, qsort() is unlikely to be the optimal solution for an embedded system.
  2. For many embedded applications, a shell sort has a lot to commend it – decent code size, fast running time, well behaved and a clean implementation. Thus if you don’t want to bother with this sort of investigation every time you need to sort an array, then a shell sort should be your starting point. It will be for me henceforth.

A quick update: Check out this link for a very cool visualization of different sorting algorithms.


Efficient C Tips #7 – Fast loops

Thursday, March 5th, 2009 Nigel Jones

Every program at some point requires some set of actions to be taken a fixed number of times. Indeed this is such a common occurrence that we typically code it without really giving it much thought. For example, if I asked you to call a function foo() ten times, I’m sure that most of you would write something like this:

for (uint8_t lpc = 0; lpc < 10; ++lpc)

While there is nothing wrong with this, per se, it is sub optimal on just about every processor. Instead you are better off using a construct which counts down to zero. Here are two alternative ways of doing this:

for(uint8_t lpc = 10; lpc != 0; --lpc)

uint8_t lpc = 10;
} while (--lpc);

Which one you think is more natural is entirely up to you.

So how does this efficiency arise? Well in the count up case, the assembly language generated by the compiler typically looks something like this:

INC lpc      ; Increment loop counter
SUB lpc, #10 ; Compare loop counter to 10
BNZ loop     ; Branch if loop counter not equal to 10

By contrast, in the count down case the assembly language looks something like this

DEC lpc   ; Decrement loop counter
BNZ loop  ; Branch if non zero

Evidently, because of the ‘specialness’ of zero, more efficient code can be generated.

So why don’t you see C programs littered with these count down constructs? Well counting down has a major limitation. If you need to use the loop variable as an index into an array then you have a problem. For example, let’s say I wanted to zero the elements of an array. Using the count down technique you might be tempted to do this:

uint8_t bar[10];
uint8_t lpc;

 bar[lpc] = 0; // Error! First time through results in index beyond end of array
} while (--lpc);

Evidently it doesn’t work. You can of course modify the code to make it work. However doing so typically loses you all the efficiency gains, such that you are better off with a standard up-counting for loop.

As a parting thought, concepts such as these are second nature to assembly language programmers – all of whom do this sort of thing instinctively. As a result, if you are really interested in getting the best out of your C compiler, you could do a lot worse than learning how to program your target processor in assembly language. Does this defeat one of the objectives in programming in a high level language – yes. However, for giving you insight in terms of what is going on under the hood it cannot be beaten.

Next Tip
Previous Tip

Computing your stack size

Sunday, March 1st, 2009 Nigel Jones

Many of the folks that come to this blog by way of search engines do so because they are having problems with stack overflow. I’ve already given my take on the likely causes of a stack overflow. Today I’d like to offer some hints on a related topic – how to set about computing the stack size for your application. This is an extremely difficult problem, which can be approached in one of three ways – experimentally, analytically or randomly. The latter is by far the most common technique, which consists essentially of choosing a number and seeing whether it works! In an effort to reduce the use of the random approach, I’ll try and summarize the other two methods.


In the experimental method, a typically very large stack size is selected. The stack is then filled with an arbitrary bit pattern such as 0xFC. The code is then executed for a ‘suitable’ amount of time, and then the stack is examined to see how far it has grown. Most people will typically take this number, add a safety margin and call it the required stack size. The main advantage of this approach is that it’s easy to do (indeed many good debuggers have this feature built in to them). It also has the advantage of being ‘experimental data’. However, there are two big problems with this approach, which will catch the unwary.

The biggest single problem with the experimental approach is the implicit assumption that the experiment that is run is representative of the worst case conditions. What do I mean by the worst case conditions? Well, the maximum stack size occurs in an embedded system when an interrupt that uses the most stack size occurs at a point in the code that the foreground application is also using the maximum stack size. On the assumption that most interrupts are asynchronous to the foreground application, the problem should be clear. How exactly do you know after your testing whether or not the interrupt that uses the most stack size did indeed trigger at the worst (best?) possible moment? Thus even if your testing had 100% code coverage, it still isn’t possible to know for sure whether you have covered all possible scenarios. If, as is the normal case, you don’t even begin to approach full code coverage, it should be clear to you that testing tends to reveal the typical ‘worst-case’ condition, rather than the genuine worst case condition.

The second major issue with testing is that it tends to be done when the code is close to being completed, rather than when it is completed. The problem is that small changes in the source code can have a huge impact on the required stack size. For example, let’s say that during testing it is discovered that an interrupt service routine is taking so long to complete that another interrupt is being occasionally missed. A ‘quick fix’ is to simply enable interrupts in the long interrupt handler, so that the other interrupts can do their thing. This one line change can lead to a dramatic increase in stack usage. (If you aren’t cognizant of the stack usage of interrupt handlers, you should read this article I wrote).


In the analytical approach, the idea is to examine the source code and from the analysis work out the maximum stack usage of the foreground application, and then to add to this the worst case interrupt handler usage. This is obviously a daunting task for anything but the simplest of applications. You will not be surprised to hear that computer programs have been written to perform this analysis. Indeed good quality linkers will now do this for you as a matter of course. Furthermore, my favorite third party tool, PC-Lint from Gimpel, will also now do this starting with version 9. However be warned that it takes a lot of work to set up PC-Lint to perform the analysis.

Although analysis can theoretically give an accurate answer, it does have several problems.


It’s almost impossible for an analytical approach to compute the stack usage of a program that uses recursion. Indeed it’s because of the unbounded effect on stack size that recursion is a really bad idea in embedded systems. Indeed MISRA bans it, and I personally banned it about twenty years ago.

Indirect Function Calls

Pointers to functions are something that I use extensively and heartily recommend (for a discussion see this article I wrote). Although they don’t have a deleterious effect on stack size, they do make it quite difficult for analysis programs to track what is going on. Indeed PC-Lint cannot handle pointers to functions when it comes to computing stack usage. Thus if you use an analytical approach and you use pointers to functions, then make absolutely sure that the analysis program can track all the indirect calls.


Code optimization can play havoc with the stack usage. Some optimizations reduce stack usage (by e.g. placing function parameters in registers), while others can increase stack usage. I should note that it’s only third party tools that should be bamboozled by the optimizer. The linker that makes up part of the compilation package should be aware of everything that the compiler has done.


Even if you have a linker that will compute stack usage, interpreting the output of the linker is always a daunting task. For example, the linker from IAR will compute your stack usage. However, it isn’t nice enough to simply say: You need 279 bytes of stack space. Instead you have to study the linker output carefully to glean the requisite information.

A Practical Approach

It’s clear from the above that it isn’t easy to determine the stack size for an application. So how exactly does one set about this in practice? Well here’s what I do.

  1. Locate the stack at the beginning / end of memory (depending upon how the stack grows) and place all variables at the other end of the memory. This essentially means that you are implicitly allowing the maximum amount of memory possible for the stack. Note that many good compilers / linkers will do this automatically for you.
  2. As a starting point, I allocate 10% of the available memory for stack use. If I know I will be using functions that are huge users of the stack (such as printf, scanf and their brethren), then I’ll typically set it to 20% of available memory.
  3. I set up the debug environment from day 1 to monitor and report stack usage. This way as I progress through the development process I get a very good feel for the application’s stack consumption. This also helps in spotting changes to the code that have big impacts on the required stack size.
  4. Once I have ‘all the code written’, I start to make use of the information in the linker report. The more tight I am on memory, the closer I examine the linker output. In particular, what I often find is that there is one and only one function call chain that leads to a stack usage that is much greater than all the other call chains. In which case, I look to see if I can restructure that call chain so as to bring the maximum stage usage more in line with the typical stack usage.

If you stumbled upon this blog courtesy of a search engine, then I hope you found the above useful. I invite you to check out some of my other posts, which you may find useful. If you are a regular reader, then as always, thanks for stopping by.