embedded software boot camp

Efficient C Tips #7 – Fast loops

Thursday, March 5th, 2009 by Nigel Jones

Every program at some point requires some set of actions to be taken a fixed number of times. Indeed this is such a common occurrence that we typically code it without really giving it much thought. For example, if I asked you to call a function foo() ten times, I’m sure that most of you would write something like this:

for (uint8_t lpc = 0; lpc < 10; ++lpc)
{
 foo();
}

While there is nothing wrong with this, per se, it is sub optimal on just about every processor. Instead you are better off using a construct which counts down to zero. Here are two alternative ways of doing this:

for(uint8_t lpc = 10; lpc != 0; --lpc)
{
 foo();
}

uint8_t lpc = 10;
do
{
 foo();
} while (--lpc);

Which one you think is more natural is entirely up to you.

So how does this efficiency arise? Well in the count up case, the assembly language generated by the compiler typically looks something like this:

INC lpc      ; Increment loop counter
SUB lpc, #10 ; Compare loop counter to 10
BNZ loop     ; Branch if loop counter not equal to 10

By contrast, in the count down case the assembly language looks something like this

DEC lpc   ; Decrement loop counter
BNZ loop  ; Branch if non zero

Evidently, because of the ‘specialness’ of zero, more efficient code can be generated.

So why don’t you see C programs littered with these count down constructs? Well counting down has a major limitation. If you need to use the loop variable as an index into an array then you have a problem. For example, let’s say I wanted to zero the elements of an array. Using the count down technique you might be tempted to do this:

uint8_t bar[10];
uint8_t lpc;

do
{
 bar[lpc] = 0; // Error! First time through results in index beyond end of array
} while (--lpc);

Evidently it doesn’t work. You can of course modify the code to make it work. However doing so typically loses you all the efficiency gains, such that you are better off with a standard up-counting for loop.

As a parting thought, concepts such as these are second nature to assembly language programmers – all of whom do this sort of thing instinctively. As a result, if you are really interested in getting the best out of your C compiler, you could do a lot worse than learning how to program your target processor in assembly language. Does this defeat one of the objectives in programming in a high level language – yes. However, for giving you insight in terms of what is going on under the hood it cannot be beaten.

Next Tip
Previous Tip
Home

Tags:

17 Responses to “Efficient C Tips #7 – Fast loops”

  1. Konstantin Zertsekel says:

    Let me contradict you since it is not that simple IMHO.At least three additional things should be taken into consideration:1. Compiler2. Compiling optimization3. CPU architecture (ARM, Xscale, Atom etc.)Let's have some specific examples for I chance to have ARMv5 and WindRiver GCC.The devil is in the details, but the truth lies there as well…C code:——-void foo(){}void testing_loop(){ int i; for (i = 0; i < 100; i++) foo();}Asm code with -Os optimization (loop starts with 99):—————————————————–00065e78 [foo]: 65e78: e1a0f00e mov pc, lr00065e7c [testing_loop]: 65e7c: e92d4010 stmdb sp!, {r4, lr} 65e80: e3a04063 mov r4, #99 ; 0x63 65e84: ebfffffb bl 65e78 [foo] 65e88: e2544001 subs r4, r4, #1 ; 0x1 65e8c: 5afffffc bpl 65e84 [testing_loop+0x8] 65e90: e8bd8010 ldmia sp!, {r4, pc}Asm code with -O0 optimization (loop starts with 0):—————————————————-0009930c [foo]: 9930c: e1a0c00d mov r12, sp 99310: e92dd800 stmdb sp!, {r11, r12, lr, pc} 99314: e24cb004 sub r11, r12, #4 ; 0x4 99318: e91ba800 ldmdb r11, {r11, sp, pc}0009931c [testing_loop]: 9931c: e1a0c00d mov r12, sp 99320: e92dd800 stmdb sp!, {r11, r12, lr, pc} 99324: e24cb004 sub r11, r12, #4 ; 0x4 99328: e24dd004 sub sp, sp, #4 ; 0x4 9932c: e1a00000 nop (mov r0,r0) 99330: e3a03000 mov r3, #0 ; 0x0 99334: e50b3010 str r3, [r11, -#16] 99338: e51b3010 ldr r3, [r11, -#16] 9933c: e3530063 cmp r3, #99 ; 0x63 99340: da000000 ble 99348 [testing_loop+0x2c] 99344: ea000004 b 9935c [testing_loop+0x40] 99348: ebffffef bl 9930c [foo] 9934c: e51b3010 ldr r3, [r11, -#16] 99350: e2832001 add r2, r3, #1 ; 0x1 99354: e50b2010 str r2, [r11, -#16] 99358: eafffff6 b 99338 [testing_loop+0x1c] 9935c: e91ba800 ldmdb r11, {r11, sp, pc}C code – 'i' is the parameter for foo():========================================void foo(int dummy){}void testing_loop(){ int i; for (i = 0; i < 100; i++) foo(i);}Asm code with -Os optimization (loop starts with 0):====================================================00065e78 [foo]: 65e78: e1a0f00e mov pc, lr00065e7c [testing_loop]: 65e7c: e92d4010 stmdb sp!, {r4, lr} 65e80: e3a04000 mov r4, #0 ; 0x0 65e84: e1a00004 mov r0, r4 65e88: e2844001 add r4, r4, #1 ; 0x1 65e8c: ebfffff9 bl 65e78 [foo] 65e90: e3540063 cmp r4, #99 ; 0x63 65e94: dafffffa ble 65e84 [testing_loop+0x8] 65e98: e8bd8010 ldmia sp!, {r4, pc}Asm code with -O0 optimization (loop starts with 0):====================================================0009930c [foo]: 9930c: e1a0c00d mov r12, sp 99310: e92dd800 stmdb sp!, {r11, r12, lr, pc} 99314: e24cb004 sub r11, r12, #4 ; 0x4 99318: e24dd004 sub sp, sp, #4 ; 0x4 9931c: e50b0010 str r0, [r11, -#16] 99320: e91ba800 ldmdb r11, {r11, sp, pc}00099324 [testing_loop]: 99324: e1a0c00d mov r12, sp 99328: e92dd800 stmdb sp!, {r11, r12, lr, pc} 9932c: e24cb004 sub r11, r12, #4 ; 0x4 99330: e24dd004 sub sp, sp, #4 ; 0x4 99334: e1a00000 nop (mov r0,r0) 99338: e3a03000 mov r3, #0 ; 0x0 9933c: e50b3010 str r3, [r11, -#16] 99340: e51b3010 ldr r3, [r11, -#16] 99344: e3530063 cmp r3, #99 ; 0x63 99348: da000000 ble 99350 [testing_loop+0x2c] 9934c: ea000005 b 99368 [testing_loop+0x44] 99350: e51b0010 ldr r0, [r11, -#16] 99354: ebffffec bl 9930c [foo] 99358: e51b3010 ldr r3, [r11, -#16] 9935c: e2832001 add r2, r3, #1 ; 0x1 99360: e50b2010 str r2, [r11, -#16] 99364: eafffff5 b 99340 [testing_loop+0x1c] 99368: e91ba800 ldmdb r11, {r11, sp, pc}Conclusion:***********When simple counting loop is used and 'i' is not referenced except forloop count, is used it is the compiler which should optimize the loopto count down. I truly beleive that every decent compiler does so.When 'i' is used for something else that for loop count, in assemblythe counting anyway starts with zero, no matter what optimization isused.So, the programmer (even the most Real-Time programmer) should NOTengage himself in assembly implementation details, because it iscounterproductive and NOT portable over different CPU architectures.The only constructive lesson to be learnt through this experience isUSE DECENT COMPILER!10x.

  2. Nigel Jones says:

    While I agree with you that one should use a decent compiler, I’m not sure I agree with much else Konstantin. (I’ll leave for another day whether GCC is a decent compiler). While just about all compilers will recognize that it’s more efficient to count down in the trivial cases that we have used here, I know it is not the case when the body of the loops start getting large. Now your assertion that programmers should not concern themselves with implementation details is IMHO wrong in several ways.1. In the particular circumstances of this example, I know of no incidences where counting down produces worse code than counting up. The converse is not true. Thus my approach is arguably more portable across different architectures.2. The argument that a programmer need not concern himself with the underlying architecture of the CPU is one that is regularly advanced by those with a CS background. Although I admire the purity of the argument, it simply doesn’t stand up to inspection in real time embedded systems. Why? Well the performance of embedded systems is judged by multiple criteria. For example in hard real time systems the correct answer delivered too late is useless. Similarly in portable systems the correct answer delivered using too many Joules is also useless. In fact this whole concept of whether one should code to the target platform is a fascinating topic in its own right. I’ll endeavor to address this in a future posting.In the interim, thanks for joining the debate. I’d be interested to hear from other readers who have recoded a loop based on this posting – and what they found out. If you do post your results, please include the target processor, compiler and optimization settings. Thanks!

  3. Tom Evans says:

    > I'd be interested to hear from> other readers who have recoded> a loop based on this postingYes, but at least 18 years ago on a 68000, where "int", "short" and "unsigned short" gave quite different results.But NOTHING beats unrolling the loop.Unless the innards of the loop are duplicating a library function like memset() or strchr(), and the ones in the library are able to use tricky CPU instructions.Especially the cache management tricks in a good memcpy() on CPUs that have a data cache.Over the lifetime of one product based on a 40MHz PPC I improved "memcpy()" from 1.6MB/s to 4MB/s (fixing the CPU initialisation, turning caches on), then 8MB/s (copy 32-bits instead of the library byte-at-a-time), then 17MB/s and finally 27MB/s with proper cache handling. The "memcpy" ended up as 146 lines of C and 120 lines of assembler, with more special cases than you'd believe..

  4. ashleigh says:

    In the case of embedded systems programming (and frequently but not always for big systems), there is nothing to be lost and much to be gained by looking at the generated code.And then THINKING.And then trying a few ideas out.If portability is imperitive, then this approach may not be a good move. If portability is not, then it can save your bacon.I've managed to do 9 hand optimisations on some embedded code that HAD TO FIT in 48K.Each time, I had about 100 bytes of spare space. After each hand optimisation (looking at the generated code and changing the C source), I was able to get the same function and get 1K to 2K of free space to add new features into. Each time I thought "thats it, nothing more can be squeezed from this", and each time, I got more. That means the original approx 48K of code was able to be shrunk to about 35K to 38K, just be looking at the generated code and having a think and a fiddle around. [And I should add – this was with space optimisation at the highest setting on the compiler.]I agree that the CS mantra is "dont look, let the compiler work, if it does not fit get a bigger machine."When you are 2 weeks from producyt shipment and your code is 200 bytes TOO BIG to fit on your embedded micro, you boss will not thank you for parroting the CS lecturers attitude. You have to fix it, and you have to fix it now, and putting in a bigger process will slip delivery 6 months. So you look at the generated code, and you go fixing.When you find a construct like loops, or if-else statements, that can save 2 or 3 or 5 bytes eahc time (and you have a couple of hundred of them) it might take a weeks editing, but all those little tiny savings add up to saving the schedule, and saving your job!

  5. ashleigh says:

    Another comment about countdown loops.I have used LOTS of timers, and these are really easy when you have a regular tick going off – in a poor mans round-robin scheduler. A timer is just a byte used to count.Counting down is easy – you test for 0 and when so, the counter has expired – you go do stuff. And its also easy then to set the timer to a special value (UNUSED) which is 255 (0xFF) for byte sized counters. Then you do a test – if the timer is UNUSED, ignore it. Otherwise, decrement and when 0, do stuff.The test against 0xFF is not always very efficient. The test against 0 (just after a decrement) is though.

  6. Nigel Jones says:

    Ashleigh:I completely agree about looking at the generated code. This is a technique I use all the time. Alas to really make use of it you have to understand the instruction set of the target – and this is becoming increasingly rare. Indeed companies are now making a virtue of not having to understand the instruction set / resort to assembly language with their CPU – see Luminary Cortex for an example of this. While I see the appeal of this, divorcing oneself from what's going on under the hood is not a great idea IMHO.

  7. Greg Nelson says:

    Guess I'm just an "under the hood" kind of guy. My first undergrad CS class had us compile something with -S and look at what came out, I guess I never stopped.I'm a huge fan of loop unrolling. There are also times when rewriting memcpy() is exactly what you want to do. One current application needs to erase 1024 continguous, word aligned bytes MANY times, very fast. Creating a specialized 'bzero' that (1) assumes word alignment, (2) takes a length in words, not bytes, and (3) gives up in disgust if the length isn't a multiple of the unrolling, allowed this to go from a 77us operation to an under 5us operation.Which, to paraphrase what Nigel might say, lets the CPU save 94% of the family joules.

  8. Doug says:

    Thanks for this post! I also started programming in assembly and always prefer to use a count-down loop when possible. Yes, there are places where it doesn’t work but lots of times when it does, and it should be tried first. I’ve been known to use this as an interview question to see how low-level a programmer thinks.

  9. It is actually important to consider “under the hood” things while creating code. However while assembler is first “under the hood” thing, compilers is second.

    Real deal is to write code which structure allow optimization. Its really simple for compiler to change counting direction when the only purpose of loop is to execute code multiple times. However much more complex is what you mentioned – addressing using loop counter etc.

    I think that the most important place to optimise C code is to fit code to target hardware, like described in comment bzero implementation – it can provide much more gain than simple tweaks every decent compiler can do.

  10. Eric Miller says:

    You intended for your last example to be incorrect, but it’s incorrect for the wrong reason. It does not initialize lpc.

    A fully corrected version of the code would look something like this:

    #define DIMOF(a) (sizeof(a)/sizeof(a[0]))

    uint8_t bar[10];
    uint8_t lpc = DIMOF(bar) – 1;

    do
    {
    bar[lpc] = 0;
    } while (–lpc);

  11. vishal says:

    thank you for such nice post

  12. Randall says:

    I am brushing up on my embedded knowledge in preparation for an interview and came across this site. Excellent, well done! Couple of things I would add.

    1. We should expect some questions where we are asked to write a code snippet that waits for a bit in a status register to indicate that peripheral has data in its input queue – then we read the data… and the complimentary case where we wait for a peripheral’s output queue to be empty by reading the status register flag then writing the data. See author’s volatile, bit manipulation and Accessing fixed memory locations.

    2. Questions about calling conventions (ie. what happens when a function call is invoked). This is very important to know when going to/from assembly and c.

    3. In the author’s question about interrupt service routines where he adds a printf into the code, the candidate should also immediately recognize – and this applies to ANY ISR, not just those used on constrained embedded systems – that its very important to get out of the ISR as quickly as possible because the whole system can be blocked. Printf is a rather large block of code not suitable for being executed in an ISR. Also the candidate should know about interrupt priorities.

    In the following, there is a very obvious reason the code did not work. lpc is not initialized and this code needs
    to have it = 9. We are looping from 9 down to 0.

    uint8_t bar[10];
    uint8_t lpc;

    do
    {
    bar[lpc] = 0; // Error! First time through results in index beyond end of array
    } while (–lpc);

  13. Steve says:

    How about this when you require an index:

    for(uint8_t lpc = 10; lpc–; )
    {
    foo();
    }

  14. Steve says:

    How about this when an index is required:

    for(uint8_t lpc = 10; lpc–; )
    {
    foo();
    }

    Should loop through the function with a lpc value of 9 to 0.

  15. Zibri says:

    If you want real fast code, you should avoid loops alltogether.
    Repeating a small piece of code is WAY faster than ANY loop.

    • Nigel Jones says:

      I think your use of the phrase ‘small piece of code’ is key. On higher performance processors with an instruction cache, unrolling loops can actually cause an instruction cache miss. It’s one of the things that makes optimizing code to run on multiple targets such a difficult thing to do.

Leave a Reply

You must be logged in to post a comment.