<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Efficient C Tips #7 &#8211; Fast loops</title>
	<atom:link href="http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/feed/" rel="self" type="application/rss+xml" />
	<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/</link>
	<description>Thoughts on embedded systems by Nigel Jones</description>
	<lastBuildDate>Wed, 28 Jul 2010 00:59:13 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Doug</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-1309</link>
		<dc:creator>Doug</dc:creator>
		<pubDate>Tue, 27 Jul 2010 23:45:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-1309</guid>
		<description>Thanks for this post! I also started programming in assembly and always prefer to use a count-down loop when possible. Yes, there are places where it doesn&#039;t work but lots of times when it does, and it should be tried first. I&#039;ve been known to use this as an interview question to see how low-level a programmer thinks.</description>
		<content:encoded><![CDATA[<p>Thanks for this post! I also started programming in assembly and always prefer to use a count-down loop when possible. Yes, there are places where it doesn&#8217;t work but lots of times when it does, and it should be tried first. I&#8217;ve been known to use this as an interview question to see how low-level a programmer thinks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg Nelson</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-151</link>
		<dc:creator>Greg Nelson</dc:creator>
		<pubDate>Wed, 24 Feb 2010 20:00:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-151</guid>
		<description>Guess I&#039;m just an &quot;under the hood&quot; kind of guy.  My first undergrad CS class had us compile something with -S and look at what came out, I guess I never stopped.I&#039;m a huge fan of loop unrolling.  There are also times when rewriting memcpy() is exactly what you want to do.  One current application needs to erase 1024 continguous, word aligned bytes MANY times, very fast.  Creating a specialized &#039;bzero&#039; that (1) assumes word alignment, (2) takes a length in words, not bytes, and (3) gives up in disgust if the length isn&#039;t a multiple of the unrolling, allowed this to go from a 77us operation to an under 5us operation.Which, to paraphrase what Nigel might say, lets the CPU save 94% of the family joules.</description>
		<content:encoded><![CDATA[<p>Guess I&#39;m just an &quot;under the hood&quot; kind of guy.  My first undergrad CS class had us compile something with -S and look at what came out, I guess I never stopped.I&#39;m a huge fan of loop unrolling.  There are also times when rewriting memcpy() is exactly what you want to do.  One current application needs to erase 1024 continguous, word aligned bytes MANY times, very fast.  Creating a specialized &#39;bzero&#39; that (1) assumes word alignment, (2) takes a length in words, not bytes, and (3) gives up in disgust if the length isn&#39;t a multiple of the unrolling, allowed this to go from a 77us operation to an under 5us operation.Which, to paraphrase what Nigel might say, lets the CPU save 94% of the family joules.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nigel Jones</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-150</link>
		<dc:creator>Nigel Jones</dc:creator>
		<pubDate>Thu, 28 Jan 2010 12:17:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-150</guid>
		<description>Ashleigh:I completely agree about looking at the generated code. This is a technique I use all the time. Alas to really make use of it you have to understand the instruction set of the target - and this is becoming increasingly rare. Indeed companies are now making a virtue of not having to understand the instruction set / resort to assembly language with their CPU - see Luminary Cortex for an example of this. While I see the appeal of this, divorcing oneself from what&#039;s going on under the hood is not a great idea IMHO.</description>
		<content:encoded><![CDATA[<p>Ashleigh:I completely agree about looking at the generated code. This is a technique I use all the time. Alas to really make use of it you have to understand the instruction set of the target &#8211; and this is becoming increasingly rare. Indeed companies are now making a virtue of not having to understand the instruction set / resort to assembly language with their CPU &#8211; see Luminary Cortex for an example of this. While I see the appeal of this, divorcing oneself from what&#39;s going on under the hood is not a great idea IMHO.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ashleigh</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-149</link>
		<dc:creator>ashleigh</dc:creator>
		<pubDate>Tue, 26 Jan 2010 01:31:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-149</guid>
		<description>Another comment about countdown loops.I have used LOTS of timers, and these are really easy when you have a regular tick going off - in a poor mans round-robin scheduler. A timer is just a byte used to count.Counting down is easy - you test for 0 and when so, the counter has expired - you go do stuff. And its also easy then to set the timer to a special value (UNUSED) which is 255 (0xFF) for byte sized counters. Then you do a test - if the timer is UNUSED, ignore it. Otherwise, decrement and when 0, do stuff.The test against 0xFF is not always very efficient. The test against 0 (just after a decrement) is though.</description>
		<content:encoded><![CDATA[<p>Another comment about countdown loops.I have used LOTS of timers, and these are really easy when you have a regular tick going off &#8211; in a poor mans round-robin scheduler. A timer is just a byte used to count.Counting down is easy &#8211; you test for 0 and when so, the counter has expired &#8211; you go do stuff. And its also easy then to set the timer to a special value (UNUSED) which is 255 (0xFF) for byte sized counters. Then you do a test &#8211; if the timer is UNUSED, ignore it. Otherwise, decrement and when 0, do stuff.The test against 0xFF is not always very efficient. The test against 0 (just after a decrement) is though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ashleigh</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-148</link>
		<dc:creator>ashleigh</dc:creator>
		<pubDate>Tue, 26 Jan 2010 01:27:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-148</guid>
		<description>In the case of embedded systems programming (and frequently but not always for big systems), there is nothing to be lost and much to be gained by looking at the generated code.And then THINKING.And then trying a few ideas out.If portability is imperitive, then this approach may not be a good move. If portability is not, then it can save your bacon.I&#039;ve managed to do 9 hand optimisations on some embedded code that HAD TO FIT in 48K.Each time, I had about 100 bytes of spare space. After each hand optimisation (looking at the generated code and changing the C source), I was able to get the same function and get 1K to 2K of free space to add new features into. Each time I thought &quot;thats it, nothing more can be squeezed from this&quot;, and each time, I got more. That means the original approx 48K of code was able to be shrunk to about 35K to 38K, just be looking at the generated code and having a think and a fiddle around. [And I should add - this was with space optimisation at the highest setting on the compiler.]I agree that the CS mantra is &quot;dont look, let the compiler work, if it does not fit get a bigger machine.&quot;When you are 2 weeks from producyt shipment and your code is 200 bytes TOO BIG to fit on your embedded micro, you boss will not thank you for parroting the CS lecturers attitude. You have to fix it, and you have to fix it now, and putting in a bigger process will slip delivery 6 months. So you look at the generated code, and you go fixing.When you find a construct like loops, or if-else statements, that can save 2 or 3 or 5 bytes eahc time (and you have a couple of hundred of them) it might take a weeks editing, but all those little tiny savings add up to saving the schedule, and saving your job!</description>
		<content:encoded><![CDATA[<p>In the case of embedded systems programming (and frequently but not always for big systems), there is nothing to be lost and much to be gained by looking at the generated code.And then THINKING.And then trying a few ideas out.If portability is imperitive, then this approach may not be a good move. If portability is not, then it can save your bacon.I&#39;ve managed to do 9 hand optimisations on some embedded code that HAD TO FIT in 48K.Each time, I had about 100 bytes of spare space. After each hand optimisation (looking at the generated code and changing the C source), I was able to get the same function and get 1K to 2K of free space to add new features into. Each time I thought &quot;thats it, nothing more can be squeezed from this&quot;, and each time, I got more. That means the original approx 48K of code was able to be shrunk to about 35K to 38K, just be looking at the generated code and having a think and a fiddle around. [And I should add - this was with space optimisation at the highest setting on the compiler.]I agree that the CS mantra is &quot;dont look, let the compiler work, if it does not fit get a bigger machine.&quot;When you are 2 weeks from producyt shipment and your code is 200 bytes TOO BIG to fit on your embedded micro, you boss will not thank you for parroting the CS lecturers attitude. You have to fix it, and you have to fix it now, and putting in a bigger process will slip delivery 6 months. So you look at the generated code, and you go fixing.When you find a construct like loops, or if-else statements, that can save 2 or 3 or 5 bytes eahc time (and you have a couple of hundred of them) it might take a weeks editing, but all those little tiny savings add up to saving the schedule, and saving your job!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Evans</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-147</link>
		<dc:creator>Tom Evans</dc:creator>
		<pubDate>Sat, 10 Oct 2009 06:55:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-147</guid>
		<description>&gt; I&#039;d be interested to hear from&gt; other readers who have recoded&gt; a loop based on this postingYes, but at least 18 years ago on a 68000, where &quot;int&quot;, &quot;short&quot; and &quot;unsigned short&quot; gave quite different results.But &lt;b&gt;NOTHING&lt;/b&gt; beats unrolling the loop.Unless the innards of the loop are duplicating a library function like memset() or strchr(), and the ones in the library are able to use tricky CPU instructions.Especially the cache management tricks in a good memcpy() on CPUs that have a data cache.Over the lifetime of one product based on a 40MHz PPC I improved &quot;memcpy()&quot; from 1.6MB/s to 4MB/s (fixing the CPU initialisation, turning caches on), then 8MB/s (copy 32-bits instead of the library byte-at-a-time), then 17MB/s and finally 27MB/s with proper cache handling. The &quot;memcpy&quot; ended up as 146 lines of C and 120 lines of assembler, with more special cases than you&#039;d believe..</description>
		<content:encoded><![CDATA[<p>&gt; I&#39;d be interested to hear from&gt; other readers who have recoded&gt; a loop based on this postingYes, but at least 18 years ago on a 68000, where &quot;int&quot;, &quot;short&quot; and &quot;unsigned short&quot; gave quite different results.But <b>NOTHING</b> beats unrolling the loop.Unless the innards of the loop are duplicating a library function like memset() or strchr(), and the ones in the library are able to use tricky CPU instructions.Especially the cache management tricks in a good memcpy() on CPUs that have a data cache.Over the lifetime of one product based on a 40MHz PPC I improved &quot;memcpy()&quot; from 1.6MB/s to 4MB/s (fixing the CPU initialisation, turning caches on), then 8MB/s (copy 32-bits instead of the library byte-at-a-time), then 17MB/s and finally 27MB/s with proper cache handling. The &quot;memcpy&quot; ended up as 146 lines of C and 120 lines of assembler, with more special cases than you&#39;d believe..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nigel Jones</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-146</link>
		<dc:creator>Nigel Jones</dc:creator>
		<pubDate>Sun, 15 Mar 2009 17:55:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-146</guid>
		<description>While I agree with you that one should use a decent compiler, I&#039;m not sure I agree with much else Konstantin. (I&#039;ll leave for another day whether GCC is a decent compiler). While just about all compilers will recognize that it&#039;s more efficient to count down in the trivial cases that we have used here, I know it is not the case when the body of the loops start getting large. Now your assertion that programmers should not concern themselves with implementation details is IMHO wrong in several ways.1. In the particular circumstances of this example, I know of no incidences where counting down produces worse code than counting up. The converse is not true. Thus my approach is arguably more portable across different architectures.2. The argument that a programmer need not concern himself with the underlying architecture of the CPU is one that is regularly advanced by those with a CS background. Although I admire the purity of the argument, it simply doesn&#039;t stand up to inspection in real time embedded systems. Why? Well the performance of embedded systems is judged by multiple criteria. For example in hard real time systems the correct answer delivered too late is useless. Similarly in portable systems the correct answer delivered using too many Joules is also useless. In fact this whole concept of whether one should code to the target platform is a fascinating topic in its own right. I&#039;ll endeavor to address this in a future posting.In the interim, thanks for joining the debate. I&#039;d be interested to hear from other readers who have recoded a loop based on this posting - and what they found out. If you do post your results, please include the target processor, compiler and optimization settings. Thanks!</description>
		<content:encoded><![CDATA[<p>While I agree with you that one should use a decent compiler, I&#8217;m not sure I agree with much else Konstantin. (I&#8217;ll leave for another day whether GCC is a decent compiler). While just about all compilers will recognize that it&#8217;s more efficient to count down in the trivial cases that we have used here, I know it is not the case when the body of the loops start getting large. Now your assertion that programmers should not concern themselves with implementation details is IMHO wrong in several ways.1. In the particular circumstances of this example, I know of no incidences where counting down produces worse code than counting up. The converse is not true. Thus my approach is arguably more portable across different architectures.2. The argument that a programmer need not concern himself with the underlying architecture of the CPU is one that is regularly advanced by those with a CS background. Although I admire the purity of the argument, it simply doesn&#8217;t stand up to inspection in real time embedded systems. Why? Well the performance of embedded systems is judged by multiple criteria. For example in hard real time systems the correct answer delivered too late is useless. Similarly in portable systems the correct answer delivered using too many Joules is also useless. In fact this whole concept of whether one should code to the target platform is a fascinating topic in its own right. I&#8217;ll endeavor to address this in a future posting.In the interim, thanks for joining the debate. I&#8217;d be interested to hear from other readers who have recoded a loop based on this posting &#8211; and what they found out. If you do post your results, please include the target processor, compiler and optimization settings. Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Konstantin Zertsekel</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-145</link>
		<dc:creator>Konstantin Zertsekel</dc:creator>
		<pubDate>Wed, 11 Mar 2009 16:44:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-145</guid>
		<description>Let me contradict you since it is not that simple IMHO.At least three additional things should be taken into consideration:1. Compiler2. Compiling optimization3. CPU architecture (ARM, Xscale, Atom etc.)Let&#039;s have some specific examples for I chance to have ARMv5 and WindRiver GCC.The devil is in the details, but the truth lies there as well...C code:-------void foo(){}void testing_loop(){    int i;    for (i = 0; i &lt; 100; i++)        foo();}Asm code with -Os optimization (loop starts with 99):-----------------------------------------------------00065e78 [foo]:   65e78:       e1a0f00e        mov     pc, lr00065e7c [testing_loop]:   65e7c:       e92d4010        stmdb   sp!, {r4, lr}   65e80:       e3a04063        mov     r4, #99 ; 0x63   65e84:       ebfffffb        bl      65e78 [foo]   65e88:       e2544001        subs    r4, r4, #1      ; 0x1   65e8c:       5afffffc        bpl     65e84 [testing_loop+0x8]   65e90:       e8bd8010        ldmia   sp!, {r4, pc}Asm code with -O0 optimization (loop starts with 0):----------------------------------------------------0009930c [foo]:   9930c:       e1a0c00d        mov     r12, sp   99310:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   99314:       e24cb004        sub     r11, r12, #4    ; 0x4   99318:       e91ba800        ldmdb   r11, {r11, sp, pc}0009931c [testing_loop]:   9931c:       e1a0c00d        mov     r12, sp   99320:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   99324:       e24cb004        sub     r11, r12, #4    ; 0x4   99328:       e24dd004        sub     sp, sp, #4      ; 0x4   9932c:       e1a00000        nop                     (mov r0,r0)   99330:       e3a03000        mov     r3, #0  ; 0x0   99334:       e50b3010        str     r3, [r11, -#16]   99338:       e51b3010        ldr     r3, [r11, -#16]   9933c:       e3530063        cmp     r3, #99 ; 0x63   99340:       da000000        ble     99348 [testing_loop+0x2c]   99344:       ea000004        b       9935c [testing_loop+0x40]   99348:       ebffffef        bl      9930c [foo]   9934c:       e51b3010        ldr     r3, [r11, -#16]   99350:       e2832001        add     r2, r3, #1      ; 0x1   99354:       e50b2010        str     r2, [r11, -#16]   99358:       eafffff6        b       99338 [testing_loop+0x1c]   9935c:       e91ba800        ldmdb   r11, {r11, sp, pc}C code - &#039;i&#039; is the parameter for foo():========================================void foo(int dummy){}void testing_loop(){    int i;    for (i = 0; i &lt; 100; i++)        foo(i);}Asm code with -Os optimization (loop starts with 0):====================================================00065e78 [foo]:   65e78:       e1a0f00e        mov     pc, lr00065e7c [testing_loop]:   65e7c:       e92d4010        stmdb   sp!, {r4, lr}   65e80:       e3a04000        mov     r4, #0  ; 0x0   65e84:       e1a00004        mov     r0, r4   65e88:       e2844001        add     r4, r4, #1      ; 0x1   65e8c:       ebfffff9        bl      65e78 [foo]   65e90:       e3540063        cmp     r4, #99 ; 0x63   65e94:       dafffffa        ble     65e84 [testing_loop+0x8]   65e98:       e8bd8010        ldmia   sp!, {r4, pc}Asm code with -O0 optimization (loop starts with 0):====================================================0009930c [foo]:   9930c:       e1a0c00d        mov     r12, sp   99310:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   99314:       e24cb004        sub     r11, r12, #4    ; 0x4   99318:       e24dd004        sub     sp, sp, #4      ; 0x4   9931c:       e50b0010        str     r0, [r11, -#16]   99320:       e91ba800        ldmdb   r11, {r11, sp, pc}00099324 [testing_loop]:   99324:       e1a0c00d        mov     r12, sp   99328:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   9932c:       e24cb004        sub     r11, r12, #4    ; 0x4   99330:       e24dd004        sub     sp, sp, #4      ; 0x4   99334:       e1a00000        nop                     (mov r0,r0)   99338:       e3a03000        mov     r3, #0  ; 0x0   9933c:       e50b3010        str     r3, [r11, -#16]   99340:       e51b3010        ldr     r3, [r11, -#16]   99344:       e3530063        cmp     r3, #99 ; 0x63   99348:       da000000        ble     99350 [testing_loop+0x2c]   9934c:       ea000005        b       99368 [testing_loop+0x44]   99350:       e51b0010        ldr     r0, [r11, -#16]   99354:       ebffffec        bl      9930c [foo]   99358:       e51b3010        ldr     r3, [r11, -#16]   9935c:       e2832001        add     r2, r3, #1      ; 0x1   99360:       e50b2010        str     r2, [r11, -#16]   99364:       eafffff5        b       99340 [testing_loop+0x1c]   99368:       e91ba800        ldmdb   r11, {r11, sp, pc}Conclusion:***********When simple counting loop is used and &#039;i&#039; is not referenced except forloop count, is used it is the compiler which should optimize the loopto count down. I truly beleive that every decent compiler does so.When &#039;i&#039; is used for something else that for loop count, in assemblythe counting anyway starts with zero, no matter what optimization isused.So, the programmer (even the most Real-Time programmer) should NOTengage himself in assembly implementation details, because it iscounterproductive and NOT portable over different CPU architectures.The only constructive lesson to be learnt through this experience isUSE DECENT COMPILER!10x.</description>
		<content:encoded><![CDATA[<p>Let me contradict you since it is not that simple IMHO.At least three additional things should be taken into consideration:1. Compiler2. Compiling optimization3. CPU architecture (ARM, Xscale, Atom etc.)Let&#39;s have some specific examples for I chance to have ARMv5 and WindRiver GCC.The devil is in the details, but the truth lies there as well&#8230;C code:&#8212;&#8212;-void foo(){}void testing_loop(){    int i;    for (i = 0; i &lt; 100; i++)        foo();}Asm code with -Os optimization (loop starts with 99):&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;00065e78 [foo]:   65e78:       e1a0f00e        mov     pc, lr00065e7c [testing_loop]:   65e7c:       e92d4010        stmdb   sp!, {r4, lr}   65e80:       e3a04063        mov     r4, #99 ; 0&#215;63   65e84:       ebfffffb        bl      65e78 [foo]   65e88:       e2544001        subs    r4, r4, #1      ; 0&#215;1   65e8c:       5afffffc        bpl     65e84 [testing_loop+0x8]   65e90:       e8bd8010        ldmia   sp!, {r4, pc}Asm code with -O0 optimization (loop starts with 0):&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-0009930c [foo]:   9930c:       e1a0c00d        mov     r12, sp   99310:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   99314:       e24cb004        sub     r11, r12, #4    ; 0&#215;4   99318:       e91ba800        ldmdb   r11, {r11, sp, pc}0009931c [testing_loop]:   9931c:       e1a0c00d        mov     r12, sp   99320:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   99324:       e24cb004        sub     r11, r12, #4    ; 0&#215;4   99328:       e24dd004        sub     sp, sp, #4      ; 0&#215;4   9932c:       e1a00000        nop                     (mov r0,r0)   99330:       e3a03000        mov     r3, #0  ; 0&#215;0   99334:       e50b3010        str     r3, [r11, -#16]   99338:       e51b3010        ldr     r3, [r11, -#16]   9933c:       e3530063        cmp     r3, #99 ; 0&#215;63   99340:       da000000        ble     99348 [testing_loop+0x2c]   99344:       ea000004        b       9935c [testing_loop+0x40]   99348:       ebffffef        bl      9930c [foo]   9934c:       e51b3010        ldr     r3, [r11, -#16]   99350:       e2832001        add     r2, r3, #1      ; 0&#215;1   99354:       e50b2010        str     r2, [r11, -#16]   99358:       eafffff6        b       99338 [testing_loop+0x1c]   9935c:       e91ba800        ldmdb   r11, {r11, sp, pc}C code &#8211; &#39;i&#39; is the parameter for foo():========================================void foo(int dummy){}void testing_loop(){    int i;    for (i = 0; i &lt; 100; i++)        foo(i);}Asm code with -Os optimization (loop starts with 0):====================================================00065e78 [foo]:   65e78:       e1a0f00e        mov     pc, lr00065e7c [testing_loop]:   65e7c:       e92d4010        stmdb   sp!, {r4, lr}   65e80:       e3a04000        mov     r4, #0  ; 0&#215;0   65e84:       e1a00004        mov     r0, r4   65e88:       e2844001        add     r4, r4, #1      ; 0&#215;1   65e8c:       ebfffff9        bl      65e78 [foo]   65e90:       e3540063        cmp     r4, #99 ; 0&#215;63   65e94:       dafffffa        ble     65e84 [testing_loop+0x8]   65e98:       e8bd8010        ldmia   sp!, {r4, pc}Asm code with -O0 optimization (loop starts with 0):====================================================0009930c [foo]:   9930c:       e1a0c00d        mov     r12, sp   99310:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   99314:       e24cb004        sub     r11, r12, #4    ; 0&#215;4   99318:       e24dd004        sub     sp, sp, #4      ; 0&#215;4   9931c:       e50b0010        str     r0, [r11, -#16]   99320:       e91ba800        ldmdb   r11, {r11, sp, pc}00099324 [testing_loop]:   99324:       e1a0c00d        mov     r12, sp   99328:       e92dd800        stmdb   sp!, {r11, r12, lr, pc}   9932c:       e24cb004        sub     r11, r12, #4    ; 0&#215;4   99330:       e24dd004        sub     sp, sp, #4      ; 0&#215;4   99334:       e1a00000        nop                     (mov r0,r0)   99338:       e3a03000        mov     r3, #0  ; 0&#215;0   9933c:       e50b3010        str     r3, [r11, -#16]   99340:       e51b3010        ldr     r3, [r11, -#16]   99344:       e3530063        cmp     r3, #99 ; 0&#215;63   99348:       da000000        ble     99350 [testing_loop+0x2c]   9934c:       ea000005        b       99368 [testing_loop+0x44]   99350:       e51b0010        ldr     r0, [r11, -#16]   99354:       ebffffec        bl      9930c [foo]   99358:       e51b3010        ldr     r3, [r11, -#16]   9935c:       e2832001        add     r2, r3, #1      ; 0&#215;1   99360:       e50b2010        str     r2, [r11, -#16]   99364:       eafffff5        b       99340 [testing_loop+0x1c]   99368:       e91ba800        ldmdb   r11, {r11, sp, pc}Conclusion:***********When simple counting loop is used and &#39;i&#39; is not referenced except forloop count, is used it is the compiler which should optimize the loopto count down. I truly beleive that every decent compiler does so.When &#39;i&#39; is used for something else that for loop count, in assemblythe counting anyway starts with zero, no matter what optimization isused.So, the programmer (even the most Real-Time programmer) should NOTengage himself in assembly implementation details, because it iscounterproductive and NOT portable over different CPU architectures.The only constructive lesson to be learnt through this experience isUSE DECENT COMPILER!10x.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
