<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Efficient C Tips #7 &#8211; Fast loops</title>
	<atom:link href="http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/feed/" rel="self" type="application/rss+xml" />
	<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/</link>
	<description>Thoughts on embedded systems by Nigel Jones</description>
	<lastBuildDate>Sun, 06 May 2012 10:34:02 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Eric Miller</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-14425</link>
		<dc:creator>Eric Miller</dc:creator>
		<pubDate>Wed, 15 Feb 2012 11:11:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-14425</guid>
		<description>You intended for your last example to be incorrect, but it&#039;s incorrect for the wrong reason.  It does not initialize lpc.

A fully corrected version of the code would look something like this:

#define DIMOF(a) (sizeof(a)/sizeof(a[0]))

uint8_t bar[10];
uint8_t lpc = DIMOF(bar) - 1;

do
{
    bar[lpc] = 0;
} while (--lpc);</description>
		<content:encoded><![CDATA[<p>You intended for your last example to be incorrect, but it&#8217;s incorrect for the wrong reason.  It does not initialize lpc.</p>
<p>A fully corrected version of the code would look something like this:</p>
<p>#define DIMOF(a) (sizeof(a)/sizeof(a[0]))</p>
<p>uint8_t bar[10];<br />
uint8_t lpc = DIMOF(bar) &#8211; 1;</p>
<p>do<br />
{<br />
    bar[lpc] = 0;<br />
} while (&#8211;lpc);</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: C Programming Tips &#124; Mechanics, Electronics &#38; Computing</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-6338</link>
		<dc:creator>C Programming Tips &#124; Mechanics, Electronics &#38; Computing</dc:creator>
		<pubDate>Thu, 08 Sep 2011 02:50:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-6338</guid>
		<description>[...] Efficient C Tips #7 – Fast loops [...]</description>
		<content:encoded><![CDATA[<p>[...] Efficient C Tips #7 – Fast loops [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Krzysztof Wesołowski</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-6067</link>
		<dc:creator>Krzysztof Wesołowski</dc:creator>
		<pubDate>Mon, 29 Aug 2011 01:17:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-6067</guid>
		<description>It is actually important to consider &quot;under the hood&quot; things while creating code. However while assembler is first &quot;under the hood&quot; thing, compilers is second.

Real deal is to write code which structure allow optimization. Its really simple for compiler to change counting direction when the only purpose of loop is to execute code multiple times. However much more complex is what you mentioned - addressing using loop counter etc.

I think that the most important place to optimise C code is to fit code to target hardware, like described in comment bzero implementation - it can provide much more gain than simple tweaks every decent compiler can do.</description>
		<content:encoded><![CDATA[<p>It is actually important to consider &#8220;under the hood&#8221; things while creating code. However while assembler is first &#8220;under the hood&#8221; thing, compilers is second.</p>
<p>Real deal is to write code which structure allow optimization. Its really simple for compiler to change counting direction when the only purpose of loop is to execute code multiple times. However much more complex is what you mentioned &#8211; addressing using loop counter etc.</p>
<p>I think that the most important place to optimise C code is to fit code to target hardware, like described in comment bzero implementation &#8211; it can provide much more gain than simple tweaks every decent compiler can do.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Doug</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-1309</link>
		<dc:creator>Doug</dc:creator>
		<pubDate>Tue, 27 Jul 2010 23:45:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-1309</guid>
		<description>Thanks for this post! I also started programming in assembly and always prefer to use a count-down loop when possible. Yes, there are places where it doesn&#039;t work but lots of times when it does, and it should be tried first. I&#039;ve been known to use this as an interview question to see how low-level a programmer thinks.</description>
		<content:encoded><![CDATA[<p>Thanks for this post! I also started programming in assembly and always prefer to use a count-down loop when possible. Yes, there are places where it doesn&#8217;t work but lots of times when it does, and it should be tried first. I&#8217;ve been known to use this as an interview question to see how low-level a programmer thinks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg Nelson</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-151</link>
		<dc:creator>Greg Nelson</dc:creator>
		<pubDate>Wed, 24 Feb 2010 20:00:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-151</guid>
		<description>Guess I&#039;m just an &quot;under the hood&quot; kind of guy.  My first undergrad CS class had us compile something with -S and look at what came out, I guess I never stopped.I&#039;m a huge fan of loop unrolling.  There are also times when rewriting memcpy() is exactly what you want to do.  One current application needs to erase 1024 continguous, word aligned bytes MANY times, very fast.  Creating a specialized &#039;bzero&#039; that (1) assumes word alignment, (2) takes a length in words, not bytes, and (3) gives up in disgust if the length isn&#039;t a multiple of the unrolling, allowed this to go from a 77us operation to an under 5us operation.Which, to paraphrase what Nigel might say, lets the CPU save 94% of the family joules.</description>
		<content:encoded><![CDATA[<p>Guess I&#39;m just an &quot;under the hood&quot; kind of guy.  My first undergrad CS class had us compile something with -S and look at what came out, I guess I never stopped.I&#39;m a huge fan of loop unrolling.  There are also times when rewriting memcpy() is exactly what you want to do.  One current application needs to erase 1024 continguous, word aligned bytes MANY times, very fast.  Creating a specialized &#39;bzero&#39; that (1) assumes word alignment, (2) takes a length in words, not bytes, and (3) gives up in disgust if the length isn&#39;t a multiple of the unrolling, allowed this to go from a 77us operation to an under 5us operation.Which, to paraphrase what Nigel might say, lets the CPU save 94% of the family joules.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nigel Jones</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-150</link>
		<dc:creator>Nigel Jones</dc:creator>
		<pubDate>Thu, 28 Jan 2010 12:17:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-150</guid>
		<description>Ashleigh:I completely agree about looking at the generated code. This is a technique I use all the time. Alas to really make use of it you have to understand the instruction set of the target - and this is becoming increasingly rare. Indeed companies are now making a virtue of not having to understand the instruction set / resort to assembly language with their CPU - see Luminary Cortex for an example of this. While I see the appeal of this, divorcing oneself from what&#039;s going on under the hood is not a great idea IMHO.</description>
		<content:encoded><![CDATA[<p>Ashleigh:I completely agree about looking at the generated code. This is a technique I use all the time. Alas to really make use of it you have to understand the instruction set of the target &#8211; and this is becoming increasingly rare. Indeed companies are now making a virtue of not having to understand the instruction set / resort to assembly language with their CPU &#8211; see Luminary Cortex for an example of this. While I see the appeal of this, divorcing oneself from what&#39;s going on under the hood is not a great idea IMHO.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ashleigh</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-149</link>
		<dc:creator>ashleigh</dc:creator>
		<pubDate>Tue, 26 Jan 2010 01:31:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-149</guid>
		<description>Another comment about countdown loops.I have used LOTS of timers, and these are really easy when you have a regular tick going off - in a poor mans round-robin scheduler. A timer is just a byte used to count.Counting down is easy - you test for 0 and when so, the counter has expired - you go do stuff. And its also easy then to set the timer to a special value (UNUSED) which is 255 (0xFF) for byte sized counters. Then you do a test - if the timer is UNUSED, ignore it. Otherwise, decrement and when 0, do stuff.The test against 0xFF is not always very efficient. The test against 0 (just after a decrement) is though.</description>
		<content:encoded><![CDATA[<p>Another comment about countdown loops.I have used LOTS of timers, and these are really easy when you have a regular tick going off &#8211; in a poor mans round-robin scheduler. A timer is just a byte used to count.Counting down is easy &#8211; you test for 0 and when so, the counter has expired &#8211; you go do stuff. And its also easy then to set the timer to a special value (UNUSED) which is 255 (0xFF) for byte sized counters. Then you do a test &#8211; if the timer is UNUSED, ignore it. Otherwise, decrement and when 0, do stuff.The test against 0xFF is not always very efficient. The test against 0 (just after a decrement) is though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ashleigh</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-148</link>
		<dc:creator>ashleigh</dc:creator>
		<pubDate>Tue, 26 Jan 2010 01:27:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-148</guid>
		<description>In the case of embedded systems programming (and frequently but not always for big systems), there is nothing to be lost and much to be gained by looking at the generated code.And then THINKING.And then trying a few ideas out.If portability is imperitive, then this approach may not be a good move. If portability is not, then it can save your bacon.I&#039;ve managed to do 9 hand optimisations on some embedded code that HAD TO FIT in 48K.Each time, I had about 100 bytes of spare space. After each hand optimisation (looking at the generated code and changing the C source), I was able to get the same function and get 1K to 2K of free space to add new features into. Each time I thought &quot;thats it, nothing more can be squeezed from this&quot;, and each time, I got more. That means the original approx 48K of code was able to be shrunk to about 35K to 38K, just be looking at the generated code and having a think and a fiddle around. [And I should add - this was with space optimisation at the highest setting on the compiler.]I agree that the CS mantra is &quot;dont look, let the compiler work, if it does not fit get a bigger machine.&quot;When you are 2 weeks from producyt shipment and your code is 200 bytes TOO BIG to fit on your embedded micro, you boss will not thank you for parroting the CS lecturers attitude. You have to fix it, and you have to fix it now, and putting in a bigger process will slip delivery 6 months. So you look at the generated code, and you go fixing.When you find a construct like loops, or if-else statements, that can save 2 or 3 or 5 bytes eahc time (and you have a couple of hundred of them) it might take a weeks editing, but all those little tiny savings add up to saving the schedule, and saving your job!</description>
		<content:encoded><![CDATA[<p>In the case of embedded systems programming (and frequently but not always for big systems), there is nothing to be lost and much to be gained by looking at the generated code.And then THINKING.And then trying a few ideas out.If portability is imperitive, then this approach may not be a good move. If portability is not, then it can save your bacon.I&#39;ve managed to do 9 hand optimisations on some embedded code that HAD TO FIT in 48K.Each time, I had about 100 bytes of spare space. After each hand optimisation (looking at the generated code and changing the C source), I was able to get the same function and get 1K to 2K of free space to add new features into. Each time I thought &quot;thats it, nothing more can be squeezed from this&quot;, and each time, I got more. That means the original approx 48K of code was able to be shrunk to about 35K to 38K, just be looking at the generated code and having a think and a fiddle around. [And I should add - this was with space optimisation at the highest setting on the compiler.]I agree that the CS mantra is &quot;dont look, let the compiler work, if it does not fit get a bigger machine.&quot;When you are 2 weeks from producyt shipment and your code is 200 bytes TOO BIG to fit on your embedded micro, you boss will not thank you for parroting the CS lecturers attitude. You have to fix it, and you have to fix it now, and putting in a bigger process will slip delivery 6 months. So you look at the generated code, and you go fixing.When you find a construct like loops, or if-else statements, that can save 2 or 3 or 5 bytes eahc time (and you have a couple of hundred of them) it might take a weeks editing, but all those little tiny savings add up to saving the schedule, and saving your job!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Evans</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-147</link>
		<dc:creator>Tom Evans</dc:creator>
		<pubDate>Sat, 10 Oct 2009 06:55:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-147</guid>
		<description>&gt; I&#039;d be interested to hear from&gt; other readers who have recoded&gt; a loop based on this postingYes, but at least 18 years ago on a 68000, where &quot;int&quot;, &quot;short&quot; and &quot;unsigned short&quot; gave quite different results.But &lt;b&gt;NOTHING&lt;/b&gt; beats unrolling the loop.Unless the innards of the loop are duplicating a library function like memset() or strchr(), and the ones in the library are able to use tricky CPU instructions.Especially the cache management tricks in a good memcpy() on CPUs that have a data cache.Over the lifetime of one product based on a 40MHz PPC I improved &quot;memcpy()&quot; from 1.6MB/s to 4MB/s (fixing the CPU initialisation, turning caches on), then 8MB/s (copy 32-bits instead of the library byte-at-a-time), then 17MB/s and finally 27MB/s with proper cache handling. The &quot;memcpy&quot; ended up as 146 lines of C and 120 lines of assembler, with more special cases than you&#039;d believe..</description>
		<content:encoded><![CDATA[<p>&gt; I&#39;d be interested to hear from&gt; other readers who have recoded&gt; a loop based on this postingYes, but at least 18 years ago on a 68000, where &quot;int&quot;, &quot;short&quot; and &quot;unsigned short&quot; gave quite different results.But <b>NOTHING</b> beats unrolling the loop.Unless the innards of the loop are duplicating a library function like memset() or strchr(), and the ones in the library are able to use tricky CPU instructions.Especially the cache management tricks in a good memcpy() on CPUs that have a data cache.Over the lifetime of one product based on a 40MHz PPC I improved &quot;memcpy()&quot; from 1.6MB/s to 4MB/s (fixing the CPU initialisation, turning caches on), then 8MB/s (copy 32-bits instead of the library byte-at-a-time), then 17MB/s and finally 27MB/s with proper cache handling. The &quot;memcpy&quot; ended up as 146 lines of C and 120 lines of assembler, with more special cases than you&#39;d believe..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nigel Jones</title>
		<link>http://embeddedgurus.com/stack-overflow/2009/03/efficient-c-tips-7-fast-loops/comment-page-1/#comment-146</link>
		<dc:creator>Nigel Jones</dc:creator>
		<pubDate>Sun, 15 Mar 2009 17:55:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2009/03/05/efficient-c-tips-7-fast-loops/#comment-146</guid>
		<description>While I agree with you that one should use a decent compiler, I&#039;m not sure I agree with much else Konstantin. (I&#039;ll leave for another day whether GCC is a decent compiler). While just about all compilers will recognize that it&#039;s more efficient to count down in the trivial cases that we have used here, I know it is not the case when the body of the loops start getting large. Now your assertion that programmers should not concern themselves with implementation details is IMHO wrong in several ways.1. In the particular circumstances of this example, I know of no incidences where counting down produces worse code than counting up. The converse is not true. Thus my approach is arguably more portable across different architectures.2. The argument that a programmer need not concern himself with the underlying architecture of the CPU is one that is regularly advanced by those with a CS background. Although I admire the purity of the argument, it simply doesn&#039;t stand up to inspection in real time embedded systems. Why? Well the performance of embedded systems is judged by multiple criteria. For example in hard real time systems the correct answer delivered too late is useless. Similarly in portable systems the correct answer delivered using too many Joules is also useless. In fact this whole concept of whether one should code to the target platform is a fascinating topic in its own right. I&#039;ll endeavor to address this in a future posting.In the interim, thanks for joining the debate. I&#039;d be interested to hear from other readers who have recoded a loop based on this posting - and what they found out. If you do post your results, please include the target processor, compiler and optimization settings. Thanks!</description>
		<content:encoded><![CDATA[<p>While I agree with you that one should use a decent compiler, I&#8217;m not sure I agree with much else Konstantin. (I&#8217;ll leave for another day whether GCC is a decent compiler). While just about all compilers will recognize that it&#8217;s more efficient to count down in the trivial cases that we have used here, I know it is not the case when the body of the loops start getting large. Now your assertion that programmers should not concern themselves with implementation details is IMHO wrong in several ways.1. In the particular circumstances of this example, I know of no incidences where counting down produces worse code than counting up. The converse is not true. Thus my approach is arguably more portable across different architectures.2. The argument that a programmer need not concern himself with the underlying architecture of the CPU is one that is regularly advanced by those with a CS background. Although I admire the purity of the argument, it simply doesn&#8217;t stand up to inspection in real time embedded systems. Why? Well the performance of embedded systems is judged by multiple criteria. For example in hard real time systems the correct answer delivered too late is useless. Similarly in portable systems the correct answer delivered using too many Joules is also useless. In fact this whole concept of whether one should code to the target platform is a fascinating topic in its own right. I&#8217;ll endeavor to address this in a future posting.In the interim, thanks for joining the debate. I&#8217;d be interested to hear from other readers who have recoded a loop based on this posting &#8211; and what they found out. If you do post your results, please include the target processor, compiler and optimization settings. Thanks!</p>
]]></content:encoded>
	</item>
</channel>
</rss>

