<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Barr Code</title>
	<atom:link href="http://embeddedgurus.com/barr-code/feed/" rel="self" type="application/rss+xml" />
	<link>http://embeddedgurus.com/barr-code</link>
	<description>A Blog by Michael Barr</description>
	<lastBuildDate>Wed, 18 Aug 2010 18:38:24 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>3 Things Every Programmer Should Know About RMA</title>
		<link>http://embeddedgurus.com/barr-code/2010/08/3-things-every-programmer-should-know-about-rma/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/08/3-things-every-programmer-should-know-about-rma/#comments</comments>
		<pubDate>Wed, 18 Aug 2010 18:38:24 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Firmware Bugs]]></category>
		<category><![CDATA[RTOS Multithreading]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/?p=379</guid>
		<description><![CDATA[
This post was originally posted in the wrong blog.  I&#8217;m reposting it here.
Real-time systems design and RMA go together like peanut butter and jelly.  So why is it that wherever I go in the embedded community, engineers are developing real-time systems without applying RMA?  This is a dangerous situation, but one that is easily remedied [...]]]></description>
			<content:encoded><![CDATA[<div style="text-align: left">
<p>This post was originally posted in the wrong blog.  I&#8217;m reposting it here.</p>
<p><em>Real-time systems design and RMA go together like peanut butter and jelly.  So why is it that wherever I go in the embedded community, engineers are developing real-time systems without applying RMA?  This is a dangerous situation, but one that is easily remedied by ensuring every programmer knows three things about RMA.</em></p>
<p>In case you are entirely unfamiliar with RMA, there&#8217;s a handy primer on the technique at<a href="http://www.netrino.com/Embedded-Systems/How-To/RMA-Rate-Monotonic-Algorithm/" target="_blank">http://www.netrino.com/Embedded-Systems/How-To/RMA-Rate-Monotonic-Algorithm/</a>. I’ve tried to write this blog post in a way that you can read that before or after, at your option.</p>
<p><strong>#1: RMA is Not Just for Academics</strong></p>
<p>You have probably heard of RMA.  Maybe you can even expand the acronym.   Maybe you also know that the theoretical underpinnings of RMA were developed largely at <a href="http://www.sei.cmu.edu">Carnegie Mellon University’s Software Engineering Institute</a> and/or that the technique has been known for about three decades.</p>
<p>If, however, you are like the vast majority of the thousands of firmware engineers I have communicated with on the subject during my years as a writer/editor, consultant, and trainer, you probably think RMA is just for academics.  I also thought that way years ago—but here’s the straight dope:</p>
<ul>
<li>All of the popular commercial real-time operating systems (e.g., <a href="http://www.windriver.com/products/vxworks/">VxWorks</a>, <a href="http://www.rtos.com/page/product.php?id=2">ThreadX</a>, and <a href="http://micrium.com/page/products/rtos">MicroC/OS</a>) are built upon fixed-priority preemptive schedulers.<a href="#_ftn1">[1]</a></li>
<li>RMA is the optimal method of assigning fixed priorities to RTOS tasks.  That is to say that if a set of tasks cannot be scheduled using RMA, it can’t be scheduled using any fixed-priority algorithm.</li>
<li>RMA provides convenient rules of thumb regarding the percentage of CPU you can safely use while still meeting all deadlines.  If you don’t use RMA to assign priorities to your tasks, there is no rule of thumb that will ensure all of their deadlines will be met.<a href="#_ftn2">[2]</a></li>
</ul>
<p>A key feature of RMA is the ability to prove a priori that a given set of tasks will always meet its deadlines—even during periods of transient overload.  Dynamic-priority operating systems cannot make this guarantee.  Nor can fixed-priority RTOSes running tasks prioritized in other ways.</p>
<p>Too many of today&#8217;s real-time systems built with an RTOS are working by luck. Excess processing power may be masking design and analysis sins or the worst-case just hasn’t happened—yet.</p>
<p><em>Bottom line</em>: You’re playing with fire if you don’t use RMA to assign priorities to critical tasks; it might be just a matter of time before your product’s users get burned.<a href="#_ftn3">[3]</a></p>
<p><strong>#2: RMA Need Not Be Applied to Every Task</strong></p>
<p>As any programmer that’s already put RMA into practice will tell you, the hardest part of the analysis phase is establishing an upper bound for the worst-case execution time of each task. The CPU utilization of each task is computed as the ratio of its worst-case execution time to its worst-case period. <a href="#_ftn4">[4]</a></p>
<p>There are three ways to place an upper bound on execution time: (1) by measuring the actual execution time during the tested worst-case scenario; (2) by performing a top-down analysis of the code in combination with a cycle-counter; or (3) by making an educated guess based on big-O notation.  I call these alternatives measuring, analyzing, and budgeting, respectively, and note that the decision of which to use involves tradeoffs of precision vs. level of effort. Measurement can be extremely precise, but requires the ability to instrument and test the actual working code—which must be remeasured after every code change.  Budgeting is easiest and can be performed even at the beginning of the project, but it is necessarily imprecise (in the conservative direction of requiring more CPU bandwidth than is actually required).</p>
<p>But there is at least some good news about the analysis.  RMA need not be performed across the entire set of tasks in the system.  It is possible to define a smaller (often much smaller, in my experience) critical set of tasks on which RMA needs to be performed, with the remaining non-critical tasks simply assigned lower priorities.</p>
<p>This critical set of tasks should contain all of the tasks with deadlines that can’t be missed or else.  In addition, it should contain any other tasks the former set either share mutexes with or from which they require timely semaphore or message queue posts.  Every other task is considered non-critical.</p>
<p>RMA can be meaningfully applied to the critical set tasks only, so long as we ensure that all of the non-critical tasks have priorities below the entire critical set.  We then need only determine worst-case periods and worst-case execution times for the critical set.  Furthermore, we need only follow the rate monotonic algorithm for assignment of priorities within the critical set.</p>
<p><em>Bottom line</em>: Anything goes at lower priorities where there are no deadlines.</p>
<p><strong>#3: RMA Applies to Interrupt Service Routines Too</strong></p>
<p>With few exceptions, books, articles, and papers that mention RMA describe it as a technique for prioritizing the tasks on a preemptive fixed-priority operating system.  But the technique is also essential for correctly prioritizing interrupt handlers.</p>
<p>Indeed, even if you have designed a real-time system that consists only of interrupt service routines (plus a do-nothing background loop in main), you should use the rate monotonic algorithm to prioritize them with respect to their worst-case frequency of occurrence.  Then you can use rate monotonic analysis to prove that they will all meet their real-time deadlines even during transient overload.</p>
<p>Furthermore, if you have a set of critical tasks in addition to interrupt service routines the prioritization and analysis associated with RMA need to be performed across the entire set of those entities.<a href="#_ftn5">[5]</a> This can be complicated, as there is an arbitrary “priority boundary” imposed by the CPU hardware: even the lowest priority ISR is deemed more important than the highest priority task.</p>
<p>For example, consider the conflict in the set of ISRs and tasks in Table 1.  RMA dictates that the priority of Task A should be higher than the priority of the ISR, because Task A can occur more frequently.  But the hardware demands otherwise, by limiting our ability to move ISRs down in priority.  If we leave things as they are, we cannot simply sum the CPU utilization of this set of entities to see if they are below the schedulable bound for four entities.</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="89" valign="top">Runnable Entity</td>
<td width="83" valign="top">Priority by RMA</td>
<td width="99" valign="top">Worst-Case Execution Time</td>
<td width="90" valign="top">Worst-Case Period</td>
<td width="82" valign="top">CPU Utilization</td>
</tr>
<tr>
<td width="89" valign="top">ISR</td>
<td width="83" valign="top">2</td>
<td width="99" valign="top">500 us</td>
<td width="90" valign="top">10 ms</td>
<td width="82" valign="top">5%</td>
</tr>
<tr>
<td width="89" valign="top">Task A</td>
<td width="83" valign="top">1</td>
<td width="99" valign="top">750 us</td>
<td width="90" valign="top">3 ms</td>
<td width="82" valign="top">25%</td>
</tr>
<tr>
<td width="89" valign="top">Task C</td>
<td width="83" valign="top">3</td>
<td width="99" valign="top">300 us</td>
<td width="90" valign="top">30 ms</td>
<td width="82" valign="top">1%</td>
</tr>
<tr>
<td width="89" valign="top">Task B</td>
<td width="83" valign="top">4</td>
<td width="99" valign="top">8 ms</td>
<td width="90" valign="top">40 ms</td>
<td width="82" valign="top">20%</td>
</tr>
</tbody>
</table>
<p style="text-align: center"><em>Table 1.  A Misprioritized Interrupt Handler</em></p>
<p>So what should we do in a conflicted scenario like this?  There are two options.  Either we change the program’s structure, by moving the ISR code into a polling loop that operates as a 10 ms task at priority 2—in which case total utilization is 51%.  Or we treat the ISR, for purposes of proof via rate monotonic analysis anyway, as though it actually has a worst-case period of 3 ms.  In the latter option, the ISR has an appropriate top priority by RMA but the CPU bandwidth dedicated to the ISR increases from 5% to 16.7%&#8211;bringing the new total up to 62.7%  Either way, the full set is provably schedulable.<a href="#_ftn6">[6]</a></p>
<p><em>Bottom line</em>: Interrupt handlers must be considered part of the critical set, with RMA used to prioritize them in relation to the tasks they might steal the CPU away from.</p>
<p><strong>Conclusion</strong></p>
<p>Every programmer should know three key things about RMA.  First, RMA is a technique that should be used to analyze any preemptive system with deadlines; it is not just for academics after all.  Second, the amount of effort involved in RMA analysis can be reduced by ignoring tasks outside the critical set; non-critical tasks can be assigned an arbitrary pattern of lower priorities and need not be analyzed.  Finally, if interrupts can preempt critical set tasks or even just each other, RMA should be used to analyze those too.</p>
<hr size="1" /><a href="#_ftnref">[1]</a> Schedulers that tweak task priorities dynamically, as desktop flavors of Windows and Linux do, may miss deadlines indiscriminately during transient overload periods.   They should thus not be used in the design of safety-critical real-time systems.</p>
<p><a href="#_ftnref">[2]</a> For example, it is widely rumored that a system less than 50% loaded will always meet its deadlines.  Unfortunately, there is no such rule of thumb that’s correct.  By contrast, when you do use RMA there is a simple rule of thumb ranging from a high of 82.8% for 2 tasks to a low of 69.2% for N tasks.</p>
<p><a href="#_ftnref">[3]</a> Perhaps your failure to use RMA to prioritize tasks and prove they’ll meet deadlines explains one or more of those “glitches” your customers have been complaining about?</p>
<p><a href="#_ftnref">[4]</a> Establishing the worst-case period of a task is both easier and more stable.</p>
<p><a href="#_ftnref">[5]</a> Note this is necessary even if one or more of the interrupts doesn’t have a real-time deadline of its own.  That’s because the interrupts may occur during the transient overload and thus prevent one or more critical set tasks from meeting its real-time deadline.</p>
<p><a href="#_ftnref">[6]</a> However, switching the code to use polling actually consumes cycles that are only reserved for the worst-case in the other solution.  That could mean failing to find CPU time for low priority non-critical tasks in the average case.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/08/3-things-every-programmer-should-know-about-rma/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Design for the Worst Case</title>
		<link>http://embeddedgurus.com/barr-code/2010/08/design-for-the-worst-case/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/08/design-for-the-worst-case/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 13:02:00 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/?p=376</guid>
		<description><![CDATA[In real-time systems, as in life, anything that can go wrong will!  A nurse could be using a GUI task to change system parameters on a ventilator just as the attached patient’s lungs demand the most help from another task.  Or an interrupt signal could start acting funny, generating a stream of unexpected [...]]]></description>
			<content:encoded><![CDATA[<p>In real-time systems, as in life, anything that can go wrong will!  A nurse could be using a GUI task to change system parameters on a ventilator just as the attached patient’s lungs demand the most help from another task.  Or an interrupt signal could start acting funny, generating a stream of unexpected ISR invocations.  Or all of those at once.  And something else.</p>
<p>The designers of hard real-time systems must design for such a worst-case.  They must ensure that sufficient CPU and memory bandwidth are present to handle the worst-case demands that could be placed on the software—simultaneously.  In simple terms, we must size the processor bandwidth to the worst-case scenario.</p>
<p>Safety for the users of our products emerges as a side effect of buying a faster (read &#8220;higher priced&#8221;) CPU.  Rate Monotonic Analysis helps ensure we’ve specified the right processor clock rate, so the users are safe.  RMA is also the optimal fixed-priority scheduling algorithm, which prevents us from over-paying for clock rate.  If a set of tasks cannot be scheduled using RMA, it can’t be scheduled using any fixed-priority algorithm.</p>
<p>The basics of RMA are well covered in many places, including my article <a href="http://www.netrino.com/Embedded-Systems/How-To/RMA-Rate-Monotonic-Algorithm" target="_blank">Introduction to Rate Monotonic Scheduling</a>. In summary, Rate Monotonic Analysis gives us mathematics to prove all deadlines are always met when you’ve followed the Rate Monotonic Algorithm to assign priorities.</p>
<p>Rate Monotonic Algorithm is a procedure for assigning fixed priorities to tasks and ISRs to maximize their schedulability.  A particular set of tasks and ISRs is considered schedulable if all deadlines will be met even in the worst-case scenario.  The algorithm is simple:  “Assign the priority of each task and ISR according to its worst-case period, so that the shorter the period the higher the priority.”  For example if Task 1 and Task 2 have periods of 50 ms and 100 ms, respectively, then Task 1 is given higher priority.  This ensures that a long Task 2 job can’t prevent Task 1 from missing its more frequent deadline.</p>
<p>Too many of today&#8217;s real-time systems built with an RTOS are working by luck. Excess processing power may be masking design and analysis sins or the worst-case simply hasn’t happened—yet.  Bottom line: You’re playing with fire if you don’t use RMA to assign priorities to safety-critical tasks; it might be just a matter of time before your product’s users get burned.  Perhaps your failure to use RMA to prioritize tasks and prove they’ll meet deadlines explains one or more of those “glitches” your customers have been complaining about?</p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/08/design-for-the-worst-case/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How to Set the Size of your C Stack</title>
		<link>http://embeddedgurus.com/barr-code/2010/03/how-to-set-the-size-of-your-c-stack/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/03/how-to-set-the-size-of-your-c-stack/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 18:16:43 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/?p=358</guid>
		<description><![CDATA[A reader of my monthly Firmware Update newsletter recently sent an e-mail to ask:

I am a firmware engineer.  I read your recent blog post regarding the C stack, about which I have two questions: First, how can I increment or decrement the size of the stack in my code?  Second, what size should [...]]]></description>
			<content:encoded><![CDATA[<p><em>A reader of my monthly <a href="http://www.firmwareupdate.net">Firmware Update newsletter</a> recently sent an e-mail to ask:</em></p>
<blockquote><p>
I am a firmware engineer.  I read your <a href="/barr-code/2010/03/firmware-specific-bug-4-stack-overflow/">recent blog post regarding the C stack</a>, about which I have two questions: First, how can I increment or decrement the size of the stack in my code?  Second, what size should I choose?
</p></blockquote>
<p><em>Here&#8217;s what I told him:</em></p>
<p>The size of the stack is set either in the linker command file or in the C or C++ <a href="http://www.netrino.com/Embedded-Systems/Glossary-S#startup_code">startup code</a>.  You should be able to learn more about how to change the stack size from your specific compiler vendor&#8217;s manual or customer support.</p>
<p>Identifying the minimum stack size required for your specific application is made challenging by these stubborn facts:</p>
<p>- MEASURING the maximum stack growth during testing may not be sufficient.  If you test for half a year, the product is sure to be run for a year or longer in the field.  Have you really tested all possible cases?  What about all possible series of interrupt service routines on top of that worst case use by main()?</p>
<p>- TOP DOWN ANALYSIS of the compiled code can be done to determine the number of function calls and interrupt service routines at maximum depth; their individual parameter and local variable use, etc.  Unfortunately, these things may keep changing whenever you change the code and recompile.</p>
<p>The best approach is usually to perform a conservative top down analysis of the source code; when in doubt, always round up.  Don&#8217;t forget about nested interrupt service routines.  Double that conservative to set your initial stack budget.  Then measure actual stack utilization during testing, preferably with code coverage analysis tools running&#8211;to ensure that you&#8217;ve tested all possible paths (except interrupts, which may run at different times in the field).</p>
<p>Then if you need to reclaim memory to ship the product, start shrinking the stack.  But also put into place a high water mark system (e.g., 0xDEADBEEF) complete with supervisor code to put the product into a failsafe state if more than, for example, 80% of the stack is ever used.</p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/03/how-to-set-the-size-of-your-c-stack/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Toyota&#8217;s Embedded Software Image Problem</title>
		<link>http://embeddedgurus.com/barr-code/2010/03/toyotas-embedded-software-image-problem/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/03/toyotas-embedded-software-image-problem/#comments</comments>
		<pubDate>Fri, 19 Mar 2010 21:02:17 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[firmware]]></category>
		<category><![CDATA[safety]]></category>
		<category><![CDATA[trends]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/?p=342</guid>
		<description><![CDATA[It remains unclear whether Toyota&#8217;s higher-than-industry-average number of complaints regarding sudden unintended acceleration (SUA) is caused (in whole or in part) by an embedded software problem.  But whether it is or it isn&#8217;t actually firmware, the company has clearly denied it and yet still developed an embedded software &#8220;image problem&#8221;.  They&#8217;ve brought some [...]]]></description>
			<content:encoded><![CDATA[<p>It remains unclear whether Toyota&#8217;s <a href="http://www.washingtonpost.com/wp-dyn/content/graphic/2010/03/10/GR2010031004046.html">higher-than-industry-average number of complaints</a> regarding <a href="http://en.wikipedia.org/wiki/Sudden_unintended_acceleration">sudden unintended acceleration</a> (SUA) is caused (in whole or in part) by an embedded software problem.  But whether it is or it isn&#8217;t actually firmware, the company has clearly denied it and yet still developed an embedded software &#8220;image problem&#8221;.  They&#8217;ve brought some of this on themselves.</p>
<p><em>Side Note</em>: I think it is a net positive that journalists, the mass media, and a broader swath of the general public are increasingly aware that there is software embedded inside cars, airplanes, medical devices, and just about everything else with a power supply or batteries.  Firmware has been inside these products for many years, of course.  But as I wrote in <a href="http://electronicdesign.com/article/embedded-software/faulty_code_will_lead_to_an_era_of_firmware_related_litigation.aspx">a recent article in Electronic Design</a>, my experience working with companies across many industries lead me to believe there is <a href="/barr-code/2010/02/embedded-software-is-the-future-of-product-quality-and-safety/">a looming firmware quality crisis</a>.  Greater public awareness is sure to bring <a href="http://embedded.com/columns/barrcode/221901488">litigation</a>.  This will force engineering management to care more about firmware quality than they currently do.</p>
<p><strong>Toyota&#8217;s Firmware Image Problem</strong></p>
<p>Long before the &#8220;floor-mat recall&#8221; <a href="http://www.nhtsa.dot.gov">NHTSA</a> had logged a higher number of unintended acceleration complaints (4.51 complaints per 100,000 cars sold for the 2005 to 2010 model years) for Toyota than any other company.  (A recent Washington Post <a href="http://www.washingtonpost.com/wp-dyn/content/graphic/2010/03/10/GR2010031004046.html">graphic</a> has more data.)  Apparently, NHTSA and Toyota were investigating the reports&#8211;but hadn&#8217;t yet taken any action.</p>
<p>It seems that what set that first Toyota recall in motion was a high-profile <a href="http://www.washingtonpost.com/wp-dyn/content/article/2010/01/28/AR2010012803971_pf.html">fatal August 2009 crash involving an off-duty California Highway Patrol office</a>, his family, a runaway Lexus, and a <a href="http://www.entertonement.com/clips/fmnjpnzgmb--Chris-Lastrella-911-Call-Before-Crashing-911-Calls-Chris-Lastrella">disturbing 911 call</a>,  Given the context of that specific crash, I&#8217;m not convinced the floor mat recall made much sense.  In particular, I find it hard to believe that a police officer with adrenaline pumping through his veins and his family&#8217;s life on the line, wouldn&#8217;t just rip a stuck floor mat out of the way like the Incredible Hulk. (Or that he would choose running off the road at 125 mph vs. shutting the vehicle off entirely.)  But I don&#8217;t have all the facts about either that specific accident or the reasoning behind the floor mat recall.</p>
<p>The broader recalls that have happened since have focused on also adding mechanical strength to the accelerator pedals in a number of different makes and models.  To this day, Toyota categorically denies any sort of electrical problem.  Yet some cars that have been modified in this way have since been reported to experience unintended acceleration!  Besides which, mechanical parts generally fail visibly or entirely once they first fail&#8211;rather than intermittently.  Intermittent failures are far more common with electronics (think EMI) and firmware.</p>
<p>Toyota&#8217;s firmware image problem stems from two things:  First, they have separately recalled the Prius for a braking-related firmware upgrade.  Other possible <a target="_blank" href="http://www.toyota.com/prius-hybrid/">Prius</a> software issues have been identified by <a target="_blank" href="http://www.youtube.com/watch?v=hc2_yLXy9O4">Steve Wozniak</a> and <a target="_blank" href="http://www.youtube.com/watch?v=Rr6dm0qFRTw">Jim Sikes</a>, but these have not yet been confirmed.  Additionally, the continued reliance (by Toyota and NHTSA) on theories such as &#8220;we can&#8217;t reproduce the problem and we haven&#8217;t been able to see it during testing&#8221; as proof that there&#8217;s not a software bug is simply unbelievable.  </p>
<p>Anyone who works with software knows from experience that lots of bugs can&#8217;t be easily reproduced.  The fact that these incidents can&#8217;t be reproduced is not a proof of anything.</p>
<p><strong>Software in Cars: The Future</strong></p>
<p>Don&#8217;t get me wrong.  I want more software in my car not less.  I very much look forward to the day that an in-car computer takes over the driving for me.  After all, some cars already have more sensor data to make decisions on than the driver does.  Imagine what a car with an integrated GPS navigation system, auto-follow cruise control, and collision avoidance systems could do.  While I guess that I should move left one lane to avoid a crash, the computer is capable of seeing in all directions at once, calculating all of the trajectories of near-by cars, including instantaneous changes in their acceleration or deceleration.</p>
<p>Additionally, I suspect that even with bugs in a car&#8217;s drive-by-wire software the car may be much safer overall for its electronic traction control and anti-lock braking systems.</p>
<p>I just wish that Toyota would own up to the fact that the inability to reproduce a problem doesn&#8217;t rule out a software (or EMI) flaw.</p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/03/toyotas-embedded-software-image-problem/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Firmware-Specific Bug #5: Heap Fragmentation</title>
		<link>http://embeddedgurus.com/barr-code/2010/03/firmware-specific-bug-5-heap-fragmentation/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/03/firmware-specific-bug-5-heap-fragmentation/#comments</comments>
		<pubDate>Mon, 15 Mar 2010 16:55:48 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Firmware Bugs]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[education]]></category>
		<category><![CDATA[firmware]]></category>
		<category><![CDATA[rtos]]></category>
		<category><![CDATA[safety]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/?p=336</guid>
		<description><![CDATA[Dynamic memory allocation is not widely used by embedded software developers—and for good reasons.  One of those is the problem of fragmentation of the heap.
All data structures created via C’s malloc() standard library routine or C++’s new keyword live on the heap.  The heap is a specific area in RAM of a pre-determined [...]]]></description>
			<content:encoded><![CDATA[<p>Dynamic memory allocation is not widely used by embedded software developers—and for good reasons.  One of those is the problem of fragmentation of the heap.</p>
<p>All data structures created via C’s malloc() standard library routine or C++’s <code>new</code> keyword live on the heap.  The heap is a specific area in RAM of a pre-determined maximum size.  Initially, each allocation from the heap reduces the amount of remaining “free” space by the same number of bytes.  For example, the heap in a particular system might span 10 KB starting from address 0&#215;20200000.  An allocation of a pair of 4-KB data structures would leave 2 KB of free space.</p>
<p>The storage for data structures that are no longer needed can be returned to the heap by a call to free() or use of the <code>delete</code> keyword.  In theory this makes that storage space available for reuse during subsequent allocations.  But the order of allocations and deletions is generally at least pseudo-random—leading the heap to become a mess of smaller fragments.</p>
<p>To see how fragmentation can be a problem, consider what would happen if the first of the above 4 KB data structures is free.  Now the heap consists of one 4-KB free chunk and another 2-KB free chunk; they are not adjacent and cannot be combined.  So our heap is already fragmented.  Despite 6 KB of total free space, allocations of more than 4 KB will fail.</p>
<p>Fragmentation is similar to entropy: both increase over time.  In a long running system (i.e., most every embedded system ever created), fragmentation may eventually cause some allocation requests to fail.  And what then?  How should your firmware handle the case of a failed heap allocation request?</p>
<p><em>Best Practice</em>: Avoiding all use of the heap may is a sure way of preventing this bug.  But if dynamic memory allocation is either necessary or convenient in your system, there is an alternative way of structuring the heap that will prevent fragmentation.  The key observation is that the problem is caused by variable sized requests.  </p>
<p>If all of the requests were of the same size, then any free block is as good as any other—even if it happens not to be adjacent to any of the other free blocks.  Thus it is possible to use multiple “heaps”—each for allocation requests of a specific size—can using a “memory pool” data structure.</p>
<p>If you like you can write your own fixed-sized memory pool API.  You’ll just need three functions:</p>
<ul>
<li>handle = pool_create(block_size, num_blocks) &#8211; to create a new pool (of size M chunks by N bytes);</li>
<li>p_block = pool_alloc(handle) &#8211; to allocate one chunk (from a specified pool); and</li>
<li>pool_free(handle, p_block).</li>
</ul>
<p>But note that many real-time operating systems (RTOSes) feature a fixed-size memory pool API.  If you have access to one of those, use it instead of the compiler&#8217;s malloc() and free() or your own implementation.</p>
<p><a href="/barr-code/2010/03/firmware-specific-bug-4-stack-overflow/">Firmware-Specific Bug #4</a></p>
<p>Firmware-Specific Bug #6 (coming soon)</p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/03/firmware-specific-bug-5-heap-fragmentation/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Firmware-Specific Bug #4: Stack Overflow</title>
		<link>http://embeddedgurus.com/barr-code/2010/03/firmware-specific-bug-4-stack-overflow/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/03/firmware-specific-bug-4-stack-overflow/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 19:52:51 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Firmware Bugs]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[firmware]]></category>
		<category><![CDATA[rtos]]></category>
		<category><![CDATA[safety]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/?p=334</guid>
		<description><![CDATA[Every programmer knows that a stack overflow is a Very Bad Thing™.  The effect of each stack overflow varies, though.  The nature of the damage and the timing of the misbehavior depend entirely on which data or instructions are clobbered and how they are used.  Importantly, the length of time between a [...]]]></description>
			<content:encoded><![CDATA[<p>Every programmer knows that a stack overflow is a Very Bad Thing™.  The effect of each stack overflow varies, though.  The nature of the damage and the timing of the misbehavior depend entirely on which data or instructions are clobbered and how they are used.  Importantly, the length of time between a stack overflow and its negative effects on the system depends on how long it is before the clobbered bits are used.</p>
<p>Unfortunately, stack overflow afflicts embedded systems far more often than it does desktop computers.  This is for several reasons, including: </p>
<ol>
<li>embedded systems usually have to get by on a smaller amount of RAM;</li>
<li>there is typically no virtual memory to fall back on (because there is no disk);</li>
<li>firmware designs based on RTOS tasks utilize multiple stacks (one per task), each of which must be sized sufficiently to ensure against unique worst-case stack depth;</li>
<li>and interrupt handlers may try to use those same stacks.</li>
</ol>
<p>Further complicating this issue, there is no amount of testing that can ensure that a particular stack is sufficiently large.  You can test your system under all sorts of loading conditions but you can only test it for so long.  A stack overflow that only occurs “once in a blue moon” may not be witnessed by tests that run for only “half a blue moon.”  Demonstrating that a stack overflow will never occur can, under algorithmic limitations (such as no recursion), be done with a top down analysis of the control flow of the code.  But a top down analysis will need to be redone every time the code is changed.</p>
<p><em>Best Practice</em>: On startup, paint an unlikely memory pattern throughout the stack(s).  (I like to use hex <code>23 3D 3D 23</code>, which looks like a fence ‘<code>#==#</code>’ in an ASCII memory dump.)  At runtime, have a supervisor task periodically check that none of the paint above some pre-established high water mark has been changed.  If something is found to be amiss with a stack, log the specific error (e.g., which stack and how high the flood) in non-volatile memory and do something safe for users of the product (e.g., controlled shut down or reset) before a true overflow can occur.  This is a nice additional safety feature to add to the watchdog task.</p>
<p><a href="/barr-code/2010/02/firmware-specific-bug-3-missing-volatile-keyword/">Firmware-Specific Bug #3</a></p>
<p>Firmware-Specific Bug #5 (coming soon)</p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/03/firmware-specific-bug-4-stack-overflow/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Embedded Gurus &#8211; Site Redesign</title>
		<link>http://embeddedgurus.com/barr-code/2010/03/embedded-gurus-site-redesign/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/03/embedded-gurus-site-redesign/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 21:20:27 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://embeddedgurus.com/barr-code/2010/03/embedded-gurus-site-redesign/</guid>
		<description><![CDATA[I am pleased to announce that the EmbeddedGurus website has been redesigned.  Among the new features of the site are:
1.  A dynamically updating home page, featuring the most recent posts from all of our bloggers.  If you prefer, you may view these posts by category.
2.  A common look and feel to [...]]]></description>
			<content:encoded><![CDATA[<p>I am pleased to announce that the <a href="http://www.embeddedgurus.com">EmbeddedGurus</a> website has been redesigned.  Among the new features of the site are:</p>
<p>1.  A dynamically updating home page, featuring the most recent posts from all of our bloggers.  If you prefer, you may view these posts <a href="/categories">by category</a>.</p>
<p>2.  A common look and feel to all of the individual blogs.</p>
<p>3.  The ability to search individual blogs, as well as to easily browse from one post to the next and via tags and categories.</p>
<p>4.  A sixth guru named <a href="/gurus/gary-stringham">Gary Stringham</a> with a blog called <a href="/embedded-bridge/">Embedded Bridge</a>.</p>
<p>A number of other minor improvements have also been made.</p>
<p>We hope you like the new look and continue to find our blogs about embedded systems design both readable and informative. </p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/03/embedded-gurus-site-redesign/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Challenge of Debugging Cache Coherency Problems</title>
		<link>http://embeddedgurus.com/barr-code/2010/02/the-challenge-of-debugging-cache-coherency-problems/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/02/the-challenge-of-debugging-cache-coherency-problems/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 16:18:00 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Firmware Bugs]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[firmware]]></category>

		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2010/02/19/the-challenge-of-debugging-cache-coherency-problems/</guid>
		<description><![CDATA[The following is an example of a cache-related embedded software bug that is a real challenge to solve for several reasons, not the least of which is the fact that the actual problem was masked in the debugger&#8217;s view of memory.
One nasty bug that came up recently for us was the realization that we were [...]]]></description>
			<content:encoded><![CDATA[<p>The following is an example of a cache-related embedded software bug that is a real challenge to solve for several reasons, not the least of which is the fact that the actual problem was masked in the debugger&#8217;s view of memory.</p>
<blockquote><p>One nasty bug that came up recently for us was the realization that we were not flushing the instruction cache after leaving the bootloader which had a very confusing effect when running our application. In our design our code pretty much runs out of flash. Our bootloader is in the lowest part of flash and our 2 images sit in their own higher memory ranges of flash. So we never realized we should do this.</p>
<p>Well, we had to copy a small piece of code into RAM for the purpose of allowing firmware upgrades to be written to flash. This piece of code would be executing when the actual erases and writes took place (i.e. we couldn&#8217;t execute from AND write to flash at the same time). This code would get copied out of flash both when the bootloader started execution AND when the image would start execution because they shared the startup code that we inherited from a board development kit (BDK).</p>
<p>Another thing we didn&#8217;t realize was that the RAM code optimized differently for the bootloader image and the application images. The end result is that the instruction cache would in certain cases have a hit and return the wrong instructions for us. For instance, when we tried to perform an upgrade while running from our image, it would erase a completely different area of flash than we intended. To make things somewhat more confusing, it did NOT help to step through the code using the debugger. The debugger was not showing us that the instruction cache was providing different lines of code than the lines of source it was showing.</p>
<p>This was ultimately one of the more frustrating bugs we have chased recently. Imagine the confusion when sometimes a firmware upgrade would work fine and other times it would completely brick your board (they could be salvaged with a JTAG programmer at least).</p>
</blockquote>
<p>Thanks to Richard von Lehe of <a href="http://www.starkey.com">Starkey Labs</a> for sharing this.</p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/02/the-challenge-of-debugging-cache-coherency-problems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Firmware-Specific Bug #3: Missing Volatile Keyword</title>
		<link>http://embeddedgurus.com/barr-code/2010/02/firmware-specific-bug-3-missing-volatile-keyword/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/02/firmware-specific-bug-3-missing-volatile-keyword/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 09:21:00 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Firmware Bugs]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[embedded]]></category>
		<category><![CDATA[firmware]]></category>
		<category><![CDATA[safety]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2010/02/18/firmware-specific-bug-3-missing-volatile-keyword/</guid>
		<description><![CDATA[Failure to tag certain types of variables with C’s &#8216;volatile&#8217; keyword, can cause a number of symptoms in a system that works properly only when the compiler’s optimizer is set to a low level or disabled.  The volatile qualifier is used during variable declarations, where its purpose is to prevent optimization of the reads [...]]]></description>
			<content:encoded><![CDATA[<p>Failure to tag certain types of variables with C’s &#8216;volatile&#8217; keyword, can cause a number of symptoms in a system that works properly only when the compiler’s optimizer is set to a low level or disabled.  The volatile qualifier is used during variable declarations, where its purpose is to prevent optimization of the reads and writes of that variable.</p>
<p>For example, if you write code that says:</p>
<p><code><br />&nbsp;&nbsp; &nbsp;g_alarm = ALARM_ON; &nbsp; &nbsp;// Patient dying--get nurse!<br />&nbsp;&nbsp; &nbsp;// Other code; with no reads of g_alarm state.<br />&nbsp;&nbsp; &nbsp;g_alarm = ALARM_OFF; &nbsp; // Patient stable.<br /></code></p>
<p>the optimizer will generally try to make your program both faster and smaller by eliminating the first line above&#8211;to the detriment of the patient.  However, if g_alarm is declared as volatile this optimization will not take place.</p>
<p><i>Best Practice</i>: The &#8216;volatile&#8217; keyword should be used to declare any: (a) global variable shared by an ISR and any other code; (b) global variable accessed by two or more RTOS tasks (even when race conditions in those accesses have been prevented); (c) pointer to a memory-mapped peripheral register (or register set);  or (d) delay loop counter.</p>
<p>Note that in addition to ensuring all reads and writes take place for a given variable, the use of volatile also constrains the compiler by adding additional “sequence points”.  Accesses to multiple volatiles must be executed in the order they are written in the code.</p>
<p><a href="/barr-code/2010/02/firmware-specific-bug-2-non-reentrant.html">Firmware-Specific Bug #2</a></p>
<p><a href="/barr-code/2010/03/firmware-specific-bug-4-stack-overflow/">Firmware-Specific Bug #4</a></p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/02/firmware-specific-bug-3-missing-volatile-keyword/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Firmware-Specific Bug #2: Non-Reentrant Function</title>
		<link>http://embeddedgurus.com/barr-code/2010/02/firmware-specific-bug-2-non-reentrant-function/</link>
		<comments>http://embeddedgurus.com/barr-code/2010/02/firmware-specific-bug-2-non-reentrant-function/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 11:01:00 +0000</pubDate>
		<dc:creator>Michael Barr</dc:creator>
				<category><![CDATA[Firmware Bugs]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[firmware]]></category>
		<category><![CDATA[safety]]></category>

		<guid isPermaLink="false">http://www.gfcdev.org/test-stack/2010/02/15/firmware-specific-bug-2-non-reentrant-function/</guid>
		<description><![CDATA[Technically, the problem of a non-reentrant functions is a special case of the problem of a race condition.  For that reason the run-time errors caused by a non-reentrant function are similar and also don’t occur in a reproducible way—making them just as hard to debug.  Unfortunately, a non-reentrant function is also more difficult to spot [...]]]></description>
			<content:encoded><![CDATA[<p>Technically, the problem of a non-reentrant functions is a special case of the problem of a <a href="http://www.embeddedgurus.net/barr-code/2010/02/firmware-specific-bug-1-race-condition.html">race condition</a>.  For that reason the run-time errors caused by a non-reentrant function are similar and also don’t occur in a reproducible way—making them just as hard to debug.  Unfortunately, a non-reentrant function is also more difficult to spot in a code review than other types of race conditions.</p>
<p>The figure below shows a typical scenario.  Here the software entities subject to preemption are RTOS tasks.  But rather than manipulating a shared object directly, they do so by way of function call indirection.  For example, suppose that Task A calls a sockets-layer protocol function, which calls a TCP-layer protocol function, which calls an IP-layer protocol function, which calls an Ethernet driver.  In order for the system to behave reliably, all of these functions must be reentrant.</p>
<p><a href="http://embeddedgurus.com/barr-code/files/2010/02/TCPIP.png"><img class="aligncenter size-medium wp-image-374" title="TCPIP" src="http://embeddedgurus.com/barr-code/files/2010/02/TCPIP-300x201.png" alt="" width="300" height="201" /></a></p>
<p>But the functions of the driver module manipulate the same global object in the form of the registers of the Ethernet Controller chip.  If preemption is permitted during these register manipulations, Task B may preempt Task A after the Packet A data has been queued but before the transmit is begun.  Then Task B calls the sockets-layer function, which calls the TCP-layer function, which calls the IP-layer function, which calls the Ethernet driver, which queues and transmits Packet B.  When control of the CPU returns to Task A, it finally requests its transmission.  Depending on the design of the Ethernet controller chip, this may either retransmit Packet B or generate an error.  Either way, Packet A&#8217;s data is lost and does not go out onto the network.</p>
<p>In order for the functions of this Ethernet driver to be callable from multiple RTOS tasks (near-)simultaneously, those functions must be made reentrant.  If each function uses only stack variables, there is nothing to do; each RTOS task has its own private stack.  But drivers and some other functions will be non-reentrant unless carefully designed.</p>
<p>The key to making functions reentrant is to suspend preemption around all accesses of peripheral registers, global variables (including static local variables), persistent heap objects, and shared memory areas.  This can be done either by disabling one or more interrupts or by acquiring and releasing a <a href="http://www.netrino.com/Embedded-Systems/Glossary-M#mutex">mutex</a>; the specifics of the type of shared data usually dictate the best solution.</p>
<p><em>Best Practice</em>: Create and hide a mutex within each library or driver module that is not intrinsically reentrant.  Make acquisition of this mutex a pre-condition for the manipulation of any persistent data or shared registers used within the module as a whole.  For example, the same mutex may be used to prevent race conditions involving both the Ethernet controller registers and a global (or static local) packet counter.  All functions in the module that access this data, must follow the protocol to acquire the mutex before manipulating these objects.</p>
<p>Beware that non-reentrant functions may come into your code base as part of third party middleware, legacy code, or device drivers.  Disturbingly, non-reentrant functions may even be part of the standard C or C++ library provided with your compiler.  For example, if you are using the <a href="http://gcc.gnu.org/">GNU compiler</a> to build RTOS-based applications, take note that you should be using the reentrant “<a href="http://sourceware.org/newlib/">newlib</a>” standard C library rather than the default.</p>
<p><a href="http://www.embeddedgurus.net/barr-code/2010/02/firmware-specific-bug-1-race-condition.html">Firmware-Specific Bug #1</a></p>
<p><a href="http://www.embeddedgurus.net/barr-code/2010/02/firmware-specific-bug-3-missing.html">Firmware-Specific Bug #3</a></p>
]]></content:encoded>
			<wfw:commentRss>http://embeddedgurus.com/barr-code/2010/02/firmware-specific-bug-2-non-reentrant-function/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
