Posts Tagged ‘architecture’

Embedded Software Training in a Box

Friday, May 6th, 2011 Michael Barr

Embedded Software Training in a BoxI am beaming with pride. I think we have finally achieved the holy grail of firmware training: Embedded Software Training in a Box. Priced at just $599, the kit includes Everything-You-Need-to-Know-to-Develop-Quality-Reliable-Firmware-in-C, including software for real-time safety-critical systems such as medical devices.

In many ways, this product is the culmination of about the last fifteen years of my career. The knowledge and skills imparted in the kit are drawn from my varied experiences as:

This kit also–at long last–answers the question I’ve been receiving from around the world since I first started writing articles and books about embedded programming: “Where/How can I learn to be a great embedded programmer?” I believe the answer is now as easy as: “Embedded Software Boot Camp in a Box!”

What NHTSA/NASA Didn’t Consider re: Toyota’s Firmware

Wednesday, March 2nd, 2011 Michael Barr

In a blog post yesterday (Unintended Acceleration and Other Embedded Software Bugs), I wrote extensively on the report from NASA’s technical team regarding their analysis of the embedded software in Toyota’s ETCS-i system. My overall point was that it is hard to judge the quality of their analysis (and thereby the overall conclusion that the software isn’t to blame for unintended accelerations) given the large number of redactions.

I need to put the report down and do some other work at this point, but I have a few other thoughts and observations worth writing down.

Insufficient Explanations

First, some of the explanations offered by Toyota, and apparently accepted by NASA, strike me as insufficent. For example, at pages 129-132 of Appendix A to the NASA Report there is a discussion of recursion in the Toyota firmware. “The question then is how to verify that the indirect recursion in the ETCS-i does in fact terminate (i.e., has no infinite recursion) and does not cause a stack overflow.”

“For the case of stack overflow, [redacted phrase], and therefore a stack overflow condition cannot be detected precisely. It is likely, however, that overflow would cause some form of memory corruption, which would in turn cause some bad behavior that would then cause a watchdog timer reset. Toyota relies on this assumption to claim that stack overflow does not occur because no reset occurred during testing.” (emphasis added)

I have written about what really happens during stack overflow before (Firmware-Specific Bug #4: Stack Overflow) and this explains why a reset may not result and also why it is so hard to trace a stack overflow back to that root cause. (From page 20, in NASA’s words: “The system stack is limited to just 4096 bytes, it is therefore important to secure that no execution can exceed the stack limit. This type of check is normally simple to perform in the absence of recursive procedures, which is standard in safety critical embedded software.”)

Similarly, “Toyota designed the software with a high margin of safety with respect to deadlines and timeliness. … [but] documented no formal verification that all tasks actually meet this deadline requirement.” and “All verification of timely behavior is accomplished with CPU load measurements and other measurement-based techniques.” It’s not clear to me if the NASA team is saying it buys those Toyota explanations or merely wanted to write them down. However, I do not see a sufficient explanation in this wording from page 132:

“The [worst case execution time] analysis and recursion analysis involve two distinctly different problems, but they have one thing in common: Both of their failure modes would result in a CPU reset. … These potential malfunctions, and many others such as concurrency deadlocks and CPU starvation, would eventually manifest as a spontaneous system reset.” (emphasis added)

Might not a deadlock, starvation, priority inversion, or infinite recursion be capable of producing a bit of “bad behavior” (perhaps even unintended acceleration) before that “eventual” reset? Or might not a stack overflow just corrupt one or a few important variables a little bit and that result in bad behavior rather than or before a result? These kinds of possibilities, even at very low probabilities, are important to consider in light of NASA’s calculation that the U.S.-owned Camry 2002-2007 fleet alone is running this software a cumulative one billion hours per year.

Paths Not Taken

My second observation is based upon reflection on the steps NASA might have taken in its review of Toyota’s ETCS-i firmware, but apparently did not. Specifically, there is no mention anywhere (unless it was entirely redacted) of:

  • rate monotonic analysis, which is a technique that Toyota could have used to validate the critical set of tasks with deadlines and higher priority ISRs (and that NASA could have applied in its review),
  • cyclomatic complexity, which NASA might have used as an additional winnowing tool to focus its limited time on particularly complex and hard to test routines,
  • hazard analysis and mitigation, as those terms are defined by FDA guidelines regarding software contained in medical devices, nor
  • any discussion or review of Toyota’s specific software testing regimen and bug tracking system.

Importantly, there is also a complete absence of discussion of how Toyota’s ETCS-i firmware versions evolved over time. Which makes and models (and model years) had which versions of that firmware? (Presumably there were also hardware changes worthy of note.) Were updates or patches ever made to cars once they were sold, say while at the dealer during official recalls or other types of service?

Firmware-Specific Bug #10: Jitter

Thursday, December 2nd, 2010 Michael Barr

Some real-time systems demand not only that a set of deadlines be always met but also that additional timing constraints be observed in the process. Such as managing jitter.

An example of jitter is shown in Figure 1. Here a variable amount of work (blue boxes) must be completed before every 10 ms deadline. As illustrated in the figure, the deadlines are all met. However, there is considerable timing variation from one run of this job to the next. This jitter is unacceptable in some systems, which should either start or end their 10 ms runs more precisely.

Jitter Figure 1

If the work to be performed involves sampling a physical input signal, such as reading an analog-to-digital converter, it will often be the case that a precise sampling period will lead to higher accuracy in derived values. For example, variations in the inter-sample time of an optical encoder’s pulse count will lower the precision of the velocity of an attached rotation shaft.

Best Practice: The most important single factor in the amount of jitter is the relative priority of the task or ISR that implements the recurrent behavior. The higher the priority the lower the jitter. The periodic reads of those encoder pulse counts should thus typically be in a timer tick ISR rather than in an RTOS task.

Figure 2 shows how the interval of three different 10 ms recurring samples might be impacted by their relative priorities. At the highest priority is a timer tick ISR, which executes precisely on the 10 ms interval. (Unless there are higher priority interrupts, of course.) Below that is a high-priority task (TH), which may still be able to meet a recurring 10-ms start time precisely. At the bottom, though, is a low priority task (TL) that has its timing greatly affected by what goes on at higher priority levels. As shown, the interval for the low priority task is 10 ms +/- approximately 5 ms.

Jitter Figure 2

Firmware-Specific Bug #9

What Belongs in a C .h Header File?

Wednesday, November 10th, 2010 Michael Barr

What sorts of things should you (or should you not) put in a C language .h header file? When should you create a header file? And why?

When I talk to embedded C programmers about hardware interfacing in C or Netrino’s Embedded C Coding Standard, I often come to see that they lack basic skills and information about the C programming language. This is usually because we are mostly a gang of electrical engineers who are self-taught in C (and every other programming language we use).

When the subject of header files comes up, here’s my list of do’s and don’ts:

DO create one .h header file for each “module” of the system. A module may comprise one or more compilation units (e.g., .c or .asm source code files). But it should implement just one aspect of the system. Examples of well-chosen modules are: a device driver for an A/D converter; a communication protocol, such as FTP; and an alarm manager that is solely responsible for logging error conditions and alerting the user of the active errors.

DO include in the header file all of the function prototypes for the public interface of the module it describes. For example a header file adc.h might contain function prototypes for adc_init(), adc_select_input(), and adc_read().

DON’T include in the header file any other function or macro that may lie inside the module source code. It is desirable to hide these internal “helper” functions inside the implementation. If it’s not called from any other module, hide it! (If your module spans several compilation units that need to share a helper function, then create a separate header file just for this purpose.) Module A should only call Module B through the public interface defined in moduleb.h.

DON’T include any executable lines of code in a header file, including variable declarations. But note it is necessary to make an exception for the bodies of some inline functions.

DON’T expose any variable in a header file, as is too often done by way of the ‘extern’ keyword. Proper encapsulation of a module requires data hiding: any and all internal state data in private variables inside the .c source code files. Whenever possible these variables should also be declared with keyword ‘static’ to enlist the linker’s help in hiding them.

DON’T expose the internal format of any module-specific data structure passed to or returned from one or more of the module’s interface functions. That is to say there should be no “struct { … } foo;” code in any header file. If you do have a type you need to pass in and out of your module, so client modules can create instances of it, you can simply “typedef struct foo moduleb_type” in the header file. Client modules should never know, and this way cannot know, the internal format of the struct.

Though not really specific to embedded software development, I hope this advice on good C programming practices is useful to you. If it is please let me know and I will provide more C advice in future blog posts.

Rate Monotonic Analysis and Round Robin Scheduling

Friday, January 22nd, 2010 Michael Barr

Rate Monotonic Analysis (RMA) is a way of proving a priori via mathematics (rather than post-implementation via testing) that a set of tasks and interrupt service routines (ISRs) will always meet their deadlines–even under worst-case timing.  In this blog, I address the issue of what to do if two or more tasks or ISRs have equal priority and whether round robin scheduling is necessary in an RTOS to deal with that special case.

First a little background.  In order for the schedulability analysis portion of the RMA mathematics to provide meaningful results, the following assumptions must hold:

Under RMA, the relative priorities are assigned according to a simple rule: “The more often a task or ISR runs (in the worst-case), the higher its priority.” Put another way, the task or ISR with the longest period between iterations (interarrival time, if you prefer) is least important. This is because an infrequent but high-priority task could prevent a more frequent task from missing an entire iteration.

So what happens if you are using RMA to assign priorities and you wind up with two (or more) tasks or ISRs assigned equal priority? (Translation: they have the same worst-case interarrival times). Must they be assigned equal priority in the real system? What if the RTOS (in the case of tasks) or hardware (in the case of interrupts) doesn’t support round-robin scheduling–or even equal priorities with run-to-completion?

Interestingly, it turns out not to matter a bit whether you:

  1. Merge the two tasks into one (i.e., executed code for Task A then Task B).
  2. Give them equal priority, either with round robin or run-to-completion behavior.
  3. Give them adjacent unequal priorities (in either relative order).

If you run through the timing diagrams for each of the above scenarios, you’ll see that all three are equivalent. Except that the equal priority with round robin potentially suffers a performance impact from unnecessary additional context switches.