Posts Tagged ‘architecture’

Building Reliable and Secure Embedded Systems

Tuesday, March 13th, 2012 Michael Barr

In this era of 140 characters or less, it has been well and concisely stated that, “RELIABILITY concerns ACCIDENTAL errors causing failures, whereas SECURITY concerns INTENTIONAL errors causing failures.” In this column I expand on this statement, especially as regards the design of embedded systems and their place in our network-connected and safety-concious modern world.

As the designers of embedded systems, the first thing we must accomplish on any project is to make the hardware and software work. That is to say we need to make the system behave as it was designed to. The first iteration of this is often flaky; certain uses or perturbations of the system by testers can easily dislodge the system into a non-working state. In common parlance, “expect bugs.”

Given time, tightening cycles of debug and test can get us past the bugs and through to a shippable product. But is a debugged system good enough? Neither reliability nor security can be tested into a product. Each must be designed in from the start. So let’s take a closer look at these two important design aspects for modern embedded systems and then I’ll bring them back together at the end.

Reliable Embedded Systems

A product can be stable yet lack reliability. Consider, for example, an anti-lock braking computer installed in a car. The software in the anti-lock brakes may be bug-free, but how does it function if a critical input sensor fails?

Reliable systems are robust in the face of adverse run-time environments. Reliable systems are able to work around errors encountered as they occur to the system in the field–so that the number and impact of failures are minimized. One key strategy for building reliable systems is to eliminate single-points-of-failure. For example, redundancy could be added around that critical input sensor–perhaps by adding a second sensor in parallel with the first.

Another aspect of reliability that is under the complete control of designers (at least when they consider it from the start) are the “fail-safe” mechanisms. Perhaps a suitable but lower-cost alternative to a redundant sensor is detection of the failed sensor with a fall back to mechanical braking.

Failure Mode and Effect Analysis (FMEA) is one of the most effective and important design processes used by engineers serious about designing reliability into their systems. Following this process, each possible failure point is traced from the root failure outward to its effects. In an FMEA, numerical weights can be applied to the likelihoods of each failure as well as the seriousness of consequences. An FMEA can thus help guide you to a cost effective but higher reliability design by highlighting the most valuable places to insert the redundancy, fail-safes, or other elements that reinforce the system’s overall reliability.

In certain industries, reliability is a key driver of product safety. And that is why you see these techniques and FMEA and other design for reliability processes being applied by the designers of safety-critical automotive, medical, avionics, nuclear, and industrial systems. The same techniques can, of course, be used to make any type of embedded system more reliable.

Regardless of your industry, it is typically difficult or impossible to make your product as reliable via patches. There’s no way to add hardware like that redundant sensor, so your options may reduce to a fail-safe that is helpful but less reliable overall. Reliability cannot be patched or tested or debugged into your system. Rather, reliability must be designed in from the start.

Secure Embedded Systems

A product can also be stable yet lack security. For example, an office printer is the kind of product most of us purchase and use without giving a minute of thought to security. The software in the printer may be bug-free, but is it able to prevent a would-be eavesdropper from capturing a remote electronic copy of everything you print, including your sensitive financial documents?

Secure systems are robust in the face of persistent attack. Secure systems are able to keep hackers out by design. One key strategy for building secure systems is to validate all inputs, especially those arriving over an open network connection. For example, security could be added to a printer by ensuring against buffer overflows and encrypting and digitally signing firmware updates.

One of the unfortunate facts of designing secure embedded systems is that the hackers who want to get in only need to find and exploit a single weakness. Adding layers of security is good, but if even any one of those layers remains fundamentally weak, a sufficiently motivated attacker will eventually find and breach that defense. But that’s not an excuse for not trying.

For years, the largest printer maker in the world apparently gave little thought to the security of the firmware in its home/office printers, even as it was putting tens of millions of tempting targets out into the world. Now the security of those printers has been breached by security researchers with a reasonable awareness of embedded systems design. Said one of the lead researchers, “We can actually modify the firmware of the printer as part of a legitimate document. It renders correctly, and at the end of the job there’s a firmware update. … In a super-secure environment where there’s a firewall and no access — the government, Wall Street — you could send a résumé to print out.”

Security is a brave new world for many embedded systems designers. For decades we have relied on the fact that the microcontrollers and Flash memory and real-time operating systems and other less mainstream technologies we use will protect our products from attack. Or that we can gain enough “security by obscurity” by keeping our communications protocols and firmware upgrade processes secret. But we no longer live in that world. You must adapt.

Consider the implications of an insecure design of an automotive safety system that is connected to another Internet-connected computer in the car via CAN; or the insecure design of an implanted medical device; or the insecure design of your product.

Too often, the ability to upgrade a product’s firmware in the field is the very vector that’s used to attack. This can happen even when a primary purpose for including remote firmware updates is motivated by security. For example, as I’ve learned in my work as an expert witness in numerous cases involving reverse engineering of the techniques and technology of satellite television piracy, much of that piracy has been empowered by the same software patching mechanism that allowed the broadcasters to perform security upgrades and electronic countermeasures. Ironically, had the security smart cards in those set-top boxes had only masked ROM images the overall system security may have been higher. This was certainly not what the designers of the system had in mind. But security is also an arms race.

Like reliability, security must be designed in from the start. Security can’t be patched or tested or debugged in. You simply can’t add security as effectively once the product ships. For example, an attacker who wished to exploit a current weakness in your office printer or smart card might download his hack software into your device and write-protect his sectors of the flash today so that his code could remain resident even as you applied security patches.

Reliable and Secure Embedded Systems

It is important to note at this point that reliable systems are inherently more secure. And that, vice versa, secure systems are inherently more reliable. So, although, design for reliability and design for security will often individually yield different results–there is also an overlap between them.

An investment in reliability, for example, generally pays off in security. Why? Well, because a more reliable system is more robust in its handling of all errors, whether they are accidental or intentional. An anti-lock braking system with a fall back to mechanical braking for increased reliability is also more secure against an attack against that critical hardware input sensor. Similarly, those printers wouldn’t be at risk of fuser-induced fire in the case of a security breach if they were never at risk of fire in the case of any misbehavior of the software.

Consider, importantly, that one of the first things a hacker intent on breaching the security of your embedded device might do is to perform a (mental, at least) fault tree analysis of your system. This attacker would then target her time, talents, and other resources at one or more single points of failure she considers most likely to fail in a useful way.

Because a fault tree analysis starts from the general goal and works inward deductively toward the identification of one or more choke points that might produce the desired erroneous outcome, attention paid to increasing reliability such as via FMEA usually reduces choke points and makes the attacker’s job considerably more difficult. Where security can break down even in a reliable system is where the possibility of an attacker’s intentionally induced failure is ignored in the FMEA weighting and thus possible layers of protection are omitted.

Similarly, an investment in security may pay off in greater reliability–even without a directed focus on reliability. For example, if you secure your firmware upgrade process to accept only encrypted and digitally signed binary images you’ll be adding a layer of protection against an inadvertently corrupted binary causing an accidental error and product failure. Anything you do to improve the security of communications (i.e., checksums, prevention of buffer overflows, etc.) can have a similar effect on reliability.

The Only Way Forward

Each year it becomes increasingly important for all of us in the embedded systems design community to learn to design reliable and secure products. If you don’t, it might be your product making the wrong kind of headlines and your source code and design documents being poured over by lawyers. It is no longer acceptable to stick your head in the sand on these issues.

Embedded Software Training in a Box

Friday, May 6th, 2011 Michael Barr

Embedded Software Training in a BoxI am beaming with pride. I think we have finally achieved the holy grail of firmware training: Embedded Software Training in a Box. Priced at just $599, the kit includes Everything-You-Need-to-Know-to-Develop-Quality-Reliable-Firmware-in-C, including software for real-time safety-critical systems such as medical devices.

In many ways, this product is the culmination of about the last fifteen years of my career. The knowledge and skills imparted in the kit are drawn from my varied experiences as:

This kit also–at long last–answers the question I’ve been receiving from around the world since I first started writing articles and books about embedded programming: “Where/How can I learn to be a great embedded programmer?” I believe the answer is now as easy as: “Embedded Software Boot Camp in a Box!”

What NHTSA/NASA Didn’t Consider re: Toyota’s Firmware

Wednesday, March 2nd, 2011 Michael Barr

In a blog post yesterday (Unintended Acceleration and Other Embedded Software Bugs), I wrote extensively on the report from NASA’s technical team regarding their analysis of the embedded software in Toyota’s ETCS-i system. My overall point was that it is hard to judge the quality of their analysis (and thereby the overall conclusion that the software isn’t to blame for unintended accelerations) given the large number of redactions.

I need to put the report down and do some other work at this point, but I have a few other thoughts and observations worth writing down.

Insufficient Explanations

First, some of the explanations offered by Toyota, and apparently accepted by NASA, strike me as insufficent. For example, at pages 129-132 of Appendix A to the NASA Report there is a discussion of recursion in the Toyota firmware. “The question then is how to verify that the indirect recursion in the ETCS-i does in fact terminate (i.e., has no infinite recursion) and does not cause a stack overflow.”

“For the case of stack overflow, [redacted phrase], and therefore a stack overflow condition cannot be detected precisely. It is likely, however, that overflow would cause some form of memory corruption, which would in turn cause some bad behavior that would then cause a watchdog timer reset. Toyota relies on this assumption to claim that stack overflow does not occur because no reset occurred during testing.” (emphasis added)

I have written about what really happens during stack overflow before (Firmware-Specific Bug #4: Stack Overflow) and this explains why a reset may not result and also why it is so hard to trace a stack overflow back to that root cause. (From page 20, in NASA’s words: “The system stack is limited to just 4096 bytes, it is therefore important to secure that no execution can exceed the stack limit. This type of check is normally simple to perform in the absence of recursive procedures, which is standard in safety critical embedded software.”)

Similarly, “Toyota designed the software with a high margin of safety with respect to deadlines and timeliness. … [but] documented no formal verification that all tasks actually meet this deadline requirement.” and “All verification of timely behavior is accomplished with CPU load measurements and other measurement-based techniques.” It’s not clear to me if the NASA team is saying it buys those Toyota explanations or merely wanted to write them down. However, I do not see a sufficient explanation in this wording from page 132:

“The [worst case execution time] analysis and recursion analysis involve two distinctly different problems, but they have one thing in common: Both of their failure modes would result in a CPU reset. … These potential malfunctions, and many others such as concurrency deadlocks and CPU starvation, would eventually manifest as a spontaneous system reset.” (emphasis added)

Might not a deadlock, starvation, priority inversion, or infinite recursion be capable of producing a bit of “bad behavior” (perhaps even unintended acceleration) before that “eventual” reset? Or might not a stack overflow just corrupt one or a few important variables a little bit and that result in bad behavior rather than or before a result? These kinds of possibilities, even at very low probabilities, are important to consider in light of NASA’s calculation that the U.S.-owned Camry 2002-2007 fleet alone is running this software a cumulative one billion hours per year.

Paths Not Taken

My second observation is based upon reflection on the steps NASA might have taken in its review of Toyota’s ETCS-i firmware, but apparently did not. Specifically, there is no mention anywhere (unless it was entirely redacted) of:

  • rate monotonic analysis, which is a technique that Toyota could have used to validate the critical set of tasks with deadlines and higher priority ISRs (and that NASA could have applied in its review),
  • cyclomatic complexity, which NASA might have used as an additional winnowing tool to focus its limited time on particularly complex and hard to test routines,
  • hazard analysis and mitigation, as those terms are defined by FDA guidelines regarding software contained in medical devices, nor
  • any discussion or review of Toyota’s specific software testing regimen and bug tracking system.

Importantly, there is also a complete absence of discussion of how Toyota’s ETCS-i firmware versions evolved over time. Which makes and models (and model years) had which versions of that firmware? (Presumably there were also hardware changes worthy of note.) Were updates or patches ever made to cars once they were sold, say while at the dealer during official recalls or other types of service?

Firmware-Specific Bug #10: Jitter

Thursday, December 2nd, 2010 Michael Barr

Some real-time systems demand not only that a set of deadlines be always met but also that additional timing constraints be observed in the process. Such as managing jitter.

An example of jitter is shown in Figure 1. Here a variable amount of work (blue boxes) must be completed before every 10 ms deadline. As illustrated in the figure, the deadlines are all met. However, there is considerable timing variation from one run of this job to the next. This jitter is unacceptable in some systems, which should either start or end their 10 ms runs more precisely.

Jitter Figure 1

If the work to be performed involves sampling a physical input signal, such as reading an analog-to-digital converter, it will often be the case that a precise sampling period will lead to higher accuracy in derived values. For example, variations in the inter-sample time of an optical encoder’s pulse count will lower the precision of the velocity of an attached rotation shaft.

Best Practice: The most important single factor in the amount of jitter is the relative priority of the task or ISR that implements the recurrent behavior. The higher the priority the lower the jitter. The periodic reads of those encoder pulse counts should thus typically be in a timer tick ISR rather than in an RTOS task.

Figure 2 shows how the interval of three different 10 ms recurring samples might be impacted by their relative priorities. At the highest priority is a timer tick ISR, which executes precisely on the 10 ms interval. (Unless there are higher priority interrupts, of course.) Below that is a high-priority task (TH), which may still be able to meet a recurring 10-ms start time precisely. At the bottom, though, is a low priority task (TL) that has its timing greatly affected by what goes on at higher priority levels. As shown, the interval for the low priority task is 10 ms +/- approximately 5 ms.

Jitter Figure 2

Firmware-Specific Bug #9

What Belongs in a C .h Header File?

Wednesday, November 10th, 2010 Michael Barr

What sorts of things should you (or should you not) put in a C language .h header file? When should you create a header file? And why?

When I talk to embedded C programmers about hardware interfacing in C or Netrino’s Embedded C Coding Standard, I often come to see that they lack basic skills and information about the C programming language. This is usually because we are mostly a gang of electrical engineers who are self-taught in C (and every other programming language we use).

When the subject of header files comes up, here’s my list of do’s and don’ts:

DO create one .h header file for each “module” of the system. A module may comprise one or more compilation units (e.g., .c or .asm source code files). But it should implement just one aspect of the system. Examples of well-chosen modules are: a device driver for an A/D converter; a communication protocol, such as FTP; and an alarm manager that is solely responsible for logging error conditions and alerting the user of the active errors.

DO include in the header file all of the function prototypes for the public interface of the module it describes. For example a header file adc.h might contain function prototypes for adc_init(), adc_select_input(), and adc_read().

DON’T include in the header file any other function or macro that may lie inside the module source code. It is desirable to hide these internal “helper” functions inside the implementation. If it’s not called from any other module, hide it! (If your module spans several compilation units that need to share a helper function, then create a separate header file just for this purpose.) Module A should only call Module B through the public interface defined in moduleb.h.

DON’T include any executable lines of code in a header file, including variable declarations. But note it is necessary to make an exception for the bodies of some inline functions.

DON’T expose any variable in a header file, as is too often done by way of the ‘extern’ keyword. Proper encapsulation of a module requires data hiding: any and all internal state data in private variables inside the .c source code files. Whenever possible these variables should also be declared with keyword ‘static’ to enlist the linker’s help in hiding them.

DON’T expose the internal format of any module-specific data structure passed to or returned from one or more of the module’s interface functions. That is to say there should be no “struct { … } foo;” code in any header file. If you do have a type you need to pass in and out of your module, so client modules can create instances of it, you can simply “typedef struct foo moduleb_type” in the header file. Client modules should never know, and this way cannot know, the internal format of the struct.

Though not really specific to embedded software development, I hope this advice on good C programming practices is useful to you. If it is please let me know and I will provide more C advice in future blog posts.