Archive for the ‘Firmware Bugs’ Category

An Update on Toyota and Unintended Acceleration

Saturday, October 26th, 2013 Michael Barr

In early 2011, I wrote a couple of blog posts (here and here) as well as a later article (here) describing my initial thoughts on skimming NASA’s official report on its analysis of Toyota’s electronic throttle control system. Half a year later, I was contacted and retained by attorneys for numerous parties involved in suing Toyota for personal injuries and economic losses stemming from incidents of unintended acceleration. As a result, I got to look at Toyota’s engine source code directly and judge for myself.

From January 2012, I’ve led a team of seven experienced engineers, including three others from Barr Group, in reviewing Toyota’s electronic throttle and some other source code as well as related documents, in a secure room near my home in Maryland. This work proceeded in two rounds, with a first round of expert reports and depositions issued in July 2012 that led to a billion-dollar economic loss settlement as well as an undisclosed settlement of the first personal injury case set for trial in U.S. Federal Court. The second round began with an over 750 page formal written expert report by me in April 2013 and culminated this week in an Oklahoma jury’s decision that the multiple defects in Toyota’s engine software directly caused a September 2007 single vehicle crash that injured the driver and killed her passenger.

It is significant that this was the first and only jury so far to hear any opinions about Toyota’s software defects. Earlier cases either predated our source code access, applied a non-software theory, or was settled by Toyota for an undisclosed sum.

In our analysis of Toyota’s source code, we built upon the prior analysis by NASA. First, we looked more closely at more lines of the source code for more vehicles for more man months. And we also did a lot of things that NASA didn’t have time to do, including reviewing Toyota’s operating system’s internals, reviewing the source code for Toyota’s “monitor CPU”, performing an independent worst-case stack depth analysis, running portions of the main CPU software including the RTOS in a processor simulator, and demonstrating–in 2005 and 2008 Toyota Camry vehicles–a link between loss of throttle control and the numerous defects we found in the software.

In a nutshell, the team led by Barr Group found what the NASA team sought but couldn’t find: “a systematic software malfunction in the Main CPU that opens the throttle without operator action and continues to properly control fuel injection and ignition” that is not reliably detected by any fail-safe. To be clear, NASA never concluded software wasn’t at least one of the causes of Toyota’s high complaint rate for unintended acceleration; they just said they weren’t able to find the specific software defect(s) that caused unintended acceleration. We did.

Now it’s your turn to judge for yourself. Though I don’t think you can find my expert report outside the Court system, here are links to the trial transcript of my expert testimony to the Oklahoma jury and a (redacted) copy of the slides I shared with the jury in Bookout, et.al. v. Toyota.

Note that the jury in Oklahoma found that Toyota owed each victim $1.5 million in compensatory damages and also found that Toyota acted with “reckless disregard”. The latter legal standard meant the jury was headed toward deliberations on additional punitive damages when Toyota called the plaintiffs to settle (for yet another undisclosed amount). It has been reported that an additional 400+ personal injury cases are still working their way through various courts.

Related Stories

Updates

On December 13, 2013, Toyota settled the case that was set for the next trial, in West Virginia in January 2014, and announced an “intensive” settlement process to try to resolve approximately 300 of the remaining personal injury case, which are consolidated in U.S. and California courts.

Toyota continues to publicly deny there is a problem and seems to have no plans to address the unsafe design and inadequate fail safes in its drive-by-wire vehicles–the electronics and software design of which is similar in most of the Toyota and Lexus (and possibly Scion) vehicles manufactured over at least about the last ten model years. Meanwhile, incidents of unintended acceleration continue to be reported in these vehicles (see also the NHTSA complaint database) and these new incidents, when injuries are severe, continue to result in new personal injury lawsuits against Toyota.

In March 2014, the U.S. Department of Justice announced a $1.2 billion settlement in a criminal case against Toyota. As part of that settlement, Toyota admitted to past lying to NHTSA, Congress, and the public about unintended acceleration and also to putting its brand before public safety. Yet Toyota still has made no safety recalls for the defective engine software.

On April 1, 2014, I gave a keynote speech at the EE Live conference, which touched on the Toyota litigation in the context of lethal embedded software failures of the past and the coming era of self-driving vehicles. The slides from that presentation are available for download at http://www.barrgroup.com/killer-apps/.

On September 18, 2014, Professor Phil Koopman, of Carnegie Mellon University, presented a talk about his public findings in these Toyota cases entitled “A Case Study of Toyota Unintended Acceleration and Software Safety“.

On October 30, 2014, Italian computer scientist Roberto Bagnara presented a talk entitled “On the Toyota UA Case
and the Redefinition of Product Liability for Embedded Software
” at the 12th Workshop on Automotive Software & Systems, in Milan.

Building Reliable and Secure Embedded Systems

Tuesday, March 13th, 2012 Michael Barr

In this era of 140 characters or less, it has been well and concisely stated that, “RELIABILITY concerns ACCIDENTAL errors causing failures, whereas SECURITY concerns INTENTIONAL errors causing failures.” In this column I expand on this statement, especially as regards the design of embedded systems and their place in our network-connected and safety-concious modern world.

As the designers of embedded systems, the first thing we must accomplish on any project is to make the hardware and software work. That is to say we need to make the system behave as it was designed to. The first iteration of this is often flaky; certain uses or perturbations of the system by testers can easily dislodge the system into a non-working state. In common parlance, “expect bugs.”

Given time, tightening cycles of debug and test can get us past the bugs and through to a shippable product. But is a debugged system good enough? Neither reliability nor security can be tested into a product. Each must be designed in from the start. So let’s take a closer look at these two important design aspects for modern embedded systems and then I’ll bring them back together at the end.

Reliable Embedded Systems

A product can be stable yet lack reliability. Consider, for example, an anti-lock braking computer installed in a car. The software in the anti-lock brakes may be bug-free, but how does it function if a critical input sensor fails?

Reliable systems are robust in the face of adverse run-time environments. Reliable systems are able to work around errors encountered as they occur to the system in the field–so that the number and impact of failures are minimized. One key strategy for building reliable systems is to eliminate single-points-of-failure. For example, redundancy could be added around that critical input sensor–perhaps by adding a second sensor in parallel with the first.

Another aspect of reliability that is under the complete control of designers (at least when they consider it from the start) are the “fail-safe” mechanisms. Perhaps a suitable but lower-cost alternative to a redundant sensor is detection of the failed sensor with a fall back to mechanical braking.

Failure Mode and Effect Analysis (FMEA) is one of the most effective and important design processes used by engineers serious about designing reliability into their systems. Following this process, each possible failure point is traced from the root failure outward to its effects. In an FMEA, numerical weights can be applied to the likelihoods of each failure as well as the seriousness of consequences. An FMEA can thus help guide you to a cost effective but higher reliability design by highlighting the most valuable places to insert the redundancy, fail-safes, or other elements that reinforce the system’s overall reliability.

In certain industries, reliability is a key driver of product safety. And that is why you see these techniques and FMEA and other design for reliability processes being applied by the designers of safety-critical automotive, medical, avionics, nuclear, and industrial systems. The same techniques can, of course, be used to make any type of embedded system more reliable.

Regardless of your industry, it is typically difficult or impossible to make your product as reliable via patches. There’s no way to add hardware like that redundant sensor, so your options may reduce to a fail-safe that is helpful but less reliable overall. Reliability cannot be patched or tested or debugged into your system. Rather, reliability must be designed in from the start.

Secure Embedded Systems

A product can also be stable yet lack security. For example, an office printer is the kind of product most of us purchase and use without giving a minute of thought to security. The software in the printer may be bug-free, but is it able to prevent a would-be eavesdropper from capturing a remote electronic copy of everything you print, including your sensitive financial documents?

Secure systems are robust in the face of persistent attack. Secure systems are able to keep hackers out by design. One key strategy for building secure systems is to validate all inputs, especially those arriving over an open network connection. For example, security could be added to a printer by ensuring against buffer overflows and encrypting and digitally signing firmware updates.

One of the unfortunate facts of designing secure embedded systems is that the hackers who want to get in only need to find and exploit a single weakness. Adding layers of security is good, but if even any one of those layers remains fundamentally weak, a sufficiently motivated attacker will eventually find and breach that defense. But that’s not an excuse for not trying.

For years, the largest printer maker in the world apparently gave little thought to the security of the firmware in its home/office printers, even as it was putting tens of millions of tempting targets out into the world. Now the security of those printers has been breached by security researchers with a reasonable awareness of embedded systems design. Said one of the lead researchers, “We can actually modify the firmware of the printer as part of a legitimate document. It renders correctly, and at the end of the job there’s a firmware update. … In a super-secure environment where there’s a firewall and no access — the government, Wall Street — you could send a résumé to print out.”

Security is a brave new world for many embedded systems designers. For decades we have relied on the fact that the microcontrollers and Flash memory and real-time operating systems and other less mainstream technologies we use will protect our products from attack. Or that we can gain enough “security by obscurity” by keeping our communications protocols and firmware upgrade processes secret. But we no longer live in that world. You must adapt.

Consider the implications of an insecure design of an automotive safety system that is connected to another Internet-connected computer in the car via CAN; or the insecure design of an implanted medical device; or the insecure design of your product.

Too often, the ability to upgrade a product’s firmware in the field is the very vector that’s used to attack. This can happen even when a primary purpose for including remote firmware updates is motivated by security. For example, as I’ve learned in my work as an expert witness in numerous cases involving reverse engineering of the techniques and technology of satellite television piracy, much of that piracy has been empowered by the same software patching mechanism that allowed the broadcasters to perform security upgrades and electronic countermeasures. Ironically, had the security smart cards in those set-top boxes had only masked ROM images the overall system security may have been higher. This was certainly not what the designers of the system had in mind. But security is also an arms race.

Like reliability, security must be designed in from the start. Security can’t be patched or tested or debugged in. You simply can’t add security as effectively once the product ships. For example, an attacker who wished to exploit a current weakness in your office printer or smart card might download his hack software into your device and write-protect his sectors of the flash today so that his code could remain resident even as you applied security patches.

Reliable and Secure Embedded Systems

It is important to note at this point that reliable systems are inherently more secure. And that, vice versa, secure systems are inherently more reliable. So, although, design for reliability and design for security will often individually yield different results–there is also an overlap between them.

An investment in reliability, for example, generally pays off in security. Why? Well, because a more reliable system is more robust in its handling of all errors, whether they are accidental or intentional. An anti-lock braking system with a fall back to mechanical braking for increased reliability is also more secure against an attack against that critical hardware input sensor. Similarly, those printers wouldn’t be at risk of fuser-induced fire in the case of a security breach if they were never at risk of fire in the case of any misbehavior of the software.

Consider, importantly, that one of the first things a hacker intent on breaching the security of your embedded device might do is to perform a (mental, at least) fault tree analysis of your system. This attacker would then target her time, talents, and other resources at one or more single points of failure she considers most likely to fail in a useful way.

Because a fault tree analysis starts from the general goal and works inward deductively toward the identification of one or more choke points that might produce the desired erroneous outcome, attention paid to increasing reliability such as via FMEA usually reduces choke points and makes the attacker’s job considerably more difficult. Where security can break down even in a reliable system is where the possibility of an attacker’s intentionally induced failure is ignored in the FMEA weighting and thus possible layers of protection are omitted.

Similarly, an investment in security may pay off in greater reliability–even without a directed focus on reliability. For example, if you secure your firmware upgrade process to accept only encrypted and digitally signed binary images you’ll be adding a layer of protection against an inadvertently corrupted binary causing an accidental error and product failure. Anything you do to improve the security of communications (i.e., checksums, prevention of buffer overflows, etc.) can have a similar effect on reliability.

The Only Way Forward

Each year it becomes increasingly important for all of us in the embedded systems design community to learn to design reliable and secure products. If you don’t, it might be your product making the wrong kind of headlines and your source code and design documents being poured over by lawyers. It is no longer acceptable to stick your head in the sand on these issues.

Combining C’s volatile and const Keywords

Tuesday, January 24th, 2012 Michael Barr

Does it ever make sense to declare a variable in C or C++ as both volatile (i.e., “ever-changing”) and const (“read-only”)? If so, why? And how should you combine volatile and const properly?

One of the most consistently popular articles on the Netrino website is about C’s volatile keyword. The volatile keyword, like const, is a type qualifier. These keywords can be used by themselves or together in variable declarations.

I’ve written about volatile and const individually before. If you haven’t previously used the volatile keyword, I recommend you read How to Use C’s volatile Keyword before going on. As that article makes plain:

C’s volatile keyword is a qualifier that is applied to a variable when it is declared. It tells the compiler that the value of the variable may change at any time–without any action being taken by the code the compiler finds nearby.

How to Use C’s volatile Keyword

By declaring a variable volatile you are effectively asking the compiler to be as inefficient as possible when it comes to reading or writing that variable. Specifically, the compiler should generate object code to perform each and every read from a volatile variable and each and every write to a volatile variable–even if you write it twice in a row or read it and ignore the result. No read or write can be skipped. Effectively no optimizations are allowed with respect to volatile variables.

The use of volatile variables also creates additional sequence points in C and C++ programs. The order of accesses of volatile variables A and B in the object code must be the same as the order of those accesses in the source code. The compiler is not allowed to reorder volatile variable accesses for any reason.

Here are a couple of examples of declarations of volatile variables:

int volatile g_flag_shared_with_isr;

uint8_t volatile * p_led_reg = (uint8_t *) 0x00080000;

The first example declares a global flag that can be shared between an ISR and some other part of the code (e.g., a background processing loop in main() or an RTOS task) without fear that the compiler will optimize (i.e., “delete”) the code you write to check for asynchronous changes to the flag’s value. It is important to use volatile to declare all variables that are shared by asynchronous software entities, which is important in any kind of multithreaded programming. (Remember, though, that access to global variables shared by tasks or with an ISR must always also be controlled via a mutex or interrupt disable, respectively.)

The second example declares a pointer to a hardware register at a known physical memory address (80000h)–in this case to manipulate the state of one or more LEDs. Because the pointer to the hardware register is declared volatile, the compiler must always perform each individual write. Even if you write C code to turn an LED on followed immediately by code to turn the same LED off, you can trust that the hardware really will receive both instructions. Because of the sequence point restrictions, you are also guaranteed that the LED will be off after both lines of the C code have been executed. The volatile keyword should always be used with creating pointers to memory-mapped I/O such as this.

[See Coding Standard Rule #4: Use volatile Whenever Possible for more on the use of volatile by itself.]

How to Use C’s const Keyword

The const keyword is can be used to modify parameters as well as in variable declarations. Here we are only interested in the use of const as a type qualifier, as in:

uint16_t const max_temp_in_c = 1000;

This declaration creates a 16-bit unsigned integer value of 1,000 with a scoped name of max_temp_in_c. In C, this variable will exist in memory at run-time, but will typically be located, by the linker, in a non-volatile memory area such as ROM or flash. Any reference to the const variable will read from that location. (In C++, a const integer may no longer exist as an addressable location in run-time memory.)

Any attempt the code makes to write to a const variable directly (i.e., by its name) will result in a compile-time error. To the extent that the const variable is located in ROM or flash, an indirect write (i.e., via a pointer to its address) will also be thwarted–though at run-time, obviously.

Another use of const is to mark a hardware register as read-only. For example:

uint8_t const * p_latch_reg = 0x10000000;

Declaring the pointer this way, any attempt to write to that physical memory address via the pointer (e.g., *p_latch_reg = 0xFF;) should result in a compile-time error.

[See Coding Standard Rule #2: Use const Whenever Possible for more on the use of const by itself.]

How to Use const and volatile Together

Though the essence of the volatile (“ever-changing”) and const (“read-only”) decorators may seem at first glance opposed, there are some times when it makes sense to use them both to declare one variable. The scenarios I’ve run across have involved pointers to memory-mapped hardware registers and shared memory areas.

(#1) Constant Addresses of Hardware Registers

The following declaration uses both const and volatile in the frequently useful scenario of declaring a constant pointer to a volatile hardware register.

uint8_t volatile * const p_led_reg = (uint8_t *) 0x00080000;

The proper way to read a complex declaration like this is from the name of the variable back to the left, as in:

p_led_reg IS A constant pointer TO A volatile 8-bit unsigned integer.

Reading it that way, we can see that the keyword const modifies only the pointer (i.e., the fixed address 80000h), which does not change at run-time. Whereas the keyword volatile modifies only the type of integer. This is actually quite useful and is a much safer version of the declaration of a p_led_reg that appears at the top of this article. In particular, adding const means that the simple typo of a missed pointer dereference (‘*’) will be caught at compile time. That is, the mistaken code p_led_reg = LED1_ON; won’t overwrite the address with the non-80000h value of LED1_ON. The compiler error leads us to correct this to *p_led_reg = LED1_ON;, which is almost certainly what we meant to write in the first place.

(#2) Read-Only Shared-Memory Buffer

Another use for a combination of const and volatile is where you have two processors communicating via a shared memory area and you are coding the side of this communications that will only be reading from a shared memory buffer. In this case you could declare variables such as:

int const volatile comm_flag;

uint8_t const volatile comm_buffer[BUFFER_SIZE];

Of course, you’d usually want to instruct the linker to place these global variables at the correct addresses in the shared memory area or to declare the above as pointers to specific physical memory addresses. In the case of pointers, the use of const and volatile may become even more complex, as in the next category.

(#3) Read-Only Hardware Register

Sometimes you will run across a read-only hardware register. In addition to enforcing compile-time checking so that the software doesn’t try to overwrite the memory location, you also need to be sure that each and every requested read actually occurs. By declaring your variable IS A (constant) pointer TO A constant and volatile memory location you request all of the appropriate protections, as in:

uint8_t const volatile * const p_latch_reg = (uint8_t *) 0x10000000;

As you can see, the declarations of variables that involve both the volatile and const decorators can quickly become complicated to read. But the technique of combining C’s volatile and const keywords can be useful and even important. This is definitely something you should learn to master to be a master embedded software engineer.

Firmware Forensics: Best Practices in Embedded Software Source Code Discovery

Tuesday, September 27th, 2011 Michael Barr

This article is now located here: https://experts.barrgroup.com/articles/firmware-forensics-best-practices-embedded-software-source-code-discovery

Don’t Follow These 5 Dangerous Coding Standard Rules

Tuesday, August 30th, 2011 Michael Barr

Over the summer I happened across a brief blog post by another firmware developer in which he presented ten C coding rules for better embedded C code. I had an immediate strong negative reaction to half of his rules and later came to dislike a few more, so I’m going to describe what I don’t like about each. I’ll refer to this author as BadAdvice. I hope that if you have followed rules like the five below my comments will persuade you to move away from those toward a set of embedded C coding rules that keep bugs out. If you disagree, please start a constructive discussion in the comments.

Bad Rule #1: Do not divide; use right shift.

As worded, the above rule is way too broad. It’s not possible to always avoid C’s division operator. First of all, right shifting only works as a substitute for division when it is integer division and the denominator is a power of two (e.g., right shift by one bit to divide by 2, two bits to divide by 4, etc.). But I’ll give BadAdvice the benefit of the doubt and assume that he meant to say you should “use right shift as a substitute for division whenever possible”.

For his example, BadAdvice shows code to compute an average over 16 integer data samples, which are accumulated into a variable sum, during the first 16 iterations of a loop. On the 17th iteration, the average is computed by right shifting sum by 4 bits (i.e., dividing by 16). Perhaps the worst thing about this example code is how much it is tied a pair of #defines for the magic numbers 16 and 4. A simple but likely refactoring to average over 15 instead of 16 samples would break the entire example–you’d have to change from the right shift to a divide proper. It’s also easy to imagine someone changing AVG_COUNT from 16 to 15 without realizing about the shift; and if you didn’t change this, you’d get a bug in that the sum of 15 samples would still be right shifted by 4 bits.

Better Rule: Shift bits when you mean to shift bits and divide when you mean to divide.

There are many sources of bugs in software programs. The original programmer creates some bugs. Other bugs result from misunderstandings by those who later maintain, extend, port, and/or reuse the code. Thus coding rules should emphasize readability and portability most highly. The choice to deviate from a good coding rule in favor of efficiency should be taken only within a subset of the code. Unless there is a very specific function or construct that needs to be hand optimized, efficiency concerns should be left to the compiler.

Bad Rule #2: Use variable types in relation to the maximum value that variable may take.

BadAdvice gives the example of a variable named seconds, which holds integer values from 0 to 59. And he shows choosing char for the type over int. His stated goal is to reduce memory use.

In principle, I agree with the underlying practices of not always declaring variables int and choosing the type (and signedness) based on the maximum range of values. However, I think it essential that any practice like this be matched with a corresponding practice of always declaring specifically sized variables using C99’s portable fixed-width integer types.

It is impossible to understand the reasoning of the original programmer from unsigned char seconds;. Did he choose char because it is big enough or for some other reason? (Remember too that a plain char may be naturally signed or unsigned, depending on the compiler. Perhaps the original programmer even knows his compiler’s chars are default unsigned and omits that keyword.) The intent behind variables declared short and long is at least as difficult to decipher. A short integer may be 16-bits or 32-bits (or something else), depending on the compiler; a width the original programmer may have (or may not have) relied upon.

Better Rule: Whenever the width of an integer matters, use C99’s portable fixed-width integer types.

A variable declared uint16_t leaves no doubt about the original intent as it is very clearly meant to be a container for an unsigned integer value no wider than 16-bits. This type selection adds new and useful information to the source code and makes programs both more readable and more portable. Now that C99 has standardized the names of fixed-width integer types, declarations involving short and long should no longer be used. Even char should only be used for actual character (i.e., ASCII) data. (Of course, there may still be int variables around, where size does not matter, such as in loop counters.)

Bad Rule #3: Avoid >= and use <.

As worded above, I can’t say I understand this rule or its goal sufficiently, but to illustrate it BadAdvice gives the specific example of an if-else if wherein he recommends if (speed < 100) ... else if (speed > 99) instead of if (speed < 100) ... else if (speed >= 100). Say what? First of all, why not just use else for that specific scenario, as speed must be either below 100 or 100 or above.

Even if we assume we need to test for less than 100 first and then for greater than or equal to 100 second, why would anyone in their right mind prefer to use greater than 99? That would be confusing to any reader of the code. To me it reads like a bug and I need to keep going back over it to find the logical problem with the apparently mismatched range checks. Additionally, I believe that BadAdvice’s terse rationale that “Benefits: Lesser Code” is simply untrue. Any half decent compiler should be able to optimize either comparison as needed for the underlying processor.

Better Rule: Use whatever comparison operator is easiest to read in a given situation.

One of the very best things any embedded programmer can do is to make their code as readable as possible to as broad an audience as possible. That way another programmer who needs to modify your code, a peer doing code review to help you find bugs, or even you years later, will find the code hard to misinterpret.

Bad Rule #4: Avoid variable initialization while defining.

BadAdvice says that following the above rule will make initialization faster. He gives the example of unsigned char MyVariable = 100; (not preferred) vs:


#define INITIAL_VALUE 100
unsigned char MyVariable;
// Before entering forever loop in main
MyVariable = INITIAL_VALUE

Though it’s unclear from the above, let’s assume that MyVariable is a local stack variable. (It could also be global, the way his pseudo code is written.) I don’t think there should be a (portably) noticeable efficiency gain from switching to the latter. And I do think that following this rule creates an opening to forget to do the initialization or to unintentionally place the initialization code within a conditional clause.

Better Rule: Initialize every variable as soon as you know the initial value.

I’d much rather see every variable initialized on creation with perhaps the creation of the variable postponed as long as possible. If you’re using a C99 or C++ compiler, you can declare a variable anywhere within the body of a function.

Bad Rule #5: Use #defines for constant numbers.

The example given for this rule is of defining three constant values, including #define ON 1 and #define OFF 0. The rationale is “Increased convenience of changing values in a single place for the whole file. Provides structure to the code.” And I agree that using named constants instead of magic numbers elsewhere in the code is a valuable practice. However, I think there is an even better way to go about this.

Better Rule: Declare constants using const or enum.

C’s const keyword can be used to declare a variable of any type as unable to be changed at run-time. This is a preferable way of declaring constants, as they are in this way given a type that can be used to make comparisons properly and enabling them to be type-checked by the compiler if they are passed as parameters to function calls. Enumeration sets may be used instead for integer constants that come in groups, such as enum { OFF = 0, ON };.

Final Thoughts

There are two scary things about these and a few of the other rules on BadAdvice’s blog. First, is that they are out there on the Internet to be found with a search for embedded C coding rules. Second, is that BadAdvice’s bio says he works on medical device design. I’m not sure which is worse. But I do hope the above reasoning and proposed better rules gets you thinking about how to develop more reliable embedded software with fewer bugs.