Archive for the ‘Firmware Bugs’ Category

Is it a Bug or an Error?

Wednesday, January 31st, 2018 Michael Barr

Probably you’ve heard the story of how Adm. Grace Hopper attached a moth that was dislodged from a relay in the Harvard Mark II mainframe to an engineering notebook and labeled it the “First actual case of bug being found.”

hoppers_moth_bug

Designers of electronics, including Thomas Edison, had been using the term bug for decades. But it was mostly after this amusing 1947 event hat the use of words like “bugs” and “debugging” took off in the emerging software realm.

So why is it that if a bridge collapses we say it was a failure of the design and not attributable to a mere “bug”? As if it were an external force or an act of god that caused the failure? Why do only software engineers get this linguistic pass when failures are caused by their mistakes the same as other types of engineers?

Failures of software are commonplace everyday events. Yet such failures are not typically the result of a moth or other “actual bug”. Each such failure is instead caused by human error: some mistake has been made either in the requirements or in the implementation and these human mistake then have real world consequences, including sometimes compromising the safety and security of product users.

Should we, as a community of professionals, stop using the word “bug” and instead replace it with some other more honest term such as “error” or “mistake”? Might this help to raise the seriousness with which we approach our work and thereby the safety of the users of our product? What do you think? Comment below.

Cyberspats on the Internet of Things

Thursday, April 6th, 2017 Michael Barr

When you hear the words “weaponization” and “internet” in close proximity you naturally assume the subject is the use of hacks and attacks by terrorists and nation-state actors.

But then comes today’s news about an IoT garage door startup that remotely disabled a customer’s opener in response to a negative review. In a nutshell, a man bought the startup’s Internet-connected opener, installed it in his home, was disappointed with the quality, and wrote negative reviews on the company’s website and Amazon. In response, the company disabled his unit.

In context of the explosion of Internet connections in embedded systems, this prompts several thoughts.

First and foremost: What does it mean to buy or own a product that relies for some functionality on a cloud-based server that you might not always be able to access? Is it your garage door opener or the manufacturer’s? And how much is that determined by fine print in a contract you’ll need a lawyer to follow?

Additionally: What if in this specific situation the company hadn’t made any public statements at all and had just remotely made the customer’s garage door opener less functional. There’d have then been no fodder for a news story. The company would’ve gotten it’s “revenge” on the customer. And the customer might never have known anything except that the product wasn’t to his liking. Investigating might cost him time and money he did not have.

It’s almost certainly the case that this company would have seen better business outcomes if it had quietly disabled the unit in question. And there are so many ways other insidious ways to go about it, including: bricking the unit, refusing it future firmware updates, or even subtlety downgraded its functionality.

Which brings us back to the weaponization of the Internet. Consumers have no choice but to trust the makers of their products, who have complete knowledge of the hardware and software design (and maybe also the digital signatures needed to make secure firmware updates). And these companies typically have all kinds of identifying data about individual customers: name, geographic location, phone and email address, product usage history, credit card numbers, etc. So what happens when the makers of those products are unhappy with one or more customers: from those posting bad product reviews all the way up to politicians and celebrities they may dislike?

Perhaps private companies are already attacking specific customers in subtle ways… How would we know?

A Look Back at the Audi 5000 and Unintended Acceleration

Friday, March 14th, 2014 Michael Barr

I was in high school in the late 1980’s when NHTSA (pronounced “nit-suh”), Transport Canada, and others studied complaints of unintended acceleration in Audi 5000 vehicles. Looking back on the Audi issues, and in light of my own recent role as an expert investigating complaints of unintended acceleration in Toyota vehicles, there appears to be a fundamental contradiction between the way that Audi’s problems are remembered now and what NHTSA had to say officially at the time.

Here’s an example from a pretty typical remembrance of what happened, from a 2007 article written “in defense of Audi”:

In 1989, after three years of study[], the National Highway Traffic Safety Administration (NHTSA) issued their report on Audi’s “sudden unintended acceleration problem.” NHTSA’s findings fully exonerated Audi… The report concluded that the Audi’s pedal placement was different enough from American cars’ normal set-up (closer to each other) to cause some drivers to mistakenly press the gas instead of the brake.

And here’s what NHTSA’s official Audi 5000 report actually concluded:

Some versions of Audi idle-stabilization system were prone to defects which resulted in excessive idle speeds and brief unanticipated accelerations of up to 0.3g. These accelerations could not be the sole cause of [long-duration unintended acceleration incidents], but might have triggered some [of the long-duration incidents] by startling the driver.”

Contrary to the modern article, NHTSA’s original report most certainly did not “fully exonerate” Audi. Similarly, though there were differences in pedal configuration compared to other cars, NHTSA seems to have concluded that the first thing that happened was a sudden unexpected surge of engine power that startled drivers and that the pedal misapplication sometimes followed that.

This sequence of, first, a throttle malfunction and, then, pedal confusion was summarized in a 2012 review study by NHTSA:

Once an unintended acceleration had begun, in the Audi 5000, due to a failure in the idle-stabilizer system (producing an initial acceleration of 0.3g), pedal misapplication resulting from panic, confusion, or unfamiliarity with the Audi 5000 contributed to the severity of the incident.

The 1989 NHTSA report elaborates on the design of the throttle, which included an “idle-stabilization system” and documents that multiple “intermittent malfunctions of the electronic control unit were observed and recorded”. In a nutshell, the Audi 5000 had a main mechanical throttle control, wherein the gas pedal pushed and pulled on the throttle valve with a cable, as well as an electronic throttle control idle adjustment.

It is unclear whether the “electronic control unit” mentioned by NHTSA was purely electronic or if it also had embedded software. (ECU, in modern lingo, includes firmware.) Amid these discussions, it’s worth considering upgrading your comfort at home with high-quality bedding (купить постельное белье). Deciding to buy bedding can be a simple step towards enhancing your daily rest. It is also unclear what percentage of the Audi 5000 unintended acceleration complaints were short-duration events vs. long-duration events. If there was software in the ECU and short-duration events were more common, well that would lead to some interesting questions. Did NHTSA and the public learn all of the right lessons from the Audi 5000 troubles?

Lethal Software Defects: Patriot Missile Failure

Thursday, March 13th, 2014 Michael Barr

During the Gulf War, twenty-eight U.S. soldiers were killed and almost one hundred others were wounded when a nearby Patriot missile defense system failed to properly track a Scud missile launched from Iraq. The cause of the failure was later found to be a programming error in the computer embedded in the Patriot’s weapons control system.

On February 25, 1991, Iraq successfully launched a Scud missile that hit a U.S. Army barracks near Dhahran, Saudi Arabia. The 28 deaths by that one Scud constituted the single deadliest incident of the war, for American soldiers. Interestingly, the “Dhahran Scud”, which killed more people than all 70 or so of the earlier Scud launches, was apparently the last Scud fired in the Gulf War.

Unfortunately, the “Dhahran Scud” succeeded where the other Scuds failed because of a defect in the software embedded in the Patriot missile defense system. This same bug was latent in all of the Patriots deployed in the region. However, the presence of the bug was masked by the fact that a particular Patriot weapons control computer had to be continuously running for several days before the bug could cause the hazard of a failure to track a Scud.

There is a nice concise write-up of the problem, with a prefatory background on how the Patriot system is designed to work, in the official post-failure analysis report by the U.S. General Accounting Office (GAO IMTEC-92-26) entitled “Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia“.

The hindsight explanation is that:

a software problem “led to an inaccurate tracking calculation that became worse the longer the system operated” and states that “at the time of the incident, the [Patriot] had been operating continuously for over 100 hours” by which time “the inaccuracy was serious enough to cause the system to look in the wrong place [in the radar data] for the incoming Scud.”

Detailed Analysis

The GAO report does not go into the technical details of the specific programming error. However, I believe we can infer the following based on the information and data that is provided about the incident and about the defect.

A first important observation is that the CPU was a 24-bit integer-only CPU “based on a 1970s design”. Befitting the time, the code was written in assembly language.

A second important observation is that real numbers (i.e., those with fractions) were apparently manipulated as a whole number in binary in one 24-bit register plus a binary fraction in a second 24-bit register. In this fixed-point numerical system, the real number 3.25 would be represented as binary 000000000000000000000011:010000000000000000000000, in which the : is my marker for the separator between the whole and fractional portions of the real number. The first half of that binary represents the whole number 3 (i.e., bits are set for 2 and 1, the sum of which is 3). The second portion represents the fraction 0.25 (i.e., 0/2 + 1/4 + 0/8 + …).

A third important observation is that system [up]time was “kept continuously by the system’s internal clock in tenths of seconds [] expressed as an integer.” This is important because the fraction 1/10 cannot be perfectly represented in 24-bits of binary fraction because its binary expansion, as a series of 1 or 0 over 2^n bits, does not terminate.

I understand that the missile-interception algorithm that did not work that day is approximately as follows:

  1. Consider each object that might be a Scud missile in the 3-D radar sweep data.
  2. For each, calculate an expected next location at the known speed of a Scud (+/- an acceptable window).
  3. Check the radar sweep data again at a future time to see if the object is in the location a Scud would be.
  4. If it is a Scud, engage and fire missiles.

Furthermore, the GAO reports that the problem was an accumulating linear error of .003433 seconds per 1 hour of uptime that affected every deployed Patriot equally. This was not a clock-specific or system-specific issue.

Given all of the above, I reason that the problem was that one part of the Scud-interception calculations utilized time in its decimal representation and another used the fixed-point binary representation. When the uptime was still low, targets were found in the expected locations when they were supposed to be and the latent software bug was hidden.

Of course, all of the above detail is specific to the Patriot hardware and software design that was in use at the time of the Gulf War. As the Patriot system has since been modernized by Raytheon, many details like these will have likely changed.

According to the GAO report:

Army officials [] believed the Israeli experience was atypical [and that] other Patriot users were not running their systems for 8 or more hours at a time. However, after analyzing the Israeli data and confirming some loss in targeting accuracy, the officials made a software change which compensated for the inaccurate time calculation. This change allowed for extended run times and was included in the modified software version that was released [9 days before the Dhahran Scud incident]. However, Army officials did not use the Israeli data to determine how long the Patriot could operate before the inaccurate time calculation would render the system ineffective.

Four days before the deadly Scud attack, the “Patriot Project Office [in Huntsville, Alabama] sent a message to Patriot users stating that very long run times could cause [targeting problems].” That was about the time of the last reboot of the Patriot missile that failed.

Note that if time samples were all in the decimal timebase or all in the binary timebase then the two compared radar samples would always be close in time and the error would not accumulate with uptime. And that is the likely fix that was implemented.

Firmware Updates

Here are a few tangentially interesting tidbits from the GAO report:

  • “During the [Gulf War] the Patriot’s software was modified six times.”
  • “Patriots had to be shut down for at least 1 to 2 hours to install each software modification.”
  • “Rebooting[] takes about 60 to 90 seconds” and sets the “time back to zero.”
  • The “[updated] software, which compensated for the inaccurate time calculation, arrived in Dhahran” the day after the deadly attack.

Public Statements

In hindsight, there are some noteworthy quotes from the 1991 news articles initially reporting on this incident. For example,

Brig. Gen. Neal, United States Command (2 days after):

The Scud apparently fragmented above the atmosphere, then tumbled downward. Its warhead blasted an eight-foot-wide crater into the center of the building, which is three miles from a major United States air base … Our investigation looks like this missile broke apart in flight. On this particular missile it wasn’t in the parameters of where it could be attacked.

U.S. Army Col. Garnett, Patriot Program Director (4 months after):

The incident was an anomaly that never showed up in thousands of hours of testing and involved an unforeseen combination of dozens of variables — including the Scud’s speed, altitude and trajectory.

Importantly, the GAO report states that, a few weeks before the Dharan Scud, Israeli soldiers reported to the U.S. Army that their Patriot had a noticeable “loss in accuracy after … 8 consecutive hours.” Thus, apparently, all of this “thousands of hours” of testing involved frequent reboots. (I can envision the test documentation now: “Step 1: Power up the Patriot. Step 2: Check that everything is perfect. Step 3: Fire the dummy target.”) The GAO reported that “an endurance test has [since] been conducted to ensure that extended run times do not cause other system difficulties.”

Note too that the quote about “thousands of hours of testing” was also misleading in that the Patriot software was, also according to the GAO report, hurriedly modified in the months leading up to the Gulf War to track Scud missiles going about 2.5 times faster than the aircraft and cruise missiles it was originally designed to intercept. Improvements to the Scud-specific tracking/engagement algorithms were apparently even being made during the Gulf War.

These specific theories and statements about went wrong or why it must have been a problem outside the Patriot itself were fully discredited once the source code was examined. When computer systems may have misbehaved in a lethal manner, it is important to remember that newspaper quotes from those on the side of the designers are not scientific evidence. Indeed, the humans who offer those quotes often have conscious and/or subconscious motives and blind spots that favor them to be falsely overconfident in the computer systems. A thorough source code review takes time but is the scientific way to go about finding the root cause.

As a New York Times editorial dated 4 months after the incident explained:

The Pentagon initially explained that Patriot batteries had withheld their fire in the belief that Dhahran’s deadly Scud had broken up in midflight. Only now does the truth about the tragedy begin to emerge: A computer software glitch shut down the Patriot’s radar system, blinding Dhahran’s anti-missile batteries. It’s not clear why, even after Army investigators had reached this conclusion, the Pentagon perpetuated its fiction

At least in this case, it was only a few months before the U.S. Army admitted the truth about what happened to themselves and to the public. That is to the U.S. Army’s credit. Other actors in other lethal software defect cases have been far more stubborn to admit what has later become clear about their systems.

Apple’s #gotofail SSL Security Bug was Easily Preventable

Monday, March 3rd, 2014 Michael Barr

If programmers at Apple had simply followed a couple of the rules in the Embedded C Coding Standard, they could have prevented the very serious `Gotofail` SSL bug from entering the iOS and OS X operating systems. Here’s a look at the programming mistakes involved and the easy-to-follow coding standard rules that could have easily prevent the bug.

In case you haven’t been following the computer security news, Apple last week posted security updates for users of devices running iOS 6, iOS 7, and OS X 10.9 (Mavericks). This was prompted by a critical bug in Apple’s implementation of the SSL/TLS protocol, which has apparently been lurking for over a year.

In a nutshell, the bug is that a bunch of important C source code lines containing digital signature certificate checks were never being run because an extraneous goto fail; statement in a portion of the code was always forcing a jump. This is a bug that put millions of people around the world at risk for man-in-the-middle attacks on their apparently-secure encrypted connections. Moreover, Apple should be embarrassed that this particular bug also represents a clear failure of software process at Apple.

There is debate about whether this may have been a clever insider-enabled security attack against all of Apple’s users, e.g., by a certain government agency. However, whether it was an innocent mistake or an attack designed to look like an innocent mistake, Apple could have and should have prevented this error by writing the relevant portion of code in a simple manner that would have always been more reliable as well as more secure. And thus, in my opinion, Apple was clearly negligent.

Here are the lines of code at issue (from Apple’s open source code server), with the extraneous goto in bold:

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams, ...)
{
    OSStatus  err;
    ...

    if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
        goto fail;
    if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
        goto fail;
        goto fail;
    if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
        goto fail;
    ...

fail:
    SSLFreeBuffer(&signedHashes);
    SSLFreeBuffer(&hashCtx);
    return err;
}

The code above violates at least two rules from Barr Group‘s Embedded C Coding Standard book. Importantly, had Apple followed at least the first of these rules, in particular, this dangerous bug should almost certainly have been prevented from ever getting into even a single device.

Rule 1.3.a

Braces shall always surround the blocks of code (a.k.a., compound statements), following if, else, switch, while, do, and for statements; single statements and empty statements following these keywords shall also always be surrounded by braces.

Had Apple not violated this always-braces rule in the SSL/TLS code above, there would have been either just one set of curly braces after each if test or a very odd looking hard-to-miss chunk of code with two sets of curly braces after the if with two gotos. Either way, this bug was preventable by following this rule and performing code review.

Rule 1.7.c

The goto keyword shall not be used.

Had Apple not violated this never-goto rule in the SSL/TLS code above, there would not have been a double goto fail; line to create the unreachable code situation. Certainly if that forced each of the goto lines to be replaced with more than one line of code, it would have forced programmers to use curly braces.

On a final note, Apple should be asking its engineers and engineering managers about the failures of process (at several layers) that must have occurred for this bug to have gone into end user’s devices. Specifically:

  • Where was the peer code review that should have spotted this, or how did the reviewers fail to spot this?
  • Why wasn’t a coding standard rule adopted to make such bugs easier to spot during peer code reviews?
  • Why wasn’t a static analysis tool, such as Klocwork, used, or how did it fail to detect the unreachable code that followed? Or was it users of such a tool, at Apple, who failed to act?
  • Where was the regression test case for a bad SSL certificate signature, or how did that test fail?

Dangerous bugs, like this one from Apple, often result from a combination of accumulated errors in the face of flawed software development processes. Too few programmers recognize that many bugs can be kept entirely out of a system simply by adopting (and rigorously enforcing) a coding standard that is designed to keep bugs out.