embedded software boot camp

Lethal Software Defects: Patriot Missile Failure

Thursday, March 13th, 2014 by Michael Barr

During the Gulf War, twenty-eight U.S. soldiers were killed and almost one hundred others were wounded when a nearby Patriot missile defense system failed to properly track a Scud missile launched from Iraq. The cause of the failure was later found to be a programming error in the computer embedded in the Patriot’s weapons control system.

On February 25, 1991, Iraq successfully launched a Scud missile that hit a U.S. Army barracks near Dhahran, Saudi Arabia. The 28 deaths by that one Scud constituted the single deadliest incident of the war, for American soldiers. Interestingly, the “Dhahran Scud”, which killed more people than all 70 or so of the earlier Scud launches, was apparently the last Scud fired in the Gulf War.

Unfortunately, the “Dhahran Scud” succeeded where the other Scuds failed because of a defect in the software embedded in the Patriot missile defense system. This same bug was latent in all of the Patriots deployed in the region. However, the presence of the bug was masked by the fact that a particular Patriot weapons control computer had to be continuously running for several days before the bug could cause the hazard of a failure to track a Scud.

There is a nice concise write-up of the problem, with a prefatory background on how the Patriot system is designed to work, in the official post-failure analysis report by the U.S. General Accounting Office (GAO IMTEC-92-26) entitled “Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia“.

The hindsight explanation is that:

a software problem “led to an inaccurate tracking calculation that became worse the longer the system operated” and states that “at the time of the incident, the [Patriot] had been operating continuously for over 100 hours” by which time “the inaccuracy was serious enough to cause the system to look in the wrong place [in the radar data] for the incoming Scud.”

Detailed Analysis

The GAO report does not go into the technical details of the specific programming error. However, I believe we can infer the following based on the information and data that is provided about the incident and about the defect.

A first important observation is that the CPU was a 24-bit integer-only CPU “based on a 1970s design”. Befitting the time, the code was written in assembly language.

A second important observation is that real numbers (i.e., those with fractions) were apparently manipulated as a whole number in binary in one 24-bit register plus a binary fraction in a second 24-bit register. In this fixed-point numerical system, the real number 3.25 would be represented as binary 000000000000000000000011:010000000000000000000000, in which the : is my marker for the separator between the whole and fractional portions of the real number. The first half of that binary represents the whole number 3 (i.e., bits are set for 2 and 1, the sum of which is 3). The second portion represents the fraction 0.25 (i.e., 0/2 + 1/4 + 0/8 + …).

A third important observation is that system [up]time was “kept continuously by the system’s internal clock in tenths of seconds [] expressed as an integer.” This is important because the fraction 1/10 cannot be perfectly represented in 24-bits of binary fraction because its binary expansion, as a series of 1 or 0 over 2^n bits, does not terminate.

I understand that the missile-interception algorithm that did not work that day is approximately as follows:

  1. Consider each object that might be a Scud missile in the 3-D radar sweep data.
  2. For each, calculate an expected next location at the known speed of a Scud (+/- an acceptable window).
  3. Check the radar sweep data again at a future time to see if the object is in the location a Scud would be.
  4. If it is a Scud, engage and fire missiles.

Furthermore, the GAO reports that the problem was an accumulating linear error of .003433 seconds per 1 hour of uptime that affected every deployed Patriot equally. This was not a clock-specific or system-specific issue.

Given all of the above, I reason that the problem was that one part of the Scud-interception calculations utilized time in its decimal representation and another used the fixed-point binary representation. When the uptime was still low, targets were found in the expected locations when they were supposed to be and the latent software bug was hidden.

Of course, all of the above detail is specific to the Patriot hardware and software design that was in use at the time of the Gulf War. As the Patriot system has since been modernized by Raytheon, many details like these will have likely changed.

According to the GAO report:

Army officials [] believed the Israeli experience was atypical [and that] other Patriot users were not running their systems for 8 or more hours at a time. However, after analyzing the Israeli data and confirming some loss in targeting accuracy, the officials made a software change which compensated for the inaccurate time calculation. This change allowed for extended run times and was included in the modified software version that was released [9 days before the Dhahran Scud incident]. However, Army officials did not use the Israeli data to determine how long the Patriot could operate before the inaccurate time calculation would render the system ineffective.

Four days before the deadly Scud attack, the “Patriot Project Office [in Huntsville, Alabama] sent a message to Patriot users stating that very long run times could cause [targeting problems].” That was about the time of the last reboot of the Patriot missile that failed.

Note that if time samples were all in the decimal timebase or all in the binary timebase then the two compared radar samples would always be close in time and the error would not accumulate with uptime. And that is the likely fix that was implemented.

Firmware Updates

Here are a few tangentially interesting tidbits from the GAO report:

  • “During the [Gulf War] the Patriot’s software was modified six times.”
  • “Patriots had to be shut down for at least 1 to 2 hours to install each software modification.”
  • “Rebooting[] takes about 60 to 90 seconds” and sets the “time back to zero.”
  • The “[updated] software, which compensated for the inaccurate time calculation, arrived in Dhahran” the day after the deadly attack.

Public Statements

In hindsight, there are some noteworthy quotes from the 1991 news articles initially reporting on this incident. For example,

Brig. Gen. Neal, United States Command (2 days after):

The Scud apparently fragmented above the atmosphere, then tumbled downward. Its warhead blasted an eight-foot-wide crater into the center of the building, which is three miles from a major United States air base … Our investigation looks like this missile broke apart in flight. On this particular missile it wasn’t in the parameters of where it could be attacked.

U.S. Army Col. Garnett, Patriot Program Director (4 months after):

The incident was an anomaly that never showed up in thousands of hours of testing and involved an unforeseen combination of dozens of variables — including the Scud’s speed, altitude and trajectory.

Importantly, the GAO report states that, a few weeks before the Dharan Scud, Israeli soldiers reported to the U.S. Army that their Patriot had a noticeable “loss in accuracy after … 8 consecutive hours.” Thus, apparently, all of this “thousands of hours” of testing involved frequent reboots. (I can envision the test documentation now: “Step 1: Power up the Patriot. Step 2: Check that everything is perfect. Step 3: Fire the dummy target.”) The GAO reported that “an endurance test has [since] been conducted to ensure that extended run times do not cause other system difficulties.”

Note too that the quote about “thousands of hours of testing” was also misleading in that the Patriot software was, also according to the GAO report, hurriedly modified in the months leading up to the Gulf War to track Scud missiles going about 2.5 times faster than the aircraft and cruise missiles it was originally designed to intercept. Improvements to the Scud-specific tracking/engagement algorithms were apparently even being made during the Gulf War.

These specific theories and statements about went wrong or why it must have been a problem outside the Patriot itself were fully discredited once the source code was examined. When computer systems may have misbehaved in a lethal manner, it is important to remember that newspaper quotes from those on the side of the designers are not scientific evidence. Indeed, the humans who offer those quotes often have conscious and/or subconscious motives and blind spots that favor them to be falsely overconfident in the computer systems. A thorough source code review takes time but is the scientific way to go about finding the root cause.

As a New York Times editorial dated 4 months after the incident explained:

The Pentagon initially explained that Patriot batteries had withheld their fire in the belief that Dhahran’s deadly Scud had broken up in midflight. Only now does the truth about the tragedy begin to emerge: A computer software glitch shut down the Patriot’s radar system, blinding Dhahran’s anti-missile batteries. It’s not clear why, even after Army investigators had reached this conclusion, the Pentagon perpetuated its fiction

At least in this case, it was only a few months before the U.S. Army admitted the truth about what happened to themselves and to the public. That is to the U.S. Army’s credit. Other actors in other lethal software defect cases have been far more stubborn to admit what has later become clear about their systems.

Tags: , , , , ,

6 Responses to “Lethal Software Defects: Patriot Missile Failure”

  1. Rick Kwiatkowski says:

    Interesting. Using 23 bits to represent 0.1 is:
    :00011001100110011001100
    which is approximately 0.0999999046325684
    which gives a difference of 0.1 of: 0.0000000953674316
    which multiplied by 36000 (the number of tenths of a second in one hour) is: 0.0034332275377635
    which matches the GAO error mentioned in the artical above.
    If they used another bit (24 bits) for the fraction, the error would have been: 0.0012874603273482

  2. DonQ says:

    Keeping time in 1/10ths of a second could/should have been done as an integer if the *units* of the value was 1/10ths of a second. This is the way I read “kept continuously by the system’s internal clock in tenths of seconds [] expressed as an integer.” For example: if you keep time in 1/60th of a minute, you don’t have to use fractions, you just keep an integer of the number of seconds. In a similar way, if your integer says 17, then there have been 17 tenths of a second since the timer started. This method happens all the time with timer “ticks”, which may be any of a variety of units. The original PC value was 18.206509677 times per second, and most of my current devices use some non-binary fraction of a second, even if running the clock off of 60 Hz. Of course, all the calculations based on this value would have to account for the units, but this should be a standard part of integer programming.

  3. groovyd says:

    agreed… writing assembly for a fixed point cpu and counting tenths of a second it would be pretty safe to assume the programmer would hold that as an int, not as a fraction of a second.

    • groovyd says:

      I think the real lesson to be learned here is to test safety of life systems for more then an hour before actually trying to use them.

  4. Atherton says:

    I just want to know where in the software life cycle the problem was?Analysis or design or coding or testing?Which team is to be blamed for the disaster. It was a calculation mistake, so ideally developer should be responsible in my view. Correct me if i am wrong

  5. MattS says:

    I use this as a case study for a course on system safety that I teach, so here are some additional details.

    Technical

    1. The volume of space which the system expects to see a track in is called the Range Gate Area (RGA), there’s an associated velocity so RGA=f(Position,Velocity,Time).
    2. Time is expresssed as an integer with no clock rollover, but to predict position time and velocity are expressed as real numbers.
    3. Patriot ECS calculations are done in floating point, the floating point registers were only 24 bit long, so time conversion from integer to real is precise only to 24 bits.

    Timeline

    1. The problem was introduced by an earlier software mod (see below).
    2. The Israeli Patriot batteries first saw the problem (they were operating continuously)
    3. By Feb 11 the program office had the Israeli data
    4. By Feb 22 the program office had released a fix.
    5. By Feb 26 the fix tapes had reached Dharain (they didn’t air freight it, a logistics error.

    A little bit about the error itself.

    The problem of 24 bit register representation limits was not a major problem in the original code because what’s used in calculations is T (next) relative to T(now) so errors cancel out. What introduced the problem was a modification to deal with modified SCUDs with higher velocities. This mod affected the time calculation but in one instance the old value was subtracted from the new value which introduced the drift error. BTW when I say patch that’s exactly what I mean, old school patch-space.

    And some thoughts on root causes

    The program office understood the problem, had got a fix in play but failed to communicate clearly to the user group what the issue was and also failed to expedite the change aggressively. At a deeper level the system (including the people and organisation) did not handle a rapidly evolving change in the operational context.

    Safety is always context dependent, change the context and all bets are off…

Leave a Reply