embedded software boot camp

Firmware Disasters

Tuesday, June 23rd, 2009 by Michael Barr

First, an Airbus A330 fell out of the sky. Then two D.C. Metro trains collided. Several hundred people have been killed and injured in these disastrous system failures. Did bugs in embedded software play a role in either or both disasters?

An incident on an earlier (October 2006) Airbus A330 flight may offer clues to the crash of Air France 447:

Qantas Flight 72 had been airborne for three hours, flying uneventfully on autopilot from Singapore to Perth, Australia. But as the in-flight dinner service wrapped up, the aircraft’s flight-control computer went crazy. The plane abruptly entered a smooth 650-ft. dive (which the crew sensed was not being caused by turbulence) that sent dozens of people smashing into the airplane’s luggage bins and ceiling. More than 100 of the 300 people on board were hurt, with broken bones, neck and spinal injuries, and severe lacerations splattering blood throughout the cabin. (Article, Time Magazine, June 3, 2009)

Authorities have blamed a pair of simultaneous computer failures for that event in the fly-by-wire A330. First, one of three redundant air data inertial reference units (ADIRUs) began giving bad angle of attack (AOA) data. Simultaneously, a voting algorithm intended to handle precisely such a failure in 1 of the 3 units by relying only on the other matching data failed to work as designed; the flight computer instead made decisions only on the basis of the one failed ADIRU!

(A later analysis by Airbus “found data fingerprints suggesting similar ADIRU problems had occurred on a total of four flights. One of the earlier instances, in fact, included a September 2006 event on the same [equipment] that entered the uncommanded dive in October [2006].” Ibid.)

Much of the attention in the publicly disclosed details of the Air France 447 crash has focused on the failure of one of several air speed indicators. Were there three of those as well? If so, was the same flight computer to blame for failing to recognize which to trust and which was unreliable?

It is very early in the investigation of yesterday’s collision between two D.C. Metro red line trains, in which a stopped train was rear-ended and heavily damaged by a moving train on the same track, to place blame. But a WashingtonPost.com article headlined “Collision was Supposed to be Impossible” says it all:

Metro was designed with a fail-safe computerized signal system that is supposed to prevent trains from colliding.

and

During morning and afternoon rush hours, all trains except longer eight-car trains typically operate in automatic mode, meaning their movements are controlled by computerized systems and the central Operations Control Center. Both trains in yesterday’s crash [about 5pm] were six-car trains. (Article, Washington Post, June 23, 2009)

Are bugs in embedded software to blame for these two disasters? You can bet the lawyers are already looking into it.

Tags: , , , ,

2 Responses to “Firmware Disasters”

  1. Lisa says:

    I suspect the embedded code could very well have been to blame – it seems nearly impossible to test every condition, but a collision like this ones seems an obvious test case to simulate. I'd also be looking at the similarities between automatic mode vs other modes. A smoking gun (for firmware anyway) is the idea of "A" car vs "B" car and which is in front, and the algorithms when these cars are reversed. And if anyone bothered to think such a situation might happen.Boy, I hate to prematurely slam embedded code, but we aren't very good about consistent, comprehensive and logical testing.I can't want to see the final report on the firmware if we ever get to see it. It's too bad we all as a community can't dig through it – engineers love puzzles – just too bad this is such a sad one.LisaReal Life Debugged" Technology Blogwww.lisaksimone.com/phoneonfire/

  2. st4rbux says:

    Maybe it's semantics, and I'm not an embedded engineer, but what is the definition of "fail-safe"? I'm assuming that means if the sensors fail, it defaults to a safe state (like stopping the train ASAP). If the sensors provide bad data, that's not a failed state so all the system can do is process the data is has available.I'm also curious what defines a bug. I thought buggy code fails to implement the design (shame on the embedded coders). Is it me, or are these examples more a case of failure in design?Ultimately, dependent components (inputs to the embedded logic, like the air-speed sensors) fail, right? Don't we have to accept that four panic-dives out of millions of A330 flights is as close to perfect as we're going to get?

Leave a Reply to st4rbux