embedded software boot camp

Boeing Dreamliner ‘Bug’

Friday, May 1st, 2015 by Nigel Jones

There’s an all too familiar story in the press today. The headline at the Guardian reads “US aviation authority: Boeing 787 bug could cause ‘loss of control’. As usual with these kinds of stories, it’s light on technical details other than to reveal that the Dreamliner’s generators will fall into a fail safe mode if kept continuously powered for 248 days. In this fail-safe mode, the generator doesn’t apparently generate power. Thus if all four of the planes generators were powered on at the same time and kept powered for 248 days, then suddenly – no power. That’s what I’d call an unfortunate state of affairs if you were at 40,000 feet over the Atlantic.

So what’s special about 248 days? Well 248 days = 248 * 24 * 3600 = 21427200 seconds. Hmm that number looks familiar. Sure Enough, 2^31 / 100 ~= 21427200. From this I can deduce the following.

Boeing’s generators likely contain a signed 32 bit timer tick counter that is being incremented every 10ms (or possibly an unsigned 32 bit counter being incremented every 5ms – but that would be unusual). On the 248th day after start up, this counter overflows. What happens next is unclear, but Boeing must have some sort of error detection in place to detect that something bad has happened – and thus takes what on the face of it is a reasonable action and shuts down the generator.

However, what has really happened here is that Boeing has fallen into the trap of assuming that redundancy (i.e. four generators) leads to increased reliability. In this line of thinking, if the probability of a system failing is q, then the probability of all four systems failing is q*q*q*q – a number that is multiple orders of magnitude smaller than the probability of just one system failing. For example if q is 0.001, then q^4 is 1,000,000,000 times smaller. At this point, the probability of a complete system failure is essentially zero. However, this is only the case if the systems are statistically independent. If they are not, then you can have as many redundant systems as you like, and the probability of failure will be stuck at q. Or to put it another way, you may as well eschew having any redundancy at all.

Now in the days of purely electro-mechanical systems, you could be excused for arguing that there’s a reasonable amount of statistical independence between redundant systems. However, once computer control comes into the mix, the degree of statistical dependence skyrockets (if you’d pardon the pun). By having four generators presumably running the same software, then Boeing made itself  completely vulnerable to this kind of failure.

Anyway, I gave a talk on this very topic on life support systems at a conference a few years ago – so if you have half an hour to waste have a watch. If you’d rather read something that’s a bit more coherent than the presentation, then the proceedings are available here. My paper starts at page 193.

The bottom line is this. If you are trying to design highly reliable systems and redundancy is an important part of your safety case, then you’d damn well better give a lot of thought to common mode failures of the software system. In this case, Boeing clearly had not, resulting in a supposedly fail-proof system that actually has a trivially simple catastrophic failure mode.

 

 

18 Responses to “Boeing Dreamliner ‘Bug’”

  1. Isaac Rabinovitch (@isaac32767) says:

    Great analysis. Thanks for clearing up my confusion. I do have to nitpick this statement:

    “That’s what I’d call an unfortunate state of affairs if you were at 40,000 feet over the Atlantic.”

    If you read the news reports, you’ll notice that the concerns are over this failure happening during a maneuver or takeoff/lending. If it happened while you’re cruising along at 40,000 feet, you’d just glide for about a hundred miles. I assume that’s enough time to restart the generators.

    See also Gimli Glider.

  2. Fernando says:

    No problem, just disconnect the plane’s batteries from time to time.

    We are used to do it all the time, specially with our computers 🙂

  3. Ryan Fox says:

    Why would an unsigned 5ms counter be more unusual than a signed 10ms counter? Using a signed counter, you open yourself up to potential undefined behaviour problems when the counter overflows, assuming it’s written in C. (Not sure how it works in Ada, or whatever else might be used in aviation.)

    • Nigel Jones says:

      That’s a fair question as I agree the comment is a bit obtuse. One of life’s enduring mysteries is what is the optimal time slice for an RTOS? Clearly, if you make it too slow, your system can be unresponsive, make it too fast and your RTOS overhead skyrockets. As a result, the industry seems to have settled on the Goldilocks values of 50 Hz – 100 Hz tick rates. I don’t think I’ve ever seen a 200 Hz tick rate, although I did see a 1000 Hz system once. With that out of the way, there’s another thing that makes me think it’s a 100 Hz system with a signed counter. If the RTOS was detecting the fact that its tick counter had overflowed then I suspect there would be no shutdown, but instead the problem would be handled seamlessly. For me, the description points to an exception handler type shutdown in which an illegal event (i.e. overflowing a signed integer) was trapped, presumably in hardware, resulting in the shutdown. If I’m right then I seriously doubt that there’s any form of protection against an unsigned integer overflowing, since this is behaviour that lots of code relies upon.

      Of course, this leads into the bigger question as to why a signed counter for what is an unsigned function? I’ve written a lot about signed vs. unsigned integers. If I’m right and Boeing used a signed integer here, then I’d suggest it was a mistake.

      • Henk de Leeuw says:

        I think representing time as a signed integer is not a mistake, it is perfectly allright.
        How else can we distinguish the past from the future?
        It is the implementation of time comparison that is tricky.
        A naive approach to a timeout function would be:
        clock_t timeout = clock() + 1000;
        while (clock() < timeout) { … }
        This would result in exactly the behavior we see here, malfunction as soon as clock() overflows.
        A correct implementation would be:
        clock_t start = clock;
        while (clock()-start < 1000) { … }
        This implementation is immune to timer wrap-around. Note that you can only do this with a signed clock_t type.
        As a side note, I know of one computer system that used a 200 Hz clock. Admittedly, it was not an embedded RTOS,, but the venerable Atari ST home computer.

  4. Wally says:

    In aviation it might use Ada, or C. The mandating of Ada seems to be long gone. But even Ada allows you to used unsigned large quantities (I wrote Ada for aviation/ aerospace fore a number of years.)

    It’s an odd thing for a programmer not to have some kind of wrap-around handling for cases like this. Or at least upper limit handling. Wrap around (using signed subtraction) is actually not very hard to do.

    There is a common programmers mindset of “oh that case will NEVER happen in practice”. And as a boss of mine used to say when channeling James Bond: Never say never.

    • EtienneH says:

      This story reminds me of the famous Patriot missile bug (http://www.ima.umn.edu/~arnold/455.f96/disasters.html).

      To my mind, the question was even not “will an overflow ever happen”. The programmer most probably never took a second to think of the possibility of an overflow happening. We need to count time? ok let’s use an int as is the most standard variable size. And of course no system testing went long enough to detect the fault.
      Had the team taken only a minute to think of wrap-around problems, then they would have quickly found some easy work-around based, as suggested, on integer wrap-around.

      Or… maybe the RTOS consistently used tick count wrap-around, but at some point a task programmer forgot to use signed int comparison of ticks, e.g. in an overengineered defensive programming like the following:
      last_ticks = ticks;
      ticks = os_get_tick_count();
      if (ticks < last_ticks)
      {
      fatal_error("inconsistent time");
      }

      Although in the end it boils down to a well-known programming mistake, the actual explanation is certainly a complex combination of insufficient/incoherent requirements, design misunderstanding, programming misconception, etc. And wrong assumptions about safety by redundancy.

  5. Karl says:

    Don’t they lsiten to their static analysis tool? But then again Apple doesn’t listen to their compiler warnings (#gotofail)

  6. Lundin says:

    I would question why the programmer used a signed integer for a timer counter. Why would you ever want a timer to hold a negative value? Smells like laziness and general unconcern about data types, just declare everything as “int” like the average beginner programmer. So I doubt they used a safe subset of the language (such as MISRA-C or SPARK), even though I think DO-178 enforces you to use a safe subset of the language used(?).

    It also seems strange that there is no check against the upper limit, but static analysis would not likely have found that bug.

    The comment about redundancy giving a false sense of safety is interesting. Lots of safety standard bureaucrats love you to add redundancy all over, and then you would for some reason magically get increased safety, while in many cases you have just reduced the probability for the inevitable failure to happen. But what really makes a system safe is the ability to detect and deal with errors when they happen.

    • Anonymous says:

      To my knowledge, MISRA-C wouldn’t mandate around this issue, but a good code review process should have caught this.
      One far out issue I can think of that may have gotten around an upper bounds check is if the counter was stored in NVM and something else (cosmic rays or whatever) modified it past the bounds.
      I’ve heard of this recently (last year or so) where the check for upper bounds was only against equality and not greater than. Given this, the value kept incrementing once it was changed to be above the boundary.

  7. ChrisH says:

    There are times when you will want a negative timer. You count down to a start (-ve to zero) and then count positive time on from that…. Other times you may time down to zero from positive to zero and again record the time until another event after zero.

    The problem isn’t a signed or unsigned integer but what happens when you reach the limit at either end of the legal range.

    I know lets call it range checking! I should patent that ide before some one else thinks of it….. 🙂

  8. Narayan says:

    Hi Nigel,
    Thought not on the same lines, this has led me to know your opinion/ comment on the recent ‘leap second’. How it could have caused issues (were there any?) and how to address such problem. It is extremely rare but I am curious to know if it could have caused any serious damage (to life/ property).

    Cheers,
    Narayan

  9. David Bakin says:

    Lots of comments on the timer overflow aspect yet nobody mentioned the Windows 98 49-day crash bug? Was that too long ago?

    Anyway, the principal takeaway from this post is the folly of not properly identifying dependent or correlated systems when you’re trying to achieve reliability via redundancy (and that’s well written here).

    I’ve seen that in datacenter situations before – e.g., multiple switches used to provide multiple paths to servers – unfortunately all switches controlled via one piece of software that made it easy for a simple operator error to affect all switches at once!

    (I distinctly remember a backhoe taking out several fiber lines at once in the Bay Area a decade or two back – surprising several companies – possibly Visa? – who thought they were getting reliability by paying for redundant fiber – unfortunately all of their redundant connections went through the same trench alongside the train tracks … but Google won’t find this for me now …)

Leave a Reply to david

You must be logged in to post a comment.