Archive for the ‘Firmware Bugs’ Category

Boeing Dreamliner ‘Bug’

Friday, May 1st, 2015 Nigel Jones

There’s an all too familiar story in the press today. The headline at the Guardian reads “US aviation authority: Boeing 787 bug could cause ‘loss of control’. As usual with these kinds of stories, it’s light on technical details other than to reveal that the Dreamliner’s generators will fall into a fail safe mode if kept continuously powered for 248 days. In this fail-safe mode, the generator doesn’t apparently generate power. Thus if all four of the planes generators were powered on at the same time and kept powered for 248 days, then suddenly – no power. That’s what I’d call an unfortunate state of affairs if you were at 40,000 feet over the Atlantic.

So what’s special about 248 days? Well 248 days = 248 * 24 * 3600 = 21427200 seconds. Hmm that number looks familiar. Sure Enough, 2^31 / 100 ~= 21427200. From this I can deduce the following.

Boeing’s generators likely contain a signed 32 bit timer tick counter that is being incremented every 10ms (or possibly an unsigned 32 bit counter being incremented every 5ms – but that would be unusual). On the 248th day after start up, this counter overflows. What happens next is unclear, but Boeing must have some sort of error detection in place to detect that something bad has happened – and thus takes what on the face of it is a reasonable action and shuts down the generator.

However, what has really happened here is that Boeing has fallen into the trap of assuming that redundancy (i.e. four generators) leads to increased reliability. In this line of thinking, if the probability of a system failing is q, then the probability of all four systems failing is q*q*q*q – a number that is multiple orders of magnitude smaller than the probability of just one system failing. For example if q is 0.001, then q^4 is 1,000,000,000 times smaller. At this point, the probability of a complete system failure is essentially zero. However, this is only the case if the systems are statistically independent. If they are not, then you can have as many redundant systems as you like, and the probability of failure will be stuck at q. Or to put it another way, you may as well eschew having any redundancy at all.

Now in the days of purely electro-mechanical systems, you could be excused for arguing that there’s a reasonable amount of statistical independence between redundant systems. However, once computer control comes into the mix, the degree of statistical dependence skyrockets (if you’d pardon the pun). By having four generators presumably running the same software, then Boeing made itself  completely vulnerable to this kind of failure.

Anyway, I gave a talk on this very topic on life support systems at a conference a few years ago – so if you have half an hour to waste have a watch. If you’d rather read something that’s a bit more coherent than the presentation, then the proceedings are available here. My paper starts at page 193.

The bottom line is this. If you are trying to design highly reliable systems and redundancy is an important part of your safety case, then you’d damn well better give a lot of thought to common mode failures of the software system. In this case, Boeing clearly had not, resulting in a supposedly fail-proof system that actually has a trivially simple catastrophic failure mode.

 

 

How to lockup the in-flight entertainment system on a Boeing 777

Saturday, June 18th, 2011 Nigel Jones

I have recently returned from  a short trip to the UK. I flew both ways on what appeared to be a relatively new Boeing 777 courtesy of  United Airlines. As is now common place on trans-Atlantic wide-body aircraft, my seat came with its own in-flight entertainment system. After I took my seat to fly to London, I was a little surprised to see that the in-flight entertainment system was suddenly rebooted. How did I know this? Well there was cute Linux penguin in the top left hand corner, plus I (and the rest of the plane) was treated to the always delightful task of watching hundreds of lines of startup script scrolling across the screen. After a minute or two the reboot came to an end and I was presented with the now standard touch screen user interface.This should have given me a hint that the in-flight entertainment system wasn’t the most stable of applications.

Now flights from the east coast of the USA to Europe tend to be night flights, and this flight was no exception. Being a seasoned traveler I have my trans-Atlantic night flight routine down pat. Get settled in, have something to eat, put on the noise cancelling headphones, select a classical music channel and then try and sleep for the rest of the flight. I followed my routine, and after selecting the appropriate music channel, I selected the volume control icon. This resulted in a pop-up window from which I adjusted the volume to something reasonable. I then made the fateful mistake. Rather than exiting the volume adjust window, I simply hit the ‘on/off’ button. This of course doesn’t turn the system off, it merely blanks the display. I then settled in for my nap. Several hours later, I woke up and wondered how far into the flight we were and thus decided to take a look at the real-time route map. Accordingly I turned the display ‘on’ and was rewarded with what I expected to see, namely the volume control screen. A photograph is shown below:

When I touched the Back button, nothing happened. Hmmm thought I, has my system died? Answer – no. I could still adjust the volume, and I could still mute and un-mute the audio. I could also turn the system ‘on’ and ‘off’. Clearly the problem was that there was no ‘back’ associated with the Back button. Needless to say, United Airlines wasn’t going to reboot the entire system so that I could experiment some more, plus I was tired. Anyway I resolved to experiment some more on the return flight.

Thus a week later I’m on a Boeing 777 with the same in flight entertainment system. I brought up the volume control page and did nothing. After about 30 seconds, the pop-up window auto cleared. This was the clue I needed. I thus brought up the volume control page again and immediately turned the unit ‘off’ and then a few seconds later ‘on’. When I did this the back button worked correctly. I repeated the exercise a few more times, and it always worked. I then repeated the exercise, but this time I waited longer than the screen timeout period. Voila – lockup. Clearly this was a case where there was a gaping hole in the state machine that was driving the user interface. At this point I found some interesting thoughts crossing my mind:

  • Idiot. Now you can’t watch any movies.
  • I bet the designer didn’t use a formal state machine tool such as visualSTATE, or QP
  • Whenever there are two distinct ways of exiting a state (in this case, user action or the passage of time), life gets complicated
  • Preserving state across a ‘power down’ is always difficult
  • I hope the guy that wrote the in-flight entertainment  system had nothing to do with the flight control systems on the plane!

Anyway if you find yourself on a plane with an in-flight entertainment system in the near future, see if you too can crash the system – and let us know how you did it.

P.S. I woke up this morning to read that a computer ‘glitch’ effectively grounded United Airlines yesterday. See this for thoughts on United’s computer system.

Sanity checking data

Sunday, May 29th, 2011 Nigel Jones

One of the major differences between embedded systems engineers and the general public is that we tend to notice embedded systems a lot more – both when they do something very well, and also of course when they do things not so well. The latter happened to me recently when I was pursuing one of my hobbies, namely riding my bicycle. I’m quite a keen cyclist, in part because I happen to live in part of the world which has some terrific riding country. (To see what I mean, check out the photos of the ride in question. They were taken by my cycling (and skiing) buddy Bill – who I think you’ll agree is a very fine photographer. Click here to see more of his published work. If you hunt carefully, you’ll find some pictures of yours truly).

Anyway at about 25 miles into the ride, I noticed that my Incite bike computer was showing that I had ridden 34 miles. This struck me as unlikely, and so I asked Bill what mileage he had on his computer – 25 being the answer. I thus cycled through the computer screens, until I came to this one (photo courtesy of Bill Tan).

Apparently the computer thought that I had hit 132.6 mph at some point – and obviously sustained it for quite a while for me to gain about 9 miles. Now to understand how this could come about, you need to know a little about my bike computer. The computer consists of two parts, the User Interface (shown above) and the pick up. The pick up senses wheel rotation by a magnet passing close to what I assume is a Hall Effect Sensor. Now whereas many bike computers transfer the signal along a cable to the display, mine transmits it using an RF link – and this I suspect was the root cause of the problem. My guess is that at some point in the ride, I rode into an area of RF interference that the display interpreted as signals from the pickup. The firmware in the bike computer appears to have blithely accepted the RF data as valid and thus produced the ridiculous result shown here.

Now I have often been faced with this kind of problem – and the solution is not easy. However IMHO Incite really fell down on the job here. If I had been writing the code, I suspect that I’d have done the following:

  1. Median filter the data to remove random outliers. (Incite may be doing this).
  2. Sanity check the output of the median filter. If the answer is ‘impossible’ (like a human pedaling a bicycle at 132.6 mph), then reject the data and let the user know that something is amiss. Incite did neither of these things.

Rejecting the data is actually a little harder than it sounds. If you reject data, what do you replace it with? Common choices are:

  1. Zero
  2. The most recent valid data
  3. The average of the last N readings.

Each of these has its place and is application dependent.

Letting the user know that something is amiss, is usually straight forward – flashing the erroneous value is the normal solution.

Anyway, the bottom line is that a wise embedded engineer always sanity checks the incoming (and outgoing) data. If you do you are less likely to end up as the subject of a blog posting.

A personal note

I apologize for the abysmal rate at which I have been posting. I moved house this spring, which coupled with me being extremely busy has resulted in there simply not being enough hours in the day. When that happens, something has to give. I hope to return to my normal blog posting rate in July.

Classic race conditions and thoughts on testing

Monday, August 30th, 2010 Nigel Jones

At a rather fundamental level this blog is about how I do embedded systems. Implicit in a lot of the articles is the concept that I believe what I’m doing is ‘right’, or at least ‘better’. Well today I thought I’d write about something I got wrong (at least on the first pass).

This is the scenario. I’m currently working on an NXP LPC17xx ARM Cortex design. Like all modern processors, the LPC17xx has a number of sophisticated timers with all sorts of operating modes. Well it so happens that I am using  four (out of a possible six) interrupt sources for one particular timer. The hardware architecture of the processor routes all of these interrupts to one vector and thus one interrupt handler. Here’s what I wrote:

void TMR3_IRQHandler(void)
{
 if (T3IR_bit.MR0INT)
 {                                            
  /* Do stuff */
 }

 if (T3IR_bit.CR0INT)
 {        
  /* Do stuff */        
 }

 if (T3IR_bit.MR1INT)
 {                                            
  /* Do stuff */
 }

 if (T3IR_bit.CR1INT)
 {                                            
  /* Do stuff */
 }

 T3IR = 0x3F;            /* Acknowledge all interrupts */

 ...
}

Thus in the ISR I tested each of my interrupt sources, took the appropriate actions in the sections marked ‘Do stuff’, acknowledged the interrupts, did a bit of clean up and I was done.  The ‘Do Stuff’ sections were quite complicated and so this was where I spent my time. Anyway having finished coding the ISR, I took a short break and came back to re-examine the code. As I was re-reading the code, I realized that I had made a classic mistake. In case you haven’t spotted it, the problem is in the line where I acknowledge all interrupts. Consider the following sequence of events:

  1. Interrupt source CR1INT is asserted and the CPU vectors to this ISR.
  2. I test the various interrupt flags and discover that CR1INT is set and do the requisite work.
  3. While I’m doing the requisite work, interrupt source MR1INT becomes active.
  4. I clear all interrupt sources (including MR1INT) and terminate the ISR
  5. As a result I have missed an interrupt.

The way this should have been coded is to acknowledge each interrupt bit individually. I.e. like this:

void TMR3_IRQHandler(void)
{
 if (T3IR_bit.MR0INT)
 {                                            
  /* Do stuff */
  T3IR_bit.MR0INT = 1;                    /* Clear the interrupt */
 }

 if (T3IR_bit.CR0INT)
 {        
  /* Do stuff */    
  T3IR_bit.CR0INT = 1;                    /* Clear the interrupt */
 }

 if (T3IR_bit.MR1INT)
 {                                            
  /* Do stuff */
  T3IR_bit.MR1INT = 1;                    /* Clear the interrupt */
 }

 if (T3IR_bit.CR1INT)
 {                                            
  /* Do stuff */
  T3IR_bit.CR1INT = 1;                    /* Clear the interrupt */        
 }

 ...
}

So how did this mistake come about? I think there were two culprits:

Mistake 1

The first mistake I made was in using another timer ISR as a template. The code I copied had just a single interrupt source, and thus acknowledging all of the sources was reasonable.

Mistake 2

I was too concerned with the ‘real work’ of the ISR. I should have written the ISR outline first and only then worried about the real work.

Notwithstanding the above, I did do one thing correctly – and that was to finish the code, walk away, and then come back to re-examine it. At no time did I reach for the debugger to test my code – which was just as well because quite frankly the chances of this bug being caught by testing are vanishingly small. Indeed just about the only way a bug like this would get caught is via code inspection – which is why I’m such a firm believer in code inspection as a debugging tool.

Anyway if you found this informative, you may find this account of another mistake I made equally enlightening.

A taxonomy of bug types in embedded systems

Wednesday, October 7th, 2009 Nigel Jones

Over the next few months I’ll be touching upon the subjects of debugging and testing embedded systems. Although much has been written about these topics (often by companies looking to sell you something), I’ve always been struck by the fact that many of these discussions treat errors as if they were all cut from the same cloth. Clearly this is foolhardy, as it’s my experience that understanding what class of error you have is key to adopting an effective debugging and testing strategy. With that being said, my taxonomy of embedded systems errors appears below, arranged roughly in the order that one encounters them in an embedded project. I might also add that the difficulty in solving these problems also roughly follows the order I’ve listed, with syntax errors being trivial to identify and fix, while race conditions can be extremely difficult to identify (even if the fix is fairly easy).

Group 1 – Building a linked image
Syntax errors
Language errors
Build environment problems (make file dependencies, linker configurations)

Group 2 – Getting the board up and running
Hardware configuration errors (failure to setup peripherals correctly)
Run time environment errors (stack & heap allocation, memory models etc)
Software configuration errors (failure to use library code correctly)

Group 3 – Knocking off the obvious mistakes
Coding errors (initialization, pointer dereferencing, N + 1 issues etc)
Algorithmic errors.

Group 4 – Background / Foreground issues
Re-entrancy
Atomicity
Interrupt response times

Group 5 – Timing related
Resource allocation mistakes
Priority / scheduling issues
Deadlocks
Priority inversion
Race conditions

It’s my intention over the next few months to discuss how I set about solving these sorts of problems, so it’s important that I’ve got the groups right. Thus if anyone thinks this taxonomy is missing an important group, then perhaps you could let me know via the comments section or email.

Home