Posts Tagged ‘safety’

Public Course on Multithreaded Programming

Tuesday, November 6th, 2007 Michael Barr

This coming January, I’ll travel from chilly Baltimore to sunny Miami to teach an in-depth training course about the proper use of real-time operating systems to design multithreaded firmware. The aim of the class is to clarify the safe and correct use of RTOS primitives, such as mutexes, semaphores, and mailboxes.

The two-day course, called Multithreaded Programming with uC/OS-II, will be held January 22-23, 2008 at the Weston, Florida headquarters of RTOS vendor Micrium, just east of the Everglades. Registration is open to the public, but the total number of seats is limited.

The hands-on course involves a mix of lectures and a coordinated series of programming exercises. The target hardware is an ARM9 development board from STMicro. The increasingly popular uC/OS-II real-time operating system will serve as the reference API with compiler and debug tools from IAR Systems.

Full details, including registration instructions, are available at the Micrium website: http://www.micrium.com/support/training.html

Building Reliable Systems

Wednesday, September 19th, 2007 Michael Barr

I’m writing this live from the Embedded Systems Conference in Boston, while participating in a birds of a feather discussion moderated by Jack Ganssle. The subject of the session is Building Reliable Systems.

The discussion here amongst perhaps 80 engineers (about 75% electrical engineers by education) initially focused on resources and schedules and the inevitability of bugs, but has now turned to what seems to a more productive thread: specific processes and tools that produce higher reliability.

One gentleman, had a great way of summarizing what needs to be done at a high level:

* Prioritization
* Process
* Metrics

In this view, Prioritization is an input to the Process. That is, prioritization of features relative to one another–as well as the accurate definition of properties such as quality and reliability. These are to be provided by the customer or from engineering management. The Process used for design and development and testing are then guided by these input parameters. Metrics are an output of such a Process, if you strive to generate them. Metrics include the schedule time it took to complete specific feature implementations, as well as bug rates.

On the specific point of process and tools, we’ve discovered that just a few companies represented in the room are using code coverage tools and test-driven development, though most appear to agree these might be helpful in raising reliability.

My bottom line fear: One of the products designed by someone in the room with me right now will kill or maim eventually!

The Limits of Knowledge

Friday, November 7th, 2003 Michael Barr

The practice of engineering has often been likened to a form of art. It is, I think, the art of making scientific tradeoffs. As scientists with a practical, rather than academic or theoretical, focus, we are often challenged to build things on the basis of information at or very near the boundaries of what is known to man.

In virtually all endeavors of engineers, there are unknowns, subtleties, and complexities over which we exercise limited control. The cost, in engineering time and resources, to fully comprehend everything about a system is in some cases unbounded; such a thorough analysis is generally at least cost prohibitive. If the product works, we can’t often afford to do much more than ship it and move on to the next project.

Just as tradeoffs are made in the area of features or implementation techniques, so too must tradeoffs be made in the area of knowledge. It is rarely possible to build a saleable product (that will also earn our employer profits) while at the same time completely understanding all of the possible implications of our numerous design and implementation decisions.

Simply put: components fail. And when individual components fail they can take even carefully-designed systems down with them. Such system failures do sometimes also take the lives of their operators or other people. Catastrophes like this are unfortunate—and bound to increase as people rely increasingly on technological solutions to everyday problems.

The designers of each system must decide how much time and money to spend investigating the dark corners. Those designing pacemakers and airplanes, for example, are responsible to shine the light of knowledge brightly in all corners of their designs; whereas the designers of stereos and televisions can leave a great deal more to chance.

There are, of course, areas of engineering that suffer from the need for thorough analysis but are not profit driven. Manned missions to space, such as those conducted by NASA, are of this nature. Tremendous efforts are made by the engineers working at NASA to understand all of the complexities and potential failure points of the Space Shuttles. Unfortunately, there is likely an unbounded amount of work to be done; these systems have millions of individual components and operate in unforgiving and poorly understood environments. And there’s only limited time to show results.

As the losses of the Challenger and Columbia have demonstrated, sometimes it is a part of a design that is thought to be reasonably well understood that is actually the most dangerous. In both cases, very similar past failures had been observed, documented, and discussed by engineers—yet the true problem and the danger it posed was not fully comprehended until after each catastrophe struck.

I don’t blame the engineers at NASA for the loss of either shuttle; in both cases they knew there was a problem but had too many other, seemingly more important, concerns. I’m willing to let NASA administrators and their overseers decide if managerial mistakes were made and, if so, how to correct them. But all engineers everywhere should learn from NASA’s mission failures: What is the true source of the problem in your system? What danger does it pose? How can you overcome organizational challenges to see the proper solution through?

Emergency

Friday, November 23rd, 2001 Michael Barr

In the days immediately following September 11 a pair of articles, one in EE Times and the other in The Washington Post, about emergency cell phone location technology caught my attention. Both articles focused on renewed lobbying efforts on Capitol Hill aimed at forcing cellular providers to meet the FCC’s deadline for implementing Phase II of the Enhanced 911 standard. (At press time, most cellular carriers have requested waivers for the October 1 deadline.)

The contradiction underlying the timing of these new lobbying efforts is that the technology, as proposed, would have helped very few, if any, of the thousands of victims of the terrorist attacks. Ignoring that most of those in the twin towers probably didn’t live past the crushing collapses, consider these more technological issues.
Handset-based locators:
  • In order for a GPS receiver in a handset to determine the owner’s location to within the required 50 m (67% of 911 callers) or 150 m (95%), a clear view of several satellites is required. It can be difficult to get an adequate view of sky in a downtown section of a city (a big part of what those percentage of caller requirements are all about) even on a normal day. Imagine trying to acquire a signal from even one GPS satellite while you’re buried in a pile of rubble that’s ten stories deep and mostly underground (or in a subway system, a traffic tunnel, or many other likely emergency sites, for that matter).
  • Even if your handset could somehow manage to acquire a sufficient number of satellites, it’s questionable whether the mandated accuracy range would have been adequate in this disaster. With literally millions of tons of debris to move, even 50 m accuracy is nowhere near precise enough to point rescuers in the proper direction to dig. The larger 150 m radius could put you anywhere within the base of one tower. And how deeply should they look? (Start digging with heavy equipment or hands?)
Network-based locators:

  • Perhaps, in this disaster, network-based triangulation would have been more useful to rescuers. At least it wouldn’t have required victims to have recently upgraded their phones or have a clear view of the sky. Yet the lower required accuracy for this technology (100 m for 67% of 911 callers, 300 m for 95%) would have made the data that much less useful to the rescuers. 

In either case, both technologies would require that the victim’s phone be also: still in her possession after the collapse; still in a working condition; and partly or fully charged. In addition, the victim turning on her phone would have to be lucky enough to be greeted by something other than a lack of signal (several base stations were destroyed right in the vicinity of the World Trade Center) or network-busy (cellular and land-based telephone traffic surged even on networks clear across the country).

Rather than pointing to the need to implement the current generation of E-911 technology more quickly, this tragedy only points to the complete inadequacy of the current requirements for certain kinds of disasters. The current E-911 technology may, in fact, be useful in some sorts of emergencies. But we can’t stop there. In addition to implementing the current technology, other technologies and approaches need to be considered as well. For example, handheld devices that pinpoint the location and distance of handset signals should be available en masse within hours of such disasters.  
Surely someone in our industry is in a position to help solve this problem before the next disaster strikes.

Safety Patrol

Thursday, September 20th, 2001 Michael Barr

When I was in the sixth grade, I was a member of my school’s Safety Patrol. It was my responsibility to ensure that younger children got on and off the school bus safely. “Safeties” wear bright orange sashes and help other kids cross streets adjacent to their bus stops. This is just one measure in a complex web of overlapping steps taken to protect the most vulnerable members of our communities.

As children and adults alike increasingly place their lives in the hands of computer hardware and software, we need to add layers of safety there as well. No software bug or hardware glitch (or combination) can ever be allowed to bring down an aircraft, whether there are hundreds of passengers on board or just a pilot. The failure of many other systems must be similarly prevented. But software and hardware do fail—perhaps inevitably. As engineers, we use system partitioning, redundancy, protection mechanisms, and other techniques to contain and work around failures when they do occur.
As software’s role in safety-critical systems continues to expand, I expect we’ll see a rapid increase in the number of civil lawsuits filed against companies that design and manufacture embedded systems. (Adding several new levels of meaning to the phrase project post mortem.) Indeed, there is anecdotal evidence that lawsuits of this sort may already be on the rise. With most of the action in hush-hush settlements outside the courtroom, though, the media hasn’t yet noticed the trend.
One organization that has definitely taken notice of the hazards posed by software in products is Underwriter’s Laboratories. An independent, not-for-profit product safety certification and ANSI-accredited standards organization, UL initiated a “Standard for Software in Programmable Components” in 1994. The resulting ANSI/UL-1998 standard addresses “the detailed safety-related characteristics of specific software in a product.”
In addition to focusing on top down design and development processes, it may also be beneficial to utilize an operating system that’s been designed with safety-critical systems in mind. Above all else, an RTOS should not compromise the stability of the system. However, an operating system can go beyond and do many things to reduce the risks inherent in your application code. Keeping software tasks from overwriting each other’s data and stacks is merely the beginning of the matter.
In your rush to select an RTOS for use in a mission critical system or life-critical medical device, do make sure you know what you’re getting, though. It turns out that one prominent new operating system marketed specifically for inclusion in products of these sorts has a potentially dangerous hole in its “innovative” protection mechanism. You don’t want to wind up on the wrong side of something like that in court.
Ultimately, the key to designing safety-critical systems is to include multiple layers of protection. The hardware, the operating system, and your application software must each do everything they can to prevent catastrophe—even if the fault itself lies outside that subsystem.