embedded software boot camp

Is Reliable Multithreaded Software Possible?

Wednesday, December 23rd, 2009 by Michael Barr

Until earlier this month, I’d overlooked a most interesting May 2006 article in Embedded Software Design magazine by Mark Bereit titled “Escape the Software Development Paradigm Trap“. The article opines that the methods we use to design embedded software, particularly multitasked software with interrupt service routines and/or real-time operating systems, are fundamentally incompatible with reliability.

Here’s the critical analogy:

Imagine for a minute that I’ve invented the Universal Bolt. This is a metal object for joining threaded holes that can extend or collapse to fit a variety of lengths. It can expand or contract to fit holes of different diameters. The really cool feature is that I have replaced the bolt’s spiral ridge with a series of extendable probes that can accommodate different thread pitches. You no longer need to stock a variety of bolts of different sizes and lengths and thread spacings because my Universal Bolt can be used in place of any of them.

Because it’s able to change configurations extremely quickly, a single Universal Bolt can take the place of many conventional bolts simultaneously. What we do is rig up a clever and very fast dispatcher device that quickly moves the [Universal Bolt] from hole to hole. If the dispatcher is fast enough, my Universal Bolt can spend a moment in each hole in turn and get the whole way through your [mechanical] product so fast that it returns to each hole before the joint has had a chance to separate.

You’d have to be crazy to fly in an airplane designed this way. “If anything caused the dispatcher to derail, the entire product would collapse in a second.” Yet this analogy describes the design of most products powered by embedded computers.

A fast and complex thread dispatcher keeps moving one simple and stupid integer-computation unit all over a big system tending to tasks [and ISRs] rapidly enough that they all get done. And if that dispatcher ever once leads the CPU into an invalid memory address the whole thing crashes to a halt.

Clearly, we need a new paradigm for reliable embedded software architecture. My thoughts on that are coming to this space in 2010.

Tags: , , , ,

7 Responses to “Is Reliable Multithreaded Software Possible?”

  1. tedmar says:

    You are right !A real time multitasking is a powerful tool and, as such, you can blow up yourself using it, if you haven't extreme care.Moreover, many people uses it in scenarios where is not needed to; if the use of the RTM is in the preempitve mode, the damage risk is high.In many cases, the RTM breaks the time order of the events, allowing for race conditions.I've found that the old foreground-background in most cases is better behaved.

  2. peterbushellwp says:

    (1) The analogy seems spurious but I've neither the time nor the inclination to argue this point.

    (2) Like many, I'm very concerned about the quality of some embedded software. There is a whole "safe software" industry out there but it seems to be totally fixated on the process of software development, especially testing. Important, yes, but I believe the biggest problem we have (and not just in this industry) is declining knowledge and skills, and lax ("best-case" design?) attitudes, among some designers and many programmers.

    (3) The problems pervade all kinds of software and in this respect multithreading/multitasking (or otherwise) is irrelevant. However, the development of multitasking software requires single-minded concentration and skills in greater measure than does the development of a sequential program, so (2) rears its head again.

    (4) I believe there are enough people around who are experienced enough to provide reliable RTOSes, etc. but we are short of people who understand them well enough to use them safely. We don't need a new paradigm but there are two things we do need: better education and training all round and RTOSes which are easier to use safely and difficult to abuse. I'm working, in my small way, on both!

    Peter Bushell
    http://software-integrity.com/

  3. Paul Newton says:

    What is the popular altenative to multi-threading? – Round-robin service – so instead of a dispatcher switching the bolt around in real time we have a sequential plate spinner trying to put that bolt in each slot…The obvious alternative is extreme multi-processing – either with many smaller processors or with fancy micro-controllers that have complex dedicated subsystems that each run some kind of program on a "kernel" (as its called on the Infineon XC16x family).

  4. TennesseeCarmelVeilleux says:

    Right now, every new passenger aircraft that gets designed is equipped with many different avionics subsystems, the most important of which run a microprocessor.For instance, the Airbus A380, the largest passenger aircraft in the world currently has a main avionics stack composed of a real-time network of 8 computers, each running 3 processors in a triple-redundant configuration. The flight characteristics of this aircraft depend on the reliable execution of more than 20 different software systems, INDEPENDANTLY developped.What keeps this beast running is the concept of Integrated Modular Avionics (IMA). IMA is an integration paradigm where software applications are safely partitioned, most often using the ARINC-653 software standard on DO-178B-certified operating systems. DO-178B is a mature software development process guideline that must be followed by anyone wishing to have avionics software certified for use in a plane by the FAA. It specifies what level of testing, QA and software architecture planning that equipment manufacturers must follow to insure all possible software fault mitigation has been included according to application safety level.ARINC-653 is a standard defining both an Application Executive and software API for space/time partitionning of multiple applications and the communication between them. Each "application" is basically segregated in a memory partition (space partitionning) and is alloted STATIC time slices for its processing, according to a predetermined schedule (time partitionning). ARINC-653 provides sampling and queuing ports for safe inter-application communication, which are also statically defined.Using these mecanisms, every "application" can run its own (certified) RTOS or as bare metal, and each thinks it "owns" the entire computer, albeit in a sandbox. Whenever something goes astray, several levels of health monitoring are provided to give deterministic resolution paths to the system.IMA and ARINC-653 are used in the Boeing 787, Airbus A-380 and several other aircraft from different manufacturers. These planes carry countless numbers of passenger. Together with hardware redundancy and strict integration guidelines, these planes have extremely low MTBF and are trusted by worldwide aviation authorities to fly in all airspaces.It's interesting to think that the multithreaded model of computing is being adopted, along with what is equivalent to virtual machine partitionning in some of the world's most safety critical systems. People designing these systems are not super-human. Rather, they spend significant amounts of time and effort to specify, architect, develop and then test these systems to insure safety. The guidelines and method exist. It's only a question of money.The consumer electronics market cannot bear the budgetary pressure of architecting and testing systems with the level of quality expected of safety-critical applications (automotive, space, aviation, medical, factory automation).The methods are here, the PRODUCTS are here, the engineers are here. It's just that economic pressure prevents most projects from spending the time required to do it right.For more info:* http://www.aviationtoday.com/av/issue/feature/8420.html* http://www.windriver.com/products/platforms/safety_critical_arinc_653/* http://avi.pennnet.com/display_article/370613/143/ARTCL/none/NWPRD/1/SYSGO-combines-PikeOS-RTOS-and-AFDX-on-Freescale%27s-dual-core-PQIII/* http://www.lynuxworks.com/rtos/rtos-178.php* http://air.di.fc.ul.pt/air-ii/?Publications–Tennessee Carmel-Veilleux, B.Eng, EITElectrical Engineering Masters Student, ETS (http://www.etsmtl.ca)

  5. Luke Teyssier says:

    Disappointing Michael. Or, maybe you were trying to be provocative. The part you left out is "What if that bolt was designed by a high-school drop-out who used a single napkin for all of his analysis and whose entire formal training included reading a few books on Auto-Cad". Multi-threading is not the problem. The problem is that multi-threaded systems are more likely to be complex, and hard real-time multi-threaded systems require discipline skill and experience to build correctly. The ad-hoc "good enough to ship" development style of "the source code is the design document" that leads to PC software crashing once a day just won't cut it. Furious code hacking based on napkin sketches is useful in some applications. This is not one of them. A capable team of practitioners can architect, design, and build absolutely reliable systems, but it's not simple, easy, or cheap. You __MUST__ be willing to commit to "This system will not ship until we can prove that all the important details are correct". Read about Rate Monotonic Analysis, for example: http://www.sei.cmu.edu/library/abstracts/reports/91tr006.cfm, or read "A Practitioner's Handbook for Real-Time Analysis: Guide to Rate Monotonic Analysis for Real-Time Systems" if you can get a copy. If someone is architecting a mission-critical hard-real-time system for you, they should be familiar, if not expert, with all the concepts in this book.

  6. poolorpond says:

    Albeit the arguments above that indicate the importance of well trained engineers and properly followed processes are certainly valid, I think the general problem is more insidious. By borrowing your scenario, there is too much time spent on trying to invent that morph-able bolt, when the real thing to do is what has been done in the "bolt world." Over time, certain sizes of bots have emerged that have become standard for one application or another. Once standardized, they are generally readily available, cheap and most importantly, they work…every time.In my experience, the large majority of embedded projects are hamstrung by legacy requirements and implementation. This generally leads to a morphing bolt scenario. Because the basic design and implementation went the way of Luke's description above…succinctly stated "the source code is the design document"… the system often ends up having little upfront design effort. The notion of code as the design document, in and of itself, does not necessary represent a bad thing (divergence between design specs and code can be just as bad as no design doc at all). The bad thing is that most systems are designed with source code rather than a thought out architecture.If the architecture is thought out and planned well, one should be able to determine what kind of bolts you are needed and build them. Implementation is unimportant re: which language is used or whether the paradigm employed is object oriented, functionally decomposed or auto-generated. What is important is to have well defined bolts and the holes that receive them. All too often this evolves rather than being planned. It is this scenario that often engenders the need for a universal bolt.As with many things in the engineering world, the “best” solution is often sought. Unfortunately this often stands in the way of getting anything done. This has hampered the multi-core world considerably as people keep looking to find a language or a method that will isolate the programmer from the underlying architecture. This is a good goal, but it is impractical in the short term as we don’t really know enough about how to make that translation. Largely, I think, because we don’t have enough working examples to understand what translation needs to be made. Rather, we should employ solid modular designs whereby the modules are built to be run independent of which processor they are executing on. We would then begin to acquire real examples of how multi-core technologies can be employed efficiently for optimal implementations. From these examples, we may begin to see a window into understanding what things we can hide from a multi-core architecture and what things are necessary for the developer to know about the architecture.We don’t need a new paradigm yet, we need to use the paradigms that have be tested tried and true effectively first.

  7. Ben Voigt says:

    Tennessee said "these planes have extremely low MTBF".No, thankfully they have very high MTBF. The rest of his information is very good though.

Leave a Reply to peterbushellwp