Posts Tagged ‘programming’

Robust Embedded Software Architecture in 5 Easy Steps

Thursday, September 17th, 2009 Michael Barr

Over the past few years, I’ve spent a large amount of my time consulting with and training software development teams that are in the midst of rearchitecture. These teams have already developed the firmware inside successful long-lived products or product families. But to keep moving forward, reduce bugs, and speed new feature development, they need to take the best of their old code and plug it into a better firmware architecture.

In the process, I have collected substantial anecdotal evidence that few programmers, technical managers, or teams truly understand what good firmware architecture is, how to achieve it, or even how to recognize it when they see it. That includes the most experienced individual developers on a team. Yet, despite the fact that these teams work in a range of very different industries (including safety-critical medical devices), the rearchitecture process is remarkably similar from my point of view. And there are numerous ways that our clients’ products and engineering teams would have benefited from getting their firmware architecture right from the beginning.

Though learning to create solid firmware architecture and simultaneously rearchitecting legacy software may take a team months of hard work, five key steps are easily identified. So whether you are designing firmware architecture from scratch for a new product or launching a rearchitecture effort of your own, here is a step-by-step process to help your team get started on the right foot.

Step 1: Identify the Requirements

Before we can begin to (re)architect an embedded system or its firmware, we must have clear requirements. Properly written requirements define the WHAT of a product. WHAT does it the product for the user, specifically? For example, if the product is a ventilator the list of WHAT it does may include a statement such as:

If power is lost during operation, the ventilator shall resume operation according to its last programmed settings within 250 ms of power up.

Note that a properly written requirement is silent about HOW this particular part of the overall WHAT is to be achieved. The implementation could be purely electronics or a combination of electronics and firmware; the firmware, if present, might contain an RTOS or it might not. From the point of view of the requirement writer, then, there may as well be a gnome living inside the product that fulfills the requirement. (So long as the gnome is trustworthy and immortal, of course!)

Each requirement statement must also be two other things: unambiguous and testable. An unambiguous statement requires no further explanation. It is as clear and as concise as possible. If the requirement includes a mathematical model of expected system behavior, it is helpful to include the equations.

Testability is key. If a requirement is written properly, a set of tests can be easily constructed to verify that requirement is met. Decoupling the tests from the particulars of the implementation, in this manner, is of critical importance. Many organizations perform extensive testing of the wrong stuff. Any coupling between the test and the implementation is problematic.

A proper set of requirements is a written list of statements each of which contains the key phrase “… the [product] shall …” and is silent about how it is implemented, unambiguous, and testable. This may seem like a subject unrelated to architecture, but too often it is poor requirements that constrain architecture. Thus good architecture depends in part on good requirements.

Step 2: Distinguish Architecture from Design

Over the years, I have found that many engineers (as well as their managers) struggle to separate the various elements or layers of firmware engineering. For example, Netrino is barraged with requests for “design reviews” that turn out to be “code reviews” because the customer is confused about the meaning of “design”. This even happens in organizations that follow a defined software development lifecycle. We need to clear this up.

The architecture of a system is the outermost layer of HOW. Architecture describes persistent features; the architecture is hard to change and must be got right through careful thinking about intended and permissible uses of the product. By analogy, an architect describes a new office building only very broadly. A scale model and drawings show the outer dimensions, foundation, and number of floors. The number of rooms on each floor and their specific uses are not part of the architecture.

Architecture is best documented via a collection of block diagrams, with directional arrows connecting subsystems. The system architecture diagram identifies data flows and shows partitioning at the hardware vs. firmware level. Drilling down, the firmware architecture diagram identifies subsystem-level blocks such as device drivers, RTOS, middleware, and major application components. These architectural diagrams should not have to change even as roadmap features are added to the product—at least for the next few years. Architectural diagrams should also pass the “six pack test,” which says that even after drinking a six pack every member of the team should still be able to understand the architecture; it is devoid of confusing details and has as few named components as possible.

The design of a system is the middle layer of HOW. The architecture does not include function or variable names. A firmware design document identifies these fine-grained details, such as the names and responsibilities of tasks within the specific subsystems or device drivers, the brand of RTOS (if one is used), and the details of the interfaces between subsystems. The design documents class, task, function/method, parameter and variable names that must be agreed upon by all implementers. This is similar to how a design firm hired by the renter of a floor on the office building describes the interior and exterior of the new building in finer detail than the architect. Designers locate and name rooms and give them specific purposes (e.g., cube farm, corner office, or conference room).

An implementation is the lowest layer of HOW. There need be no document, other than the source code or schematics, to describe the implementation details. If the interfaces are defined sufficiently at the design level above, individual engineers are able to begin implementation of the various component parts in parallel. This is similar to the way that a carpenter, plumber, and electrician work in parallel in nearby space, applying their own judgment about the finer details of component placement, after the design has been approved by the lessee.

Of course, there is architecture and there is good architecture. Good architecture makes the most difficult parts of the project easy. These difficult parts vary in importance somewhat from industry to industry, but always center on three big challenges that must be traded off against each other: meeting real-time deadlines, testing, and diversity management. Addressing those issues comprise the final three steps.

Step 3: Manage CPU Time

Some of your product’s requirements will mention explicit amounts of time. For example, consider the earlier ventilator requirement about doing something “within 250 ms of power up.” That is a timeliness requirement. “Within 250 ms of power up” is just one deadline for the ventilator implementation team to meet. (And something to be tested under a variety of scenarios.) The architecture should make it easy to meet this deadline, as well as to be certain it will always be met.

Most products feature a mix of non-real-time, soft-real-time, and hard-real-time requirements. Soft deadlines are usually the most challenging to
define in an unambiguous manner, test, and implement. For example, in set-top box design it may be acceptable to drop a frame of video once in a while, but never more than two in a row, and never any audio, which arrives in the same digital input stream. The simplest way to handle soft deadlines is to treat them as hard deadlines that must always be met.

With deadlines identified, the first step in architecture is to push as many of the timeliness requirements as possible out of the software and onto the hardware. Figure 1 shows the preferred placement of real-time functionality. As indicated, an FPGA or a dedicated CPU is the ideal place to put real-time functionality (irrespective of the length of the deadline). Only when that is not possible, should an interrupt service routine (ISR) be used instead. And only when an ISR won’t work should a high-priority task be used.

Where to Put Real-Time Functionality

Figure 1. Where to Put Real-Time Functionality

Keeping the real-time functionality separate from the bulk of the software is valuable for two important reasons. First, because it simplifies the design and implementation of the non-real-time software. With timeliness requirements architected out of the bulk of the software, code written by novice implementers can be used without affecting user safety.

A second advantage of keeping the real-time functionality together is it simplifies the analysis involved in proving all deadlines are always met. If all of the real-time software is segregated into ISRs and high-priority tasks, the amount of work required to perform rate monotonic analysis (RMA) is significantly reduced. Additionally, once the RMA analysis is completed, it need not be revised every time the non-real-time code is tweaked or added to.

Step 4: Design for Test

Every embedded system needs to be tested. Generally, it is also valuable or mandatory that testing be performed at several levels. The most common levels of testing are:

System Tests verify that the product as a whole meets or exceeds the stated requirements. System tests are generally best developed outside of the engineering department, though they may fit into a test harness developed by engineers.

Integration Tests verify that a subset of the subsystems identified in the architecture diagrams interact as expected and produce reasonable outcomes. Integration tests are generally best developed by a testing group or person within software engineering.

Unit Tests verify that individual software components identified at the intermediate design level perform as their implementers expect. That is, they test at the level of the public API the component presents to other components. Unit tests are generally best developed by the same people that write the code under test.

Of the three, system tests are most easily developed, as those test the product at its exposed hardware interfaces to the world (e.g., does the dialysis machine perform as required). Of course, a test harness may need to be developed for engineering and/or factory acceptance tests. But this is generally still easier than integration and unit tests, which demand additional visibility inside the device as it operates.

To make the development, use, and maintenance of integration and unit tests easy it is valuable to architect the firmware in a manner compatible with a software test framework. The single best way to do this is to architect the interactions between all software components at the levels you want to test so they are based on publish-subscribe event passing (a.k.a., message passing).

Interaction based on a publish-subscribe model allows a lightweight test framework like the one shown in Figure 2 to be inserted alongside the software component(s) under test. Any interface between the test framework and the outside world, such as a serial port, provides an easy way to inject or log events. A test engine on the other side of that communications interface can then be designed to accept test “scripts” as input, log subscribed event occurrences, and off-line check logged events against valid result sequences. Adding timestamps to the event logger and scripting language features like delay(time) and waitfor(event) significantly increases testing capability.

A Test Framework Based on a Publish-Subscribe Event Bus

Figure 2. A Test Framework Based on a Publish-Subscribe Event Bus

It is unfortunate that the publish-subscribe component interaction model is at odds with proven methods of analyzing software schedulability (e.g., RMA). The sheer number of possible message arrival orders, queue depths, and other details make the analysis portion of guaranteeing timeliness difficult and fragile against minor implementation changes. This is, in fact, why it is important to separate the code that must meet deadlines from the rest of the software. In this architecture, though, the real-time functionality remains difficult to test other than at the system level.

Step 5: Plan for Change

The third key consideration during the firmware architecture phase of the project is the management of feature diversity and product customizations. Many companies use a single source code base to build firmware for a family of related products. For example, consider microwave ovens; though one high-end model may feature a dedicated “popcorn” button, another may lack this. The architecture of any new product’s firmware will also soon be tested and stretched in the direction of foreseeable planned feature additions along the product road map.

To plan for change, you must first understand the types of changes that occur in your specific product. Then architect the firmware so that those sorts of changes are the easiest to make. If the software is architected well, feature diversity can be managed through a single software build with compile-time and/or run-time behavioral switches in the firmware. Similarly, new features can be added easily to a good architecture without breaking the existing product’s functionality.

An architectural approach that handles product family diversity particularly well is one in which groups of related software components are collected into “packages”. Each such package is effectively an internal widget from which larger products can be built. The source code and unit tests for each particular package should be maintained by a team of “package developers” focused primarily on their stability and ease of use.

Teams of “product developers” combine stable releases of packages that contain the features they need, customize each as appropriate (e.g., via compile- or run-time mechanisms, or both) to their particular product, and add product-specific “glue.” Typically, all of the products in a related product family are built upon a common “Foundation” package (think API). For example a Model X microwave might be built from Foundation + Package A + Package B; whereas Model Y might consist of Foundation + A’ + B + C, where package A’ is a compile-time variant of package A and package C contains optional high-leve
l cooking features, such as “Popcorn.”

Using this approach in a large organization, a new product built from a selection of stable bug-free packages can be brought to market quickly—and all products share an easy upgrade path as their core packages are improved. The main challenge in planning for change of this sort is in striking the right balance between packages that are too small and packages that are too large. Like many of the details of firmware architecture, achieving that balance for a number of years is more of an art than a science.

Next Steps

I hope the five-step “architecture road map” presented here is useful to you. I plan to drill down into more of the details in articles and blog posts over the coming months. Meanwhile your constructive feedback is welcome via the comment form or e-mail.

Take the Mutual Exclusion Challenge

Thursday, September 10th, 2009 Michael Barr

If you’ve been reading my articles or blog for a while, you’ve probably noticed a few pieces about the differences between mutexes and semaphores. The most concise presentation of these issues that I’ve made was published last year in Embedded Systems Design. That article, called Mutexes and Semaphores Demystified is also available at http://www.netrino.com/Embedded-Systems/How-To/RTOS-Mutex-Semaphore.

A new blogger in the embedded software area (Niall Cooling) is revisiting the mutex vs. semaphore subject and reading that caused me to come across a few other sources on the subject. (You can find his blog at http://www.feabhas.com/blog.) The “Toilet Example” that he cites via a link to another website contains one of the worst explanations of the use of semaphores I have seen. I don’t even know where to start rewriting it.

So I challenge you, dear RSS subscribers, can you individually or collectively (a) identify the flaws in the Toilet Example explanation at http://koti.mbnet.fi/niclasw/MutexSemaphore.html and (b) propose a proper implementable solution by way of a rewrite? I suggest we do this via the comment mechanism provided at the end of this blog post.

To C++ or Not to C++ – That is the question…

Friday, August 28th, 2009 Michael Barr

There are raging discussions about my latest column, Real Men Program in C, going on at Techonline.com and Reddit.com. Though it was never my intent to malign C++, some of the forum participants have headed off in that direction. Even Dan Saks has been compelled to weigh in, in his latest column.

For the record, I agree with Dan Saks about the following: (1) that C++ has some strengths vs. C and (2) that the top two of these are “classes with private access control” and “initialization by constructors. I have developed embedded software in both C and C++. Both languages can be abused, especially by those who don’t have a full understanding of all the features they are using. However, both languages can also be used to develop reliable embedded software.

Real Men Program in C

Monday, August 3rd, 2009 Michael Barr

A couple of months ago, I ate a pleasant lunch with a couple of young entrepreneurs in Baltimore. The two are recent computer science graduates from Johns Hopkins University with a fast-growing consulting business. Their firm specializes in writing software for web-centric databases in a language called Ruby on Rails (a.k.a., “Ruby”). As we discussed many of the similarities and a few of the differences in our respective businesses over lunch, one of the young men made a comment I won’t soon forget, “Real men program in C.”[1]

Clever though he is, the young man admitted he wasn’t making that quote up on the spot. That “real men program in C” is part of a lingo he and his fellow computer science students developed while categorizing the usefulness of the various programming languages available to them. Exploring a bit, I learned the quiche-like phrase assigns both a high difficulty factor to the C language and a certain age group to C programmers. Put simply, C was too hard for programmers of their generation to bother mastering.

Is C a dead language?

For today’s computer science students, learning C is like taking an elective class in Latin. But C is anything but history and not at all a dead language. And C remains the dominant language in the fast growing field of embedded software development. Figure 1 summarizes 13 years of relevant annual survey data collected by the publishers of Embedded Systems Design.

The discontinuity after 2004 is necessary because the phrasing of the question and permissible answers were changed in 2005. Prior to 2005, the question was phrased, “For your embedded development, which of the following programming languages have you used in the last 12 months?” In 2005, the phrasing became, “My current embedded project is programmed mostly in ____?” Prior to 2005, multiple selections were permitted. This meant that the aggregate data was allowed to sum to over 100% (the average sum was 209%, implying many respondents made two or more selections).

The biggest impact of the survey change from multiple selections to one selection was on the numbers reported for assembly language. Prior to 2005, assembly language was present in an average of 62% of all responses to this question. This should not be surprising, as it is well known that most firmware projects require at least small quantities of assembly code.

After 2004, assembly becomes a minor player–averaging just 7% of all responses across five survey years.[2] This data more closely represents the percentage of projects written mostly or entirely in assembly. The data also show a decline in the popularity of that programming style, from 8% in 2005 to 5% in 2009.

Turning our attention back to C, give Figure 1 a new look with an eye toward the dominance of that language throughout the 13 years of survey data. C was the most used language for embedded software development in the 1997 survey and in the 2009 survey and in every year in between. C has dominated when multiple languages could be chosen (averaging 81%) as clearly as when only the one most-used language could be chosen (averaging 57%).

Remarkably, C appears to have spent the last five years stealing share from assembly as well as from C++. This recent C vs. C++ data defies the expected movement toward ever higher-level languages. C++ is clearly a part of many embedded software projects–and the primary language for about 27% of those coded in the last five years. But my read on the entire 13-year data set is that use of C++ increased rapidly in the late 1990s, peaked in 2001, and has been stable to slightly declining since.[3]

The bottom line is that embedded programmers aren’t going to stop using C anytime soon. There are several reasons for this. First, C compilers are available for the vast majority of 8-, 16-, and 32-bit CPUs. Second, C offers just the right mix of low-level and high-level language features for programming at the processor and driver level. Until the use of C starts to turn down in future such surveys, C programming skills will remain important.

Ten billion and counting

Of course, C will not survive as an important programming language if it is widely used by a shrinking subset of programmers. For C to remain important, the number of embedded software developers must not shrink. For better or worse, I believe the opposite is happening. Around 98% of the new CPUs produced each year are embedded. And the number of new CPUs per year is on a long-term upward trend.

Figure 2 shows the rise in the number of new CPUs per year next to the Nasdaq Composite stock index. As anyone can see, the number of new CPUs per year more than doubled in the decade shown. By comparison, the Nasdaq index was down in the same interval. There is a palpable disconnect between the growth in numbers of embedded computers and the prices of technology stocks generally.

Based on this data and various other observations over the past 15 years, I conclude that (a) the practice of embedding software into products is on a fast growth curve, and (b) the number of people writing embedded software is growing too. It is important to note that 8-bit processor sales remain a large and growing segment and that these tend to require only one- to two-person programming teams. As processors become cheaper, new applications emerge.

The embedded software education gap

At the same time that C becomes increasingly important to the world, fewer learn how to use that language in school. This is part of a larger “education gap” affecting all organizations that make embedded systems. American institutions of higher learning largely fail to teach the practical skills necessary to design and implement reliable embedded software. From the importance of C’s volatile keyword to reentrancy to task prioritization within real-time operating systems to state machine implementation, the trustworthy development processes and firmware architectures for embedded software must be learned on the job.

Figure 3 shows the education gap in a Venn diagram, which is a kind of a tool we mostly did learn in university. Only a little bit of what is studied in electrical engineering curricula is applicable to embedded software development–typically in a lab class requiring assembly programming near the end. And only a little bit more of the applicable stuff is taught in computer science curricula–typically in early classes on computer architecture and now mostly elective courses in C and C++. At least computer science students are taught some of what we must know about software lifecycle management.

Though there has been a positive trend toward the addition of computer engineering degrees at many universities, these tend to be too little too late. From what I have seen these CE programs mostly draw courses and professors from the existing EE and CS departments–adding little new content specific to embed
ded software development. Though the CE program is aimed squarely at educating chip designers and system software developers (a field that includes more than just embedded software), at least one program I am familiar with gives their freshman the choice between a C and a Java track!

Unfortunately, on-the-job learning is also poorly organized in embedded software. It is possible, even common, to start writing firmware with only an EE degree and to begin by making the mistakes of any novice, to receive little code review if any, and to ship a “glitchy” product, only to be rewarded with a new product to work on. Where is the critical feedback that a bug you created frustrated or even injured a user?

Solutions needed

If you accept from the evidence I’ve presented here that C shall remain important for the foreseeable future and that embedded software is of ever-increasing importance, then you’ll begin to see trouble brewing. Although they are smart and talented computer scientists, my younger friends don’t know how to competently program in C. And they don’t care to learn.

But someone must write the world’s ever-increasing quantity of embedded software. New languages could help, but will never be retrofitted onto all the decades-old CPU architectures we’ll continue to use for decades to come. As turnover is inevitable, our field needs to attract a younger generation of C programmers.

What is the solution? What will happen if these trends continue to diverge?

Notes

1. I’m sure he wasn’t trying to be sexist. Real women surely program in C, too!

2. The 7% average is a pretty strong showing for assembly. By comparison, all other languages averaged less than 5% combined during the same period, including Java (2%) and Basic (1%).

3. The use of Java has never been more than a blip in embedded software development, and peaked during the telecom bubble–in the same year as C++.

4. While the numbers for the Nasdaq index are accurate, the numbers of CPUs per year are my interpolations between a small number of data points of varying reliability.

Firmware Disasters

Tuesday, June 23rd, 2009 Michael Barr

First, an Airbus A330 fell out of the sky. Then two D.C. Metro trains collided. Several hundred people have been killed and injured in these disastrous system failures. Did bugs in embedded software play a role in either or both disasters?

An incident on an earlier (October 2006) Airbus A330 flight may offer clues to the crash of Air France 447:

Qantas Flight 72 had been airborne for three hours, flying uneventfully on autopilot from Singapore to Perth, Australia. But as the in-flight dinner service wrapped up, the aircraft’s flight-control computer went crazy. The plane abruptly entered a smooth 650-ft. dive (which the crew sensed was not being caused by turbulence) that sent dozens of people smashing into the airplane’s luggage bins and ceiling. More than 100 of the 300 people on board were hurt, with broken bones, neck and spinal injuries, and severe lacerations splattering blood throughout the cabin. (Article, Time Magazine, June 3, 2009)

Authorities have blamed a pair of simultaneous computer failures for that event in the fly-by-wire A330. First, one of three redundant air data inertial reference units (ADIRUs) began giving bad angle of attack (AOA) data. Simultaneously, a voting algorithm intended to handle precisely such a failure in 1 of the 3 units by relying only on the other matching data failed to work as designed; the flight computer instead made decisions only on the basis of the one failed ADIRU!

(A later analysis by Airbus “found data fingerprints suggesting similar ADIRU problems had occurred on a total of four flights. One of the earlier instances, in fact, included a September 2006 event on the same [equipment] that entered the uncommanded dive in October [2006].” Ibid.)

Much of the attention in the publicly disclosed details of the Air France 447 crash has focused on the failure of one of several air speed indicators. Were there three of those as well? If so, was the same flight computer to blame for failing to recognize which to trust and which was unreliable?

It is very early in the investigation of yesterday’s collision between two D.C. Metro red line trains, in which a stopped train was rear-ended and heavily damaged by a moving train on the same track, to place blame. But a WashingtonPost.com article headlined “Collision was Supposed to be Impossible” says it all:

Metro was designed with a fail-safe computerized signal system that is supposed to prevent trains from colliding.

and

During morning and afternoon rush hours, all trains except longer eight-car trains typically operate in automatic mode, meaning their movements are controlled by computerized systems and the central Operations Control Center. Both trains in yesterday’s crash [about 5pm] were six-car trains. (Article, Washington Post, June 23, 2009)

Are bugs in embedded software to blame for these two disasters? You can bet the lawyers are already looking into it.