Posts Tagged ‘bugs’

Proposed Rule Changes for Embedded C Coding Standard

Wednesday, June 20th, 2018 Michael Barr

My book Embedded C Coding Standard began as an internal coding standard of a consulting company and was first published in 2008 by that company (Netrino). In 2013, the book’s cover was given a new look and the standard became known as “Barr Group‘s Embedded C Coding Standard“. A 2018 update to the book will be released soon and this will be the first time the substance of the standard has changed in over a decade.

The primary motivation for making changes is to better harmonize the rules with the MISRA-C Guidelines. The older Barr Group standard made reference to MISRA C:2004, which was superseded by MISRA C:2012. The few known direct conflicts between BARR-C:2013 and MISRA C (stemming from our earlier embrace of ISO C99) were effectively eliminated with their 2012 update. But we wanted to do more than just remove those noted conflicts and push even further in our embrace of harmonization.

In a sentence, MISRA C is a safer subset of the C language plus a set guidelines for using what is left of the language more safely. In about as many words, BARR-C is a style guide for the C language that reduces the number of defects introduced during the coding phase by increasing readability and portability. It turns out there is quite a bit of value in combining rules from both standards. For example, MISRA’s guidelines do not provide stylistic advice and applying BARR-C’s stylistic rules to the C subset further increases safety.

The rest of this post is a preview of the specific rule changes and additions we will make in BARR-C:2013. (Omitted from this list are rules reworded simply for greater clarity.)

Rule 1.1.d (new)

The following rule will be added:

Preprocessor directive #define shall not be used to alter or rename any keyword or other aspect of the programming language.

Rule 1.6.b (new)

The following rule will be added:

Conversion from a pointer to void to a pointer to another type shall be cast.

Rule 1.7.c (change)

The earlier rule:

The goto keyword shall not be used.

will be replaced with:

It is a preferred practice to avoid all use of the goto keyword. If goto is used it shall only jump to a label declared later and in the same or an enclosing block.

This is being done so that the BARR-C standard does not restrict uses of goto that are permissible under MISRA C.

Rule 1.7.d (change)

The earlier rule:

The continue keyword shall not be used.

will be replaced with:

It is a preferred practice to avoid all use of the continue keyword.

This is being done so that the BARR-C standard does not restrict uses of continue that are permissible under MISRA C.

Rule 1.7.e (eliminated)

There will no longer be any restriction on the use of the break keyword. This is being done so that the BARR-C standard does not restrict uses of break that are permissible under MISRA C.

Rule 4.2.d (change)

The earlier rule:

No header file shall contain a #include statement.

will be replaced with:

No public header file shall contain a #include of any private header file.

This is being done because apparently no one (other than me) actually valued the old rule. 🙂

Rule 5.1.c (new)

The following rule will be added:

The name of all public data types shall be prefixed with their module name and an underscore.

Rule 5.4.b.v (new)

The following rule will be added:

Always invoke the isfinite() macro to check that prior calculations have resulted in neither infinity nor NaN.

Rule 5.6.a (new)

The following rule will be added:

Boolean variables shall be declared as type bool.

Rule 5.6.b (new)

The following rule will be added:

Non-Boolean values shall be converted to Boolean via use of relational operators (e.g., < or !=), not via casts.

Rule 6.2.c (changed)

The earlier rule:

All functions shall have just one exit point and it shall be at the bottom of the function. That is, the keyword return shall appear a maximum of once.

will be replaced with:

It is a preferred practice that all functions shall have just one exit point and it shall be via a return at the bottom of the function.

Rule 6.3.b.iv (new)

The following rule will be added:

Never include a transfer of control (e.g., return keyword).

Rule 7.1.n (new)

The following rule will be added:

The names of all variables representing non-pointer handles for objects, e.g., file handles, shall begin with the letter ‘h’.

Rule 7.1.o (new)

The following rule will be added:

In the case of a variable name requiring multiple of the above prefixes, the order of their inclusion before the first underscore shall be [g][p|pp][b|h].

Rule 7.2.c (new)

The following rule will be added:

If project- or file-global variables are used, their definitions shall be grouped together and placed at the top of a source code file.

Rule 7.2.d (new)

The following rule will be added:

Any pointer variable lacking an initial address shall be initialized to NULL.

Rule 8.2.a (changed)

The earlier rule:

The shortest (measured in lines of code) of the if and else if clauses should be placed first.

will be replaced with:

It is a preferred practice that the shortest (measured in lines of code) of the if and else if clauses should be placed first.

Rule 8.3.c (new)

The following rule will be added:

Any case designed to fall through to the next shall be commented to clearly explain the absence of the corresponding break.

Rule 8.5.b (new)

The following rule will be added:

The C Standard Library functions abort(), exit(), and setjmp()/longjmp() shall not be used.

If you have questions about any of these draft changes or suggestions for better or other changes, please comment below.

C’s goto Keyword: Should we Use It or Lose It?

Wednesday, June 6th, 2018 Michael Barr

In the 2008 and 2013 editions of my bug-killing Embedded C Coding Standard, I included this rule:

Rule 1.7.c. The goto keyword shall not be used.

Despite this coding standard being followed by about 1 in 8 professional embedded systems designers, none of us at Barr Group have heard any complaints that this rule is too strict. (Instead we hear lots of praise for our unique focus on reducing intra-team stylistic arguments by favoring above all else C coding rules that prevent bugs.)

However, there exist other coding standards with a more relaxed take on goto. I’ve thus been revisiting this (and related) rules for an updated 2018 edition of the BARR-C standard.

Specifically, I note that the current (2012) version of the MISRA-C Guidelines for the Use of the C Programming Language in Critical Systems merely advises against the use of goto:

Rule 15.1 (Advisory): The goto statement should not be used

though at least the use of goto is required to be appropriately narrowed:

Rule 15.2 (Required): The goto statement shall jump to a label declared later in the same function

Rule 15.3 (Required): Any label referenced by a goto statement shall be declared in the same block, or in any block enclosing the goto statement

Generally speaking, the rules of the MISRA-C:2012 standard are either the same as or stricter than those in my BARR-C standard. In addition to overlaps, they have more and stricter coding rules and we add stylistic advice. It is precisely because MISRA-C’s rules are stricter and only BARR-C includes stylistic rules that these two most popular embedded programming standards frequently and easily combined.

Which all leads to the key question:

Should the 2018 update to BARR-C relax the prior ban on all use of goto?

According to the authors of the C programming language, “Formally, [the goto keyword is] never necessary” as it is “almost always easy to write code without it”. They go on to recommend that goto “be used rarely, if at all.”

Having, in recent days, reviewed the arguments for and against goto across more than a dozen C programming books as well as other C/C++ coding standards and computer science papers, I believe there is just one best bug-killing argument for each side of this rule. And I’ll have to decide between them this week to keep the new edition on track for a June release.

Allow goto: For Better Exception Handling

There seems to be universal agreement that goto should never be used to branch UP to an earlier point in a function. Likewise, branching INTO a deeper level of nesting is a universal no-no.

However, branching DOWN in the function and OUT to an outer level of nesting are often described as powerful uses of the goto keyword. For example, a single goto statement can be used to escape from two or more levels of nesting. And this is not a behavior that can be done with a single break or continue. (To accomplish the same behavior without using goto, one could, e.g., set a flag variable in the inner block and check it in each outer block.)

Given this, the processing of exceptional conditions detected inside nested blocks is a potentially valuable application of goto that could be implemented in a manner compliant with the three MISRA-C:2012 rules quoted above. Such a jump could even proceed from two or more points in a function to a common block of error recovery code.

Because good exception handling is a property of higher reliabilty software and is therefore a potential bug killer, I believe I must consider relaxing Rule 1.7.c in BARR-C:2018 to permit this specific use of goto. A second advantage of relaxing the rule would be increasing the ease of combining rules from MISRA-C with those from BARR-C (and vice versa), which is a primary driver for the 2018 update.

Ban goto: Because It’s Even Riskier in C++

As you are likely aware, there are lots of authors who opine that use of goto “decreases readability” and/or “increases potential for bugs”. And none other than Dijkstra (50 years ago now!) declared that the “quality of programmers is [inversely proportional to their number] of goto statements”.

But–even if all true–none of these assertions is a winning argument for banning the use of goto altogether vs. the above “exception handling exception” suggested by the above.

The best bug-killing argument I have heard for continued use of the current Rule 1.7.c is that the constructors and destructors of C++ create a new hazard for goto statements. That is, if a goto jumps over a line of code on which an object would have been constructed or destructed then that step would never occur. This could, for example, lead to a very subtle and difficult type of bug to detect–such as a memory leak.

Given that C++ is a programming language intentionally backward compatible with legacy C code and that elsewhere in BARR-C there is, e.g., a rule that variables (which could be objects) should be declared as close as possible to their scope of use, relaxing the existing ban on all use of goto could forseeably create new hazards for the followers of the coding standard who later migrate to C++. Indeed, we already have many programmers following our C coding rules while crafting C++ programs.

So what do you think? What relevant experiences can bring to bear on this issue in the comments area below? Should I relax the ban on goto or maintain it? Are there any better (bug-killing) arguments for or against goto?

Is it a Bug or an Error?

Wednesday, January 31st, 2018 Michael Barr

Probably you’ve heard the story of how Adm. Grace Hopper attached a moth that was dislodged from a relay in the Harvard Mark II mainframe to an engineering notebook and labeled it the “First actual case of bug being found.”

hoppers_moth_bug

Designers of electronics, including Thomas Edison, had been using the term bug for decades. But it was mostly after this amusing 1947 event hat the use of words like “bugs” and “debugging” took off in the emerging software realm.

So why is it that if a bridge collapses we say it was a failure of the design and not attributable to a mere “bug”? As if it were an external force or an act of god that caused the failure? Why do only software engineers get this linguistic pass when failures are caused by their mistakes the same as other types of engineers?

Failures of software are commonplace everyday events. Yet such failures are not typically the result of a moth or other “actual bug”. Each such failure is instead caused by human error: some mistake has been made either in the requirements or in the implementation and these human mistake then have real world consequences, including sometimes compromising the safety and security of product users.

Should we, as a community of professionals, stop using the word “bug” and instead replace it with some other more honest term such as “error” or “mistake”? Might this help to raise the seriousness with which we approach our work and thereby the safety of the users of our product? What do you think? Comment below.

Cyberspats on the Internet of Things

Thursday, April 6th, 2017 Michael Barr

When you hear the words “weaponization” and “internet” in close proximity you naturally assume the subject is the use of hacks and attacks by terrorists and nation-state actors.

But then comes today’s news about an IoT garage door startup that remotely disabled a customer’s opener in response to a negative review. In a nutshell, a man bought the startup’s Internet-connected opener, installed it in his home, was disappointed with the quality, and wrote negative reviews on the company’s website and Amazon. In response, the company disabled his unit.

In context of the explosion of Internet connections in embedded systems, this prompts several thoughts.

First and foremost: What does it mean to buy or own a product that relies for some functionality on a cloud-based server that you might not always be able to access? Is it your garage door opener or the manufacturer’s? And how much is that determined by fine print in a contract you’ll need a lawyer to follow?

Additionally: What if in this specific situation the company hadn’t made any public statements at all and had just remotely made the customer’s garage door opener less functional. There’d have then been no fodder for a news story. The company would’ve gotten it’s “revenge” on the customer. And the customer might never have known anything except that the product wasn’t to his liking. Investigating might cost him time and money he did not have.

It’s almost certainly the case that this company would have seen better business outcomes if it had quietly disabled the unit in question. And there are so many ways other insidious ways to go about it, including: bricking the unit, refusing it future firmware updates, or even subtlety downgraded its functionality.

Which brings us back to the weaponization of the Internet. Consumers have no choice but to trust the makers of their products, who have complete knowledge of the hardware and software design (and maybe also the digital signatures needed to make secure firmware updates). And these companies typically have all kinds of identifying data about individual customers: name, geographic location, phone and email address, product usage history, credit card numbers, etc. So what happens when the makers of those products are unhappy with one or more customers: from those posting bad product reviews all the way up to politicians and celebrities they may dislike?

Perhaps private companies are already attacking specific customers in subtle ways… How would we know?

Lethal Software Defects: Patriot Missile Failure

Thursday, March 13th, 2014 Michael Barr

During the Gulf War, twenty-eight U.S. soldiers were killed and almost one hundred others were wounded when a nearby Patriot missile defense system failed to properly track a Scud missile launched from Iraq. The cause of the failure was later found to be a programming error in the computer embedded in the Patriot’s weapons control system.

On February 25, 1991, Iraq successfully launched a Scud missile that hit a U.S. Army barracks near Dhahran, Saudi Arabia. The 28 deaths by that one Scud constituted the single deadliest incident of the war, for American soldiers. Interestingly, the “Dhahran Scud”, which killed more people than all 70 or so of the earlier Scud launches, was apparently the last Scud fired in the Gulf War.

Unfortunately, the “Dhahran Scud” succeeded where the other Scuds failed because of a defect in the software embedded in the Patriot missile defense system. This same bug was latent in all of the Patriots deployed in the region. However, the presence of the bug was masked by the fact that a particular Patriot weapons control computer had to be continuously running for several days before the bug could cause the hazard of a failure to track a Scud.

There is a nice concise write-up of the problem, with a prefatory background on how the Patriot system is designed to work, in the official post-failure analysis report by the U.S. General Accounting Office (GAO IMTEC-92-26) entitled “Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia“.

The hindsight explanation is that:

a software problem “led to an inaccurate tracking calculation that became worse the longer the system operated” and states that “at the time of the incident, the [Patriot] had been operating continuously for over 100 hours” by which time “the inaccuracy was serious enough to cause the system to look in the wrong place [in the radar data] for the incoming Scud.”

Detailed Analysis

The GAO report does not go into the technical details of the specific programming error. However, I believe we can infer the following based on the information and data that is provided about the incident and about the defect.

A first important observation is that the CPU was a 24-bit integer-only CPU “based on a 1970s design”. Befitting the time, the code was written in assembly language.

A second important observation is that real numbers (i.e., those with fractions) were apparently manipulated as a whole number in binary in one 24-bit register plus a binary fraction in a second 24-bit register. In this fixed-point numerical system, the real number 3.25 would be represented as binary 000000000000000000000011:010000000000000000000000, in which the : is my marker for the separator between the whole and fractional portions of the real number. The first half of that binary represents the whole number 3 (i.e., bits are set for 2 and 1, the sum of which is 3). The second portion represents the fraction 0.25 (i.e., 0/2 + 1/4 + 0/8 + …).

A third important observation is that system [up]time was “kept continuously by the system’s internal clock in tenths of seconds [] expressed as an integer.” This is important because the fraction 1/10 cannot be perfectly represented in 24-bits of binary fraction because its binary expansion, as a series of 1 or 0 over 2^n bits, does not terminate.

I understand that the missile-interception algorithm that did not work that day is approximately as follows:

  1. Consider each object that might be a Scud missile in the 3-D radar sweep data.
  2. For each, calculate an expected next location at the known speed of a Scud (+/- an acceptable window).
  3. Check the radar sweep data again at a future time to see if the object is in the location a Scud would be.
  4. If it is a Scud, engage and fire missiles.

Furthermore, the GAO reports that the problem was an accumulating linear error of .003433 seconds per 1 hour of uptime that affected every deployed Patriot equally. This was not a clock-specific or system-specific issue.

Given all of the above, I reason that the problem was that one part of the Scud-interception calculations utilized time in its decimal representation and another used the fixed-point binary representation. When the uptime was still low, targets were found in the expected locations when they were supposed to be and the latent software bug was hidden.

Of course, all of the above detail is specific to the Patriot hardware and software design that was in use at the time of the Gulf War. As the Patriot system has since been modernized by Raytheon, many details like these will have likely changed.

According to the GAO report:

Army officials [] believed the Israeli experience was atypical [and that] other Patriot users were not running their systems for 8 or more hours at a time. However, after analyzing the Israeli data and confirming some loss in targeting accuracy, the officials made a software change which compensated for the inaccurate time calculation. This change allowed for extended run times and was included in the modified software version that was released [9 days before the Dhahran Scud incident]. However, Army officials did not use the Israeli data to determine how long the Patriot could operate before the inaccurate time calculation would render the system ineffective.

Four days before the deadly Scud attack, the “Patriot Project Office [in Huntsville, Alabama] sent a message to Patriot users stating that very long run times could cause [targeting problems].” That was about the time of the last reboot of the Patriot missile that failed.

Note that if time samples were all in the decimal timebase or all in the binary timebase then the two compared radar samples would always be close in time and the error would not accumulate with uptime. And that is the likely fix that was implemented.

Firmware Updates

Here are a few tangentially interesting tidbits from the GAO report:

  • “During the [Gulf War] the Patriot’s software was modified six times.”
  • “Patriots had to be shut down for at least 1 to 2 hours to install each software modification.”
  • “Rebooting[] takes about 60 to 90 seconds” and sets the “time back to zero.”
  • The “[updated] software, which compensated for the inaccurate time calculation, arrived in Dhahran” the day after the deadly attack.

Public Statements

In hindsight, there are some noteworthy quotes from the 1991 news articles initially reporting on this incident. For example,

Brig. Gen. Neal, United States Command (2 days after):

The Scud apparently fragmented above the atmosphere, then tumbled downward. Its warhead blasted an eight-foot-wide crater into the center of the building, which is three miles from a major United States air base … Our investigation looks like this missile broke apart in flight. On this particular missile it wasn’t in the parameters of where it could be attacked.

U.S. Army Col. Garnett, Patriot Program Director (4 months after):

The incident was an anomaly that never showed up in thousands of hours of testing and involved an unforeseen combination of dozens of variables — including the Scud’s speed, altitude and trajectory.

Importantly, the GAO report states that, a few weeks before the Dharan Scud, Israeli soldiers reported to the U.S. Army that their Patriot had a noticeable “loss in accuracy after … 8 consecutive hours.” Thus, apparently, all of this “thousands of hours” of testing involved frequent reboots. (I can envision the test documentation now: “Step 1: Power up the Patriot. Step 2: Check that everything is perfect. Step 3: Fire the dummy target.”) The GAO reported that “an endurance test has [since] been conducted to ensure that extended run times do not cause other system difficulties.”

Note too that the quote about “thousands of hours of testing” was also misleading in that the Patriot software was, also according to the GAO report, hurriedly modified in the months leading up to the Gulf War to track Scud missiles going about 2.5 times faster than the aircraft and cruise missiles it was originally designed to intercept. Improvements to the Scud-specific tracking/engagement algorithms were apparently even being made during the Gulf War.

These specific theories and statements about went wrong or why it must have been a problem outside the Patriot itself were fully discredited once the source code was examined. When computer systems may have misbehaved in a lethal manner, it is important to remember that newspaper quotes from those on the side of the designers are not scientific evidence. Indeed, the humans who offer those quotes often have conscious and/or subconscious motives and blind spots that favor them to be falsely overconfident in the computer systems. A thorough source code review takes time but is the scientific way to go about finding the root cause.

As a New York Times editorial dated 4 months after the incident explained:

The Pentagon initially explained that Patriot batteries had withheld their fire in the belief that Dhahran’s deadly Scud had broken up in midflight. Only now does the truth about the tragedy begin to emerge: A computer software glitch shut down the Patriot’s radar system, blinding Dhahran’s anti-missile batteries. It’s not clear why, even after Army investigators had reached this conclusion, the Pentagon perpetuated its fiction

At least in this case, it was only a few months before the U.S. Army admitted the truth about what happened to themselves and to the public. That is to the U.S. Army’s credit. Other actors in other lethal software defect cases have been far more stubborn to admit what has later become clear about their systems.