embedded software boot camp

What NHTSA/NASA Didn’t Consider re: Toyota’s Firmware

Wednesday, March 2nd, 2011 by Michael Barr

In a blog post yesterday (Unintended Acceleration and Other Embedded Software Bugs), I wrote extensively on the report from NASA’s technical team regarding their analysis of the embedded software in Toyota’s ETCS-i system. My overall point was that it is hard to judge the quality of their analysis (and thereby the overall conclusion that the software isn’t to blame for unintended accelerations) given the large number of redactions.

I need to put the report down and do some other work at this point, but I have a few other thoughts and observations worth writing down.

Insufficient Explanations

First, some of the explanations offered by Toyota, and apparently accepted by NASA, strike me as insufficent. For example, at pages 129-132 of Appendix A to the NASA Report there is a discussion of recursion in the Toyota firmware. “The question then is how to verify that the indirect recursion in the ETCS-i does in fact terminate (i.e., has no infinite recursion) and does not cause a stack overflow.”

“For the case of stack overflow, [redacted phrase], and therefore a stack overflow condition cannot be detected precisely. It is likely, however, that overflow would cause some form of memory corruption, which would in turn cause some bad behavior that would then cause a watchdog timer reset. Toyota relies on this assumption to claim that stack overflow does not occur because no reset occurred during testing.” (emphasis added)

I have written about what really happens during stack overflow before (Firmware-Specific Bug #4: Stack Overflow) and this explains why a reset may not result and also why it is so hard to trace a stack overflow back to that root cause. (From page 20, in NASA’s words: “The system stack is limited to just 4096 bytes, it is therefore important to secure that no execution can exceed the stack limit. This type of check is normally simple to perform in the absence of recursive procedures, which is standard in safety critical embedded software.”)

Similarly, “Toyota designed the software with a high margin of safety with respect to deadlines and timeliness. … [but] documented no formal verification that all tasks actually meet this deadline requirement.” and “All verification of timely behavior is accomplished with CPU load measurements and other measurement-based techniques.” It’s not clear to me if the NASA team is saying it buys those Toyota explanations or merely wanted to write them down. However, I do not see a sufficient explanation in this wording from page 132:

“The [worst case execution time] analysis and recursion analysis involve two distinctly different problems, but they have one thing in common: Both of their failure modes would result in a CPU reset. … These potential malfunctions, and many others such as concurrency deadlocks and CPU starvation, would eventually manifest as a spontaneous system reset.” (emphasis added)

Might not a deadlock, starvation, priority inversion, or infinite recursion be capable of producing a bit of “bad behavior” (perhaps even unintended acceleration) before that “eventual” reset? Or might not a stack overflow just corrupt one or a few important variables a little bit and that result in bad behavior rather than or before a result? These kinds of possibilities, even at very low probabilities, are important to consider in light of NASA’s calculation that the U.S.-owned Camry 2002-2007 fleet alone is running this software a cumulative one billion hours per year.

Paths Not Taken

My second observation is based upon reflection on the steps NASA might have taken in its review of Toyota’s ETCS-i firmware, but apparently did not. Specifically, there is no mention anywhere (unless it was entirely redacted) of:

  • rate monotonic analysis, which is a technique that Toyota could have used to validate the critical set of tasks with deadlines and higher priority ISRs (and that NASA could have applied in its review),
  • cyclomatic complexity, which NASA might have used as an additional winnowing tool to focus its limited time on particularly complex and hard to test routines,
  • hazard analysis and mitigation, as those terms are defined by FDA guidelines regarding software contained in medical devices, nor
  • any discussion or review of Toyota’s specific software testing regimen and bug tracking system.

Importantly, there is also a complete absence of discussion of how Toyota’s ETCS-i firmware versions evolved over time. Which makes and models (and model years) had which versions of that firmware? (Presumably there were also hardware changes worthy of note.) Were updates or patches ever made to cars once they were sold, say while at the dealer during official recalls or other types of service?

Tags: , , , , , , , , , , ,

9 Responses to “What NHTSA/NASA Didn’t Consider re: Toyota’s Firmware”

  1. Lundin says:

    Surely the discussion must be about recursive interrupts? That is, clearing the global interrupt flag from inside an ISR?

    Otherwise, why would they need recursion, as in functions calling themselves? In my experience, using recursion is only justified for some particular searching- and sorting algorithms, and even there it can be unrolled to a non-recursive version. It doesn’t make much sense to use searching or sorting in a car break system.

    Whether this is about recursive interrupts or actual functions calling themselves, it is bad practice that doesn’t belong in safety-critical software. Pretty much every software-related standard for any form of safety-related application bans recursion (MISRA-C for example).

    And surely they must have some sort of hazard analysis included in their specifications, at least an informal one? Perhaps project/quality management wasn’t studied by NESC?

  2. Michael Barr says:

    The recursion is reported to be indirect, a la “function A calls function B, B calls C, and C calls A.” (NESC Report, Appendix A, p. 129) “As to why recursion was present in the ETCS-i software, Toyota reported that it was a deliberate design choice in order to simplify the quality assurance process and reduce the total size of the executable. The recursion made this possible by allowing part of a newly implemented state machine to be linked to a state machine that was already present in the code. (The [redacted] function is just one of three sites where recursion is present.) This linkage [allowed code] to be reused, unmodified, and therefore did not require additional testing nor contribute to an increase in code size.” (p. 130)

    • Lundin says:

      Oh I see… how horrible. In other words it is a quick & dirty hack made by the programmer so he would minimize the no doubt burdensome, unglamorous quality assurance and static analysis tests.

      And arguing about code size reduction through recursion, when you have already managed to blow up 100k flash, doesn’t seem particularly relevant.

      • David says:

        “it is a quick & dirty hack made by the programmer “. I don’t think so. Programmers do what managers require them to do, most of the time.

  3. David says:

    May I mention that 56 people were killed in various run-away Toyota events? It is easy to lose sight of this when discussing the minutia of floor mats or firmware. I wanted to remind readers that this was no trivial issue and involved multiple deaths.

    After one of these events, I remember hearing a news report: A Toyota affiliated person would go to the vehicle, connect some type of device “To gather vehicle information” and this required connecting some type of device to the vehicle. I wonder how long it takes to reprogram Toyota’s firmware?

    I also remember hearing some lawyers discuss this on a panel. They said, if it were firmware, that Toyota’s losses could be so high, the company may not be able to survive the lawsuits.

  4. Phil Koopman says:

    Thanks for publishing your thoughts on all this Mike.

    One thing I didn’t find looking through that appendix was any description of how the watchdog timer is actually used in the system. In the majority of design reviews I do (and I have done many) the watchdog timer is used incorrectly. For example, it might be kicked inside a loop within a long-running task or — horrors — kicked by a hardware timer ISR. Or it might be kicked in a way that lets some tasks die without the watchdog “noticing” they have hung. In any of those cases it is perfectly possible for the system to be mostly crashed without tripping a watchdog reset. They don’t mention that they confirmed the watchdog is being used properly, and yet they argue the watchdog timer would save the day.

    If you have insight into that it would be interesting to hear.

  5. Dan says:

    In my experience, way too much trust is placed in watchdogs that are often poorly implemented. To suggest that a stack overflow would always cause a watchdog reset is foolish.
    As an automotive firmware engineer, I’m also suprised by the code size and apparent complexity. KISS

    These types of problems usually turn out to be misunderstandings of the system level requirements and/or miscommunication between disparate groups handling different functions of the vehicle. Most embedded firmware engineers have diffculty envisioning unusual interactions between the cruise control, ABS, pedal position sensor, etc. because they have to focus on the bits and bytes as they work.
    The proverbial forest vs. trees…

    David,
    I don’t know anyone in my field that believes this is trivial. Quite the opposite, we are all more concerned about how to make sure our systems are as safe as possible.

  6. Mary says:

    “bad behavior that would then cause a watchdog timer reset”

    “would eventually manifest as a spontaneous system reset”

    Perhaps this question is answered elsewhere, but did anyone say what a watchdog timer or system reset would entail? It seems like they are relying on those to ensure a properly functioning system. I come from the medical diagnostic device world and in my world a system reset is comparable to rebooting a computer. What does it mean to reset a car while it’s traveling at a high rate of speed? Perhaps I am missing something and this is obvious to those in the auto industry, but I am not certain I would like to rely on that as a recovery mechanism for my car or at least for certain key subsystems. I would certainly prefer that the designers do their job to begin with and ensure that those fail safe mechanisms are not required.

  7. Ark says:

    A theme not to my knowledge discussed is memory glitches. Cars drive through EMI and ESD and yes, radiation. A glitch in a state of a stable control algorithm may be very unpleasant but is usually manageable. A glitch in a state of a state machine is usually fatal. Is the engine computer sufficiently protected? How many people checksum their data objects these days?

Leave a Reply to Michael Barr