Over the next few months I’ll be touching upon the subjects of debugging and testing embedded systems. Although much has been written about these topics (often by companies looking to sell you something), I’ve always been struck by the fact that many of these discussions treat errors as if they were all cut from the same cloth. Clearly this is foolhardy, as it’s my experience that understanding what class of error you have is key to adopting an effective debugging and testing strategy. With that being said, my taxonomy of embedded systems errors appears below, arranged roughly in the order that one encounters them in an embedded project. I might also add that the difficulty in solving these problems also roughly follows the order I’ve listed, with syntax errors being trivial to identify and fix, while race conditions can be extremely difficult to identify (even if the fix is fairly easy).
Group 1 – Building a linked image
Syntax errors
Language errors
Build environment problems (make file dependencies, linker configurations)
Group 2 – Getting the board up and running
Hardware configuration errors (failure to setup peripherals correctly)
Run time environment errors (stack & heap allocation, memory models etc)
Software configuration errors (failure to use library code correctly)
Group 3 – Knocking off the obvious mistakes
Coding errors (initialization, pointer dereferencing, N + 1 issues etc)
Algorithmic errors.
Group 4 – Background / Foreground issues
Re-entrancy
Atomicity
Interrupt response times
Group 5 – Timing related
Resource allocation mistakes
Priority / scheduling issues
Deadlocks
Priority inversion
Race conditions
It’s my intention over the next few months to discuss how I set about solving these sorts of problems, so it’s important that I’ve got the groups right. Thus if anyone thinks this taxonomy is missing an important group, then perhaps you could let me know via the comments section or email.
Projects are increasingly moving up in scale of CPU types and count, e.g. PowerPc multi-cores. Which brings into prominence another bug type: thread safety violations. This is not quite the same thing as re-entrancy bugs. The latter usually involve un-necessary global state, while threading issues usually revolve around poorly- (or un-) protected critical regions around access to global or shared resources. Interrupts must be consider as pre-emptive threads, even with co-operative tasking architectures. As must resource handlers shared amongst multiple CPU cores.I had my exemplar experience in this with a buffer queue bug in a queue from interrupt handler to routing layer. It did not rear its head until three years after the product started shipping. Up till then all uses had been stochastically well distributed, but in this one instance it was being used for printing pick-tickets in an automated warehouse. It just so happened that the interval between tickets, at full load, raised the probability from insignificant to one in twenty messages, because of a synchronicity arising from the processing time between one chunk interrupt to the next lining up with the task level access to the same queue. I spent three days sitting on the factory floor glued to an oscilloscope looking for this one. In the end I just had to DI – EI around two assembly instructions to fix it.Lesson learned.
You make some interesting points GrayGaffer. I essentially have no experience of inter-processor communications in a multi-core design, which probably goes a long way to explaining why this class of problem doesn't appear on my taxonomy. I found the problem you described very interesting – particularly as it illustrates something that has long been apparent to me – namely that completely testing a product is virtually impossible. I think most of us would treat with great skepticism a bug report three years after the product shipped. While I've done my fair share of sitting on factory floors, I long ago came to the conclusion that the only real way to prevent bugs of this type is via a formal code review that explicitly looks for these sorts of race conditions.