embedded software boot camp

Classic race conditions and thoughts on testing

Monday, August 30th, 2010 by Nigel Jones

At a rather fundamental level this blog is about how I do embedded systems. Implicit in a lot of the articles is the concept that I believe what I’m doing is ‘right’, or at least ‘better’. Well today I thought I’d write about something I got wrong (at least on the first pass).

This is the scenario. I’m currently working on an NXP LPC17xx ARM Cortex design. Like all modern processors, the LPC17xx has a number of sophisticated timers with all sorts of operating modes. Well it so happens that I am using  four (out of a possible six) interrupt sources for one particular timer. The hardware architecture of the processor routes all of these interrupts to one vector and thus one interrupt handler. Here’s what I wrote:

void TMR3_IRQHandler(void)
{
 if (T3IR_bit.MR0INT)
 {                                            
  /* Do stuff */
 }

 if (T3IR_bit.CR0INT)
 {        
  /* Do stuff */        
 }

 if (T3IR_bit.MR1INT)
 {                                            
  /* Do stuff */
 }

 if (T3IR_bit.CR1INT)
 {                                            
  /* Do stuff */
 }

 T3IR = 0x3F;            /* Acknowledge all interrupts */

 ...
}

Thus in the ISR I tested each of my interrupt sources, took the appropriate actions in the sections marked ‘Do stuff’, acknowledged the interrupts, did a bit of clean up and I was done.  The ‘Do Stuff’ sections were quite complicated and so this was where I spent my time. Anyway having finished coding the ISR, I took a short break and came back to re-examine the code. As I was re-reading the code, I realized that I had made a classic mistake. In case you haven’t spotted it, the problem is in the line where I acknowledge all interrupts. Consider the following sequence of events:

  1. Interrupt source CR1INT is asserted and the CPU vectors to this ISR.
  2. I test the various interrupt flags and discover that CR1INT is set and do the requisite work.
  3. While I’m doing the requisite work, interrupt source MR1INT becomes active.
  4. I clear all interrupt sources (including MR1INT) and terminate the ISR
  5. As a result I have missed an interrupt.

The way this should have been coded is to acknowledge each interrupt bit individually. I.e. like this:

void TMR3_IRQHandler(void)
{
 if (T3IR_bit.MR0INT)
 {                                            
  /* Do stuff */
  T3IR_bit.MR0INT = 1;                    /* Clear the interrupt */
 }

 if (T3IR_bit.CR0INT)
 {        
  /* Do stuff */    
  T3IR_bit.CR0INT = 1;                    /* Clear the interrupt */
 }

 if (T3IR_bit.MR1INT)
 {                                            
  /* Do stuff */
  T3IR_bit.MR1INT = 1;                    /* Clear the interrupt */
 }

 if (T3IR_bit.CR1INT)
 {                                            
  /* Do stuff */
  T3IR_bit.CR1INT = 1;                    /* Clear the interrupt */        
 }

 ...
}

So how did this mistake come about? I think there were two culprits:

Mistake 1

The first mistake I made was in using another timer ISR as a template. The code I copied had just a single interrupt source, and thus acknowledging all of the sources was reasonable.

Mistake 2

I was too concerned with the ‘real work’ of the ISR. I should have written the ISR outline first and only then worried about the real work.

Notwithstanding the above, I did do one thing correctly – and that was to finish the code, walk away, and then come back to re-examine it. At no time did I reach for the debugger to test my code – which was just as well because quite frankly the chances of this bug being caught by testing are vanishingly small. Indeed just about the only way a bug like this would get caught is via code inspection – which is why I’m such a firm believer in code inspection as a debugging tool.

Anyway if you found this informative, you may find this account of another mistake I made equally enlightening.

16 Responses to “Classic race conditions and thoughts on testing”

  1. Carroll Robinson says:

    Another interesting article, Nigel. When reading it, I was reminded of a related problem that I have seen with some processors, which your readers might be interested in knowing.

    On some systems that use bit fields, accessing a register containing multiple interrupt acknowledge bits can be problematic. An instruction such as REG.INTACK.bit.XX = 1, which is supposed to clear the XX interrupt flag, may be implemented as a read-modify-write operation. Thus, when the register is read, if multiple bits are set to 1, then writing a single 1 bit will have the side effect of writing 1s to any other bits that were read as 1s. The result is that multiple flags are cleared, and interrupts are missed. In this situation, the entire flag register must be written using a bit mask rather than bit fields.

    I assume that this is not a problem with the LPC17xx, since your code writes individual bits of the register.

    But, my point is that when writing to a register using bit fields, the programmer must be aware of how the write operation will be implemented, and the side effects that may occur.

    • Nigel Jones says:

      An excellent point. As I hadn’t got around to testing this code it made me wonder how exactly the compiler was handling the construct. Well it turns out that it’s doing exactly what you are warning of! Clearly I wasn’t having a very good day – so thanks!

    • Bernhard Weller says:

      Very nice pointing that out, especially for the new ones in embedded programming.

      But I wonder if you actually read a 1 out of an acknowledge register which you haven’t set yourself.

      The way I set and reset bits until now was always to use: to set REG |= 0x01; (or after learning the benefits of defining bits rather REG |= BIT0;) and to reset REG &= ~(BIT0);
      But I guess this would have the same problems, as it reads the value from the register applies bitwise and or or and writes it back, preserving any bit you have read but not specifically modified, so the 1s would still be there.

      So in this case should I just write to the register e.g. INTACK = BIT0;?

      Now as I think about it a bit more, I’ve actually never had a MCU which used an acknowledge register in the first place, I always had to modify the interruptflags directly. Is there any benefit in using an acknowledge register? Was it implemented to prevent people from setting interrupts while only wanting to reset one (Interrupt Flag Register = ~BIT0 instead of IFREG &= ~BIT0)?

    • Gauthier says:

      In your example, the register INTACK is used only to acknowledge the interrupt by writing 1 to a specific bit.
      I fail to understand why the reading operation of read-modify-write would read a one (besides design flaw).

      It would not make sense to read such an acknowledge register, would it? Now I understand that a compiler might effectively achieve a read-modify-write anyway, of course.
      If the processor is well designed, a read-write (without modify) should not have any effect, right?

      The question is, if writing 1 to bits of this register has an effect, and writing 0 does not, and if reading the register does not make sense, why would reading the register anyway give any bit at 1?

      Write-only registers use to have defined behaviour when the program attempts a read on them. In that case I would expect the read to only result in 0.

      Did I miss something?

    • Lundin says:

      On some architectures this problem would appear even if you used bit-wise operators, and then in the opposite way of what you describe. The typical C programmer would try to clear a flag using bit-wise operators like this:

      REG |= FLAG; /* write 1 to clear */

      But this could also translate into read-modify write. For example, on all compilers for Freescale MCUs I’ve seen, the above code will be translated into 3 assembler instructions:

      store REG into CPU register
      manipulate contents of CPU register
      write contents of CPU register back to REG

      And that will erase all set flags in the register. The proper way to clear the flag in this case is less intuitive:

      REG = FLAG;

      which translates to a single “bit set” instruction.

      It is all CPU dependent! So I guess the lesson learnt here is to always disassemble your ISR whenever it is clearing some flags, and see what the C code actually does behind the lines.

      And btw, bit fields == problems. Never use them. Ever. I could probably write several pages of reasons why you shouldn’t, but that’s another story…

      • Gauthier says:

        The point here is if an interrupt arrives between the read and the write of the rmw, is it?

        My point was more that writing 0 to a bit of a clear interrupt register should not have any effect.
        If that is the case (and I agree that it is platform dependent, but I would dislike a platform not fulfilling this), then
        REG |= FLAG;
        should not be a problem?

        – store REG into CPU register
        bit at 0 must be read as 0 (possibly all bit read as 0, it does not matter).

        – manipulate contents of CPU register
        set to 1 only the bits for which you want to clear the interrupt.

        – write contents of CPU register back to REG
        only the bits at 1 in the CPU register clear an interrupt.

        About your comments on bit fields, do you mean the C bit fields in a struct? It would be interesting to read your “other story”. I feel that there might be compiler compatibility issues with that…

  2. groovyd says:

    depends entirely on the device how interrupts are cleared and sometimes even the specific peripheral causing the interrupt. on 32-bit avr it has separate set and clear registers for almost all status registers so you would write to AVR32_TC.src=bit to clear and sometimes you just need to read a register to clear it.

  3. Nigel,

    This area always gives me the heebie-jeebies.

    I agree with everything you’ve said, and I did indeed go NOOOOOOOOOO! when I looked at the code, so not all brain cells are yet dead.

    But even your “fixed” version doesn’t necessarily fix everything, depending on the underlying hardware.

    If the original interrupt signal is edge-triggered, it MIGHT be good enough – one edge gives you one service routine, and then you clear it….. Of course you have to know in your heart that if the input signal went down and back up again before you finally clear it, then in all probability your second signal will be lost forever.

    That was the bugbear with edge-triggered ints, and I think ISA busses used to be like that – giving rise to much agonising of whether the worst could happen.

    The worst case of course is that an edge-triggered input sits high, but you have failed to deal with it – which is a deadly-embrace to the end of the world, as you’ll never get another rising edge to unjam it.

    Do you know we once built a bit of kit that XOred the interrupt inputs with a 1khz pulse – on the basis that it’s better to service 1000 interrupts with no apparent cause than miss the 1 that jams up your hardware forever.

    The alternative of course, is that the interrupts are level-sensitive – and that’s all fine. Provided that each interrupt source will keep on hammering you with a high-level if you reset it while it still wants your attention, and that your interrupt controller is correctly set-up to go right on back into the interrupt code if the input line is still high when it exits.
    This issue of course is much worse if the peripheral is sharing one interrupt line between 4 causes, as UARTs have a tendency to do.

    By default I always write interrupt routines to manually check the attached peripheral before exiting to make sure it really has been satisfied – it doesn’t make it impossible for an interrupt to arrive during the reti sequence ( so you still have to be sure it will be dealt with ) but it avoids too many exit-reenter sequences.

    Life is a bit easier these days where we rarely stuff an interrupt signal into a CPU pin without a sophisticated interrupt controller in between – but I’ve been bitten by raw Z80s, and the wound still smarts.

  4. On the read-modify-write subject….

    Of course an interrupt-clear register should always read as zero, so that it CAN be accessed as rmw without causing the grief suggested.

    The compiler is at the mercy of the hardware and instruction-set of course as actual bit-wise addressing of registers is rare.
    Where it does truly exist it may only apply to a limited set of CPU-internal resgisters, but be unavailable, or worse dummied-up on other registers. Providing true bit-addressing requires a whole pile of gates that don’t normally exist.

    Am I right in thinking that some of the internal registers on some 8051-family stuff are truly bit-addressable?

    In the end, a programmer should assume that all accesses to values are done with rmw unless they have a deep groking of the underlying hardware and a promise in blood from the compiler-writer to use the facilities.

    If a standard rmw cycle on a register to set one bit has an unpleasant side-effect on another, then that really is a hardware design flaw.

    • Bernhard Weller says:

      At least the 80C517A I worked with had some bit-adressable registers, I know this for sure as I programmed them in assembler and there was a special instruction SETB and CLR to set or reset a single bit in the register, the accumulator did support this, some special function registers and most of the ports as well.

      If you are really interested have a look at the datasheet here: http://www.keil.com/dd/docs/datashts/infineon/80c5x7_um.pdf

    • Carroll Robinson says:

      My point on the read-modify-write subject was that clearing flags of the interrupt acknowledge register using C language bitfields could be problematic. But, if the programmer simply writes to the entire register with a bit mask containing a single 1, then only a single acknowledge flag will be cleared (because writing zeroes typically has no effect on the acknowledge flags).

      So, clearing an acknowledge bit using a bitfield, such as in REG.INTACK.bit.6 = 1, may cause multiple flags to be cleared, due to the read-modify-write side effect mention in my original reply.

      But, clearing an acknowledge bit by writing the entire register, such as in REG.INTACK = 0x0040, would cause only a single flag to be cleared. This is the approach that Nigel may want to take.

      (Of course, I would define a constant for the bitmask, such as #define BITMASK_CR0INT (1 << 6), rather than use a "magic number" like 0x0040).

      • Gauthier says:

        The difference between:

        REG.INTACK.bit.6 = 1

        and

        REG.INTACK = 0×0040

        is that in the first case, other bits that bit 6 are undefined upon writing to the register. Is that correct?

        • The problem is that if any bit read ‘1’ in the register, then writing the ‘1’ back accidentally will clear those interrupts.

          So imagine our register value is 0x3F – 01011111

          We want to clear the ‘1’ in position 0, so we read and get the value – 01011111
          Then, the next step will be to modify the given value to write a ‘1’ to what we read
          Modified, we get 01011111
          Our constraint (‘1’ in the LSB position) is preserved, but wait! We actually had additional constraints – namely, ‘0”s in the other positions. The value we WANT to write is 00000001
          So now we go to write the 01011111 value back to the register. Since it’s special, writing the ‘1’s everywhere wipes out all ‘1’s so now the register is 00000000 or 0x00.
          All of our interrupts look like they haven’t occurred, so we don’t process them and leave the ISR.

          Ah, embedded. So many ways to fail.

          • Ah crap, that’s 0x5F, not 0x3F. I fail, I know it….

          • Gauthier says:

            I thought we were talking about separated interrupt clear registers and interrupt flag registers.
            If the register in question is a clear register (not a flag register), why would reading it yield any 1 in any bit?

            That is what I meant with:

            “Write-only registers use to have defined behaviour when the program attempts a read on them. In that case I would expect the read to only result in 0.”

            Since an interrupt clear register is functionally a write-only register, reading it anyway should read 0 only. I would see an architecture failing to do that as flawed (but I am sure there are plenty of such micro-controllers out there).

Leave a Reply

You must be logged in to post a comment.