embedded software boot camp

What’s the state of your Cortex?

Monday, September 26th, 2011 by Miro Samek

Recently, I’ve been involved in a fascinating bug hunt related to a very peculiar behavior of the ARM Cortex-M3 core. Given the incredible popularity of this core, I thought that digging a little deeper into the mysteries of ARM Cortex could be interesting and informative.

First, I need to provide some background. So, the bug was related to the very unique ARM Cortex-M exception type called PendSV. This is an exception triggered by software, but unlike any regular software interrupt, PendSV is an asynchronous exception. This means that PendSV typically does not run immediately after it is triggered, but only after the Nested Vectored Interrupt Controller (NVIC) determines that the priority of the currently executing code drops below the priority associated with PendSV.

At this point, you might wonder, why and where would such “Pended Software Interrupt” be useful? Well, it turns out that PendSV is the only reliable way on ARM Cortex-M to find out when all (possibly nested) interrupt service routines (ISRs) have completed. And this determination is essential to run the scheduler in any preemptive real time kernel.

Virtually all preemptive RTOSes for ARM Cortex-M processors work as follows. Upon initialization the priority associated with PendSV is set to be the lowest of all exceptions (0xFF). All ISRs in the system, prioritized above PendSV, trigger the PendSV exception by writing 1 to the PENDSVSET bit in the NVIC ICSR register, like this:

*((uint32_t volatile *)0xE000ED04) = 0x10000000;

Now, the heavy lifting is left entirely to the NVIC hardware. NVIC will activate PendSV only after the last of all nested interrupts completes and is about to return to the preempted task context. This is exactly the right time for a context switch. In other words, the PendSV exception is designed to call the scheduler and perform the task preemption. ARM Cortex is so smart that it eliminates the overhead of exiting one exception (the last nested interrupt) and activating another (the PendSV) in the trick called “tail-chaining”.

Everything looks easy so far, but ARM Cortex has one more trick up it’s sleeve and this optimization, called “late-arrival”, has interesting side effects related to PendSV. This subtle interaction between PendSV and late-arrival leads essentially to a hardware race condition I’ve recently had a pleasure to chase down.

To illustrate the events that lead up to the bug, I’ve prepared a distilled hardware trace available for viewing at ARM-Cortex-M3_bug.txt. Please go ahead and click on this link to follow along.

The trace starts with an interrupt entry (labelled as Exception 83). This system runs under the preemptive kernel called QK, so the ISR calls QK_ISR_ENTRY() and later QK_ISR_EXIT() macros to inform the kernel about the interrupt. At trace index 069545 the QK_ISR_EXIT() macro triggers the PendSV exception by writing 0x10000000 into the ICSR register.

After this, the Exception 83 runs to completion and eventually tail-chains to Exception 14 (PendSV). This is all as expected.

However, the real problem starts at trace index 069618, at which the execution of the first instruction of PendSV (CPSID i) is cancelled due to arrival of a higher-priority Exception 36 (another interrupt).

This cancellation of low-priority Exception 14 in favor of the higher-priority Exception 36 is the ARM Cortex-special called late arrival. The ARM core optimizes the interrupt entry (which is identical for all exception), and instead of entering the low-priority exception and than immediately high-priority exception, it simply enters the high-priority exception.

The problem is that just before the late arrival, the PENDSVSET bit in the NVIC-ICSR register is already cleared.

However, the late-arriving Exception 36 sets this bit again in QK_ISR_EXIT(), which is normal for any interrupt (trace index 070126).

The Exception 36 eventually exits to the original PendSV (trace index 070130), but this is not the usual tail-chaining (the trace indicates tail-chaining by the pair Exception Exit/Exception Entry). This time around the trace shows only Exception Exit, but no entry.

This difference has very important implication, which is that the PENDSVSET bit in the NVIC-ICSR register is not cleared (remember that it is set, however).

What unfolds next is the consequence of the PENDSVSET bit being set. PendSV executes, fakes its own return to the QK scheduler, and eventually it unlocks interrupts. But before SVCall (Exception 11) can execute, the PendSV Exception 14 is taken again (because it is triggered by the PENDSVSET bit). This makes no sense and should never happen, because PendSV should never be in the triggered state at this point.

So, what are the consequences of this behavior and what is the fix?

Well, as you can see, due to late-arrival PendSV can be occasionally entered with the PENDSVSET bit being set, so it will be triggered again immediately after it completes. This might or might not have adverse consequences. In case of the QK kernel, this was unacceptable and led to a Hardware Fault. In other RTOSes it might simply cause another scheduler call, waste of some CPU, and delay of the task-level response, but perhaps not a catastrophic failure.

The actual fix of the problem is very simple. Since you cannot rely on the automatic clearing of the PENDSVSET bit in the NVIC-ICSR register, you need to clear it manually (by writing 1 to the PENDSVCLR bit in the NVIC-ICSR register.) Of course this is wasteful, because only one time in a million this bit is actually not cleared automatically.

Interestingly, I have not seen such writing to the PENDSVCLR bit in open source RTOSes for ARM Cortex-M (such as FreeRTOS.org). Recently, I’ve come across some posts to the ARM Community Forums that this problem exists for the Frescale MQX RTOS (see PendSV pending inside PendSV handler? (Cortex-M4)).

If you use a preemptive kernel on ARM Cortex-M0/M3, perhaps you could check how your kernel handles PendSV. If you don’t see an explicit write to the PENDSVCLR bit, I would recommend that you think through the consequences of re-entering PendSV. I’d be very interested to collect a survey of how the existing kernels for ARM Cortex-M handle this situation.

Tags: ,

48 Responses to “What’s the state of your Cortex?”

  1. Nick Merriam says:

    Great post Miro. I love this kind of thing. It reminds me of a tricky behaviour of the C167 family (it could be still there for all I know), where you could enter an interrupt and the state on the stack showed that the interrupted context was running at a higher priority than the interrupt. This happened when you interrupted an instruction that was writing the interrupt level bits of the PSW. The processor decided to handle the interrupt but also completed the instruction to raise the interrupted context’s interrupt level. The fix was to wrap such writes to the PSW with an ATOMIC sequence.

    Our gliwOS microkernel does not support pre-emptive task switching because (a) most automotive applications do not actually require it and (b) we have seen soooo many cases like this of nasty race conditions that affect a preemptive OS. Also, there are cases where a non-preemtive OS outperforms a preemptive OS.

  2. Richard Barry says:

    I’m always grateful for a mention of FreeRTOS in any article. I need to triple read what you are saying in this article, but from my first read, it appears you are implying there is a bug in how FreeRTOS interacts with the Cortex-M core – when in fact there is no such thing. The worst case scenario would be that the PendSV runs twice, which has zero consequence over and above wasting a tiny amount of time.

    FreeRTOS has no special interrupt entry or exit code, does not pend a PendSV unless it is necessary (that is if the ISR unblocked a task with a priority higher than the task currently in the Running state), and never globally disables interrupts (CPSID i). In the standard version the only SVC call used is to start the scheduler, although the MPU version makes multiple use of SVC calls.

    Author, FreeRTOS

    • Miro Samek says:

      I have mentioned FreeRTOS.org only because it is popular and truly open source, so it was very easy for me to go and check its source code without needing to register or relinquish my personal information (as it is the case in so many other RTOSes out there). As an author of the open source QP framework I also constantly deal with such scrutiny of the source code. I see it as a very good thing, because it contributes to constant improvement of the quality of the code.

      I am really glad to hear that after performing analysis of the FreeRTOS.org port to ARM Cortex-M you conclude that occasional entry to the PendSV exception with the PENDSVSET bit set causes no problem for FreeRTOS. This is all I was asking for.

      The real question is whether you *knew* about this peculiar interference between late-arrival and the PENDSVSET bit, so that you *know* that you need to think through the consequences of this particular scenario.

      I certainly didn’t know about it and I would have been very grateful to anybody who would share this piece of information with me. As far as I know this behavior is undocumented and is untestable with a single-step debugger where you trigger various exceptions at various places in the code. The bug raises its ugly had only in the highly dynamic situation of late-arrival, so it can be revealed only in a system running at full speed and heavily hammered by interrupts.

      • Paul Kimelman says:

        Miro, you are focused on the wrong issue. I am not even sure this is late arrival – it looks more like preemption, but without time or stack info, no way to verify. But, the simpler case is that PendSV has entered and then your ISR 36 fires. The exact same scenario would occur as you describe. I assume your PendSV handler protects its critical data, so ISR 36 would occur before or after.
        In either case, the side effect of ISR 36 needs to be verified by PendSV. Since ISR36 could fire anywhere in PendSV, it makes sense for the core to re-run PendSV to make sure the change is handled. Since you are using critical sections, there is no risk of corruption, so it is only time.

      • Well, Paul Kimelman is obviously far more knowledgeable on this than I am, so I won’t say too much, but I don’t understand your use of the word “bug”. How you are describing the core working is exactly how I would expect it to work, and I think how it is intended to work, and how I would infer it worked from the documentation. I would go as far as to say that if the core did not exhibit this behaviour, it would be a bug in the core, not if it does exhibit it. For example, if the pend bit is cleared, and a preempting interrupt could not re-set it, then that would be a problem. There is no race condition, is there? Then again, I have been using Cortex-M devices for many years, so maybe my knowledge (?) of its workings come more from experience than the documentation.

        • Dan says:

          Hi Richard,

          When Miro said “the bug raises its ugly [head]”, I don’t think Miro was referring to the CM3 core, I think he was referring to the QK kernel code.

          (Miro, feel free to correct me if I have mis-represented your intent.)

          • Paul Kimelman says:

            Dan, he is treating it as a core problem: “actual fix of the problem is very simple. Since you cannot rely on the automatic clearing of the PENDSVSET bit in the NVIC-ICSR register, you need to clear it manually (by writing 1 to the PENDSVCLR bit in the NVIC-ICSR register.)”.

            If he cannot tolerate PendSV being called as soon as it exits, then he has a big problem since this “fix” will not fix that. As I noted, the high priority ISR could come in at instruction 0x08023278 (where PendSV enables interrupts) and it will of course re-execute.

            But, the trace does not make sense anyway. He has an SVC instruction in his PendSV handler apparently. For this to not fault, the SVCall exception must be higher priority than PendSV. If it is, then it should execute before PendSV can be re-executed. Further, PendSV will not re-enter until you return. That is, the PendSV bit does *not* cause PendSV to pre-empt itself – that cannot happen. So, his trace is not reflecting something, but we do not have enough info to see what is going on.

          • Dan says:

            Hi Paul,

            (Staying at current level of indentation for readability)…

            I hear you – I was only responding to Richard’s remark about the use of “bug”, I thought Miro was only referring to the QK implementation, not the core, when he used that term. Just how I read it.

            This is totally my interpretation, but I read the “you cannot rely on the automatic clearing of the PENDSVSET bit” not as a slam on the CM3, but more pointing out a situation where the QK doesn’t handle the particular sequence of events properly (gotta love the asynchronous nature of interrupts, right?). My experience with Miro is that he’s usually the first to assume the problem is in his own code rather than blame someone else’s code (or processor design).

            By the way, thanks for joining in the conversation. I’d say you’re uniquely qualified to weigh in on this topic, so any insight & suggestions are welcome by all of us.

        • Miro Samek says:

          I certainly realize that many millions or perhaps now even billions of ARM Cortex-M processors have been deployed in all sorts of products, so it is quite safe to say that the core doesn’t have bugs. The point of my post was rather to describe an interesting, and non-obvious (at least to me) behavior related to the interference of two features (PendSV and late-arrival), both being unique to ARM Cortex-M.

          Perhaps I should also note that the hardware trace referenced in the original post contains very useful end-of-line comments, which were generated by the trace probe, not by me. These comments provide additional information from the hardware, from which you can clearly see that the Exception 14 was entered (index 069617), but it was cancelled and Exception 36 was entered instead. Also, the exit from Exception 36 (index 070130) is *not* followed by the “Exception Entry” comment, but the first instruction of PendSV is executed. These two pieces of evidence make me still think that my interpretation of this trace is correct, in that is we are dealing here not with “normal” preemption, but with late-arrival.

          The non-obvious behavior to me is that late-arrival causes entry to *both* exceptions simultaneously, including execution of the side effects of both exceptions, such as clearing the pending bits (PENDSVSET bit in case of PendSV and some other pending bit for Exception 36). But then again, perhaps the right way of thinking about late-arrival is that it *is* the preemption of a low-priority exception by a higher-priority exception only compressed to one machine instruction (exception entry). With this interpretation I admit that I should have *expected* that the PENDSVSET bit won’t be cleared as part of resuming the preempted PendSV. (Although, interestingly, the cancelled first instruction of PendSV *is* recalled and executed.)

          But this whole discussion brings into focus one important point, I hope. And this is that occasionally PendSV is triggered while another instance of PendSV is already active. When this happens and when the PENDSVSET bit is *not* explicitly cleared right before the context switch, another instance of PendSV is executed immediately after the first one. I understand that this is not catastrophic in the FreeRTOS.org implementation. However, the amount of code executed from PendSV is non-trivial and delays the task-level response. My gut feeling is that adding explicit write to the PENDSVCLR bit right before invoking the context switch would improve the determinism of the kernel. Of course, the average execution time would increase by three short instructions, but the *worst-case* task-level response would improve. I believe this is something to think about.

          • Paul Kimelman says:

            I think we should clarify a few things. Late arrival simply means that while we are pushing regs, a higher priority interrupt comes along and so we pass control to it vs. the original ISR – when it returns, the one that was prevented from running will be entered by tail chaining (which just means skip the pop and then push). This works by keeping the pend bit set until the ISR is “activated” (1st location loaded into PC).
            I do not know what the trace tool you are using does, but the PendSV pending bit is not cleared until the 1st instruction is about to start (after too late for late arrival) and the active bit is also set. However, if a higher pri interrupt comes in, that 1st instruction is not executed (canceled) and the interrupt taken. The registers are all stacked again since this is a pre-emption. So, if you see it cleared, then it must have been entered into PendSV but 1st instruction canceled by pre-emption.
            If your PendSV code does a lot of work before anything that would be affected by another interrupt, then by all means clear the PendSV pend bit. But, you cannot do it after that critical point and be safe of course. For most kernels, this is near or at the start. For you, it is worse since you disable interrupts right away. So, the only place it can occur is just before that 1st instruction. So, the value of clearing it as one of the 1st few instructions is very very limited – the odds of catching it on its 1st instruction is low. However you see that when it does happen only because it crashes your code.
            My guess is that you are creating a lot of these effects by global interrupt disable for whole ISRs. This causes interrupts to align to where the interrupts are enabled.
            I still do not understand your trace showing SVC being used in PendSV. I also do not understand the next instruction being PendSV – CM3 will not pre-empt an ISR with itself. Are you using the same function for SVCall and PendSV?

  3. Paul Kimelman says:

    This is intentional. First of all, as Richard Barry notes above, ISRs only need to set PendSV if they do something that would need rescheduling (that is, they change a task state or a resource task is waiting on). Second of all, PendSV can be set from an ISR which has pre-empted PendSV – this is very intentional. If this happens, there is a potential race since it could happen after you finished updating the task list (and released the critical section). So, it runs again (by tail chaining). If the 2nd run finds nothing new to do, that is OK. But, it could.
    The reason you are confused is that you think the PendSV bit is lost on the late arriving high pri interrupt, but that is not the case. My guess is that this is not a late arrival case but a pre-emption. You likely pre-empted it (after it entered). The PendSV bit is not cleared until the PendSV handler is active (stacked). You can tell by looking at the active bits, by looking to see if the higher pri int will be returning to the task level, or by looking at the stack.
    Also as Richard says, the best way to do a critical section is BASEPRI (and BASEPRI_MAX) so only impact ISRs which use or affect the same critical data. So, if the top 3 pri levels do not make OS calls, then set BASE pri to the 4th level so that the top 3 can run unimpeded but the priorities below are protected from corrupting your data. You can even use different levels for different resources. That is, tasks may be the 4th pri level, but mailbox CBs may be the 5th level since the 4th does not touch mailbox CBs.
    Regards, Paul

  4. Paul Kimelman says:

    In response to Nick, I will say that Cortex-M3/M4 (really ARMv7-M) was designed to support OSes including pre-emptive ones. So, it provides a number of facilities to ensure better performance in many cases, faster scheduling, avoidance of critical sections and avoidance of global critical sections, etc. It also supports a protected OS using the MPU, which Richard Barry has added support for in FreeRTOS – this includes fast update of the MPU during scheduling.
    There are still many tricks up its sleeve that are not yet being used by any OS/kernel (that I know of). This includes minimally:
    – Using bit banding to support task sleep/wake with no critical sections. That is ISRs set an atomic bit and PendSV, and the scheduler can perform the task operation without then needing a critical section at all.
    – Use of LDREX/STREX (exclusives) for non-blocking and non-locking FIFOs between ISRs and tasks, such as for peripheral feeds (RX and TX).
    – Support of supervisor and user tasks. Best with MPU, but can be useful in other cases.
    – You can run user code within an ISR context. This was done for Autosar, but can be useful for security or safety.
    – Many local fault handlers to allow for better handling of problems in a layered way with a global catch (Hard fault) for panics.

    Regards, Paul

    P.S. To Miro, PendSV is not the only way to ensure a handler runs before returning to task level. You can also use the SET PEND bit on any ISR (system or interrupt) to get this effect. PendSV was added to make it really easy and standard (vs. having to do a port for each MCU variant) for the same reason SysTick was added to the core.

  5. 42Bastian says:

    Hi Miro,
    on the first read it sounded you are right. But the more often I read your article and look at the trace, I get the feeling you are doing something wrong.
    One comment: Not every ISR shall call PendSV, only if there is a need for re-scheduling.

    I also do not see, where SVC comes into the play.

    In our RTOS, I do not clear then PendSV bit. I actually think it would be a false action unless one really wants to cancel a pending scheduling.
    I never got a report about this behavior you describe before. And our customers made some high-ISR-rate applications.

  6. Miro Samek says:

    Thank you all for your comments.

    It’s becoming clear to me that I should have given more background information about the particular preemptive kernel that I was working on, because it is very different from all traditional (blocking) kernels.

    So, my kernel, called QK, is fully preemptive and priority based, but it can only manage single-shot tasks that run to completion and cannot block. (This class of kernels is known in the OSEK/VDX terminology as the BCC1-class kernels.) In exchange for the inability to block such a run-to-completion (RTC) kernel can be very simple and very fast, because it can use a *single stack* for all tasks. The RTC kernel works in the same way as a prioritized interrupt controller, such as the NVIC, which keeps the context of all nested interrupts on a single stack. An RTC kernel implements in software the same policy as NVIC implements in hardware.

    At this point, you might question the usefulness of such a crippled kernel, but it turns out that an RTC kernel is ideal for execution of state machines that exactly need the RTC semantics, but never need to block in the middle of RTC processing. My ESD article “Build a Super-Simple Tasker”, which I wrote together with Robert Ward, explains the inner workings of an RTC kernel and its benefits for executing state machines (http://www.eetimes.com/design/embedded/4025691/Build-a-Super-Simple-Tasker).

    Going back to the ARM Cortex-M, I found it tricky to implement the RTC kernel, because I couldn’t find a way to send the End-Of-Interrupt (EOI) command to the NVIC. The implementation I ended up with is to fake the PendSV exception return to get to the QK scheduler at the task level, but then I needed to return to the preempted task context, at which point I needed another exception return, for which I employed SVCall. The details of the QK port to ARM Cortex-M are described in the Application Note “QP and ARM Cortex-M with IAR” available at http://www.state-machine.com/arm/AN_QP_and_ARM-Cortex-M-IAR.pdf. The QK code, including the PendSV assembly code is available from SourceForge.net at http://sourceforge.net/projects/qpc/files/QP-nano/4.2.04/ (The QP-nano framework is simplest to experiment with).

    @Paul Kimelman: I hope that the provision of the context clarifies some of your concerns about my sanity. The BCC1-class QK kernel I employ to execute state machines lies off the beaten path of traditional kernels, so perhaps my PendSV/SVCall code doesn’t look right to you at first glance. But I hope that if you delve into it just a little deeper, you might find the ultra-simple and ultra-fast QK kernel interesting, in which case I’d love to enlist your help. My basic question is this: Is it possible to send the EOI command to NVIC in any simpler way than I am currently doing?

    • Paul Kimelman says:

      I am somewhat confused by what all the EOI needs to do. Traditionally, the popup thread model (run to completion) works just like any other scheme except that all tasks must yield by return (and so are like polling loop handlers – hold all state locally and make decisions by non-blocking calls (e.g.
      if (MailBoxHasData(&data)) { process data…}
      which return either way).
      So, the stack is created and discarded each time (so is the same). In Cortex-M3, this could be the process stack or the main stack, but the process stack works well for this. Normally the tasks are setup with a return link (LR) which invokes SVC. So, the task is started by “returning” from SVCall to a function like:
      void TaskStart(void (*task_start))
      svc(OS_DONE); // use SVC to enter SVCall handler

      The PendSV mechanism was designed for what you want: the last ISR to return back to thread level returns into PendSV because the pend bit is set. The PendSV routine can do what it wants and then return into the task level, including messing with the return frame (as a normal kernel does). If you want to add pre-emption on top of what I show above, then you normally would create a fake frame over the real frame in PendSV and return to that TaskStart. When it is finished, the SVCall handler can toss the finished tasks “frame” and get back to the pre-empted task. Since it created the entry frame, it knows how big it is and how to get rid of it.
      In this model, SVCall is only invoked by the task ending and is the same priority as PendSV. That was my intent anyway.
      The only thing to watch for is the stack getting too deep. I guess you can also have some concept of priority inversion but that is not solvable. By that I mean if the lower pri task (below on the stack) has claimed a mutex, the higher pri task obviously cannot “wait” on it and somehow has to delay being run next. Traditional popup schemes use a task invoke list. So, the high pri task registers with the mutex (like a pend list) and when the mutex is freed up, the next task is invoked.
      So, what did I miss about how you want to use EOI?
      Regards, Paul

    • Paul Kimelman says:

      As a quick side note: I did envision these popup threads (or state machine polling loop handlers). I used a scheme like that on 8051 years ago to replace ladder logic.
      But, I did not consider pre-emptive tasks. However, I do not see it as a problem if you use the PendSV/SVCall creates fake return frame to invoke and “exit” of task is SVCall re-entry. That method allows user tasks but also allows “pre-emption by nesting (stacking)” to work just fine.

    • 42Bastian says:

      I think the problem is, that you use SVC for scheduling. Actually your scheduling looks strange to me. Why don’t you call the scheduler from PendSV_Handler.
      If the scheduler will be called from some OS function it shall not call QK_schedule_() directly, instead it should make a PendSV.
      Then you have no problem returning from the scheduler.

  7. Fred Roeber says:

    Some good discussion of this issue here. And it is one of those very tricky edge cases that those of us who do OS code have to worry about. I see what Paul is saying but from the trace that Miro shows I tend to agree that this is a bug. The PendSV bit is supposed to allow interrupt handlers to trigger processing that is deferred until all interrupts have completed. It is not unreasonable to have an OS design that assumes that an interrupt handler did trigger a request for such processing and that the processing has not been performed upon entry to the PendSV handler. Because of the bug Miro points out, one can’t make that assumption since, because of the bug, there could be a case where the processing request signaled by an interrupt handler could have already been processed. Specifically, the PendSV handler is running a second time even though no interrupt occurred since the last time the handler was entered. And it’s not like it could be hard to imagine there is such a bug. I’ve seen very similar bugs in several other processors over the years. And just look at the errata for the Cortex M3 and the bugs that keep popping up (there is a new one for the r2p1 release that has me worried but that’s another story).

    In the “bare metal” Cortex M3 application we are wrapping up our code used the PendSV interrupt to provide a sort of “high priority” signal handling capability along the lines Paul mentions in his last reply based on bitbanding (a wonderful feature) and PendSV. I was going to “fix” our code to clear the PendSV bit as Miro suggests. Not because the extra invocation of our code would break anything but just because it would waste time. Then I got worried that clearing the PendSV bit at the start of the interrupt handler could introduce an even worse sort of race condition. That would be the case where the PendSV interrupt handler gets invoked normally but before it executes the instruction to clear the PendSV bit in the ICSR the handler gets interrupted by a higher priority interrupt that again signals the PendSV bit. Upon exiting that interrupt and properly returning to the PendSV handler the instruction to clear the PendSV bit would clear the notification that a valid new interrupt had occurred since the start of the function. But I realized this is actually ok based on the assumption that you clear the ICSR PendSV before doing any of your real work since you will end up handling the late arriving request at the same time as the earlier request.

    I came to the conclusion that for our code I’m not going to make any changes. Since an extra invocation doesn’t hurt our operation at all and only wastes a small amount of time, why waste even just a few instructions to save time for the “one in a million” case. But it sounds like with Miro’s design he does need the fix he proposes.

    And that’s just my $.02.

    • Paul Kimelman says:

      Fred, I am not following your point on PendSV. This is not a bug, but is intentional. There are two choices: either PendSV would block all interrupts once it starts to ensure no interrupt could come in the entire time, or it can be re-pended if an interrupt comes in after it has started (and that ISR sets the PendSV pend bit). I chose the latter for simple reasons. Think about this logically: if you blocked all interrupts for the whole duration of PendSV, then any pending interrupt will fire as soon as PendSV leaves. So, you would end up running it again. Therefore there is no difference between the two cases except that the case we have does not block high pri interrupts for long periods. The only “wasted time” case is if the interrupt just snuck in before the PendSV handler happened to start its decision process; but that usually happens right away, so the period where it can do this is tiny; if not tiny, then you can manually clear PendSV bit just before you do start it. But, normally the PendSV handler just finds the highest pri task and checks if that is the same as the current task. If so, it returns to the current task; if not, it switches the tasks. If a high pri interrupt changes the task list just after that lookup or decision, then you have to run PendSV again. If you disabled all interrupts that time (as Miro does), then that interrupt will simply fire as soon as you release it and so you have to run it again anyway.
      The problem is that the trace looks very messed up for 2 reasons:
      1. It implies PendSV pre-empted itself which is not true. PendSV will exit and re-enter (tail chained). But, I do not see the return in the trace. So, something is messed up in how he trimmed the trace I think (or SVCall and PendSV vector to the same code).
      2. He uses SVC at the end of PendSV. As 42Bastian says, he should just call that handler if same priority. You cannot embed SVC in a handler unless the SVCall priority is higher – this is because SVCall has to vector instantly. This is why I created PendSV. So, I am not sure why this is there and I am not sure the intent.

      I am not sure which bugs you are referring to with CM3. LDRD is the main one of concern. The r2p1 bug on the SP load should not impact anyone since SP is not normally loaded from volatile “memory” but from a TCB or other memory-like area. Setting SP at all is rare other than a scheduler and I cannot see it getting the SP from memory that would give a different result if you read it twice.

      • Paul Kimelman says:

        Actually, it is good to make sure everyone working on an RTOS is aware of that SP bug, although it has been in all versions of Cm3. The risk would be if you are loading the SP from the TCB and you use ! (update) in that LDR. That is:
        LDR SP,[R0,#4]! or LDR SP,[R0],#4
        The above two cases update R0 after loading the stack pointer. Although it is unclear the advantage of this, it is possible. More likely you have:
        LDR SP,[R0,#20] ; load tasks stack from TCB’s SP save area
        But, if you use the update base register mechanism, then this would affect you (in all versions of CM3).
        Note that when I said LDRD is the main one of concern, that would be only if you used it in your own code via assembly. It is a narrow case and a strange use (1st destination is also base register).

      • Fred Roeber says:

        Paul, First, about your last point. I was referring to the newly reported LDRD bug in the r2p1 errata sheet. I haven’t fully thought through the implications of that bug since I just fetched that newer errata sheet this morning as part of thinking about Miro’s issue and impact on code we are readying for release. But I am always concerned with instruction execution problems since they are sometimes hidden behind compiler generated code. (In fact I was amazed at how little assembler was needed with the Cortex M3 since it has so many nice features catering to compiled OS type code). From what you say it sounds like I probably don’t have to worry that the IAR compiler we are using will generate code that will suffer from the bug.

        As to the PendSV issue, I totally agree that you would never want execution of the PendSV handler to hold off other interrupts. Particularly since it makes most sense to handle that interrupt at the lowest priority. Also, when reading Miro’s post I got a bit wrapped up in the theory of PendSV rather than the reality of using it. Not knowing the microarchitecture of the MC3 the way you do, I thought Miro’s trace does show a case where the PendSV handler does get invoked (ie reaches it’s first instruction) twice following the last higher priority interrupt handler that sets PendSV. And theoretically, that seems wrong.

        But, as you say, as a practical matter it doesn’t really matter. There is no difference writing code for the PendSV handler between an interrupt sneaking in during the PendSV dispatch process and one arriving after the dispatch process. If there is some critical interaction between something interrupt handlers do and what the PendSV handler does then I can see that you would just have to use a critical section in the PendSV handler to disable interrupts, clear PendSV, and do whatever critical operations there were before enabling interrupts (assuming the disable call wasn’t nested).

        I haven’t looked at Miro’s specific issue and why he couldn’t use a critical section of the type I mention or just code his PendSV handler to tolerate multiple invocations (which is how our PendSV handler operates). Sorry for missing your point.

        • Paul Kimelman says:

          What newly reported bug in r2p1? As far as I know, the latest errata is still from Feb and v3. Are you saying there is a new one? (I no longer work at ARM, so I am not up on this stuff, especially as we now use CM4-F only).
          As far as I know, r2p1 fixed the LDRD bug, but they found a writeback problem when loading SP (which is what I was referring to). The LDRD bug has to be accounted for simply because it affects the previous versions.
          The compilers all had to deal with over a year ago. LDRD is not generally very useful for CM3 and was not widely used (vs. LDM). It was only being used by compilers because of codegens for R4/A8/A9 where it was more useful with the wider bus. But, I know IAR corrected it quite a while back.

          The sad reality on the whole PendSV thing is that I had written a document at ARM on strategies for different OSes. But, ARM would not release it because of concerns about “support” and wanting to run it through tech pubs (who would strip all useful content out). They have loosened their rules so that these documents can be done, which is good, but sadly that doc never made it out. I had diagrams to show the stacking models and the side effects of different critical section strategies. Sigh.

          Anyway, as you say, the rule has to be that at the end of any critical section, all critical data should be back to a known state since interrupts can and will come in. It is important that any kernel or OS can tolerate that. Lessening the chance of it (e.g. clearing PendSV) but not eradicating the chance is worse because it means you may not catch it while testing, only in the field.

          Thanks, Paul

          • Fred Roeber says:

            Paul, You are right as far as I know. Last errata I found is the one from February. Too bad your document you mention didn’t make it out with info on OS design recommendations particular to M3 internal design. Such writeups are frequently quite useful.

  8. Miro Samek says:

    @Paul Kimelman: As I described in my earlier comment, the kernel I’m using is *different* than the traditional blocking kernels. For example, the SVCall really happens *after* the exception-return from PendSV, so SVCall is triggered from the *task-level*.

    Actually, if you only looked at the trace more carefully, you would notice the Exception Exit from PendSV at index 070142, just before calling QK_schedule_. This is done specifically to run the scheduler at the task level, because the scheduler launches the tasks.

    The bottom line is that you really need to get a little deeper if you are genuinely interested in helping here. I attach the assembly code with PendSV and SVCall, so you can take a look (see http://www.state-machine.com/attachments/qk_port.s). But even this piece of assembly is not sufficient to get the whole picture. You also need to see the QK scheduler and understand the general philosophy behind the RTC kernels.

    • 42Bastian says:


      let be describe in short how _I_ would do it:

      1. enter main()
      2. call PendSV
      3. PendSV-Handler == Scheduler: Prepare stack frame for highest Pri task
      4. return from PendSV handler =>highest Pri task runs.
      5a. task runs to completion => return just behind 2) entering idle-loop
      5b. interrupt happens
      6a. interrupt triggers a re-scheduling => PendSV, but only if new pri > current pri !!
      6b. interrupt does not trigger a re-scheduling
      7. on return either PendSV handler is called(goto 3) or old task is running (goto 5)

      So the only “messing” with the stack is the exception-frame you have to prepare for calling the next task.

      See, no need at all for SVC, only if you also want to add user/supervisor mode.


      • Paul Kimelman says:

        42Bastian, remember that he is pre-empting tasks. This does not mean swap (since not possible), but rather stacking a new task over the old one the way nested interrupts stack over each other. So, task 1 is running, an interrupt comes in that causes task 8 to be ready and it is higher pri, so he wants to go to PendSV, start task 8 by pushing its newly created frame just below the last stack used by task 1 and returning to it. When task 8 finishes, it then needs to allow task 1 to continue where it left off.
        The reason for using SVC besides supervisor code would be that there is a frame containing the scratch registers of task 1 that need to be popped back (so task 1 returns properly). So, he invokes SVC which tosses its frame and so returns to the original task 1 frame. A bit confusing in words, but very simple in pictures. Note that you cannot do this by direct return since the core knows there is no active ISR and so an LR of 0xFFFFFFFx would cause a fault. You could force it by setting an active bit, but that is rather crude. I allowed active bits to be set/cleared manually only for process models and cleanup after a panic.

        • Miro Samek says:

          @Paul Kimelman: This last comment is a very good explanation of what’s going on in the QK kernel. I’m really glad to see that you wrapped your mind around it so quickly.

          I’m intrigued by your mentioning setting/clearing active bits. Could you please elaborate how this works?

          • Paul Kimelman says:

            At the risk of regretting it I am talking about the SHCSR’s PENDSVACT bit (and the rest of them). As I say, I do not recommend using this. I made these RW to allow making system handlers part of a task if you need. So, if a task uses SVCall and is in the middle of a long OS function, the scheduler can save that state and restore it later. I made all of them settable to be orthogonal and to allow unwinding more complex cases.
            So, I still recommend you follow the model I suggest and I think you will be happy with the results.
            I also recommend you use BASEPRI_MAX and BASEPRI rather than CPSI – this so you can have very high pri ISRs that do not make system calls and whose latency is not impacted.

        • 42Bastian says:


          I don’t think I missed a thing. Whenever a scheduling is needed PendSV has to be called. So after task 8 finishes PendSV has to be called. Which unless there is no higher prio task will return to task 1.
          But task 8 might have done something to activate task 4. Therefore PendSV has to be called.

          But I think, I should just try it out to either prove I am wrong or right.
          I just think, why should a RTC kernel need SVC where a “normal” kernel does not ? IMHO a RTC kernel is just a special usage of a GP-(RT)OS.

          BTW: The reason one cannot switch easily modes caused me some pain when I added support for user mode to my kernel (SVC handling).

          • Paul Kimelman says:

            You can use PendSV that way but there are two issues:
            1. PendSV is not “called”. It has to be invoked by pending its bit, which means you cannot control when it actives exactly (SVC does not come back until SVCall has happened).
            2. PendSV would need to know it is handling that case. That is, it is normally entered when a running task was interrupted. It would need to have the task tell it that it is done (via a static address variable) so PendSV knows to pop off its full frame.

            I agree that a flaw in Miro’s setup is that SVCall is not setting PendSV to see if another high pri task is ready to run before task 1 is allowed. That is, if task 6 were also made ready when task 8 is running but task 6 is lower pri than task 8, it would not run. When task 8 finishes, the SVCall handler should use PendSV to see if task 6 is higher pri than task 1. It is possible that he does that in exit code for the task (directly invoke task 6), but that would not work in a user mode scheme.

            As to your comment on other GP RTOSes: I intentionally tried to make use of SVC so easy and fast that you would switch to that model vs. the horrible in-place call code used on an ARM7/9/11/etc. That code is used on ARM7 and later because the SWI is so painful and slow. The SVC scheme is designed to be cheap enough for real system calls but certainly for yield (since it is pushing half the needed registers for you anyway).

            One can easily change from supervisor to user, but obviously not the other way. If you could change the other way, there would be no protection. I agree it is harder than I would like for the bootstrap case (1st start) but is easy after that. SVC is commonly used for yield (from a system function) and maybe the whole system call (e.g. PendMBox could just pass the args via SVC and handle the whole thing in ISR level). But, I agree that if the system code has been called from a user task, then it will need to pay an ugly price (up to 24 total cycles) to be switched to supervisor mode to access control blocks. There was no obvious way to get around this. I toyed with an MPU marker which indicated that the corresponding execute region should be allowed to self-switch to supervisor (writing the CONTROL reg), but that has issues with fetch ahead and other cases of “barriers”.

      • 42Bastian says:


        I wrote a small RTC kernel, and it seems to work like described with one exception. Upon “return” i.e. when the task is finished, the PendSV handler has to clean
        the stack.
        A first shot for ek-lm3s6965 with IAR is at http://www.monlynx.de/download/interrupts.7z


        • Paul Kimelman says:

          Excellent and congratulations. My point was only that SVC can do that cleanup after the task completes, so PendSV stays more pure. But, I certainly agree that you can do it all with PendSV.
          Glad to see it is on an LM3 board 😉

    • 42Bastian says:

      In addition to my last post: With “prepare stack frame”, I did mean “push” everything on the stack to call the next task.
      There is _no_ “pop” at all needed !

  9. Paul Kimelman says:

    OK, I was not reading your trace as much as your comments. Now that I look at it in terms of code flow, I better understand, you are returning into the code just after the PendSV handler, so it looks like this is still in the handler, but this has done an exception return so acts like task code.
    Anyway, you doing the opposite of what I envisioned originally. I envisioned a return to the task and SVC to exit.
    But, I do not see why you would fault. Your PendSV handler disables all interrupts, determines whether a new task to stack (vs. return to current task), and if so, creates a fake frame so it can return to a stacked scheduler. This begs two obvious questions:
    1. Why not do the scheduling in the PendSV handler (or call it from that handler) so you can “return” into the scheduled in newly stacked task?
    2. I would expect that the scheduler left the context such that rerunning PendSV would find QK_readySet_ is 0, so you just return and do nothing. There is no reason to crash under that model. If there was a problem, consider that the CPSIE would also allow high pri interrupts. So, one of those could come in, set the PendSV bit, and you would have the same exact situation. That can happen now with your “fix”. So, you need to understand why you would crash on PendSV running before the SVC happens (and pops the fake frame).

    As I say, I would strongly suggest you look at why you even do it this way. I think this is a holdover from ARM7 type schemes where you needed to go to System mode to do this. In this case, PendSV can handle the scheduling in line and never leaving its context. Then, no fake frame to pop via SVCall. You just return into the newly stacked task. This is what I suggested above and how I envisioned the popup thread model working (and I wrote a simple example to prove it – I wrote 7 mini-kernels back in 2005 to test different kernel schemes to be sure this would all work well).

    • Miro Samek says:

      @Paul Kimelman: Thanks a lot for your numerous thoughtful comments. I am trying to understand your recommendations for my run-to-completion (RTC) kernel. Please bear with me as I’m trying to wrap my mind around this.

      On multiple occasions in your comments (as well as comments by 42Bastian) I see recommendations that PendSV should “return into a newly stacked task”. The problem is that my high priority task does not even exist yet. Tasks under this kernel are literally ordinary single-shot function calls made from the scheduler. I don’t know the stack layout of my tasks, because each one is potentially different (this is determined by the compiler, who knows better than me which registers are used in which task). Could you please elaborate what you mean by “return to the high-priority task” in the case of an RTC kernel?

      • 42Bastian says:


        take a look at my example implementation. You do not have to know the stack layout, just how the exception frame looks like.
        The Scheduler just prepares the exception-stackframe for the new process. This frame contains two essential values: LR and return address.
        LR is used to call a special “end of task” function
        The “return address” is the actual task-function.

        This simple design allows to pass 4 parameters to a task (r0..r3).
        I started with simple tasks but the current examples even uses sprintf().

        One caveat about the example: It does not care for crititcal sections (here setting of the ready-mask). Since it is setting/clearing of bits, bit-banding should be used to be atomic.

        BTW: Don’t get confused of the GPIO-Interrupts, I left them in in case I wanted to extend the example. The whole thing was derived from an IAR example.

        @Paul: The LM3 kits are really nice. With everything on a credit-card sized PCB I have a eval board with me, where ever I go :-)

        • Paul Kimelman says:

          Yes, I carry around eval boards too, but I try not to admit it in polite company 😉

          By the way, I agree that bit-band can be used to get around some critical sections. I wrote a small kernel back in 2005 on CM3 (on FPGA – no Si) that used bitband for sleep and ready “lists” and used CLZ to find the highest pri task in one instruction. I had ISRs write to a bitbanded “wake list” and used PendSV after (so no race possible). It did mean each task had a separate priority (no round robin) and if you wanted more than 32 tasks, it used a simple two level version of this (bit banding for the directory and then page) but was really designed for 32 or less.
          I used LDREX/STREX to handle non-blocking and non-locking queues to send data between ISRs and Tasks so no critical data issues there either. Purpose was to show that you could build a powerful kernel which had no critical sections at all.

      • Paul Kimelman says:

        Yes. The concept is that an exception frame has PC, LR, R0-R3, etc. You can create a frame on the PSP stack (allocate space and fill in the stack contents) just below (stack grows down) the current task’s last used area. In that frame, you fill in the PC of the new task’s function entry, an LR which goes to some special exit code (see below), and R0-R3 filled in as arguments if you want. Then you “return” to it which in effect creates it.
        As an example, the stack looks like (shown growing down):
        – oldest
        – what PendSV would return to.
        Now, PendSV will subtract the PSP by 12 words to add a new frame below:
        – oldest
        – what PendSV would return to.
        – what PendSV will now return to

        This new frame will cause it start executing task6’s function without knowing or caring what is above it on the stack (it will not corrupt it). The LR will point to some “exit code” which may be an SVC instruction or a PendSV set with delay loop (what 42Sebastian is doing). The SVC will invoke an exception frame (real one) with PC pointing to the instruction after the SVC:
        – oldest
        – what PendSV would return to.

        So, that frame can be tossed by adding 12 words to PSP. This gets you back to task1’s exception frame:
        – oldest
        – what PendSV would return to.

        If you then return, it will re-enter task1 where it was pre-empted and it can run to completion (or get interrupted again).

      • Paul Kimelman says:

        Wow, the comment had my info stripped in the examples! :-(
        They were:
        (task1’s prologue created stack area for the functions; from pushing regs and the like) – oldest
        (exception frame for task1 from when interrupt pre-empted it) – what PendSV would return to

        Then after you subtract PSP and create a fake frame:

        (task1’s prologue created stack area for the functions; from pushing regs and the like) – oldest
        (exception frame for task1 from when interrupt pre-empted it) – what PendSV would return to
        (fake exception frame for task6 with PC to start function) – what PendSV will now return to

        Then, after task6 ends and returns through LR to a piece of code with SVC:

        (task1’s prologue created stack area for the functions; from pushing regs and the like) – oldest
        (exception frame for task1 from when interrupt pre-empted it)
        (exception frame for task6 from SVC exception)

        You then add back 12 words to PSP to get:

        (task1’s prologue created stack area for the functions; from pushing regs and the like) – oldest
        (exception frame for task1 from when interrupt pre-empted it) – what PendSV will return to

        and then return to it. Task1 continues normally – never aware that task6 run.

      • Miro Samek says:

        @Paul Kimelman: Thanks a lot again. I’m terribly sorry for being so slow, but I think that I do already everything you describe. In my PendSV handler, which I posted several comments back (at http://www.state-machine.com/attachments/qk_port.s) I exactly construct the stack frame you describe and I *return* to the scheduler. I don’t return to the specific task, because for reasons of portability I prefer to keep the scheduler in C, but the idea is exactly the same.

        After the scheduler returns I activate SVC, which discards its own fake stack frame and returns via exception return to the preempted task. This is all exactly as you describe.

        Please note that my PendSV/SVCall handlers in qk_port.s seem to work correctly. After the addition of clearing the PENDSVSET bit, which was the original reason for my blog post, the kernel code apparently survives very heavy onslot of interrupts.

        So, I guess what I am asking for is a review of my PendSV/SVCall handlers in qk_port.s (all in all only 21 assembly instructions) and pointing out what would you do differently. This would be much more productive for me than trying to describe the process in words.

        @42Bastian: Thanks a lot for the code, but I got lost in it :-( The PendSV handler in helper.s consist of some 44 instructions (twice the length of my code) and I don’t know the context. As I said, I apologize for being so slow.

      • Paul Kimelman says:

        OK, I looked at your code and it looks generally OK, but you have two bugs:
        1. You need to clear the 1st byte of QK_readySet_ when you decide to create a new frame. That is why the crash. If it is not clear and PendSV is rerun, you will create a 2nd frame. If you clear it or otherwise indicate you have created this new task, then QK_schedule_ can do the rest. I do not know what else you do in QK_schedule, so it is hard to say otherwise.

        2. The SVCall handler should be checking if a next highest pri task is waiting. I \do not know how QK_readSet_ works, but once the highest pri task finishes, you would want to check if a next ready task and if it is higher than the stacked one. For example if task1 is current but an interrupt (or set of interrupts) ready task6 and task 5 such that both are higher pri, then when task6 finishes, you need to then run task5, not return to task1. You may do that in QK_schedule, which would be OK. If so, then just fixing the QK_readSet_ 1st byte will solve your crash.

        Note that your PendSV “fix” does hide bug 1. But, I think fixing bug 1 and removing the PendSV clear would be a preferred scheme although the pendsv clear is harmless in that model.

        • Paul Kimelman says:

          I should also note that I prefer critical sections to be as small as possible to avoid latency on interrupts. Further, I was suggesting that PendSV could return directly into the new task vs. your schedule/QK_schedule scheme. I understand your point about C – I did everything I could to make Cortex-M (ARMv7-M) as C friendly as possible. But, I am not sure that the few extra instructions to extract the function start and cleanup the ready list are that big a deal and it would keep the critical section as small as possible.
          I also continue to suggest use of BASEPRI (and BASEPRI_MAX) vs. CPS.

  10. Konrad Anton says:

    Thanks Miro for sending me this link in the ARM forum (I’m konrada there). I’ve encountered a similar problem in Freescale MQX, where the priority of PendSV is computed from the current task’s priority… unless PendSV is already pending.

    My Freescale forum posting detailing that problem is at http://forums.freescale.com/t5/Freescale-MQX-trade-Software/Possible-scheduler-bug-in-MQX-3-7-for-Kinetis/td-p/86385

  11. Miro Samek says:

    @Paul Kimelman: Thank you for taking a look at my code. It’s not every day that one has an opportunity to pick the brains of the engineer who actually designed the Cortex-M core.

    I guess, I now better understand your suggestions for re-designing my current implementation of the RTC kernel for Cortex-M. I also recognize some of your suggestions in the code contributed by @42Bastian.

    It seems to me that to implement your recommendations I would need to re-partition the kernel completely for the Cortex-M core, so any portability to other processors would be essentially lost. Luckily, an RTC kernel is by nature so small, that the scheduler could be written entirely in assembly, if need be. In the process of creating a CM-specific RTC kernel I can also make use of bit-banding and other goodies available only in CM.

    But before I move on, I’d like to make sure that this is really the best we can do for an RTC kernel on Cortex-M (?) From the aesthetic point of view I find the RTC kernel design (both mine or any of the proposed ones) rather unpleasing. I mean, the tail-chaining to PendSV is brilliant. But once inside the PendSV, the stack is already perfectly set up to either launch a higher-level tasks or return to the preempted task. This is so, because an RTC kernel works with the machine’s *natural* stack protocol. So, all one needs to do is to tell the NVIC to drop to the task level (this is what I mean by a generically understood EOI command) and after this, to exception-return to the preempted task. On most processors this takes only a few machine instructions and *no* stack manipulation.

    Unfortunately, on Cortex-M this “telling it to the NVIC” takes pushing and popping two exception stack frames (which also wastes 32-bytes of stack) as well as several assembly instructions. I find it aesthetically unpleasing, because all this stack manipulations accomplish exactly nothing. They must accomplish nothing, because the stack *is* already set-up correctly before all the pushing and popping exception stack frames.

    I’d greatly appreciate any comments. My ultimate goal is to come up with the simplest possible RTC kernel implementation on Cortex-M. I just don’t want to sweat the little details (like bit-banding) while losing many more clock cycles and stack space on pushing and popping exception stack frames.

    Finally, before moving on to re-implementing RTC kernel for Cortex-M, I’d really like to understand the failure mode illustrated in the trace discussed in my original blog. Please correct me if I misrepresent your diagnosis, but you essentially suggested that the Hard Fault at the end of the trace is due to normal preemption of the PendSV by a higher-level interrupt, which has set the PENDSVSET bit. I tested this scenario several times (with different settings of the kernel’s ready-set and the current priority ceiling) in a debugger. My tests were done as follows. I’ve set a breakpoint on the first instruction of PendSV. As soon as the breakpoint was hit, I’ve removed it and placed it on the very next instruction. I’ve also triggered an specifically instrumented interrupt (by manually writing to the PEND bit in the debugger). I than hit “go” in the debugger and watched the preemptions. The point is that while I could reproduce every instruction in the original trace, I could *not* reproduce the Hard Fault. The code handled the preemption (including setting the PENDSVSET bit) in the PendSV handler itself correctly every time.

    Maybe, as you say, the provided hardware trace is insufficient to provide an accurate diagnosis. However, it seems to me that the one thing I could’n test in a single-step debugger is the dynamic condition of the late-arrival scenario. So, I’m left to believe that something is different with late-arrival. I speculate further that my use of SVCall is also implicated. Other kernels simply don’t do this. Using a single PendSV exception with global variables for directing the flow of control would most likely mask the problem. Again, I would appreciate any comments or suggestions what else can be done to get to the bottom of this.

  12. Paul Kimelman says:

    I will start with your last question. I said that the bug is that your code cannot tolerate the PendSV running twice. This is because you really do half of the PendSV work in the task and your PendSV code does not know or check if the latter part has finished. This can be fixed in one of two ways: either you mark the state so you know that the PendSV has run (and created a new stack frame) or you do what you do and clear PendSV once interrupts are disabled. Your code crashed because it made two fake frames (from PendSV running twice) and when the 1st finished, the return to “original” task went into “scheduler” instead which blew up.

    Now, I want to be clear that I do not agree with your comment about other processors. If you need to create a new task stacked over an old task, you have to have saved enough registers of the original task to ensure it can return safely. This means that you manually did what the Cortex-M core is doing for you – either the ISRs do it in their prologue or you do it in code or they go into some sort of shadow regs. If you want nested interrupts, all of those end up having to deal with that to. But, no matter method is used, if you are creating a new task, enough regs have to be preserved of the current task – you cannot escape that. For many processors you pay for pop and then repush to do this. For Cortex-M, you get the savings on average and certainly for this case (old task’s regs are on the frame for you, which is why you create a new one).
    Creating a new frame for the new task is not really a waste – you always had to set the new PC and a return link of some sort. The fact that you do not pass any args means it is just stack math of 32 bytes but not pushing. The popping does happen and could be considered a waste, but this still averages out as a win over all; yes you can cheat this (ACT bit), but do you really need to? It is like people who refuse to use an RTOS because they can save a few cycles here and there. The point is that the new frame is really about preserving the old frame for the current task – invert it and think of it as saving the context of the old task and it seems OK.
    So, you can save the SVC trick if your tasks are all supervisor and you really want to. Just pop the context back from the save frame vs. using the machinery of return link. Again, you can cheat this (by setting ACT bit and returning to do the pop), but does it really buy you enough?
    If you really need more of the methods of other processors, there is another way. Your PendSV does not change the frame at all. Instead it puts the original frame PC (current task) in a variable (e.g. current_task_PC) and replaces with a “new_task_create” function which is like your schedule function. It immediately pushes R0-R3, R12, LR, current_task_PC, and xPSR – basically just what the hardware does on exception. Then, it calls the new task. When it returns, the new_task_create function 1st checks if another higher pri task to run above the stacked one and if not, just pops those regs and so returns into the old task. That is how you did it on other processors. Not sure you save much in time since the extra instructions are used. But, if this would make you happier you can do it.
    My point about using the scheme I said is that it is cheap and easy and supports user tasks if you want them. The overhead is there (an extra push set (SVC) and partly an extra pop (from PendSV return to new task) for a new task) but it is relatively small and no extra cost if another higher pri task waiting. Also, no cost if task runs to completion and a new task is started (since SVC frame is new tasks frame). That is, this only overhead is when a task pre-empts another.

  13. Miro Samek says:

    Thank you for the reply. But, now I’m not sure whether we are on the same page regarding the structure of a run-to-completion kernel. I can’t describe this kernel any better than I already did in the aforementioned article in Embedded Systems Design “Build a Super Simple Tasker” (http://www.eetimes.com/design/embedded/4025691/Build-a-Super-Simple-Tasker). But if you have no time to take a look, here is the slightly simplified code of the scheduler:

    void QK_schedule_(void) { /* entered with interrupts locked!!! */
    ....uint8_t pin = QK_currPrio_; /* the initial QK-nano priority */
    ....uint8_t p; /* highest-priority ready to run */

    ....while ((p = log2Lkup[QF_readySet_]) > pin) { /* above threshold? */
    ........QActive *a;
    ........QK_currPrio_ = p; /* update the current priority */
    ........QF_INT_UNLOCK(); /* it's safe to leave critical section */

    ........a = (QActive *)QF_active[p].act; /* map prio. to active obj. */
    ........e = QEQueue_get(&a->queue); /* obtain the event */
    ........QHsm_dispatch((QHsm *)a, e); /* dispatch to state machine */

    ........QF_INT_LOCK(); /* lock interrupts for next loop or exit */
    ....QK_currPrio_ = pin; /* restore the initial priority */
    } /* scheduler entered with interrupts locked!!! */

    My main point is that launching a task is just a simple function call and *no* additional registers need to be saved above and beyond what the C compiler already does. Quite specifically, I really don’t need to save the registers clobbered by the APCS (r0-r3,r12,lr) to launch a task. The only time I care for these registers is when an exception preempts a task, but Cortex-M does this automatically for me. So, by the time I’m inside PendSV, I already sit on top of the exception stack frame that preserves the APCS-clobbered registers for the preempted task. Any additional saving of these APCS-clobbered registers would be saving them *twice*, which is harmless, but incurs cost both in CPU time and stack space. I hope you agree that it would be nice to avoid this unnecessary overhead.

    So, do other processors handle this better than Cortex-M? I think so. Looking only at CPUs that can prioritize interrupts, Coldfire or M16C require just one assembly instruction to drop the IPL (interrupt priority level) to the task. Here is an example ISR for M16C:

    #pragma INTERRUPT ta0_isr (vect = 21) /* system clock tick ISR */
    void ta0_isr(void) {
    ....++QK_intNest_; /* inform QK about entering in ISR */
    ...._asm("FSET I"); /* unlock the interrupts */

    ....QF_tick(); /* ISR processing */
    ..../* perform other ISR work . . . */

    ...._asm("LDC #0,FLG"); /* lock interrupts and set IPL to 0 */
    ....if (QK_intNest_ == (uint8_t)0) { /* last nested interrupt? */
    ........QK_schedule_(); /* handle the preemptions */

    I’d like to achieve similar performance on Cortex-M. Is it possible?

    • 42Bastian says:


      just a side-note: Locking interrupts on ColdFire is very costly: The lock/unlock pair needs up to 15cycle (including preserving the old state).


  14. Paul Kimelman says:

    I am very well aware of a run to completion kernel model.
    You keep missing the point that yes PendSV is sitting on the current_task’s frame (registers), but if you return from PendSV, it is ***gone***.
    All of this is about creating a new task ***stacked*** over an existing task (due to priority) . Like I said, we can look at 3 scenarios and maybe it will be clearer:
    1. Start a task and when it finishes, start another. This could be chained by the end of the task (so never goes to exception) or can be by PendSV or SVCall. If either of those, just returns into new task.
    2. Task is running, but interrupt comes in and readies higher pri task. This is the case we are talking about. You need to create a new frame so you can return via it (so original frame is left in place to preserve running task). I am not sure why you are not getting this. This is what I showed before with representation of stack state. If you just returned, you would go back to the task that was running. If you modified its PC on frame, then you would lose running task’s registers!
    3. Higher pri task is running stacked over another (due to pri) and it finishes. The code needs to determine if another high pri task should run or if it should return to original context.

    My point is that you can use a variety of approaches to handle all 3 cases cleanly. The two obvious choices are:
    – You have a “launcher” running at task level.
    – You do via handlers as we have been discussing.

    A launcher looks like:
    void Launcher(void) {
    while (top_ready.function) // a task is ready to run, so run it
    top_ready.function(); // run to completion. When done see if more to run
    return; // pop back to SVC instruction (LR points to SVC when PendSV created frame)

    This handles back to back, nested back to back, and so on. When nested stack finishes and no other high pri task, it returns to allow “frame” to be popped by task underneath. You *could* do this by manually popping yourself from return link code instead of SVC.
    If nothing to run, then you can use SLEEP_ON_EXIT. If you do not want to do that, you can have launcher never leave if lowest:
    void Launcher(void) {
    while (top_ready.function // a task is ready to run, so run it
    || nothing_stacked)
    if (top_ready.function)
    top_ready.function(); // run to completion.
    else // no nested tasks, so run idle task
    idle_task(); // sleeps or whatever you do
    return; // pop back to SVC instruction (LR points to SVC when PendSV created frame)

Leave a Reply