Archive for the ‘Algorithms’ Category

Peak detection of a time series

Friday, September 18th, 2015 Nigel Jones

I’ve been doing embedded work for so long now that it’s rare that I come across a need that I haven’t run into before. Well, it happened the other day, so I thought I’d share it with you.

Here’s the situation. I have a transducer whose role is to determine a binary condition (essentially X is present / absent).  The transducer is AC coupled and pass band filtered such that what I’m looking for is the presence of a noisy AC signal. My first inclination was to take the pass-band filtered signal, rectify it and average it over time. While this worked reasonably well, the delta between X being present / absent was not as large as I’d have liked. Accordingly I investigated other metrics. Somewhat to my surprise I found that the peak signal values (i.e peak X present : peak X absent) gave a far better ratio. Thus I found myself needing to find the peak (i.e. maximum value) of a moving time series of data values.

To prove the concept, I used a brute force approach. That is, every time I received a new reading I’d store it in a buffer, the size of which was determined by the amount of time I wished to look back over. I’d then search the entire buffer looking for the maximum value. This approach of course worked fine – but it was hopelessly inefficient. Given that I am receiving a new value to process every few milliseconds, this inefficiency is unacceptable in the real product.  Accordingly, I needed a better algorithm.

My first port of call was of course the internet. If you do a search for peak detector algorithms then you’ll find a plethora of algorithms. However they are all way too sophisticated for what I need, as they are aimed at finding *all* the local maxima within an N-Dimensional array (apparently an important problem in physics, image processing etc).  I just want to know what’s the maximum value in a buffer of values – and I want to know it as efficiently as possible.

After pondering the problem for awhile, my thoughts first turned to an efficient median filtering algorithm. I discussed median filtering here. My rationale was quite simple. The efficient median filtering algorithm inherently keeps the time series data sorted and so I should be able to easily adapt it to return a maximum value rather than a median value. Well it turned out to be quite trivial. Indeed it’s so trivial that I wonder why I haven’t modified it before. Here’s the code (note, as explained in my original posting on median filtering, this algorithm and code is mostly courtesy of Phil Ekstrom):

#define STOPPER 0                                /* Smaller than any datum */
#define MEDIAN_FILTER_SIZE    (31)

void median_filter(uint16_t datum, uint16_t *calc_median, uint16_t *calc_peak)
{
    struct pair
    {
        struct pair   *point;        /* Pointers forming list linked in sorted order */
        uint16_t  value;             /* Values to sort */
    };
    static struct pair buffer[MEDIAN_FILTER_SIZE] = {0}; /* Buffer of nwidth pairs */
    static struct pair *datpoint = buffer;   /* Pointer into circular buffer of data */
    static struct pair small = {NULL, STOPPER};          /* Chain stopper */
    static struct pair big = {&small, 0}; /* Pointer to head (largest) of linked list.*/
    
    struct pair *successor;         /* Pointer to successor of replaced data item */
    struct pair *scan;              /* Pointer used to scan down the sorted list */
    struct pair *scanold;           /* Previous value of scan */
    struct pair *median;            /* Pointer to median */
    
    uint16_t i;
    
    if (datum == STOPPER)
    {
        datum = STOPPER + 1;                             /* No stoppers allowed. */
    }
    
    if ( (++datpoint - buffer) >= MEDIAN_FILTER_SIZE)
    {
        datpoint = buffer;                  /* Increment and wrap data in pointer.*/
    }
    
    datpoint->value = datum;                /* Copy in new datum */
    successor = datpoint->point;            /* Save pointer to old value's successor */
    median = &big;                          /* Median initially to first in chain */
    scanold = NULL;                         /* Scanold initially null. */
    scan = &big;              /* Points to pointer to first (largest) datum in chain */
    
    /* Handle chain-out of first item in chain as special case */
    if (scan->point == datpoint)
    {
        scan->point = successor;
    }
    scanold = scan;                                     /* Save this pointer and   */
    scan = scan->point ;                                /* step down chain */
    
    /* Loop through the chain, normal loop exit via break. */
    for (i = 0 ; i < MEDIAN_FILTER_SIZE; ++i)
    {
        /* Handle odd-numbered item in chain  */
        if (scan->point == datpoint)
        {
            scan->point = successor;                      /* Chain out the old datum.*/
        }
        
        if (scan->value < datum)         /* If datum is larger than scanned value,*/
        {
            datpoint->point = scanold->point;             /* Chain it in here.  */
            scanold->point = datpoint;                    /* Mark it chained in. */
            datum = STOPPER;
        };
        
        /* Step median pointer down chain after doing odd-numbered element */
        median = median->point;                       /* Step median pointer.  */
        if (scan == &small)
        {
            break;                                      /* Break at end of chain  */
        }
        scanold = scan;                               /* Save this pointer and   */
        scan = scan->point;                           /* step down chain */
        
        /* Handle even-numbered item in chain.  */
        if (scan->point == datpoint)
        {
            scan->point = successor;
        }
        
        if (scan->value < datum)
        {
            datpoint->point = scanold->point;
            scanold->point = datpoint;
            datum = STOPPER;
        }
        
        if (scan == &small)
        {
            break;
        }
        
        scanold = scan;
        scan = scan->point;
    }
    *calc_median = median->value;
    *calc_peak = big.point->value;
}

To use this code, simply set the MEDIAN_FILTER_SIZE to your desired value, and then call median_filter every time you have a new value to process. The advantage of this code is that it gives you two parameters – the median and the maximum.
However, when I bench marked it against the brute force approach I discovered that the brute force algorithm was actually faster – by a reasonable amount. After reflecting upon this for awhile I realized that the median filtering approach was of course keeping all of the data sorted – which was way more than I needed for a simple peak detector. Thus it was time for another approach.

After thinking about it for awhile, it was clear that the only datum I cared about in my buffer was the current maximum value. Thus upon receipt of a new value, I essentially had to decide:

  1. Is this new value >= current maximum value? If it is then it’s the new maximum value. Note that I really do mean >= and not >. The reason is explained below.
  2. Otherwise, is this new value about to overwrite the existing maximum value? If so, overwrite it and use a brute force search to find the new maximum value.

In order to make the algorithm as efficient as possible we need to minimize the number of times a brute force search is required. Now if your buffer contains only one incidence of the maximum value, then you’ll simply have to do the brute force search when you replace the maximum value. However, if your buffer contains multiple copies of the maximum value, then you’ll minimize the number of brute force searches by declaring the operative maximum value as the one furthest away from where we will write to in the buffer next. It’s for this reason that we use a >= comparison in step 1.

In a similar vein, if we have to do a brute force search then if there are multiple copies of the maximum value then once again we want to choose the maximum value furthest away from where we will write to in the buffer next.

Anyway, here’s the code

 
#define WINDOW_SIZE 31

uint16_t efficient_peak_detector(uint16_t value)
{
    static uint16_t wbuffer[WINDOW_SIZE] = {0U};
    static uint16_t wr_idx = 0U;
    static uint16_t peak_value = 0U;
    static uint16_t peak_value_idx = 0U;
    
    if (value >= peak_value)
    {
        peak_value = value;            /* New peak value, so record it */
        peak_value_idx = wr_idx;
        wbuffer[wr_idx] = value;
    }
    else
    {
        wbuffer[wr_idx] = value;        /* Not a new peak value, so just store it */
        if (wr_idx == peak_value_idx)    /* Have we over written the peak value ? */
        {
            /*  Yes, so we need to do a brute force search to find the new
                maximum. Note that for efficiency reasons, if there are multiple
                values of the new peak value, then we want to chose the one
                whose index value is as far away as possible from the current index */
            uint16_t idx;
            uint16_t cnt;
            
            for (cnt = 0U, idx = wr_idx, peak_value = 0U; cnt < WINDOW_SIZE; ++cnt)
            {
                if (wbuffer[idx] >= peak_value)
                {
                    peak_value = wbuffer[idx];    /* Record new peak */
                    peak_value_idx = idx;
                }
                if (++idx >= WINDOW_SIZE)
                {
                    idx = 0;
                }
            }
        }
    }
    if (++wr_idx >= WINDOW_SIZE)
    {
        wr_idx = 0;
    }
        
    return peak_value;
}

Obviously, if you make your buffer size a power of two then you can optimize the buffer wrapping code. There are a couple of other minor optimizations that can be made. For example on eight bit processors, making the various indices and counters  (wr_idx, peak_value_idx, idx, cnt) 8 bits will speed things up a lot. Similarly on a 32 bit processor you will likely gain a bit by using 32 bit values.

Notwithstanding the aforementioned optimizations, this code is on average considerably faster than the brute force approach and not much larger. Its main limitation is that its run time is highly variable depending upon whether we have to determine a new maximum value using the brute force approach or not.

Clearly it’s trivial to modify this code to perform a ‘trough detector’ (aka minimum detector), or indeed have it do both functions.

<I’ve noticed that comments are shown as closed. I’m not sure why this is, but am working with the system administrator to get it fixed. Apologies>

Boeing Dreamliner ‘Bug’

Friday, May 1st, 2015 Nigel Jones

There’s an all too familiar story in the press today. The headline at the Guardian reads “US aviation authority: Boeing 787 bug could cause ‘loss of control’. As usual with these kinds of stories, it’s light on technical details other than to reveal that the Dreamliner’s generators will fall into a fail safe mode if kept continuously powered for 248 days. In this fail-safe mode, the generator doesn’t apparently generate power. Thus if all four of the planes generators were powered on at the same time and kept powered for 248 days, then suddenly – no power. That’s what I’d call an unfortunate state of affairs if you were at 40,000 feet over the Atlantic.

So what’s special about 248 days? Well 248 days = 248 * 24 * 3600 = 21427200 seconds. Hmm that number looks familiar. Sure Enough, 2^31 / 100 ~= 21427200. From this I can deduce the following.

Boeing’s generators likely contain a signed 32 bit timer tick counter that is being incremented every 10ms (or possibly an unsigned 32 bit counter being incremented every 5ms – but that would be unusual). On the 248th day after start up, this counter overflows. What happens next is unclear, but Boeing must have some sort of error detection in place to detect that something bad has happened – and thus takes what on the face of it is a reasonable action and shuts down the generator.

However, what has really happened here is that Boeing has fallen into the trap of assuming that redundancy (i.e. four generators) leads to increased reliability. In this line of thinking, if the probability of a system failing is q, then the probability of all four systems failing is q*q*q*q – a number that is multiple orders of magnitude smaller than the probability of just one system failing. For example if q is 0.001, then q^4 is 1,000,000,000 times smaller. At this point, the probability of a complete system failure is essentially zero. However, this is only the case if the systems are statistically independent. If they are not, then you can have as many redundant systems as you like, and the probability of failure will be stuck at q. Or to put it another way, you may as well eschew having any redundancy at all.

Now in the days of purely electro-mechanical systems, you could be excused for arguing that there’s a reasonable amount of statistical independence between redundant systems. However, once computer control comes into the mix, the degree of statistical dependence skyrockets (if you’d pardon the pun). By having four generators presumably running the same software, then Boeing made itself  completely vulnerable to this kind of failure.

Anyway, I gave a talk on this very topic on life support systems at a conference a few years ago – so if you have half an hour to waste have a watch. If you’d rather read something that’s a bit more coherent than the presentation, then the proceedings are available here. My paper starts at page 193.

The bottom line is this. If you are trying to design highly reliable systems and redundancy is an important part of your safety case, then you’d damn well better give a lot of thought to common mode failures of the software system. In this case, Boeing clearly had not, resulting in a supposedly fail-proof system that actually has a trivially simple catastrophic failure mode.

 

 

Idling along, (or what to do in the idle task)

Sunday, April 14th, 2013 Nigel Jones

If you are using an RTOS in your latest design then no doubt you have an idle task. (Most of the time, the idle task is explicit and is the user task with the lowest priority; sometimes it’s built into the RTOS). It’s been my experience that the idle task is an interesting beast. On the one hand it is what gets executed when there’s nothing else to do, and thus inherently contains nothing directly related the product. On the other hand it’s this wonderful resource that you can exploit to do all sorts of interesting things to improve your product without having to worry too much about it impacting the running of your product.

With that being said, here are some suggestions for what your idle task can do, starting with one thing it shouldn’t do.

Do Nothing

If your idle task consists of a do-nothing loop, then you are almost certainly missing an opportunity. Hopefully the suggestions below will serve to spark your creative juices.

Watchdog

In any RTOS based design, the idle task should play a role in the overall watchdog supervisor. Without going into a treatise on watchdog design, suffice it to say if the idle task isn’t being run frequently then you’ve got a problem. Thus if the idle task doesn’t feature in your watchdog supervisor then you are doing something wrong. Note that if the idle task is featured in your watchdog, then it doesn’t necessarily mean you are doing it right either!

Power Save

For power constrained systems, the idle task is usually the place to put the microprocessor to sleep and / or indulge in other power saving strategies. I often find that my idle tasks consist of some of the features described here, plus power save. In other words, the idle task takes care of some housekeeping and then takes a nap.

Load calculation

Used in conjunction with hardware timers and the task switch hook function, it’s normally possible to construct a system that gives a decent indication of both overall CPU load and also the CPU utilization of each of the tasks. The idle task isn’t a bad place to do all the calculations. I find this a very useful diagnostic aid as I’m developing a system. Once you are done using it as a development aid, with a bit more work it can be modified to be part of your overall watchdog strategy, in that it can provide useful information about how tasks are (mis)behaving.

Flash Check

Just about every embedded system I’ve looked at in the last decade or two performs a CRC check on program memory on start up. However, if you are designing a system that is safety critical and / or expected to run a long time between power cycles, then you should seriously consider running a Flash CRC check in the idle task. Because no writing of memory is involved and one is instead reading memory that is supposed to be constant, there are are no real race-conditions to worry about and thus there’s no need to be entering critical sections to perform the reads. Of course if you are using an MMU or MPU then things might get a little more challenging. Naturally such a challenge is nothing for a reader of your ability! [As an aside, one of my electrical engineering professors used to say to me, “Nigel, this is nothing for a man of your ability!” One is of course simultaneously flattered, irritated and motivated. I’ve never forgotten it.]

RAM Check

This is of the course the evil twin to the Flash check. However this time you need to perform both reads and writes from locations that are being used by higher priority tasks and interrupts. You can of course only do this safely by executing a suitable lock / unlock procedure on each RAM location. Now doing this in the idle task could seriously change your system’s response time, so you need to think very seriously about how to structure such a test. A good starting point is to do just one locked read / write per idle task invocation. Of course if that results in a 10 year RAM test on your system, then you’ll need to rethink the strategy.

If you have other good ideas for idle task work then please leave them in the comments.

Efficient C Tip #13 – use the modulus (%) operator with caution

Tuesday, February 8th, 2011 Nigel Jones

This is the thirteenth in a series of tips on writing efficient C for embedded systems.  As the title suggests, if you are interested in writing efficient C, you need to be cautious about using the modulus operator.  Why is this? Well a little thought shows that C = A % B is equivalent to C = A – B * (A / B). In other words the modulus operator is functionally equivalent to three operations. As a result it’s hardly surprising that code that uses the modulus operator can take a long time to execute. Now in some cases you absolutely have to use the modulus operator. However in many cases it’s possible to restructure the code such that the modulus operator is not needed. To demonstrate what I mean, some background information is in order as to how this blog posting came about.

Converting seconds to days, hours, minutes and seconds

In Embedded Systems Design there is an increasing need for some form of real time clock. When this is done, the designer typically implements the time as a 32 bit variable containing the number of seconds since a particular date. When this is done, it’s not usually long before one has to convert the ‘time’ into days, hours, minutes and seconds. Well I found myself in just such a situation recently. As a result, I thought a quick internet search was in order to find the ‘best’ way of converting ‘time’ to days, hours, minutes and seconds. The code I found wasn’t great and as usual was highly PC centric. I thus sat down to write my own code.

Attempt #1 – Using the modulus operator

My first attempt used the ‘obvious’ algorithm and employed the modulus operator. The relevant code fragment appears below.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 seconds = time % 60UL;
 time /= 60UL;
 minutes = time % 60UL;
 time /= 60UL;
 hours = time % 24UL;
 time /= 24UL;
 days = time;  
}

This approach has a nice looking symmetry to it.  However, it contained three divisions and three modulus operations. I thus was rather concerned about its performance and so I measured its speed for three different architectures – AVR (8 bit), MSP430 (16 bit), and ARM Cortex (32 bit). In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and include the test harness overhead:

AVR:  29,825 cycles

MSP430: 27,019 cycles

ARM Cortex: 390 cycles

No that isn’t a misprint. The ARM was nearly two orders of magnitude more cycle efficient than the MSP430 and AVR. Thus my claim that the modulus operator can be very inefficient is true for some architectures – but not all.  Thus if you are using the modulus operator on an ARM processor then it’s probably not worth worrying about. However if you are working on smaller processors then clearly something needs to be done  – and so I investigated some alternatives.

Attempt #2 – Replace the modulus operator

As mentioned in the introduction,  C = A % B is equivalent to C = A – B * (A / B). If we compare this to the code in attempt 1, then it should be apparent that the intermediate value (A/B) computed as part of the modulus operation is in fact needed in the next line of code. Thus this suggests a simple optimization to the algorithm.

void compute_time(uint32_t time)
{
 uint32_t    days, hours, minutes, seconds;

 days = time / (24UL * 3600UL);    
 time -= days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = time / 3600UL;
 time -= (hours * 3600UL);
 /* time now contains the number of seconds in the last hour */
 minutes = time / 60U;
 seconds = time - minutes * 60U;
 }

In this case I have replaced three mods with three subtractions and three multiplications. Thus although I have replaced a single operator (%) with two operations (- *) I still expect an increase in speed because the modulus operator is actually three operators in one (- * /).  Thus effectively I have eliminated three divisions and so I expected a significant improvement in speed. The results however were a little surprising:

AVR:  18,720 cycles

MSP430: 14,805 cycles

ARM Cortex: 384 cycles

Thus while this technique yielded a roughly order of two improvements for the AVR and MSP430 processors, it had essentially no impact on the ARM code.  Presumably this is because the ARM has native support for the modulus operation. Notwithstanding the ARM results, it’s clear that at least in this example, it’s possible to significantly speed up an algorithm by eliminating the modulus operator.

I could of course just stop at this point. However examination of attempt 2 shows that further optimizations are possible by observing that if seconds is a 32 bit variable, then days can be at most a 16 bit variable. Furthermore, hours, minutes and seconds are inherently limited to an 8 bit range. I thus recoded attempt 2 to use smaller data types.

Attempt #3 – Data type size reduction

My naive implementation of the code looked like this:

void compute_time(uint32_t time)
{
 uint16_t    days;
 uint8_t     hours, minutes, seconds;
 uint16_t    stime;

 days = (uint16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types and to add casts where appropriate. The results were interesting:

AVR:  14,400 cycles

MSP430: 11,457 cycles

ARM Cortex: 434 cycles

Thus while this resulted in a significant improvement for the AVR & MSP430, it resulted in a significant worsening for the ARM. Clearly the ARM doesn’t like working with non 32 bit variables. Thus this suggested an improvement that would make the code a lot more portable – and that is to use the C99 fast types. Doing this gives the following code:

Attempt #4 – Using the C99 fast data types

void display_time(uint32_t time)
{
 uint_fast16_t    days;
 uint_fast8_t    hours, minutes, seconds;
 uint_fast16_t    stime;

 days = (uint_fast16_t)(time / (24UL * 3600UL));    
 time -= (uint32_t)days * 24UL * 3600UL;
 /* time now contains the number of seconds in the last day */
 hours = (uint_fast8_t)(time / 3600UL);
 stime = time - ((uint32_t)hours * 3600UL);
 /*stime now contains the number of seconds in the last hour */
 minutes = stime / 60U;
 seconds = stime - minutes * 60U;
}

All I have done is change the data types to the C99 fast types. The results were encouraging:

AVR:  14,400 cycles

MSP430: 11,595 cycles

ARM Cortex: 384 cycles

Although the MSP430 time increased very slightly, the AVR and ARM stayed at their fastest speeds. Thus attempt #4 is both fast and portable.

Conclusion

Not only did replacing the modulus operator with alternative operations result in faster code, it also opened up the possibility for further optimizations. As a result with the AVR & MSP430 I was able to more than halve the execution time.

Converting Integers for Display

A similar problem (with a similar solution) occurs when one wants to display integers on a display. For example if you are using a custom LCD panel with say a 3 digit numeric field, then the problem arises as to how to determine the value of each digit. The obvious way, using the modulus operator is as follows:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999)
 {
 value = 999;
 }

 lsd = value % 10;
 value /= 10;
 nsd = value % 10;
 value /= 10;
 msd = value;

 /* Now display the digits */
}

However, using the technique espoused above, we can rewrite this much more efficiently as:

void display_value(uint16_t value)
{
 uint8_t    msd, nsd, lsd;

 if (value > 999U)
 {
  value = 999U;
 }

 msd = value / 100U;
 value -= msd * 100U;

 nsd = value / 10U;
 value -= nsd * 10U;

 lsd = value;

 /* Now display the digits */
}

If you benchmark this you should find it considerably faster than the modulus based approach.

Previous Tip

Median Filter Performance Results

Tuesday, November 9th, 2010 Nigel Jones

In my earlier post on median filtering I made the claim that for filter sizes of 3, 5 or 7 that using a simple insertion sort is ‘better’ than using Phil Ekstrom’s technique.  It occurred to me that this claim was based upon my testing with 8 bit processors quite a few years ago, and that the results might not be valid for 32 bit processors with their superior pointer manipulation.  Accordingly I ran some bench marks comparing an insertion sort based approach with Ekstrom’s method.

The procedure was as follows:

  1. I generated an array of random integers on the interval 900 – 1000. The idea is that these would represent data from a typical 10 bit ADC found on many microcontrollers.
  2. I then put together a base line project which performed all the basic house keeping functions, but without performing any filtering. The idea was to try and get a feel for the non-algorithm specific overhead.
  3. I then put together a project which median filtered using an insertion sort, for sizes, 3, 5, 7, 9, 11, and 13. Note that I elected to take a copy of the data prior to sorting. See this comment thread for a discussion of whether this is necessary or not.
  4. I put together another project which median filtered using Ekstrom’s method.
  5. I compiled the above for an ARM Cortex M3 target using an IAR compiler with full speed optimization.

The results were a clear win for Ekstrom. His code size was 132 bytes versus 224. His code was 5%, 32%, 61%, 89%,113% and 146% faster than the insertion sort for filters sizes of 3, 5, 7, 9, 11 and 13 respectively. To be fair to the insertion sort technique, I have made no effort to optimize it. Notwithstanding this, I think I can say that for 32 bit targets, you may as well just use Ekstrom’s approach for all filter sizes.

I’ll endeavor to update this post with results for a 16 bit target (MSP430) in the next few days.

Well I finally got around to running the tests on an MSP430 target. In this case Ekstrom’s method produced a larger code size (186 bytes versus 160). Much to my surprise, Ekstrom’s method was dramatically superior to the insertion sort approach, with speeds of 69% faster for a filter size of 3, going up to a whopping 250% faster with a filter size of 13.  The bottom line: I think my original claim is bunk. Use Ekstrom’s method by default!