Although countless PhD’s have been awarded on sorting algorithms, it’s not a topic that seems to come up much in embedded systems (or at least the kind of embedded systems that I work on). Thus it was with some surprise recently that I found myself needing to sort an array of integers. The array wasn’t very large (about twenty entries) and I was eager to move on to the real problem at hand and so I just dropped in a call to the standard C routine qsort(). I didn’t give it a great deal of thought because I ‘knew’ that a ‘Quick Sort’ algorithm is in general fast and well behaved and that with sorting so few entries I wasn’t too concerned about it being ‘optimal’. Anyway, with the main task at hand solved, on a whim I decided to take another look at qsort(), just to make sure that I wasn’t being too cavalier in my approach. Boy did I get a shock! My call to qsort() was increasing my code size by 1500 bytes and it wasn’t giving very good sort times either. For those of you programming big systems, this may seem acceptable. In my case, the target processor had 16K of memory and so 1500 bytes was a huge hit.

Surely there had to be a better solution? Well there’s always a better solution, but in my case in particular, and for embedded systems in general, what is the optimal sorting algorithm?

Well, after thinking about it for a while, I think the optimal sorting algorithm for embedded systems has these characteristics:

- It must sort in place.
- The algorithm must not be recursive.
- Its best, average and worst case running times should be of similar magnitude.
- Its code size should be commensurate with the problem.
- Its running time should increase linearly or logarithmically with the number of elements to be sorted.
- Its implementation must be ‘clean’ – i.e. free of breaks and returns in the middle of a loop.

##### Sort In Place

This is an important criterion not just because it saves memory, but most importantly because it obviates the need for dynamic memory allocation. In general dynamic memory allocation should be avoided in embedded systems because of problems with heap fragmentation and allocation performance. If you aren’t aware of this issue, then read this series of articles by Dan Saks on the issue.

##### Recursion

Recursion is beautiful and solves certain problems amazingly elegantly. However, it’s not fast and it can easily lead to problems of stack overflow. As a result, it should never be used in embedded systems.

##### Running Time Variability

Even the softest of real time systems have some time constraints that need to be met. As a result a function whose execution time varies enormously with the input data can often be problematic. Thus I prefer code whose execution time is nicely bounded.

##### Code Size

This is often a concern. Suffice to say that the code size should be reasonable for the target system.

##### Data Size Dependence

Sorting algorithms are usually classified using ‘Big O notation’ to denote how sensitive they are to the amount of data to be sorted. If N is the number of elements to be sorted, then an algorithm whose running time is N Log N is usually preferred to one whose running time is N^{2}. However, as you shall see, for small N the advantage of the more sophisticated algorithms can be lost by the the overhead of the sophistication.

##### Clean Implementation

I’m a great proponent of ‘clean’ code. Thus code where one exits from the middle of a loop isn’t as acceptable as code where everything proceeds in an orderly fashion. Although this is a personal preference of mine, it is also codified in for example the MISRA C requirements, to which many embedded systems are built. Anyway to determine the optimal sorting algorithm, I went to the Wikipedia page on sorting algorithms and initially selected the following for comparison to the built in qsort: Comb, Gnome, Selection, Insertion, Shell & Heap sorts. All of these are sort in place algorithms. I originally eschewed the Bubble & Cocktail sorts as they really have nothing to commend them. However, several people posted comments asking that I include them – so I did. As predicted they have nothing to commend them. In all cases, I used the Wikipedia code pretty much as is, optimized for maximum speed. (I recognize that the implementations in Wikipedia may not be optimal – but they are the best I have). For each algorithm, I sorted arrays of 8, 32 & 128 signed integers. In every case I sorted the same random array, together with a sorted array and an inverse sorted array. First the code sizes in bytes:

qsort() 1538 Gnome() 76 Selection() 130 Insertion() 104 Shell() 242 Comb() 190 Heap() 200 Bubble() 104 Cocktail() 140

Clearly, anything is a lot better than the built in qsort(). However, we are not comparing apples and oranges, because qsort() is a general purpose routine, whereas the others are designed explicitly to sort integers. Leaving aside qsort(), the Gnome sort Insertion sort and Bubble sorts are clearly the code size leaders. Having said that, in most embedded systems, a 100 bytes here or there is irrelevant and so we are free to choose based upon other criteria.

#### Execution times for the 8 element array

Name Random Sorted Inverse Sorted qsort() 3004 832 2765 Gnome() 1191 220 2047 Selection() 1120 1120 1120 Insertion() 544 287 756 Shell() 1233 1029 1425 Comb() 2460 1975 2480 Heap() 1265 1324 1153 Bubble() 875 208 1032 Cocktail() 1682 927 2056

In this case, the Insertion sort is the clear winner. Not only is it dramatically faster in almost all cases, it also has reasonable variability and it has almost the smallest code size. Notice that the bubble sort for all its vaunted simplicity consumes as much code and runs considerably slower. Notice that the Selection sort’s running time is completely consistent – and not too bad when compared to other methods.

#### Execution times for the 32 element array

Name Random Sorted Inverse Sorted qsort() 23004 3088 19853 Gnome() 17389 892 35395 Selection() 14392 14392 14392 Insertion() 5588 1179 10324 Shell() 6589 4675 6115 Comb() 10217 8638 10047 Heap() 8449 8607 7413 Bubble() 13664 784 16368 Cocktail() 17657 3807 27634

In this case, the winner isn’t so clear cut. Although the insertion sort still performed well, it’s showing a very large variation in running time now. By contrast the shell sort has got decent times with small variability. The Gnome, Bubble and Cocktail sorts are showing huge variability in execution times (with a very bad worst case), while the Selection sort shows consistent execution time. On balance, I’d go with the shell sort in most cases.

#### Execution times for the 128 element array

Name Random Sorted Inverse Sorted qsort() 120772 28411 77896 Gnome() 316550 3580 577747 Selection() 217420 217420 217420 Insertion() 88475 4731 158020 Shell() 41661 25611 34707 Comb() 50858 43523 48568 Heap() 46959 49215 43314 Bubble() 231294 3088 262032 Cocktail() 271821 15327 422266

In this case the winner is either the shell sort or the heap sort depending on whether you want raw performance more or less when compared to performance variability. The Gnome, Bubble and Cocktail sorts are hopelessly outclassed. So what to make of all this? Well in any comparison like this there are a myriad of variables that one should take into account, and so I don’t believe these data should be treated as gospel. What is clear to me is that:

- Being a general purpose routine, qsort() is unlikely to be the optimal solution for an embedded system.
- For many embedded applications, a shell sort has a lot to commend it – decent code size, fast running time, well behaved and a clean implementation. Thus if you don’t want to bother with this sort of investigation every time you need to sort an array, then a shell sort should be your starting point. It will be for me henceforth.

A quick update: Check out this link for a very cool visualization of different sorting algorithms.

Tags: Sorting

Of course, the system library typically implements a highly optimized qsort(), which leads to lots of code overhead (it may fall back to insertion sort for the last n entries, for instance). I can see your point about preferring non-recursive algorithms (although you could code qsort with an explicit stack and it’ll work fine), but the code size comparison is unfair.In general, though, your sample sizes are *very* small – sorting 32 or even 128 integers should be simple with any algorithm.Also, in cases where it is appropriate, bucket sort is very simple and very fast.

For this small number of items, bubble sort could be fast enough and very small code sizeint array[SIZE];int i,j;for (i=0;i<(SIZE-1);i++){ for(j=i;j<(SIZE-1);j++) { int tmp = intArray[j]; if( intArray[j+1] > intArray[j]) { intArray[j] = intArray[j+1]; intArray[j+1] = tmp; } }}

I’m not sure how a highly optimized qsort leads to lots of code overhead (in general I’d expect highly optimized to lead to the exact opposite). If instead you mean highly

generalizedthen I agree. I also agree my sample sizes are small – in fact that’s my point. In embedded systems (at least the ones I work in), there is absolutely no call for sorting million element entities. Instead it’s usually a small number of entities that need to be sorted – hence my sample sizes. I guess my point is that when you have such a small number of things to be sorted the classic metrics become less relevant, and the overhead of the algorithm becomes more relevant. You can see this in the numbers I quote. They also show that your assertion that pretty much any algorithm will do is simply not the case.I’m not familiar with a bucket sort. As and when I get the time I’ll track it down and see how it compares to the others.Highly optimized can be optimized for speed or memory use or code space.Like the old adage, “Fast, Good or Cheap, pick two”, with optimization, you have to make decisions which of the three you want to be really good

>its running time should increase linearly …good luck with that. O(NlogN) is about the best you can do, unless you can constrain the input so that pigeonhole sort will work.http://en.wikipedia.org/wiki/Sorting_algorithm

Perhaps I misunderstand.Aren’t you doing a array of known length, where you’re having greater issues with code size then runtime?Just use the sort that has the smallest footprint, then calculate what array data would be the worst case runtime, and run it a thousand times on that data while measuring your timing.If that’s satisfactory, problem solved.

What you suggest would of course work admirably for a specific case. However I’m trying to generalize the solution for embedded systems, since different things come in to play when compared to sorting data in your typical database application. In many cases, the code footprint is important (as it was when I found that qsort was very large compared to my available memory). In other cases the running time is the most important. Depending upon the application, it could be the worst case running time or it might be the variability between best and worst case that is of interest. Finally the issues of dynamic memory allocation and recursion that I mentioned are also important in many (most?) embedded systems.

“I’m not sure how a highly optimized qsort leads to lots of code overhead”I think he is talking about the qsort implementing special cases. Robust implementations of algorithms can have a lot more code than a basic implementation.

Very fascinating!RTwww.anon-tools.cz.tc

shell sort has long been my “favorite”, since it’s easy enough to mostly remember (it’s just a modified bubble sort) and is very predictable. Michael Langford: yes, brute force bubble sort works well for a VERY small number of items, but.. shell only has maybe 10-20% more code, and it works well for a MUCH larger range of items.We had a project where someone had coded a bubble sort for a small number of items, because it was quick. Then, we ended up putting in a few hundred or thousand items later. And it took MINUTES to run on a fast machine. Moral of the story is: don’t use an algorithm which degrades horribly unless you have no choice..

Bubble sort was the right answer 30 years ago, with insertion sort a close second. It’s still the right answer. Glad you saw the light. ðŸ™‚

void shellsort(long *a, long n){ /* Sedgewick-Incerpi sequence */ long step[16] = {1, 3, 7, 21, 48, 112, 336, 861, 1968, 4592, 13776, 33936, 86961, 198768, 463792, 1391376}; long i, j, k, s, t; k = (n < 256) ? 6 : 15; do { for (s=step[k], i=s-1; i < n; i++) { t = a[i]; for (j=i; j >= s && t > a[j-s]; a[j] = a[j-s], j -= s); a[j] = t; } } while (k–);}

First off, any recursive algorithm can be made iterative by keeping an explicit state stack instead of relying on the call stack to preserve state (and if you know the stack size at compile time, as you often do, you can allocate that data structure on the stack; at worst, there's calloc).Second, I would have liked to see bubblesort/cocktail sort in your analysis. They perform very well for small n.Third, integers of a known width (e.g. 32-bit) can be sorted in linear time. See radix sort.

Looking over some of the comments it's interesting to see how many people claim to know what the best algorithm is – with a strong emphasis on bubble sorting. I went ahead and added the bubble and cocktail sorts – and as expected they had nothing to commend them. Gregory's additional comment that a radix sort can be performed in linear time prompted me to take a look at it. Maybe it's just the Wikipedia implementation but it appears to require dynamic memory allocation, random number generation and integer divide by 10 and integer modulo 10 arithmetic. In short it looks like a complete non starter on an embedded system.

It seems to me that you intended this research for in-memory arrays, correct? What happens for sorting data that exists in Flash Memory where a concern is the number of writes (possibly reads also)? Any data on each of these algorithms and how they impact that aspect?

True. However you will note that the entire point of the article is that for small data sets the overhead of some algorithms is significant. Thus if you have a small amount of data in flash, I’d copy it to RAM, sort it, and then write it back on an as needed basis. To your larger question as to whether there are algorithms that minimize the number of writes (stores), I’m sure there are. However I don’t know of any research on the topic.

qsort() is pretty horrible for sorting integers because it uses function pointers to do the comparisons in a generic manner. What could be a single instruction to compare 2 registers now becomes an indirect function call. There have been several articles showing how C++ templated std::sort beats qsort for exactly this reason.

I’ve found a way how to speed up shellsort by 30% if a bigger memory footprint (ROM) is ok. Checked on Cortex-M4. The key is “computed goto”.

The idea is to remove the outer loop (where the gaps get fetched) and copy/paste the body of the outer loop (i.e. the inner loops) while hardcoding the gaps in the duplicated code fragments. Then I put a label before each of them. To catch N<=1, a do-nothing-label at the end of the function is useful – inserting a "return" at the end can be necessary because there has to be something after a label.

A jump table with the label addresses gets filled at compile time. This must be function local because the label scope is local. The table should be declared with the static const qualifiers so that the linker can put it into ROM.

Then there is computed goto for getting to the right point upon entering the sorting. The list length (maybe clipped using the ternary operator) serves as jump table index.

Of course, the compiler needs to support computed goto (not standard C), which is the case for GCC and CLANG. Checking for the "__GNUC__" define and using an elseif for compilers without the GNU extensions keeps the code portable.

Admittedly, this solution is somewhat hacky, but if the priority is maximum performance, then it is a good way to go. This is unlikely to pass any certification code review, but neither GCC nor CLANG are used in these domains anyway.