Archive for the ‘Efficient C/C++’ Category

Efficient C Tips #10 – Use unsigned integers

Friday, July 31st, 2009 Nigel Jones

This is the tenth in a series of tips on writing efficient C for embedded systems. Today I consider the topic of whether one should use signed integers or unsigned integers in order to produce faster code. Well the short answer is that unsigned integers nearly always produce faster code. Why is this you ask? Well there are several reasons:

Lack of signed integer support at the op code level

Many low end microprocessors lack instruction set support (i.e. op codes) for signed integers. The 8051 is a major example. I believe low end PICs are also another example. The Rabbit processor is sort of an example in that my recollection is that it lacks support for signed 8 bit types, but does have support for signed 16 bit types! Furthermore some processors will have instructions for performing signed comparisons, but only directly support unsigned multiplication.

Anyway, so what’s the implication of this? Well lacking direct instruction set support, use of a signed integer forces the compiler to use a library function or macro to perform the requisite operation. Clearly this is not very efficient. But what if you are programming a processor that does have instruction set support for signed integers? Well for most basic operations such as comparison and addition you should find no difference. However this is not the case for division…

Shift right is not the same as divide by two for signed integers

I doubt there is a compiler in existence that doesn’t recognize that division by 2N is equivalent to a right shift N places for unsigned integers. However this is simply not the case for signed integers, since the issue of what to do with the sign bit always arises. Thus when faced with performing a division by 2N on a signed integer, the compiler has no choice other than to invoke a signed divide routine rather than a simple shift operation. This holds true for every microprocessor I have ever looked at in detail.

There is a third area where unsigned integers offer a speed improvement over signed integers – but it comes about by a different mechanism…

Unsigned integers can often save you a comparison

From time to time I find myself writing a function that takes as an argument an index into an array or a file. Naturally to protect against indexing beyond the bounds of the array or file, I add protection code. If I declare the function as taking a signed integer type, then the code looks like this:

void foo(int offset)
{
 if ((offset >= 0) && (offset < ARRAY_SIZE))
 {
  //Life is good...
 }
}

However, if I declare the function as taking an unsigned integer type, then the code looks like this:

void foo(unsigned int offset)
{
 if (offset < ARRAY_SIZE)
 {
  //Life is good...
 }
}

Clearly it’s nonsensical to check whether an unsigned integer is >=0 and so I can dispense with a check. The above are examples of where unsigned integer types are significantly more efficient than signed integer types. In most other cases, there isn’t usually any difference between the types. That’s not to say that you should choose one over the other on a whim. See this for a discussion of some of the other good reasons to use an unsigned integer. Before I leave this topic, it’s worth asking whether there are situations in which a signed integer is more efficient than an unsigned integer? Off hand I can’t think of any. There are situations where I could see the possibility of this occurring. For example when performing pointer arithmetic, the C standard requires that subtraction of two pointers return the data type ptrdiff_t. This is a signed integral type (since the result may be negative). Thus if after subtracting two pointers, you needed to add an offset to the result, it’s likely that you’ll get better code if the offset is a signed integral type. Of course this touches upon the nasty topic of mixing signed and unsigned integral types in an expression. I’ll address this another day.

Next Tip

Previous Tip

Home

Division of integers by constants

Friday, June 5th, 2009 Nigel Jones

An issue that comes up frequently in embedded systems is division of an integer by a constant. Of course most of the time we try and arrange things such that the divisor is a power of two such that the division may be performed by shift operations. However, all too often we have to divide an integer by some non power of two value. Divisors that seem to crop up a lot are 10 & 100 (for obvious reasons), 3 (for no good reason), 60 (when dealing with time) and of course various combination’s of pi and root 2. In cases like these you can of course just code it ‘normally’ and let the compiler do the work for you. However, when you feel the need for speed, there are other techniques that are spectacularly good.

I learned about this subject in dribs and drabs over the years without ever coming across a good summary – until I located this paper by Douglas Jones (no relationship). It does a nice job of explaining most of what you need to know in order to perform division of an integer by a constant. I particularly like the fact that he has algorithms for CPUs that contain barrel shifters – and those that do not. I strongly recommend that you read the paper. One note of caution however – Jones like many academics is used to working on CPUs with 32 bit word lengths. As such, his code assumes that integers are 32 bits. If you use his code as is, then it will fail on 16 bit word length machines. It’s for reasons such as this that I really recommend everyone would use the C99 data types.

For those of you too lazy to read the paper, its basic premise is based upon the fact that division by a constant is equivalent to multiplication by the reciprocal of that constant. There is nothing of course earth shattering about this observation. However, Jones then goes ahead and explains about binary points, rounding etc in order to achieve the desired result. Since I had to reduce his paper to practice, I thought I’d go ahead and share the ‘recipe’ with you. Before doing so I should note that I work mostly with 8 & 16 bit CPUs that do not contain barrel shifters. As a result I am most interested in the techniques that use multiplication. If you are working with a 32 bit processor with a barrel shifter and an instruction cache then you should seriously look at his other implementations.

Division of a uint16_t by a constant K

In the steps that follow, there is no requirement that K be integer. It must however be greater than 1.
There are two recipes. The first works for many divisors – but not all and is the faster of the two. The second recipe will give better results for all inputs – but produces less efficient code. While I am sure that there is some analytical way of making the determination ahead of time, I’ve found it easier to use the first recipe and exhaustively test it. If it works – great. If not then switch to the second recipe.

In the following descriptions, Q is the quotient (i.e. the result) of dividing an unsigned integer A by the constant K.

Recipe #1
  1. Convert 1 / K into binary. There is a nice web based calculator here that will do the job.
  2. Take all the bits to the right of the binary point, and left shift them until the bit to the right of the binary point is 1. Record the required number of shifts S.
  3. Take the most significant 17 bits and add 1 and then truncate to 16 bits. This effectively rounds the result.
  4. Express the remaining 16 bits to the right of the binary point as a 4 digit hexadecimal number M of the form hhhh.
  5. Q = (((uint32_t)A * (uint32_t)M) >> 16) >> S
  6. Perform an exhaustive check for all A & Q. If necessary adjust M or try recipe #2.

Incidentally, you may be wondering why I don’t use the form espoused by Jones, namely:Q = (((uint32_t)A * (uint32_t)M) >> (16 + S))
The answer is that this requires a left shift 16 + S places of a 32 bit integer. By splitting the shift into two as shown and by making use of the C integer promotion rules, the expression becomes:

  1. Right shift a 32 bit integer 16 places and convert to a 16 bit integer. This effectively means just use the top half of the 32 bit integer.
  2. Right shift the 16 bit integer S places.

This is dramatically more efficient on an 8 or 16 bit processor. On a 32 bit processor it probably is not.

Recipe #2

  1. Convert 1 / K into binary.
  2. Take all the bits to the right of the binary point, and left shift them until the bit to the right of the binary point is 1. Record the required number of shifts S.
  3. Take the most significant 18 bits and add 1 and then truncate to 17 bits. This effectively rounds the result.
  4. Express the 17 bit result as 1hhhh. Denote the hhhh portion as M
  5. Q = ((((uint32_t)A * (uint32_t)M) >> 16) + A) >> 1) >> S;
  6. Perform an exhaustive check for all A & Q. If necessary adjust M.

Again I split the shifts up as shown for efficiency on an 8 / 16 bit machine.

Example 1 – Divide by 30

In this case I wish to divide a uint16_t by 30.

  1. Convert to binary. 1 / 30 = 0.000010001000100010001000100010001000100010001000100010001
  2. Left shift until there is a 1 to the right of the binary point. In this case it requires 4 shifts and we get 0.10001000100010001000100010001000100010001000100010001. S is thus 4.
  3. Take the most significant 17 bits: 1000 1000 1000 1000 1
  4. Add 1: giving 1000 1000 1000 1000 1 + 1 = 1000 1000 1000 1001 0
  5. Truncate to 16 bits: 1000 1000 1000 1001
  6. Express in hexadecimal: M = 0×8889
  7. Q = (((uint32_t)A * (uint32_t)0×8889) >> 16) >> 4

An exhaustive check confirms that this expression does indeed do the job for all 16 bit values of A. It is also about 10 times faster than the compiler division routine on an AVR processor.

Example 2 – Divide by 100

In this case I wish to divide a uint16_t by 100. This is one of those cases where we need 17 bit resolution

  1. Convert to binary. 1 / 100 = 0.00000010100011110101110000101000111101011100001010001111011
  2. Left shift until there is a 1 to the right of the binary point. In this case it requires 6 shifts and we get 0.10100011110101110000101000111101011100001010001111011. S is thus 6.
  3. Take the most significant 18 bits: 1 0100 0111 1010 1110 0
  4. Add 1: 1 0100 0111 1010 1110 0 + 1 = 1 0100 0111 1010 1110 1
  5. Truncate to 17 bits: 1 0100 0111 1010 1110
  6. Express in hexadecimal: M = 1 47AE
  7. Q = ((((uint32_t)A * (uint32_t)0x47AE) >> 16) + A) >> 1) >> 6;

An exhaustive check shows that the division is not exact for all A. I thus incremented M to 0x47AF and got exact results for all A. This code was about twice as fast as the compiler division routine on an AVR processor.

Example 3 – Divide by π

This is an example where the resultant expression results in an approximate result. The approximation is very good though, with a quotient that is off by at most 1 for all A.

  1. Convert to binary: 1 / π = 0.010100010111110011000001101101110010011100100010001001
  2. Left shift until there is a 1 to the right of the binary point. In this case it requires 1 shift and we get
    10100010111110011000001101101110010011100100010001001. S is thus 1.
  3. Take the most significant 18 bits: 1 0100 0101 1111 0011 0
  4. Add 1: 1 0100 0101 1111 0011 0 + 1 = 1 0100 0101 1111 0011 1
  5. Truncate to 17 bits: 1 0100 0101 1111 0011
  6. Express in hexadecimal: M = 1 45F3
  7. Q = ((((uint32_t)A * (uint32_t)0x45F3) >> 16) + A) >> 1) >> 1;

An exhaustive check that compared the result of this expression to (float)A * 0.31830988618379067153776752674503f showed that the match was exact for all but 263 values in the range 0 – 0xFFFF. Where there was a mismatch it is off by at most 1. It’s also 23 times faster than converting to floating point. Not a bad trade off.

Example 4 – Divide by 10 on an 8 bit value

This technique is obviously usable on 8 bit values. One just has to adjust the number of bits. Here’s an example

  1. Convert to binary. 1 / 10 = 0.0001100110011001100110011001100110011001100110011001101
  2. Left shift until there is a 1 to the right of the binary point. In this case it requires 3 shifts and we get 0.1100110011001100110011001100110011001100110011001101. S is thus 3.
  3. Take the most significant 9 bits: 1100 1100 1
  4. Add 1: giving 110011001 + 1 = 110011010
  5. Truncate to 8 bits: 1100 1101
  6. Express in hexadecimal: M = 0xCD
  7. Q = (((uint16_t)A * (uint16_t)0xCD) >> 8) >> 3

An exhaustive check confirms that this expression does indeed do the job for all 8 bit values of A. It is also about 8 times faster than the compiler division routine on an AVR processor.

Summary

Using the values generated by Jones, together with some of the values I have computed, here’s a summary of some common divisors for unsigned 16 bit integers.

Divide by 3: (((uint32_t)A * (uint32_t)0xAAAB) >> 16) >> 1
Divide by 5: (((uint32_t)A * (uint32_t)0xCCCD) >> 16) >> 2
Divide by 6: (((uint32_t)A * (uint32_t)0xAAAB) >> 16) >> 2
Divide by 7: ((((uint32_t)A * (uint32_t)0x2493) >> 16) + A) >> 1) >> 2
Divide by 9: (((uint32_t)A * (uint32_t)0xE38F) >> 16) >> 3
Divide by 10: (((uint32_t)A * (uint32_t)0xCCCD) >> 16) >> 3
Divide by 11: (((uint32_t)A * (uint32_t)0xBA2F) >> 16) >> 3
Divide by 12: (((uint32_t)A * (uint32_t)0xAAAB) >> 16) >> 3
Divide by 13: (((uint32_t)A * (uint32_t)0x9D8A) >> 16) >> 3
Divide by 14: ((((uint32_t)A * (uint32_t)0x2493) >> 16) + A) >> 1) >> 3
Divide by 15: (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 3
Divide by 30: (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 4
Divide by 60: (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 5
Divide by 100: (((((uint32_t)A * (uint32_t)0x47AF) >> 16U) + A) >> 1) >> 6
Divide by PI: ((((uint32_t)A * (uint32_t)0x45F3) >> 16) + A) >> 1) >> 1
Divide by √2: (((uint32_t)A * (uint32_t)0xB505) >> 16) >> 0

Hopefully you have spotted the relationship between divisors that are multiples of two. For example compare the expressions for divide by 15, 30 & 60.

If someone has too much time on their hands and would care to write a program to compute the values for all integer divisors, then I’d be happy to post the results for everyone to use.

Update

Alan Bowens has risen to the challenge and has generated some nifty programs for generating coefficients for arbitrary 8 and 16 bit values. He’s also generated header files for all 8 and 16 bit integer divisors that you can just include and use. You’ll find it all at his blog. Nice work Alan.

Efficient C Tips #9 – Use lookup tables

Thursday, May 28th, 2009 Nigel Jones

This the ninth in a series of tips on how to make your C code more efficient.

(Note if you are looking for basic information on lookup tables, you should read this).

Typically the fastest ways to compute something on a microcontroller is to not compute it all – but to simply read the result from a lookup table. For example this is regularly done as part of CRC calculations. Despite this I’ve noticed over the years what I’ll call the ‘look up tables are boring’ syndrome. What do I mean by this? Well when having to code a solution to a problem, it seems that most of us would rather code something that involves crunching numbers, rather than generate a table where we just look up the result. I’m sure that many of you are thinking that I’m dead wrong and that you use lookup tables all the time. Well I’m sure many of you do. However the question is whether you make full use of this capability?

What started me thinking about this is the person who ended up on this blog looking for an efficient algorithm for determining the day of the year. I have no idea if they were coding for an embedded system or not, nor whether they were looking for a fast solution, a minimal memory solution, or something in between. However, it did make me realize that it would be a simple albeit slightly contrived way of demonstrating my point about look up tables.

First off, I imposed some constraints

  • The Gregorian calendar is to be used.
  • Days, months and years are numbered from 1 and not zero, such that January 1 is day 1 and not day 0.

Here’s my first solution, that makes use of a small lookup table:

#define JAN_DAYS (31)
#define FEB_DAYS (28)
#define LY_FEB_DAYS (29)
...
#define NOV_DAYS (30)
#define MONTHS_IN_A_YEAR (12+1)

uint16_t day_of_year(uint8_t day, uint8_t month, uint16_t year)
{
 static const uint16_t days[2][MONTHS_IN_A_YEAR] =
 {
  { /* Non leap year table */
   0,    /* Padding because first month is not zero */
   0,    /* If month is january, then no days before it */
   JAN_DAYS,  JAN_DAYS + FEB_DAYS,...JAN_DAYS + FEB_DAYS + ... + NOV_DAYS
  },
  { /* Leap year lookup table */
   0,    /* Padding because first month is not zero */
   0,    /* If month is January, then no days before it */
   JAN_DAYS,  JAN_DAYS + LY_FEB_DAYS,  ...JAN_DAYS + LY_FEB_DAYS + ... + NOV_DAYS
  }
 };

 uint16_t day_of_year;

 if ((year % 4 == 0) && (year % 100 != 0) || (year % 400 == 0))
 {
   /* Leap year */
   day_of_year = days[1][month] + day;
 }
 else
 {
  /* Non leap year */
  day_of_year = days[0][month] + day;
 }
 return day_of_year;
}

For most applications I think this is an optimal solution in that it handles a very wide range of dates, uses a small amount of storage for the lookup tables and requires minimal computational effort to achieve the result. (On an ARM7 it requires 128 bytes of code space, 64 bytes for the lookup table and executes in about 40 cycles). However, what about if the code had to run as fast as possible? I’d guess that most folks would work on optimizing the details of the implementation and leave it at that. I’m not sure that many people would consider a gigantic look up table so that the code looks like this:

#define LAST_YEAR (2400 + 1)/* Last year to worry about */
uint16_t day_of_year(uint8_t day, uint8_t month, uint16_t year)
{
 static const uint16_t days[LAST_YEAR][[MONTHS_IN_A_YEAR] =
 {
  {
   /* Padding because first year is not zero */ 0,
   /* Padding because first month is zero */ 0,
   /* If month is January, then no days before it */
   JAN_DAYS,  JAN_DAYS + FEB_DAYS,...JAN_DAYS + FEB_DAYS + ... + NOV_DAYS
  },

  {
    /* Year 1 - non leap year */0,
    /* Padding because first month is zero */0,
    /* If month is January, then no days before it */
    JAN_DAYS,  JAN_DAYS + FEB_DAYS,...JAN_DAYS + FEB_DAYS + ... + NOV_DAYS
  },

 ... 

  {
   /* Last year - a leap year */ 0,
   /* Padding because first month is zero */ 0,
   /* If month is January, then no days before it */
   JAN_DAYS,  JAN_DAYS + LY_FEB_DAYS,...JAN_DAYS + LY_FEB_DAYS + ... + NOV_DAYS
  },
};

return days[year][month] + day;
}

The lookup table would of course require at least 2401 * 13 * 2 = 62426 bytes. Evidently this would likely be unreasonable on an 8 bit processor. On a 32 bit processor with 8 Mbytes of Flash – not so unreasonable. I first learned this lesson many years ago in an application that required an 8051 processor to perform a complicated refresh of multiplexed LEDs at about 1 kHz (a significant load for the 8051). The initial implementation consisted of pure code. Over the next year or so the two of us working on it realized that we could speed it up by using lookup tables. We started off with a small look up table, and by the time we were done, the table was 48K (out of the 64K available to the 8051) while the execution time was a fraction of what it had been before. Thus next time you are faced with making something run faster consider using a look up table – even if it is huge. Sometimes it’s just the best way to go.

Next Tip

Previous Tip

Home

Efficient C Tips #8 – Use const

Monday, March 30th, 2009 Nigel Jones

One of the easiest ways to make your code more efficient is to use const wherever feasible. Just like declaring local functions as static, this is one of those changes that makes your code more robust, more maintainable and faster – a true win-win situation. So how does this work? Well you get the most benefit when passing pointers as parameters to functions. Here’s an example of a function whose job it is to compute the sum of an array of integers. The naive implementation would look something like this:

uint32_t sum(uint16_t *ptr, uint16_t n_elements)
{
 uint16_t lpc;
 uint32_t sum = 0;

 for (lpc = 0; lpc < n_elements; lpc++)
 {
  sum += *ptr++;
 }
 return sum;
}

I’ll ignore the issues of post increment and counting up (for now). Instead, consider the declaration of ptr. As it stands, the caller of this function has no idea whether sum() will modify the data or not, and hence must assume that it does. This has obvious implications for the compiler when it comes to optimization. To overcome this, it is necessary to declare ptr as pointing to const. The function prototype for sum() now becomes:

uint32_t sum(uint16_t const *ptr, uint16_t n_elements);

You’ll notice that I prefer to use what I call Saks notation for where I place the const modifier. The more conventional, albeit less sensible way of writing the declaration is:

uint32_t sum(const uint16_t *ptr, uint16_t n_elements);

Regardless of the style, by doing this you are indicating to the compiler that you will not be modifying the data that ptr points to. As a result, the optimizer can make assumptions that will typically lead to tighter code.

Before I give you the final code, I’d like to make a few other observations.

  • As well as potentially making your code more efficient, use of const also makes your code more readable and maintainable. That is, someone examining your code will know something extra about the function simply by looking at the prototype. Personally I find this very useful.
  • If you examine the C standard library, you’ll find very liberal use of the const modifier. You should take this as a strong hint that it’s a good idea.
  • PC-Lint will very helpfully tell you if a pointer can be declared as pointing to const. Yet another reason for using Lint!

So what does my sum() function look like? Well, incorporating my previous hints on post increment and counting down, it looks something like this:

uint32_t sum(uint16_t const *ptr, uint16_t n_elements)
{
 uint32_t sum = 0;

 for (; n_elements != 0; --n_elements)
 {
  sum += *ptr;
  ++ptr;
 }
 return sum;
}

Next Tip
Previous Tip
Home

Editorial Note

I’ve been following my own advice and have been on a short vacation. As a result I’ve been tardy in responding to some of your comments. I’ll try and rectify this over the next few days.

Efficient C Tips #7 – Fast loops

Thursday, March 5th, 2009 Nigel Jones

Every program at some point requires some set of actions to be taken a fixed number of times. Indeed this is such a common occurrence that we typically code it without really giving it much thought. For example, if I asked you to call a function foo() ten times, I’m sure that most of you would write something like this:

for (uint8_t lpc = 0; lpc < 10; ++lpc)
{
 foo();
}

While there is nothing wrong with this, per se, it is sub optimal on just about every processor. Instead you are better off using a construct which counts down to zero. Here are two alternative ways of doing this:

for(uint8_t lpc = 10; lpc != 0; --lpc)
{
 foo();
}

uint8_t lpc = 10;
do
{
 foo();
} while (--lpc);

Which one you think is more natural is entirely up to you.

So how does this efficiency arise? Well in the count up case, the assembly language generated by the compiler typically looks something like this:

INC lpc      ; Increment loop counter
SUB lpc, #10 ; Compare loop counter to 10
BNZ loop     ; Branch if loop counter not equal to 10

By contrast, in the count down case the assembly language looks something like this

DEC lpc   ; Decrement loop counter
BNZ loop  ; Branch if non zero

Evidently, because of the ‘specialness’ of zero, more efficient code can be generated.

So why don’t you see C programs littered with these count down constructs? Well counting down has a major limitation. If you need to use the loop variable as an index into an array then you have a problem. For example, let’s say I wanted to zero the elements of an array. Using the count down technique you might be tempted to do this:

uint8_t bar[10];
uint8_t lpc;

do
{
 bar[lpc] = 0; // Error! First time through results in index beyond end of array
} while (--lpc);

Evidently it doesn’t work. You can of course modify the code to make it work. However doing so typically loses you all the efficiency gains, such that you are better off with a standard up-counting for loop.

As a parting thought, concepts such as these are second nature to assembly language programmers – all of whom do this sort of thing instinctively. As a result, if you are really interested in getting the best out of your C compiler, you could do a lot worse than learning how to program your target processor in assembly language. Does this defeat one of the objectives in programming in a high level language – yes. However, for giving you insight in terms of what is going on under the hood it cannot be beaten.

Next Tip
Previous Tip
Home