An issue that comes up frequently in embedded systems is division of an integer by a constant. Of course most of the time we try and arrange things such that the divisor is a power of two such that the division may be performed by shift operations. However, all too often we have to divide an integer by some non power of two value. Divisors that seem to crop up a lot are 10 & 100 (for obvious reasons), 3 (for no good reason), 60 (when dealing with time) and of course various combination’s of pi and root 2. In cases like these you can of course just code it ‘normally’ and let the compiler do the work for you. However, when you feel the need for speed, there are other techniques that are spectacularly good.

I learned about this subject in dribs and drabs over the years without ever coming across a good summary – until I located this paper by Douglas Jones (no relationship). It does a nice job of explaining most of what you need to know in order to perform division of an integer by a constant. I particularly like the fact that he has algorithms for CPUs that contain barrel shifters – and those that do not. I strongly recommend that you read the paper. One note of caution however – Jones like many academics is used to working on CPUs with 32 bit word lengths. As such, his code assumes that integers are 32 bits. If you use his code as is, then it will fail on 16 bit word length machines. It’s for reasons such as this that I really recommend everyone would use the C99 data types.

For those of you too lazy to read the paper, its basic premise is based upon the fact that division by a constant is equivalent to multiplication by the reciprocal of that constant. There is nothing of course earth shattering about this observation. However, Jones then goes ahead and explains about binary points, rounding etc in order to achieve the desired result. Since I had to reduce his paper to practice, I thought I’d go ahead and share the ‘recipe’ with you. Before doing so I should note that I work mostly with 8 & 16 bit CPUs that do not contain barrel shifters. As a result I am most interested in the techniques that use multiplication. If you are working with a 32 bit processor with a barrel shifter and an instruction cache then you should seriously look at his other implementations.

#### Division of a uint16_t by a constant K

In the steps that follow, there is no requirement that K be integer. It must however be greater than 1.

There are two recipes. The first works for many divisors – but not all and is the faster of the two. The second recipe will give better results for all inputs – but produces less efficient code. While I am sure that there is some analytical way of making the determination ahead of time, I’ve found it easier to use the first recipe and exhaustively test it. If it works – great. If not then switch to the second recipe.

In the following descriptions, Q is the quotient (i.e. the result) of dividing an unsigned integer A by the constant K.

##### Recipe #1

- Convert 1 / K into binary. There is a nice web based calculator here that will do the job.
- Take all the bits to the right of the binary point, and left shift them until the bit to the right of the binary point is 1. Record the required number of shifts S.
- Take the most significant 17 bits and add 1 and then truncate to 16 bits. This effectively rounds the result.
- Express the remaining 16 bits to the right of the binary point as a 4 digit hexadecimal number M of the form hhhh.
- Q = (((uint32_t)A * (uint32_t)M) >> 16) >> S
- Perform an exhaustive check for all A & Q. If necessary adjust M or try recipe #2.

Incidentally, you may be wondering why I don’t use the form espoused by Jones, namely:Q = (((uint32_t)A * (uint32_t)M) >> (16 + S))

The answer is that this requires a left shift 16 + S places of a 32 bit integer. By splitting the shift into two as shown and by making use of the C integer promotion rules, the expression becomes:

- Right shift a 32 bit integer 16 places and convert to a 16 bit integer. This effectively means just use the top half of the 32 bit integer.
- Right shift the 16 bit integer S places.

This is dramatically more efficient on an 8 or 16 bit processor. On a 32 bit processor it probably is not.

#### Recipe #2

- Convert 1 / K into binary.
- Take all the bits to the right of the binary point, and left shift them until the bit to the right of the binary point is 1. Record the required number of shifts S.
- Take the most significant 18 bits and add 1 and then truncate to 17 bits. This effectively rounds the result.
- Express the 17 bit result as 1hhhh. Denote the hhhh portion as M
- Q = ((((uint32_t)A * (uint32_t)M) >> 16) + A) >> 1) >> S;
- Perform an exhaustive check for all A & Q. If necessary adjust M.

Again I split the shifts up as shown for efficiency on an 8 / 16 bit machine.

#### Example 1 – Divide by 30

In this case I wish to divide a uint16_t by 30.

- Convert to binary. 1 / 30 = 0.000010001000100010001000100010001000100010001000100010001
- Left shift until there is a 1 to the right of the binary point. In this case it requires 4 shifts and we get 0.10001000100010001000100010001000100010001000100010001. S is thus 4.
- Take the most significant 17 bits: 1000 1000 1000 1000 1
- Add 1: giving 1000 1000 1000 1000 1 + 1 = 1000 1000 1000 1001 0
- Truncate to 16 bits: 1000 1000 1000 1001
- Express in hexadecimal: M = 0x8889
- Q = (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 4

An exhaustive check confirms that this expression does indeed do the job for all 16 bit values of A. It is also about 10 times faster than the compiler division routine on an AVR processor.

#### Example 2 – Divide by 100

In this case I wish to divide a uint16_t by 100. This is one of those cases where we need 17 bit resolution

- Convert to binary. 1 / 100 = 0.00000010100011110101110000101000111101011100001010001111011
- Left shift until there is a 1 to the right of the binary point. In this case it requires 6 shifts and we get 0.10100011110101110000101000111101011100001010001111011. S is thus 6.
- Take the most significant 18 bits: 1 0100 0111 1010 1110 0
- Add 1: 1 0100 0111 1010 1110 0 + 1 = 1 0100 0111 1010 1110 1
- Truncate to 17 bits: 1 0100 0111 1010 1110
- Express in hexadecimal: M = 1 47AE
- Q = ((((uint32_t)A * (uint32_t)0x47AE) >> 16) + A) >> 1) >> 6;

An exhaustive check shows that the division is not exact for all A. I thus incremented M to 0x47AF and got exact results for all A. This code was about twice as fast as the compiler division routine on an AVR processor.

#### Example 3 – Divide by π

This is an example where the resultant expression results in an approximate result. The approximation is very good though, with a quotient that is off by at most 1 for all A.

- Convert to binary: 1 / π = 0.010100010111110011000001101101110010011100100010001001
- Left shift until there is a 1 to the right of the binary point. In this case it requires 1 shift and we get

10100010111110011000001101101110010011100100010001001. S is thus 1. - Take the most significant 18 bits: 1 0100 0101 1111 0011 0
- Add 1: 1 0100 0101 1111 0011 0 + 1 = 1 0100 0101 1111 0011 1
- Truncate to 17 bits: 1 0100 0101 1111 0011
- Express in hexadecimal: M = 1 45F3
- Q = ((((uint32_t)A * (uint32_t)0x45F3) >> 16) + A) >> 1) >> 1;

An exhaustive check that compared the result of this expression to (float)A * 0.31830988618379067153776752674503f showed that the match was exact for all but 263 values in the range 0 – 0xFFFF. Where there was a mismatch it is off by at most 1. It’s also 23 times faster than converting to floating point. Not a bad trade off.

#### Example 4 – Divide by 10 on an 8 bit value

This technique is obviously usable on 8 bit values. One just has to adjust the number of bits. Here’s an example

- Convert to binary. 1 / 10 = 0.0001100110011001100110011001100110011001100110011001101
- Left shift until there is a 1 to the right of the binary point. In this case it requires 3 shifts and we get 0.1100110011001100110011001100110011001100110011001101. S is thus 3.
- Take the most significant 9 bits: 1100 1100 1
- Add 1: giving 110011001 + 1 = 110011010
- Truncate to 8 bits: 1100 1101
- Express in hexadecimal: M = 0xCD
- Q = (((uint16_t)A * (uint16_t)0xCD) >> 8) >> 3

An exhaustive check confirms that this expression does indeed do the job for all 8 bit values of A. It is also about 8 times faster than the compiler division routine on an AVR processor.

### Summary

Using the values generated by Jones, together with some of the values I have computed, here’s a summary of some common divisors for unsigned 16 bit integers.

Divide by 3: (((uint32_t)A * (uint32_t)0xAAAB) >> 16) >> 1 Divide by 5: (((uint32_t)A * (uint32_t)0xCCCD) >> 16) >> 2 Divide by 6: (((uint32_t)A * (uint32_t)0xAAAB) >> 16) >> 2 Divide by 7: ((((uint32_t)A * (uint32_t)0x2493) >> 16) + A) >> 1) >> 2 Divide by 9: (((uint32_t)A * (uint32_t)0xE38F) >> 16) >> 3 Divide by 10: (((uint32_t)A * (uint32_t)0xCCCD) >> 16) >> 3 Divide by 11: (((uint32_t)A * (uint32_t)0xBA2F) >> 16) >> 3 Divide by 12: (((uint32_t)A * (uint32_t)0xAAAB) >> 16) >> 3 Divide by 13: (((uint32_t)A * (uint32_t)0x9D8A) >> 16) >> 3 Divide by 14: ((((uint32_t)A * (uint32_t)0x2493) >> 16) + A) >> 1) >> 3 Divide by 15: (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 3 Divide by 30: (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 4 Divide by 60: (((uint32_t)A * (uint32_t)0x8889) >> 16) >> 5 Divide by 100: (((((uint32_t)A * (uint32_t)0x47AF) >> 16U) + A) >> 1) >> 6 Divide by PI: ((((uint32_t)A * (uint32_t)0x45F3) >> 16) + A) >> 1) >> 1 Divide by √2: (((uint32_t)A * (uint32_t)0xB505) >> 16) >> 0

Hopefully you have spotted the relationship between divisors that are multiples of two. For example compare the expressions for divide by 15, 30 & 60.

If someone has too much time on their hands and would care to write a program to compute the values for all integer divisors, then I’d be happy to post the results for everyone to use.

### Update

Alan Bowens has risen to the challenge and has generated some nifty programs for generating coefficients for arbitrary 8 and 16 bit values. He’s also generated header files for all 8 and 16 bit integer divisors that you can just include and use. You’ll find it all at his blog. Nice work Alan.

Tags: division

Some time ago, I needed to implement an efficient and effective way (in terms of program size, RAM usage and speed) to convert an uint32_t to a BCD on a PIC16F690. After trying and finetuning such different approaches as the compiler built-in division, sprintf, successive subtractions, and some faster-but-obscure algorithms, a reciprocal multiplication implemented with shifts and additions resulted in a non-convoluted method, and the fastest one I benchmarked. Truly, Douglas Jones' paper was a lifesaver for me. Really good stuff.

A fellow fan! Did you implement it in assembly language or C? Also what was the magnitude of the variation between the various methods? I ask because I'm intrigued to see how the method ports to other architectures.

Hello Nigel!In order to give you the exact figures, I have to undust my code. But I can tell you from the top of my head that the reciprocal multiplication is more than 10 times faster than the naive ones (i.e. sprintf and built-in division by 10)All the approaches were implemented in C, using HI-TECH PICC standard (not PRO) compiler.By the way, just yesterday Embedded.com published a nice article about this very same topic. Check it out.http://www.embedded.com/217900224?printable=true