http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer

helped in finding yet another implementation of the CLZ algorithm:

static inline uint32_t CLZz(register uint32_t x)

{

// Note: clz(x_32) = 32 minus the Hamming_weight(x_32),

// in case the leading zero's are followed by 1-bits only

```
``` x |= x >> 1;

// the next 5 statements will turn all bits after the

// leading zero's into a 1-bit

x |= x >> 2;

x |= x >> 4;

x |= x >> 8;

x |= x >> 16;

` // compute the Hamming weight of x ... and return 32`

// minus the computed result'

x -= ((x >> 1) & 0x55555555U);

x = (x & 0x33333333U) + ((x >> 2) & 0x33333333U);

return 32U - ( (((x + (x >> 4)) & 0x0F0F0F0FU) * 0x01010101U) >> 24 );

}

The last 3 lines can also be written as follows (i.e. without the multiply statement):

x -= (x >> 1) & 0x55555555U;

x = ((x >> 2) & 0x33333333U) + (x & 0x33333333U);

x = ((x >> 4) + x) & 0x0f0f0f0fU;

x += x >> 8;

x += x >> 16;

return 32U - (x & 0x0000003fU);

Or (even less optimized), as follows:

x = (x & 0x55555555U) + ((x >> 1 ) & 0x55555555U);

x = (x & 0x33333333U) + ((x >> 2 ) & 0x33333333U);

x = (x & 0x0F0F0F0FU) + ((x >> 4 ) & 0x0F0F0F0FU);

x = (x & 0x00FF00FFU) + ((x >> 8 ) & 0x00FF00FFU);

x = (x & 0x0000FFFFU) + ((x >> 16) & 0x0000FFFFU);

return 32U - x;

The Hamming weight (count of the 1-bits) is computed using sideways addition (and divide and conquer), and is explained here:

http://en.wikipedia.org/wiki/Hamming_weight

Whether or not this implementation is useful in your use case, I cannot really tell. Yes, the algorithm is linear (no branch statements), but only you can tell.

Henri

]]>static inline uint32_t CLZ3(uint32_t x) {

…

x >>= n;

#if 1 // ADDED STATEMENTS

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x++;

#endif

return clz_lkup[(x & -x) >> 1] – n; // MODIFIED STATEMENT

}

The first three of the added statements turn all bits after the leading zero’s into an one …

And yes, the number of statements of the modified algorithm will certainly turn out to be higher than the number of statements of cli2(). No doubt.

Henri

]]>********

This means that CLZ6() proposed by Bob is entirely **INCORRECT**. It just calculates a different thing and, in fact, it should NOT be even called CLZ.

********

–Miro

]]>********

This means that CLZ3() proposed by Bob is entirely **INCORRECT**. It just calculates a different thing and, in fact, it should NOT be even called CLZ.

********

–Miro

]]>I am not sure whether or not I really understand clz3(), because I far as I can tell clz3() does not report the same result as clz1() does, for ALL cases …

Executing both cases on my Intel computer resulted in the following output:

@@ ./clz1

CLZ1: 1, number of leading zero’s: 31

CLZ1: 3, number of leading zero’s: 30

CLZ1: cc, number of leading zero’s: 24

CLZ1: 80000000, number of leading zero’s: 0

CLZ1: 5, number of leading zero’s: 29

CLZ1: 50000000, number of leading zero’s: 1

CLZ1: a0000000, number of leading zero’s: 0

@@ ./clz3

CLZ3: 1, number of leading zero’s: 31

CLZ3: 3, number of leading zero’s: 31

CLZ3: cc, number of leading zero’s: 29

CLZ3: 80000000, number of leading zero’s: 0

CLZ3: 5, number of leading zero’s: 31

CLZ3: 50000000, number of leading zero’s: 3

CLZ3: a0000000, number of leading zero’s: 2

Again, the output from clz3() is not the same as that of clz1() in ALL cases. Where do I go wrong?

Henri

]]>I’ve tested your CLZ6() code under the same conditions as all other implementations (IAR EWARM 7.10, ARM Cortex-M0, highest level of optimization). The results are as follows (not very good): CLZ6() completes in 24 instructions out of which 3 are NOPs. Indeed, the code is completely linear and there is no single branch there. The code takes 64 bytes plus 4*132 bytes for the lookup tables, which totals 592 bytes. This is some kind of a record in big size.

Overall, I would say that CLZ6() is not optimal.

–Miro

]]>*********

Miro,

I tried on two occasions to post this to the Embedded Gurus website as a comment, but it was rejected without any message. I can only assume that it exceeded the length limit.

Here’s another version of the CLZ algorithm that I thought of today. This one uses more memory, and probably results in more instructions. But it completely avoids branch instructions, thereby preventing pipeline cache misses (which are non-deterministic).

Bob Snyder

——————————————————

`static inline uint32_t CLZ6(uint32_t x) {`

static uint8_t const b0[] = {

0,1,2,0,3,0,0,0,

4,0,0,0,0,0,0,0,

5,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

6,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

7,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

8};

```
``` static uint8_t const b1[] = {

0,9,10,0,11,0,0,0,

12,0,0,0,0,0,0,0,

13,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

14,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

15,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

16};

static uint8_t const b2[] = {

0,17,18,0,19,0,0,0,

20,0,0,0,0,0,0,0,

21,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

22,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

23,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

24};

static uint8_t const b3[] = {

0,25,26,0,27,0,0,0,

28,0,0,0,0,0,0,0,

29,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

30,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

31,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

32

};

x &= -x; /* isolate leftmost 1-bit in x*/

` /* Note: at most one of the b[n] values will be non-zero */`

return 32U

- b3[(x >> 24) & 0xFFU]

- b2[(x >> 16) & 0xFFU]

- b1[(x >> 8 ) & 0xFFU]

- b0[x & 0xFFU];

}

An interesting idea. So, for completeness, here is your modification applied to CLZ1():

static inline uint32_t CLZ3(uint32_t x) {

static uint8_t const clz_lkup[] = {

32U, 31U, 30U, 0U, 29U, 0U, 0U, 0U,

28U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

27U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

26U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

25U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

24U

};

uint32_t n;

```
```

` if (x >= (1U << 16)) {`

if (x >= (1U << 24)) {

n = 24U;

}

else {

n = 16U;

}

}

else {

if (x >= (1U << 8)) {

n = 8U;

}

else {

n = 0U;

}

}

x >>= n;

return clz_lkup[x & -x] - n;

}

The CLZ3() code executes in 15 instructions and takes 44 bytes of code and 129 bytes of lookup table in ROM. In practice, the lookup table will be aligned to the nearest 4-bytes so that it will take 132 bytes. With this, the code takes 176 bytes of ROM. I would say, it’s pretty good for saving 124 bytes from CLZ1() at the cost of two additional instructions, which probably execute in 1 clock cycle each, since they don’t break the instruction pipeline (like the branch instructions do).

I’m not sure about the real practical use of the gaps in the lookup table (shown as zeros in the listing above). I would say, these bytes are wasted.

–Miro

]]>I haven’t come up with anything better. If I do I will let you know.

]]>