A:

32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,

28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U,

27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,

27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,

……….

I am developing an MP3 & ACELP decoder for the M0+ so a LOT of CLZ instructions will be carried out. I found this unusual approach on just 1 site. Currently I am testing to find which byte of the 32 bits the leading zero is in and then using a table. The ARM v6 allows PC-relative addressing but I need to put it in-line. To that end, the core has a zero-page like the old 8-bit days. An immediate 8-bit value so I can shave off a cycle but I really don’t want to waste this as the cycle-saver because multiple other, smaller tables are more efficient.

function mssb30(x)

{

var C = b(’00 100000 100000 100000 100000 100000′);

// Check whether the high bit of each block is set.

var y1 = x & C;

// Check whether the lower bits of each block is set.

var y2 = ~(C – (x & ~C)) & C;

var y = y1 | y2;

// Shift the result bits down to the lowest 5 bits.

var z = ((y >>> 5) * b(‘0000 10000 10000 10000 10000 10000000’)) >>> 27;

// Compute the bit index of the most significant set block.

var b1 = 6 * mssb5(z);

// Compute the most significant set bit inside the most significant

// set block.

var b2 = mssb6((x >>> b1) & b(‘111111’));

return b1 + b2;

}

function mssb32(x)

{

// Check the high duplet and fall back to mssb30 if it’s not set.

var h = x >>> 30;

return h ? (30 + mssb5(h)) : mssb30(x);

}

As you can see, no tables. Now the CPU I’m using can conveniently place a few constants on it’s zero-page but constantly rewriting to accommodate the evolving optimizations is a pain so I’m going to give it a try. The 80×86 has a relatively low number of registers but is very optimized for speculative execution so in that case it may be faster if it IS a subroutine.

For me, an extra 6 cycles for a 14 cycle routine (call & return) isn’t so useful. Other cores will likely fall in between the two. On more powerful ARM cores, it is ideal for placing in the TCM (Tightly Coupled Memory).

]]>Thanks a lot!

–MMS ]]>

`static inline unsigned stdclz(uint64_t x)`

{

unsigned base;

unsigned ms_oct;

uint64_t tmp1, tmp2;

```
``` if ((tmp1 = x >> 32))

if ((tmp2 = tmp1 >> 16))

if ((tmp1 = tmp2 >> 8))

base = 0, ms_oct = tmp1;

else

base = 8, ms_oct = tmp2;

else

if ((tmp2 = tmp1 >> 8))

base = 16, ms_oct = tmp2;

else

base = 24, ms_oct = tmp1;

else

if ((tmp2 = x >> 16))

if ((tmp1 = tmp2 >> 8))

base = 32, ms_oct = tmp1;

else

base = 40, ms_oct = tmp2;

else

if ((tmp2 = x >> 8))

base = 48, ms_oct = tmp2;

else

base = 56, ms_oct = x;

` return base + Clz_8b[ms_oct];`

}

One minor portability issue with your implementation is that you assume unsigned int (i.e. – your integer literals) has 32 bits, which is often untrue on embedded platforms. So, rather than using expressions like (1U << 16), which could invoke UB when unsigned int only has 16 bits (although a decent compiler would probably warn of or might just automatically paper over this situation), instead use fully expressed constants like 0x10000U or just throw a cast in there, like ((uint32_t) 1 << 16).

Rather than >=, I used bitwise AND, with appropriate masks. So, instead of if (x >= ((uint32_t) 1 << 16), I use if (x & 0xFFFF0000U). A smart compiler might translate your >= to my bitwise AND’s anyway.

Finally, a number of platforms have poor shift performance when the size of the shift is variable. Maybe a smart compiler would translate your variable shift into constant shifts, but I just did it explicitly.

Here’s what my version of software clz looks like:

`#include <stdio.h>`

#include <inttypes.h>

```
```#define REPEAT_2x(X) (X), (X)

#define REPEAT_4x(X) REPEAT_2x(X), REPEAT_2x(X)

#define REPEAT_8x(X) REPEAT_4x(X), REPEAT_4x(X)

#define REPEAT_16x(X) REPEAT_8x(X), REPEAT_8x(X)

#define REPEAT_32x(X) REPEAT_16x(X), REPEAT_16x(X)

#define REPEAT_64x(X) REPEAT_32x(X), REPEAT_32x(X)

#define REPEAT_128x(X) REPEAT_64x(X), REPEAT_64x(X)

static const unsigned char Clz_8b[256] =

{

8,

7,

REPEAT_2x(6),

REPEAT_4x(5),

REPEAT_8x(4),

REPEAT_16x(3),

REPEAT_32x(2),

REPEAT_64x(1),

REPEAT_128x(0)

};

static inline unsigned clz_64b(uint64_t x)

{

unsigned base, ms_oct;

if (x & 0xFFFFFFFF00000000U)

if (x & 0xFFFF000000000000U)

if (x & 0xFF00000000000000U)

base = 0, ms_oct = x >> 56;

else

base = 8, ms_oct = x >> 48;

else

if (x & 0x0000FF0000000000U)

base = 16, ms_oct = x >> 40;

else

base = 24, ms_oct = x >> 32;

else

if (x & 0x00000000FFFF0000U)

if (x & 0x00000000FF000000U)

base = 32, ms_oct = x >> 24;

else

base = 40, ms_oct = x >> 16;

else

if (x & 0x000000000000FF00U)

base = 48, ms_oct = x >> 8;

else

base = 56, ms_oct = x >> 0;

return base + Clz_8b[ms_oct];

}

int main()

{

uint64_t x = (uint64_t) -1;

do

{

fprintf(stdout, "%llu -> %u\n", (unsigned long long) x + 1, clz_64b(x + 1));

fprintf(stdout, "%llu -> %u\n", (unsigned long long) x, clz_64b(x));

fprintf(stdout, "%llu -> %u\n\n", (unsigned long long) x - 1, clz_64b(x - 1));

} while (x >>= 1);

` return 0;`

}

http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer

helped in finding yet another implementation of the CLZ algorithm:

static inline uint32_t CLZz(register uint32_t x)

{

// Note: clz(x_32) = 32 minus the Hamming_weight(x_32),

// in case the leading zero's are followed by 1-bits only

```
``` x |= x >> 1;

// the next 5 statements will turn all bits after the

// leading zero's into a 1-bit

x |= x >> 2;

x |= x >> 4;

x |= x >> 8;

x |= x >> 16;

` // compute the Hamming weight of x ... and return 32`

// minus the computed result'

x -= ((x >> 1) & 0x55555555U);

x = (x & 0x33333333U) + ((x >> 2) & 0x33333333U);

return 32U - ( (((x + (x >> 4)) & 0x0F0F0F0FU) * 0x01010101U) >> 24 );

}

The last 3 lines can also be written as follows (i.e. without the multiply statement):

x -= (x >> 1) & 0x55555555U;

x = ((x >> 2) & 0x33333333U) + (x & 0x33333333U);

x = ((x >> 4) + x) & 0x0f0f0f0fU;

x += x >> 8;

x += x >> 16;

return 32U - (x & 0x0000003fU);

Or (even less optimized), as follows:

x = (x & 0x55555555U) + ((x >> 1 ) & 0x55555555U);

x = (x & 0x33333333U) + ((x >> 2 ) & 0x33333333U);

x = (x & 0x0F0F0F0FU) + ((x >> 4 ) & 0x0F0F0F0FU);

x = (x & 0x00FF00FFU) + ((x >> 8 ) & 0x00FF00FFU);

x = (x & 0x0000FFFFU) + ((x >> 16) & 0x0000FFFFU);

return 32U - x;

The Hamming weight (count of the 1-bits) is computed using sideways addition (and divide and conquer), and is explained here:

http://en.wikipedia.org/wiki/Hamming_weight

Whether or not this implementation is useful in your use case, I cannot really tell. Yes, the algorithm is linear (no branch statements), but only you can tell.

Henri

]]>static inline uint32_t CLZ3(uint32_t x) {

…

x >>= n;

#if 1 // ADDED STATEMENTS

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x++;

#endif

return clz_lkup[(x & -x) >> 1] – n; // MODIFIED STATEMENT

}

The first three of the added statements turn all bits after the leading zero’s into an one …

And yes, the number of statements of the modified algorithm will certainly turn out to be higher than the number of statements of cli2(). No doubt.

Henri

]]>********

This means that CLZ6() proposed by Bob is entirely **INCORRECT**. It just calculates a different thing and, in fact, it should NOT be even called CLZ.

********

–Miro

]]>