## Posts Tagged ‘CLZ’

### Fast, Deterministic, and Portable Counting Leading Zeros

Monday, September 8th, 2014 Miro Samek

Counting leading zeros in an integer number is a critical operation in many DSP algorithms, such as normalization of samples in sound or video processing, as well as in real-time schedulers to quickly find the highest-priority task ready-to-run.

In most such algorithms, it is important that the count-leading zeros operation be fast and deterministic. For this reason, many modern processors provide the CLZ (count-leading zeros) instruction, sometimes also called LZCNT, BSF (bit scan forward), FF1L (find first one-bit from left) or FBCL (find bit change from left).

Of course, if your processor supports CLZ or equivalent in hardware, you definitely should take advantage of it. In C you can often use a built-in function provided by the embedded compiler. A couple of examples below illustrate the calls for various CPUs and compilers:

``` y = __CLZ(x);          // ARM Cortex-M3/M4, IAR compiler (CMSIS standard) y = __clz(x);          // ARM Cortex-M3/M4, ARM-KEIL compiler y = __builtin_clz(x);  // ARM Cortex-M3/M4, GNU-ARM compiler y = _clz(x);           // PIC32 (MIPS), XC32 compiler y = __builtin_fbcl(x); // PIC24/dsPIC, XC16 compiler ```

However, what if your CPU does not provide the CLZ instruction? For example, ARM Cortex-M0 and M0+ cores do not support it. In this case, you need to implement CLZ() in software (typically an inline function or as a macro).

The Internet offers a lot of various algorithms for counting leading zeros and the closely related binary logarithm (log-base-2(x) = 32 – 1 – clz(x)). Here is a sample list of the most popular search results:

But, unfortunately, most of the published algorithms are either incomplete, sub-optimal, or both. So, I thought it could be useful to post here a complete and, as I believe, optimal CLZ(x) function, which is both deterministic and outperforms most of the published implementations, including all of the “Hacker’s Delight” algorithms.

Here is the first version:

```static inline uint32_t CLZ1(uint32_t x) {
static uint8_t const clz_lkup[] = {
32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U,
27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U
};
uint32_t n;
if (x >= (1U << 16)) {
if (x >= (1U << 24)) {
n = 24U;
}
else {
n = 16U;
}
}
else {
if (x >= (1U << 8)) {
n = 8U;
}
else {
n = 0U;
}
}
return (uint32_t)clz_lkup[x >> n] - n;
}```

This algorithm uses a hybrid approach of bi-section to find out which 8-bit chunk of the 32-bit number contains the first 1-bit, which is followed by a lookup table clz_lkup[] to find the first 1-bit within the byte.

The CLZ1() function is deterministic in that it completes always in 13 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core with the highest level of optimization.

The CLZ1() implementation takes about 40 bytes for code plus 256 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 300 bytes of ROM.

If the ROM footprint is too high for your application, at the cost of running the bi-section for one more step, you can reduce the size of the lookup table to only 16 bytes. Here is the CLZ2() function that illustrates this tradeoff:

```static inline uint32_t CLZ2(uint32_t x) {
static uint8_t const clz_lkup[] = {
32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U
};
uint32_t n;

if (x >= (1U << 16)) {
if (x >= (1U << 24)) {
if (x >= (1 << 28)) {
n = 28U;
}
else {
n = 24U;
}
}
else {
if (x >= (1U << 20)) {
n = 20U;
}
else {
n = 16U;
}
}
}
else {
if (x >= (1U << 8)) {
if (x >= (1U << 12)) {
n = 12U;
}
else {
n = 8U;
}
}
else {
if (x >= (1U << 4)) {
n = 4U;
}
else {
n = 0U;
}
}
}
return (uint32_t)clz_lkup[x >> n] - n;
}```

The CLZ2() function completes always in 17 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core.

The CLZ2() implementation takes about 80 bytes for code plus 16 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 100 bytes of ROM.

I wonder if you can beat the CLZ1() and CLZ2() implementations. If so, please post it in the comment. I would be really interested to find an even better way.

NOTE: In case you wish to use the published code in your projects, the code is released under the “Do What The F*ck You Want To Public License” (WTFPL).