embedded software boot camp

Fast, Deterministic, and Portable Counting Leading Zeros

Monday, September 8th, 2014 by Miro Samek

Counting leading zeros in an integer number is a critical operation in many DSP algorithms, such as normalization of samples in sound or video processing, as well as in real-time schedulers to quickly find the highest-priority task ready-to-run.

In most such algorithms, it is important that the count-leading zeros operation be fast and deterministic. For this reason, many modern processors provide the CLZ (count-leading zeros) instruction, sometimes also called LZCNT, BSF (bit scan forward), FF1L (find first one-bit from left) or FBCL (find bit change from left).

Of course, if your processor supports CLZ or equivalent in hardware, you definitely should take advantage of it. In C you can often use a built-in function provided by the embedded compiler. A couple of examples below illustrate the calls for various CPUs and compilers:


y = __CLZ(x);          // ARM Cortex-M3/M4, IAR compiler (CMSIS standard)
y = __clz(x);          // ARM Cortex-M3/M4, ARM-KEIL compiler
y = __builtin_clz(x);  // ARM Cortex-M3/M4, GNU-ARM compiler
y = _clz(x);           // PIC32 (MIPS), XC32 compiler
y = __builtin_fbcl(x); // PIC24/dsPIC, XC16 compiler

However, what if your CPU does not provide the CLZ instruction? For example, ARM Cortex-M0 and M0+ cores do not support it. In this case, you need to implement CLZ() in software (typically an inline function or as a macro).

The Internet offers a lot of various algorithms for counting leading zeros and the closely related binary logarithm (log-base-2(x) = 32 – 1 – clz(x)). Here is a sample list of the most popular search results:

http://en.wikipedia.org/wiki/Find_first_set
http://aggregate.org/MAGIC/#Leading Zero Count
http://www.hackersdelight.org/hdcodetxt/nlz.c.txt (from Hacker’s Delight (2nd Edition), Section 5-3 “Counting Leading 0’s”)
http://www.codingforspeed.com/counting-the-number-of-leading-zeros-for-a-32-bit-integer-signed-or-unsigned/
http://stackoverflow.com/questions/9041837/find-the-index-of-the-highest-bit-set-of-a-32-bit-number-without-loops-obviously
http://stackoverflow.com/questions/671815/what-is-the-fastest-most-efficient-way-to-find-the-highest-set-bit-msb-in-an-i

But, unfortunately, most of the published algorithms are either incomplete, sub-optimal, or both. So, I thought it could be useful to post here a complete and, as I believe, optimal CLZ(x) function, which is both deterministic and outperforms most of the published implementations, including all of the “Hacker’s Delight” algorithms.

Here is the first version:

static inline uint32_t CLZ1(uint32_t x) {
    static uint8_t const clz_lkup[] = {
        32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
        28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U,
        27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
        27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U,
        24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U
    };
    uint32_t n;
    if (x >= (1U << 16)) {
        if (x >= (1U << 24)) {
            n = 24U;
        }
        else {
            n = 16U;
        }
    }
    else {
        if (x >= (1U << 8)) {
            n = 8U;
        }
        else {
            n = 0U;
        }
    }
    return (uint32_t)clz_lkup[x >> n] - n;
}

This algorithm uses a hybrid approach of bi-section to find out which 8-bit chunk of the 32-bit number contains the first 1-bit, which is followed by a lookup table clz_lkup[] to find the first 1-bit within the byte.

The CLZ1() function is deterministic in that it completes always in 13 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core with the highest level of optimization.

The CLZ1() implementation takes about 40 bytes for code plus 256 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 300 bytes of ROM.

If the ROM footprint is too high for your application, at the cost of running the bi-section for one more step, you can reduce the size of the lookup table to only 16 bytes. Here is the CLZ2() function that illustrates this tradeoff:

static inline uint32_t CLZ2(uint32_t x) {
    static uint8_t const clz_lkup[] = {
        32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U,
        28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U
    };
    uint32_t n;

    if (x >= (1U << 16)) {
        if (x >= (1U << 24)) {
            if (x >= (1 << 28)) {
                n = 28U;
            }
            else {
                n = 24U;
            }
        }
        else {
            if (x >= (1U << 20)) {
                n = 20U;
            }
            else {
                n = 16U;
            }
        }
    }
    else {
        if (x >= (1U << 8)) {
            if (x >= (1U << 12)) {
                n = 12U;
            }
            else {
                n = 8U;
            }
        }
        else {
            if (x >= (1U << 4)) {
                n = 4U;
            }
            else {
                n = 0U;
            }
        }
    }
    return (uint32_t)clz_lkup[x >> n] - n;
}

The CLZ2() function completes always in 17 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core.

The CLZ2() implementation takes about 80 bytes for code plus 16 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 100 bytes of ROM.

I wonder if you can beat the CLZ1() and CLZ2() implementations. If so, please post it in the comment. I would be really interested to find an even better way.

NOTE: In case you wish to use the published code in your projects, the code is released under the “Do What The F*ck You Want To Public License” (WTFPL).

 

Tags: ,

11 Responses to “Fast, Deterministic, and Portable Counting Leading Zeros”

  1. Bob Snyder says:

    Here’s a way to reduce the effective size of the lookup table, to as little as 9 bytes, at the cost of a few more instructions.

    You can use y = x & (-x) to isolate the rightmost bit that is set in a byte. The result will be either 0, 1, 2, 4, 8, 16, 32, 64, or 128. Use that number as an index into a 129-byte lookup table. All of the unused regions of the table can be used to store data needed by other parts of the program. The lookup table could be implemented as a C struct, like this:


    struct HybridLookupTable {
    byte clz0;
    byte clz1;
    byte clz2;
    char unused1[1];
    byte clz4;
    char unused3[3];
    byte clz8;
    char unused7[7];
    byte clz16;
    char unused15[15];
    byte clz32;
    char unused31[31];
    byte clz64;
    char unused63[63];
    byte clz128;
    }

    • Miro Samek says:

      Hi Bob,
      An interesting idea. So, for completeness, here is your modification applied to CLZ1():


      static inline uint32_t CLZ3(uint32_t x) {
      static uint8_t const clz_lkup[] = {
      32U, 31U, 30U, 0U, 29U, 0U, 0U, 0U,
      28U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      27U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      26U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      25U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,
      24U
      };
      uint32_t n;

      if (x >= (1U << 16)) {
      if (x >= (1U << 24)) {
      n = 24U;
      }
      else {
      n = 16U;
      }
      }
      else {
      if (x >= (1U << 8)) {
      n = 8U;
      }
      else {
      n = 0U;
      }
      }
      x >>= n;
      return clz_lkup[x & -x] - n;
      }

      The CLZ3() code executes in 15 instructions and takes 44 bytes of code and 129 bytes of lookup table in ROM. In practice, the lookup table will be aligned to the nearest 4-bytes so that it will take 132 bytes. With this, the code takes 176 bytes of ROM. I would say, it’s pretty good for saving 124 bytes from CLZ1() at the cost of two additional instructions, which probably execute in 1 clock cycle each, since they don’t break the instruction pipeline (like the branch instructions do).

      I’m not sure about the real practical use of the gaps in the lookup table (shown as zeros in the listing above). I would say, these bytes are wasted.

      –Miro

      • Henri says:

        Hi Miro,

        I am not sure whether or not I really understand clz3(), because I far as I can tell clz3() does not report the same result as clz1() does, for ALL cases …

        Executing both cases on my Intel computer resulted in the following output:

        @@ ./clz1
        CLZ1: 1, number of leading zero’s: 31
        CLZ1: 3, number of leading zero’s: 30
        CLZ1: cc, number of leading zero’s: 24
        CLZ1: 80000000, number of leading zero’s: 0
        CLZ1: 5, number of leading zero’s: 29
        CLZ1: 50000000, number of leading zero’s: 1
        CLZ1: a0000000, number of leading zero’s: 0

        @@ ./clz3
        CLZ3: 1, number of leading zero’s: 31
        CLZ3: 3, number of leading zero’s: 31
        CLZ3: cc, number of leading zero’s: 29
        CLZ3: 80000000, number of leading zero’s: 0
        CLZ3: 5, number of leading zero’s: 31
        CLZ3: 50000000, number of leading zero’s: 3
        CLZ3: a0000000, number of leading zero’s: 2

        Again, the output from clz3() is not the same as that of clz1() in ALL cases. Where do I go wrong?

        Henri

        • Miro Samek says:

          Thank you Henri for catching this (now obvious!) error.

          ********
          This means that CLZ3() proposed by Bob is entirely INCORRECT. It just calculates a different thing and, in fact, it should NOT be even called CLZ.
          ********

          –Miro

          • Henri says:

            … although clz3() can be easily modified into a true “counting leading zero’s algorithm”, as follows:

            static inline uint32_t CLZ3(uint32_t x) {

            x >>= n;
            #if 1 // ADDED STATEMENTS
            x |= (x >> 1);
            x |= (x >> 2);
            x |= (x >> 4);
            x++;
            #endif
            return clz_lkup[(x & -x) >> 1] – n; // MODIFIED STATEMENT
            }

            The first three of the added statements turn all bits after the leading zero’s into an one …

            And yes, the number of statements of the modified algorithm will certainly turn out to be higher than the number of statements of cli2(). No doubt.

            Henri

  2. Many thanks for this, Miro. I have looked at this problem (I even bought the Kindle version of Hacker’s Delight) for just the same reason you have (real-time scheduler).

    I haven’t come up with anything better. If I do I will let you know.

  3. Miro Samek says:

    Attached below is another comment from Bob Snyder. (NOTE: I’ve corrected a few bugs in the original Bob’s code)

    *********
    Miro,

    I tried on two occasions to post this to the Embedded Gurus website as a comment, but it was rejected without any message. I can only assume that it exceeded the length limit.

    Here’s another version of the CLZ algorithm that I thought of today. This one uses more memory, and probably results in more instructions. But it completely avoids branch instructions, thereby preventing pipeline cache misses (which are non-deterministic).

    Bob Snyder
    ——————————————————
    static inline uint32_t CLZ6(uint32_t x) {
    static uint8_t const b0[] = {
    0,1,2,0,3,0,0,0,
    4,0,0,0,0,0,0,0,
    5,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    6,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    7,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    8};

    static uint8_t const b1[] = {
    0,9,10,0,11,0,0,0,
    12,0,0,0,0,0,0,0,
    13,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    14,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    15,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    16};

    static uint8_t const b2[] = {
    0,17,18,0,19,0,0,0,
    20,0,0,0,0,0,0,0,
    21,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    22,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    23,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    24};

    static uint8_t const b3[] = {
    0,25,26,0,27,0,0,0,
    28,0,0,0,0,0,0,0,
    29,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    30,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    31,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,
    32
    };

    x &= -x; /* isolate leftmost 1-bit in x*/

    /* Note: at most one of the b[n] values will be non-zero */
    return 32U
    - b3[(x >> 24) & 0xFFU]
    - b2[(x >> 16) & 0xFFU]
    - b1[(x >> 8 ) & 0xFFU]
    - b0[x & 0xFFU];
    }

    • Miro Samek says:

      Hi Bob,

      I’ve tested your CLZ6() code under the same conditions as all other implementations (IAR EWARM 7.10, ARM Cortex-M0, highest level of optimization). The results are as follows (not very good): CLZ6() completes in 24 instructions out of which 3 are NOPs. Indeed, the code is completely linear and there is no single branch there. The code takes 64 bytes plus 4*132 bytes for the lookup tables, which totals 592 bytes. This is some kind of a record in big size.

      Overall, I would say that CLZ6() is not optimal.

      –Miro

      • Miro Samek says:

        Per the earlier comment by Henri, the CLZ6() is INCORRECT for the same reasons as CLZ3().

        ********
        This means that CLZ6() proposed by Bob is entirely INCORRECT. It just calculates a different thing and, in fact, it should NOT be even called CLZ.
        ********

        –Miro

  4. Henri says:

    Hi Miro,

    http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer

    helped in finding yet another implementation of the CLZ algorithm:


    static inline uint32_t CLZz(register uint32_t x)
    {
    // Note: clz(x_32) = 32 minus the Hamming_weight(x_32),
    // in case the leading zero's are followed by 1-bits only

    x |= x >> 1;

    // the next 5 statements will turn all bits after the
    // leading zero's into a 1-bit
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;

    // compute the Hamming weight of x ... and return 32
    // minus the computed result'
    x -= ((x >> 1) & 0x55555555U);
    x = (x & 0x33333333U) + ((x >> 2) & 0x33333333U);
    return 32U - ( (((x + (x >> 4)) & 0x0F0F0F0FU) * 0x01010101U) >> 24 );
    }

    The last 3 lines can also be written as follows (i.e. without the multiply statement):


    x -= (x >> 1) & 0x55555555U;
    x = ((x >> 2) & 0x33333333U) + (x & 0x33333333U);
    x = ((x >> 4) + x) & 0x0f0f0f0fU;
    x += x >> 8;
    x += x >> 16;
    return 32U - (x & 0x0000003fU);

    Or (even less optimized), as follows:


    x = (x & 0x55555555U) + ((x >> 1 ) & 0x55555555U);
    x = (x & 0x33333333U) + ((x >> 2 ) & 0x33333333U);
    x = (x & 0x0F0F0F0FU) + ((x >> 4 ) & 0x0F0F0F0FU);
    x = (x & 0x00FF00FFU) + ((x >> 8 ) & 0x00FF00FFU);
    x = (x & 0x0000FFFFU) + ((x >> 16) & 0x0000FFFFU);
    return 32U - x;

    The Hamming weight (count of the 1-bits) is computed using sideways addition (and divide and conquer), and is explained here:

    http://en.wikipedia.org/wiki/Hamming_weight

    Whether or not this implementation is useful in your use case, I cannot really tell. Yes, the algorithm is linear (no branch statements), but only you can tell.

    Henri

  5. […] A blog post by Miro Samek about counting leading zeros […]

Leave a Reply