Counting leading zeros in an integer number is a critical operation in many DSP algorithms, such as normalization of samples in sound or video processing, as well as in real-time schedulers to quickly find the highest-priority task ready-to-run.

In most such algorithms, it is important that the count-leading zeros operation be fast and deterministic. For this reason, many modern processors provide the CLZ (count-leading zeros) instruction, sometimes also called LZCNT, BSF (bit scan forward), FF1L (find first one-bit from left) or FBCL (find bit change from left).

Of course, if your processor supports CLZ or equivalent in hardware, you definitely should take advantage of it. In C you can often use a built-in function provided by the embedded compiler. A couple of examples below illustrate the calls for various CPUs and compilers:

y = __CLZ(x); // ARM Cortex-M3/M4, IAR compiler (CMSIS standard)

y = __clz(x); // ARM Cortex-M3/M4, ARM-KEIL compiler

y = __builtin_clz(x); // ARM Cortex-M3/M4, GNU-ARM compiler

y = _clz(x); // PIC32 (MIPS), XC32 compiler

y = __builtin_fbcl(x); // PIC24/dsPIC, XC16 compiler

However, what if your CPU does not provide the CLZ instruction? For example, ARM Cortex-M0 and M0+ cores do not support it. In this case, you need to implement CLZ() in software (typically an inline function or as a macro).

The Internet offers a lot of various algorithms for counting leading zeros and the closely related binary logarithm (log-base-2(x) = 32 – 1 – clz(x)). Here is a sample list of the most popular search results:

http://en.wikipedia.org/wiki/Find_first_set

http://aggregate.org/MAGIC/#Leading Zero Count

http://www.hackersdelight.org/hdcodetxt/nlz.c.txt (from Hacker’s Delight (2nd Edition), Section 5-3 “Counting Leading 0’s”)

http://www.codingforspeed.com/counting-the-number-of-leading-zeros-for-a-32-bit-integer-signed-or-unsigned/

http://stackoverflow.com/questions/9041837/find-the-index-of-the-highest-bit-set-of-a-32-bit-number-without-loops-obviously

http://stackoverflow.com/questions/671815/what-is-the-fastest-most-efficient-way-to-find-the-highest-set-bit-msb-in-an-i

But, unfortunately, most of the published algorithms are either incomplete, sub-optimal, or both. So, I thought it could be useful to post here a complete and, as I believe, optimal CLZ(x) function, which is both deterministic and outperforms most of the published implementations, including all of the “Hacker’s Delight” algorithms.

Here is the first version:

static inline uint32_t CLZ1(uint32_t x) { static uint8_t const clz_lkup[] = { 32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U, 28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 27U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 26U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 25U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U, 24U }; uint32_t n; if (x >= (1U << 16)) { if (x >= (1U << 24)) { n = 24U; } else { n = 16U; } } else { if (x >= (1U << 8)) { n = 8U; } else { n = 0U; } } return (uint32_t)clz_lkup[x >> n] - n; }

This algorithm uses a hybrid approach of bi-section to find out which 8-bit chunk of the 32-bit number contains the first 1-bit, which is followed by a lookup table clz_lkup[] to find the first 1-bit within the byte.

The CLZ1() function is deterministic in that it completes always in 13 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core with the highest level of optimization.

The CLZ1() implementation takes about 40 bytes for code plus 256 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 300 bytes of ROM.

If the ROM footprint is too high for your application, at the cost of running the bi-section for one more step, you can reduce the size of the lookup table to only 16 bytes. Here is the CLZ2() function that illustrates this tradeoff:

static inline uint32_t CLZ2(uint32_t x) { static uint8_t const clz_lkup[] = { 32U, 31U, 30U, 30U, 29U, 29U, 29U, 29U, 28U, 28U, 28U, 28U, 28U, 28U, 28U, 28U }; uint32_t n; if (x >= (1U << 16)) { if (x >= (1U << 24)) { if (x >= (1 << 28)) { n = 28U; } else { n = 24U; } } else { if (x >= (1U << 20)) { n = 20U; } else { n = 16U; } } } else { if (x >= (1U << 8)) { if (x >= (1U << 12)) { n = 12U; } else { n = 8U; } } else { if (x >= (1U << 4)) { n = 4U; } else { n = 0U; } } } return (uint32_t)clz_lkup[x >> n] - n; }

The CLZ2() function completes always in 17 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core.

The CLZ2() implementation takes about 80 bytes for code plus 16 bytes of constant lookup table in ROM. Altogether, the algorithm uses some 100 bytes of ROM.

I wonder if you can beat the CLZ1() and CLZ2() implementations. If so, please post it in the comment. I would be really interested to find an even better way.

NOTE: In case you wish to use the published code in your projects, the code is released under the “Do What The F*ck You Want To Public License” (WTFPL).

Tags: ARM Cortex-M0, CLZ

Here’s a way to reduce the effective size of the lookup table, to as little as 9 bytes, at the cost of a few more instructions.

You can use

`y = x & (-x)`

to isolate the rightmost bit that is set in a byte. The result will be either 0, 1, 2, 4, 8, 16, 32, 64, or 128. Use that number as an index into a 129-byte lookup table. All of the unused regions of the table can be used to store data needed by other parts of the program. The lookup table could be implemented as a C struct, like this:struct HybridLookupTable {

byte clz0;

byte clz1;

byte clz2;

char unused1[1];

byte clz4;

char unused3[3];

byte clz8;

char unused7[7];

byte clz16;

char unused15[15];

byte clz32;

char unused31[31];

byte clz64;

char unused63[63];

byte clz128;

}

Hi Bob,

An interesting idea. So, for completeness, here is your modification applied to CLZ1():

static inline uint32_t CLZ3(uint32_t x) {

static uint8_t const clz_lkup[] = {

32U, 31U, 30U, 0U, 29U, 0U, 0U, 0U,

28U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

27U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

26U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

25U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

0U, 0U, 0U, 0U, 0U, 0U, 0U, 0U,

24U

};

uint32_t n;

`if (x >= (1U << 16)) {`

if (x >= (1U << 24)) {

n = 24U;

}

else {

n = 16U;

}

}

else {

if (x >= (1U << 8)) {

n = 8U;

}

else {

n = 0U;

}

}

x >>= n;

return clz_lkup[x & -x] - n;

}

The CLZ3() code executes in 15 instructions and takes 44 bytes of code and 129 bytes of lookup table in ROM. In practice, the lookup table will be aligned to the nearest 4-bytes so that it will take 132 bytes. With this, the code takes 176 bytes of ROM. I would say, it’s pretty good for saving 124 bytes from CLZ1() at the cost of two additional instructions, which probably execute in 1 clock cycle each, since they don’t break the instruction pipeline (like the branch instructions do).

I’m not sure about the real practical use of the gaps in the lookup table (shown as zeros in the listing above). I would say, these bytes are wasted.

–Miro

Hi Miro,

I am not sure whether or not I really understand clz3(), because I far as I can tell clz3() does not report the same result as clz1() does, for ALL cases …

Executing both cases on my Intel computer resulted in the following output:

@@ ./clz1

CLZ1: 1, number of leading zero’s: 31

CLZ1: 3, number of leading zero’s: 30

CLZ1: cc, number of leading zero’s: 24

CLZ1: 80000000, number of leading zero’s: 0

CLZ1: 5, number of leading zero’s: 29

CLZ1: 50000000, number of leading zero’s: 1

CLZ1: a0000000, number of leading zero’s: 0

@@ ./clz3

CLZ3: 1, number of leading zero’s: 31

CLZ3: 3, number of leading zero’s: 31

CLZ3: cc, number of leading zero’s: 29

CLZ3: 80000000, number of leading zero’s: 0

CLZ3: 5, number of leading zero’s: 31

CLZ3: 50000000, number of leading zero’s: 3

CLZ3: a0000000, number of leading zero’s: 2

Again, the output from clz3() is not the same as that of clz1() in ALL cases. Where do I go wrong?

Henri

Thank you Henri for catching this (now obvious!) error.

********

This means that CLZ3() proposed by Bob is entirely

INCORRECT. It just calculates a different thing and, in fact, it should NOT be even called CLZ.********

–Miro

… although clz3() can be easily modified into a true “counting leading zero’s algorithm”, as follows:

static inline uint32_t CLZ3(uint32_t x) {

…

x >>= n;

#if 1 // ADDED STATEMENTS

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x++;

#endif

return clz_lkup[(x & -x) >> 1] – n; // MODIFIED STATEMENT

}

The first three of the added statements turn all bits after the leading zero’s into an one …

And yes, the number of statements of the modified algorithm will certainly turn out to be higher than the number of statements of cli2(). No doubt.

Henri

Many thanks for this, Miro. I have looked at this problem (I even bought the Kindle version of Hacker’s Delight) for just the same reason you have (real-time scheduler).

I haven’t come up with anything better. If I do I will let you know.

Attached below is another comment from Bob Snyder. (NOTE: I’ve corrected a few bugs in the original Bob’s code)

*********

Miro,

I tried on two occasions to post this to the Embedded Gurus website as a comment, but it was rejected without any message. I can only assume that it exceeded the length limit.

Here’s another version of the CLZ algorithm that I thought of today. This one uses more memory, and probably results in more instructions. But it completely avoids branch instructions, thereby preventing pipeline cache misses (which are non-deterministic).

Bob Snyder

——————————————————

`static inline uint32_t CLZ6(uint32_t x) {`

static uint8_t const b0[] = {

0,1,2,0,3,0,0,0,

4,0,0,0,0,0,0,0,

5,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

6,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

7,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

8};

static uint8_t const b1[] = {

0,9,10,0,11,0,0,0,

12,0,0,0,0,0,0,0,

13,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

14,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

15,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

16};

static uint8_t const b2[] = {

0,17,18,0,19,0,0,0,

20,0,0,0,0,0,0,0,

21,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

22,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

23,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

24};

static uint8_t const b3[] = {

0,25,26,0,27,0,0,0,

28,0,0,0,0,0,0,0,

29,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

30,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

31,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,

32

};

x &= -x; /* isolate leftmost 1-bit in x*/

`/* Note: at most one of the b[n] values will be non-zero */`

return 32U

- b3[(x >> 24) & 0xFFU]

- b2[(x >> 16) & 0xFFU]

- b1[(x >> 8 ) & 0xFFU]

- b0[x & 0xFFU];

}

Hi Bob,

I’ve tested your CLZ6() code under the same conditions as all other implementations (IAR EWARM 7.10, ARM Cortex-M0, highest level of optimization). The results are as follows (not very good): CLZ6() completes in 24 instructions out of which 3 are NOPs. Indeed, the code is completely linear and there is no single branch there. The code takes 64 bytes plus 4*132 bytes for the lookup tables, which totals 592 bytes. This is some kind of a record in big size.

Overall, I would say that CLZ6() is not optimal.

–Miro

Per the earlier comment by Henri, the CLZ6() is

INCORRECTfor the same reasons as CLZ3().********

This means that CLZ6() proposed by Bob is entirely

INCORRECT. It just calculates a different thing and, in fact, it should NOT be even called CLZ.********

–Miro

Hi Miro,

http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer

helped in finding yet another implementation of the CLZ algorithm:

static inline uint32_t CLZz(register uint32_t x)

{

// Note: clz(x_32) = 32 minus the Hamming_weight(x_32),

// in case the leading zero's are followed by 1-bits only

x |= x >> 1;

// the next 5 statements will turn all bits after the

// leading zero's into a 1-bit

x |= x >> 2;

x |= x >> 4;

x |= x >> 8;

x |= x >> 16;

`// compute the Hamming weight of x ... and return 32`

// minus the computed result'

x -= ((x >> 1) & 0x55555555U);

x = (x & 0x33333333U) + ((x >> 2) & 0x33333333U);

return 32U - ( (((x + (x >> 4)) & 0x0F0F0F0FU) * 0x01010101U) >> 24 );

}

The last 3 lines can also be written as follows (i.e. without the multiply statement):

x -= (x >> 1) & 0x55555555U;

x = ((x >> 2) & 0x33333333U) + (x & 0x33333333U);

x = ((x >> 4) + x) & 0x0f0f0f0fU;

x += x >> 8;

x += x >> 16;

return 32U - (x & 0x0000003fU);

Or (even less optimized), as follows:

x = (x & 0x55555555U) + ((x >> 1 ) & 0x55555555U);

x = (x & 0x33333333U) + ((x >> 2 ) & 0x33333333U);

x = (x & 0x0F0F0F0FU) + ((x >> 4 ) & 0x0F0F0F0FU);

x = (x & 0x00FF00FFU) + ((x >> 8 ) & 0x00FF00FFU);

x = (x & 0x0000FFFFU) + ((x >> 16) & 0x0000FFFFU);

return 32U - x;

The Hamming weight (count of the 1-bits) is computed using sideways addition (and divide and conquer), and is explained here:

http://en.wikipedia.org/wiki/Hamming_weight

Whether or not this implementation is useful in your use case, I cannot really tell. Yes, the algorithm is linear (no branch statements), but only you can tell.

Henri

[…] A blog post by Miro Samek about counting leading zeros […]

CLZ function (inlined, no branches ) can be implemented in 8 instructions (16 bytes) + 128 byte LUT . Thus it takes 8 cycles to execute on the Cortex-M0 beating all posted methods by lightyears.

Sounds fantastic. Could you perhaps substantiate your claim by posting the code or by providing a link to the original source?

Thanks a lot!

–MMS

I recently implemented a 64b log_base_2() that uses the same general approach with a few minor twists.

One minor portability issue with your implementation is that you assume unsigned int (i.e. – your integer literals) has 32 bits, which is often untrue on embedded platforms. So, rather than using expressions like (1U << 16), which could invoke UB when unsigned int only has 16 bits (although a decent compiler would probably warn of or might just automatically paper over this situation), instead use fully expressed constants like 0x10000U or just throw a cast in there, like ((uint32_t) 1 << 16).

Rather than >=, I used bitwise AND, with appropriate masks. So, instead of if (x >= ((uint32_t) 1 << 16), I use if (x & 0xFFFF0000U). A smart compiler might translate your >= to my bitwise AND’s anyway.

Finally, a number of platforms have poor shift performance when the size of the shift is variable. Maybe a smart compiler would translate your variable shift into constant shifts, but I just did it explicitly.

Here’s what my version of software clz looks like:

`#include <stdio.h>`

#include <inttypes.h>

#define REPEAT_2x(X) (X), (X)

#define REPEAT_4x(X) REPEAT_2x(X), REPEAT_2x(X)

#define REPEAT_8x(X) REPEAT_4x(X), REPEAT_4x(X)

#define REPEAT_16x(X) REPEAT_8x(X), REPEAT_8x(X)

#define REPEAT_32x(X) REPEAT_16x(X), REPEAT_16x(X)

#define REPEAT_64x(X) REPEAT_32x(X), REPEAT_32x(X)

#define REPEAT_128x(X) REPEAT_64x(X), REPEAT_64x(X)

static const unsigned char Clz_8b[256] =

{

8,

7,

REPEAT_2x(6),

REPEAT_4x(5),

REPEAT_8x(4),

REPEAT_16x(3),

REPEAT_32x(2),

REPEAT_64x(1),

REPEAT_128x(0)

};

static inline unsigned clz_64b(uint64_t x)

{

unsigned base, ms_oct;

if (x & 0xFFFFFFFF00000000U)

if (x & 0xFFFF000000000000U)

if (x & 0xFF00000000000000U)

base = 0, ms_oct = x >> 56;

else

base = 8, ms_oct = x >> 48;

else

if (x & 0x0000FF0000000000U)

base = 16, ms_oct = x >> 40;

else

base = 24, ms_oct = x >> 32;

else

if (x & 0x00000000FFFF0000U)

if (x & 0x00000000FF000000U)

base = 32, ms_oct = x >> 24;

else

base = 40, ms_oct = x >> 16;

else

if (x & 0x000000000000FF00U)

base = 48, ms_oct = x >> 8;

else

base = 56, ms_oct = x >> 0;

return base + Clz_8b[ms_oct];

}

int main()

{

uint64_t x = (uint64_t) -1;

do

{

fprintf(stdout, "%llu -> %u\n", (unsigned long long) x + 1, clz_64b(x + 1));

fprintf(stdout, "%llu -> %u\n", (unsigned long long) x, clz_64b(x));

fprintf(stdout, "%llu -> %u\n\n", (unsigned long long) x - 1, clz_64b(x - 1));

} while (x >>= 1);

`return 0;`

}

Or, if you don’t like encoding all those 8 byte constants (e.g. – code bloat):

`static inline unsigned stdclz(uint64_t x)`

{

unsigned base;

unsigned ms_oct;

uint64_t tmp1, tmp2;

if ((tmp1 = x >> 32))

if ((tmp2 = tmp1 >> 16))

if ((tmp1 = tmp2 >> 8))

base = 0, ms_oct = tmp1;

else

base = 8, ms_oct = tmp2;

else

if ((tmp2 = tmp1 >> 8))

base = 16, ms_oct = tmp2;

else

base = 24, ms_oct = tmp1;

else

if ((tmp2 = x >> 16))

if ((tmp1 = tmp2 >> 8))

base = 32, ms_oct = tmp1;

else

base = 40, ms_oct = tmp2;

else

if ((tmp2 = x >> 8))

base = 48, ms_oct = tmp2;

else

base = 56, ms_oct = x;

`return base + Clz_8b[ms_oct];`

}

Hi,

I am developing an MP3 & ACELP decoder for the M0+ so a LOT of CLZ instructions will be carried out. I found this unusual approach on just 1 site. Currently I am testing to find which byte of the 32 bits the leading zero is in and then using a table. The ARM v6 allows PC-relative addressing but I need to put it in-line. To that end, the core has a zero-page like the old 8-bit days. An immediate 8-bit value so I can shave off a cycle but I really don’t want to waste this as the cycle-saver because multiple other, smaller tables are more efficient.

function mssb30(x)

{

var C = b(’00 100000 100000 100000 100000 100000′);

// Check whether the high bit of each block is set.

var y1 = x & C;

// Check whether the lower bits of each block is set.

var y2 = ~(C – (x & ~C)) & C;

var y = y1 | y2;

// Shift the result bits down to the lowest 5 bits.

var z = ((y >>> 5) * b(‘0000 10000 10000 10000 10000 10000000’)) >>> 27;

// Compute the bit index of the most significant set block.

var b1 = 6 * mssb5(z);

// Compute the most significant set bit inside the most significant

// set block.

var b2 = mssb6((x >>> b1) & b(‘111111’));

return b1 + b2;

}

function mssb32(x)

{

// Check the high duplet and fall back to mssb30 if it’s not set.

var h = x >>> 30;

return h ? (30 + mssb5(h)) : mssb30(x);

}

As you can see, no tables. Now the CPU I’m using can conveniently place a few constants on it’s zero-page but constantly rewriting to accommodate the evolving optimizations is a pain so I’m going to give it a try. The 80×86 has a relatively low number of registers but is very optimized for speculative execution so in that case it may be faster if it IS a subroutine.

For me, an extra 6 cycles for a 14 cycle routine (call & return) isn’t so useful. Other cores will likely fall in between the two. On more powerful ARM cores, it is ideal for placing in the TCM (Tightly Coupled Memory).