embedded software boot camp

Optimizing for the CPU / compiler

Sunday, June 3rd, 2012 by Nigel Jones

It is well known that standard C language features map horribly on to the architecture of many processors. While the mapping is obvious and appalling for some processors (low end PICs, 8051 spring to mind), it’s still not necessarily great at the 32 bit end of the spectrum where processors without floating point units can be hit hard with C’s floating point promotion rules. While this is all obvious stuff, it’s essentially about what those CPUs are lacking. Where it gets really interesting in the embedded space is when you have a processor that has all sorts of specialized features that are great for embedded systems – but which simply do not map on to the C language view of the world. Some examples will illustrate my point.

Arithmetic vs. Logical shifting

The C language does of course have support for performing shift operations. However, these are strictly arithmetic shifts. That is when bits get shifted off the end of an integer type, they are simply lost. Logical shifting, sometimes known as rotation, is different in that bits simply get rotated back around (often through the carry bit but not always). Now while arithmetic shifting is great for, well arithmetic operations, there are plenty of occasions in which I find myself wanting to perform a rotation. Now can I write a rotation function in C – sure – but it’s a real pain in the tuches.

Saturated addition

If you have ever had to design and implement an integer digital filter, I am sure you found yourself yearning for an addition operator that will saturate rather than overflow. [In this form of arithmetic, if the integral type would overflow as the result of an operation, then the processor simply returns the minimum or maximum value as appropriate].Ā  Processors that the designers think might be required to perform digital filtering will have this feature built directly into their instruction sets.Ā  By contrast the C language has zero direct support for such operations, which must be coded using nasty checks and masks.

Nibble swapping

Swapping the upper and lower nibbles of a byte is a common operation in cryptography and related fields. As a result many processors include this ever so useful instruction in their instruction sets. While you can of course write C code to do it, it’s horrible looking and grossly inefficient when compared to the built in instruction.

Implications

If you look over the examples quoted I’m sure you noticed a theme:

  1. Yes I can write C code to achieve the desired functionality.
  2. The resultant C code is usually ugly and horribly inefficient when compared to the intrinsic function of the processor.

Now in many cases, C compilers simply don’t give you access to these intrinsic functions, other than resorting to the inline assembler. Unfortunately, using the inline assembler causes a lot of problems. For example:

  1. It will often force the compiler to not optimize the enclosing function.
  2. It’s really easy to screw it up.
  3. It’s banned by most coding standards.

As a result, the intrinsic features can’t be used anyway. However, there are embedded compilers out there that support intrinsic functions. For example here’s how to swap nibbles using IAR’s AVR compiler:

foo = __swap_nibbles(bar);

There are several things to note about this:

  1. Because it’s a compiler intrinsic function, there are no issues with optimization.
  2. Similarly because one works with standard variable names, there is no particular likelihood of getting this wrong.
  3. Because it looks like a function call, there isn’t normally a problem with coding standards.

This then leads to one of the essential quandaries of embedded systems. Is it better to write completely standard (and hence presumably portable) C code, or should one take every advantage of neat features that are offered by your CPU (and if it is any good), your compiler?

I made my peace with this decision many years ago and fall firmly into the camp of take advantage of every neat feature offered by the CPU / compiler – even if it is non-standard. My rationale for doing so is as follows:

  1. Porting code from one CPU to another happens rarely. Thus to burden the bulk of systems with this mythical possibility seems weird to me.
  2. End users do not care. When was the last time you heard someone extoll the use of standard code in the latest widget? Instead end users care about speed, power and battery life. All things that can come about by having the most efficient code possible.
  3. It seems downright rude not to use those features that the CPU designer built in to the CPU just because some purist says I should not.

Having said this, I do of course understand completely if you are in the business of selling software components (e.g. an AES library), where using intrinsic / specialized instructions could be a veritable pain. However for the rest of the industry I say use those intrinsic functions! As always, let the debate begin.

 

37 Responses to “Optimizing for the CPU / compiler”

  1. OJ says:

    I have actually been in a situation where code that uses intrinsic functions was required to be portable, in this case between three different DSP processor architectures from two different manufacturers. Our solution was to create a “DSP math library”, a set of platform-specific macros that wrapped the intrinsics or, in the cases where one of the platforms did not have an intrinsic function, C implementations.

  2. Walter Banks says:

    As a compiler developer for embedded systems your list is a recurring bad dream. We provide processor intrinsics and trap a s many sequences as we can find and translate them into the processor intrinsics . Part of this is the reason for creating a file that makes every instruction able to be generated from C.
    http://www.bytecraft.com/C_versus_Assembly

    One of the important points in this post is pointing out that there is no standardized syntax for many common new operators. Saturated arithmetic common in many applications remains an implementation hach.

    Walter..

    • Nigel Jones says:

      I feel your pain. When I said it was rude not to use the features that the chip designer had included, I should also have acknowledged the efforts and contributions of the compiler writer.

      • Lundin says:

        On the other hand, one should be able to expect that modern MCUs are designed with an instruction set that has the C language in mind. If the engineer picks some ancient core like PIC or 8051, they only have themselves to blame.

        As for the specific case of nibble swapping, I’m guessing the same code could be written in several ways. The sensible thing for the compiler implementation would then be to write: “if you want fast nibble-swapping, write the code as x = (x>>4) | (x<<4);, if you write it in any other form, then that's not the compiler's problem". I think such an enforced way of coding would be much better than some ugly non-portable __swap() function.

        • Ian says:

          @Lundin -“some ugly non-portable __swap() function”

          Having all the code that makes use of the processor-specific functionality wrapped up in a number of intrinsic function calls makes porting far, far easier. You need to swap nibbles and your new processor does not have a built in swap function, just write your own function to replace the __swap() function in the code. Your function will not be as efficient but it can be a direct replacement and your application has been ported to the new target. Job done.

        • Talulah says:

          I disagree Lundin, your construct seems a lot uglier to me than __swap_nibbles(), and prone to more mistakes, notwithstanding that the syntax on a non-nibble-swapping processor would not actually achieve the desired operation without some masking in there as well!

          Nigel’s other example of saturated arithmetic is a more interesting case – where the ability to do it (and indeed the need to do it – mostly in DSP applications) is more recent than the use of C in embedded applications. For modern applications such as DSP it is extremely likely that useful features will be required that are not supported by ANSI C.

        • Ben Voigt says:

          What do you do when one compiler recognizes “x = (x & 0xF0) >> 4 | (x & 0x0F) << 4" and the other recognizes "x = (x <> 4) & 0x0F”?

          The “portable’ code is not a more universal way to access the intrinsics.

    • Tim Wescott says:

      Hey Walter. I was going to get onto comp.arch.embedded and see if I could goad you into commenting on this article — I guess I didn’t have to.

      I have (five years ago, with Code Composter for the TMS320F2812 processor) tried various ways to get the compiler to understand the ‘vector dot product with shift, and a saturate at the end’ that’s inherent in the ‘F28xx machine code. I had absolutely no luck. We ended up having one magic file that was written in assembly with our library code to do vector dot products and block floating point matrix multiplies.

      I’m with Lundin on wishing that in such cases there were guidance from the compiler vendor on just what to write that would get the attention of the optimizer. While I’d like an optimizer that could figure out any old thing I spewed onto the page and turn it into the correct code, I’d accept something along the lines of “cut and paste this code” — as long as it takes advantage of the processor’s intrinsic capabilities and still coughs up the same result (however slowly) on any other ANSI-strict processor/toolchain combination, I would achieve at least a qualified level of joy.

  3. hpro says:

    Actually, the example regarding arithmetic or shifting is not really correct. Most decent compilers will *know* which is faster for your particular architecture and do that when you write x = y /2.

    Check out http://ridiculousfish.com/blog/posts/will-it-optimize.html and especially point no. 5 there.

    Cheers.

    • Nigel Jones says:

      I’m not sure of your point. I wasn’t discussing using shift operations for arithmetic operations (such as divide by 2). Rather I was pointing out that rotation (a related but different operation) is a useful feature in some circumstances.

  4. Lundin says:

    I would say that the arguments against portable code would have been more valid ten years ago. Because somewhere around that time, the MCU market exploded: every major MCU manufacturer now has thousands of different MCUs available and they keep spitting them out at a frantic pace. Regardless of whatever guarantees they may give us about future support for the part, common sense says that they will soon reach the point where they have to start phasing out old MCUs at the same pace as they spit out new ones. Keeping thousands of old MCUs in production and stock is simply bad business. This is the main reason why I think portability will become far more important in the future.

    There is also the issue of re-using and porting code between projects. Suppose you have written a lot of code for some project and that project in itself will never ever get ported. However, there might be a lot of useful code in that project that you could re-use in new projects. If the code is portable, you can simply share it between projects. And you only have to maintain one file instead of two if the projects are supposed to have regular updates.

    Another advantage of portable code is C language and coding standard conformance. Code that is written is standard C can be compiled by any compiler and diagnosed by any analyser tool.

    If you find out that a particular compiler performs nibble swapping poorly, then simply get a new one with better support for the specific MCU. If a compiler performs poorly or comes with bugs, raise a support ticket demanding that it gets fixed. Don’t accept bad quality from your single-most important tool! Programmers only have themselves to blame for all the bad compilers out there; if you accept poor quality then poor quality is what you get. The embedded branch is lagging far behind is this aspect, mainstream compilers for PC etc have far better and more advanced optimizers than the average embedded compiler. Therefore, a PC programmer who write inline assembler or manual optimization (shift instead of division, down-counting loops etc etc) will be dismissed as naive and/or outdated.

    Regarding shifting:

    I come from the Motorola side of things, so I might have strange, biased believes… but I always speak of three different shift operations: logical, arithmetic and rotation. All Motorola-ish architectures have instructions for all 3 of them. Logical shift would be the C way: shift in zeroes no matter if it is left- or right shift. Arithmetic shift would also mean shifting in zeroes, but with sign-preserving the MSB during right shifts, ie allowing shifting on negative 2’s complement numbers. And rotate would be rotating through a carry bit. At least Wikipedia seems to agree with me, for what that is worth.

    • Talulah says:

      Regarding shifting: You’re right about the nomenclature. However, the C compiler should use an arithmetic shift for signed integers and logical for unsigned integers – at least IAR does.

      • Tim Wescott says:

        Perhaps the C compiler _should_ do an arithmetic shift for signed integers, but not all compilers _do_ — and they are specification-correct, because (at least the last time I checked) in the case of right-shifting signed integers, the of the shift is “implementation defined”.

        (Actually, I think it’s signed integers with negative values — details, details!!).

        At any rate, to be portable you must check sign and always shift on a positive value — and always shifting on a positive value that’s been cast to unsigned in probably a Really Good idea:

        y = (x < 0) ? -static_cast(static_cast((-x) >> shift)) : (x >> shift);

        There, now isn’t that nasty? (And is it even correct, since it just came out of the Brain of Questionable Accuracy?) And more nasty yet if you can’t use the ? : construction?

    • Ben Voigt says:

      Rotation may or may not go through a carry, but you’re absolutely right that logical shift is not at all the same as rotation.

      Nigel is wrong when he writes “Logical shifting, sometimes known as rotation”.

  5. Anders says:

    Hi Nigel,
    Just a quick comment on your first rationale: I think it happens quite often for producers of high-volume products to actually change CPU even several times mid-production. From an economical point of view it makes sense; if you can negotiate a better price for a comparable CPU, why not use that CPU instead? (And if you are buying train loads of silicon you definitely have a good bargaining position…) I’ve seen this on at least a handful of occasions with some truly high-volume manufacturers where even a cent up or down in BOM cost makes a huge difference on the bottom line.
    This way of working of course requires some really good planning up front to separate “business logic” from hardware dependencies. But if done right it is simply yet another data point in favor of your position. I.e. if you have gone through the trouble of creating some form of hardware abstraction layer which is ported to every CPU you plan to use there is absolutely no reason to not use the available intrinsic functions to optimize the HAL for each CPU.

  6. Anonymous says:

    For short snippets of embedded assembler (rotations, mask operations etc, too short to be coded as separate functions) I find that macros can give you neat looking source and cross platform consistency. Importing MCU specific headers (which we usually have to do anyway) can contain the appropriate macros. For longer code sequences we can build a library (coded in assembler) to have same effect (I’ve used this to also solve data marshaling issues, differences in low level HW, comms etc). Not built into the language but a viable enough solution.

  7. Eduardo Martinez says:

    I think the same as Anders; trying to write portable code in a portable language, such as C, is very helpful in embedded projects. First of all, trying to write portable code is an effort to think what is hardware and platform independent, resulting in a much neater code; as a matter of experience, the portion of code that changes with hardware and/or platform not is more than 10%
    Second, as Anders says, leave our hands free to change our processor and/or platform of a product or a line of products with speed and ease; it’s true that a product, maybe, has a short duration in market, but it is also true that there are products that inherit from old products.
    Third, thinking that a great portion of cost in a hardware product is in software, trying to write portable and good code, can be reused in other projects, speeding up the development process of a development group.
    Last but not least, I may agree that C language is far from be the “ultimate” language and has some inconsistencies in its construction, but I don’t know of any other language that can replace it for embedded programming

  8. Steve Karg says:

    Hi Nigel,

    I enjoyed the article about the challenges of using C for embedded development!

    It is fairly trivial to use the intrinsic functions and still create portable C code if the intrinsic functions are documented somewhere as to what they do and their limitations. Simply include a conversion file or header file for the compiler that is missing the intrinsic functions. As long as the intrinsic functions have a C translation, the compilers missing them can use the translation.

    For example, in an open source library code that I write, I have an iar2gcc.h file that includes some intrinsic functions, as well as common macros to encompass the non-standard methods of marking embedded code (i.e. flash, eeprom, interrupts). One such conversion is:

    #define __multiply_unsigned(x,y) ((x)*(y))

    Best Regards,

    Steve

  9. Dave Kellogg says:

    I agree that porting for product purposes might not be done frequently. However, it *is* useful to have C code that can be dual-targeted to be either natively compiled on a PC (for unit testng purposes), or cross-compiled for the embedded target. As has been mentioned earlier, suitable planning and an abstraction layer should make this relatively painless. When testing on the PC environment, rarely is performance important. So inefficent C hacks to work arround missing functionality when targeting the PC environment seem to be acceptable.

    Regarding Nigel’s original points: Another ‘missing link’ between the C level and the CPU level is the ability to detect basic overflow (ie, the carry bit) after addition, multiplication, etc. I suppose that a compiler could provide some sort of extension capability to query the recent state of the carry bit, such as via a magic symbol name. But it certainly would not be standard.

  10. Harold says:

    The GNU folks have solved part of this problem with a useful set of intrinsics (called builtins) that often map directly into hardware instructions. Things like counting the number of 1-bits in a word, or the number of leading/trailing 0-bits, and also atomic operations for multi-threaded applications. If you stick to these intrinsics, your code will at least be portable to all architectures that have a gcc compiler. The full list is at http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

  11. Bradley says:

    I think this relates more to our notions of what portability is. Special CPU features are just another hardware dependency that can be encapsulated and contained appropriately.

    • Nigel Jones says:

      Thanks for the comment Bradley. I have to say I’m a little surprised at how much support for intrinsic functions has been expressed here. I’m not sure if this is a sampling issue in that people that read this blog do so because it validates their existing world view, or whether there is indeed wide spread support for such practices. What I’m particularly intrigued about is that in general I see people trying to abstract the hardware at the low levels of the code. By the time one has got up to the application layer, the hardware is normally a distant fuzzy memory. The intrinsic functions I mentioned can and will be used at the application layer, and so folks willingness to use them in this context is very interesting.

      • Rasmus says:

        “What Iā€™m particularly intrigued about is that in general I see people trying to abstract the hardware at the low levels of the code.”

        Depending on the project, there can be good reasons to do so. Right now, I got a private bare metal MCU project going with a very complex application, and I’m developing cross-platform. On the target, I may have a debugger, but that’s it. But since 95% of the application isn’t tied to a specific hardware, I can use GCC on other platforms and concentrate the hardware dependencies in one single file, the HAL. Things that aren’t there in the PC version, like buzzer, LEDs etc, just end in empty stub functions, but the application doesn’t know this.

        That in turn enables me to use GDB, GPROF and the awesome sanitiser feature of recent GCC versions, not to mention the comfort of all the debug printfs. I get far more from the portability for the testing and debug phase than from some intrinsics here and there. Especially since half of the code isn’t even by me (that’s fun to debug..) and the logics are extremely complicated. I might go industrial with the project, but I need a rock-solid prototype since any stability issue during a demonstration would pretty much end the project.

        If that were a commercial project, a huge advantage would be that the SW would be 90% working by the time the HW department would come out with the first PCB. Five years from now, I may consider a HW redesign with the more powerful MCUs that will be available then, and I won’t have too much trouble with porting the stuff.

        Btw, your article on sorting in embedded systems was very useful for that project – the gained speedup was really welcome, so thanks for that one.

  12. Yaniv says:

    Thanks for the article, Nigel. A couple of comments:

    1. This is the first time I encounter this distinction between logical and arithmetic shift operations, as you described the. From three decades of programming on many architectures, I always regarded logical shift as to push-in zero bits into the newly vacant places while discarding the pushed-out bits. Arithmetic shifts are similar, except for duplicating the sign bit (msb) if the shift is to the right. Rotate operation would push-in the pushed-out bit into the newly vacant position. Alternatively, an extra bit (e.g., carry bit) is used to receive the pushed-out bit, while its old content is pushed into the register (this arrangement allows the implementation of wide additions, wider than the native machine register size).

    2. Of the cases you described, I think saturated arithmetics is probably the one most difficult to implement (let alone, efficiently implement) in standard C. These operations – filters being a great example for the need – really cry for intrinsic calls, or language extensions.

    3. As of the arguments for using builtins over inline assembly, note that with modern compilers you can use Extended inline assembly that eliminates some of the limitations you mentioned. With this syntax you actually use your C variables instead of the machine registers and let the compiler take care of register selections, interfacing and optimizations.

    • Nigel Jones says:

      Yanic – I was a bit too lackadaisical in my nomenclature when it came to the whole shifting / rotating issue. Your understanding is correct.

      • Sidi says:

        Actually, your description between “logical shifting aka rotation” has some historical correlations. There were processors that did NOT have logical shift operation (Rockwell 6502 as example), where the way of performing logical shift right was through pair of CLC and ROR instructions.

  13. David Haile says:

    Sheesh! This is a good article that makes me realize that even after 28 years of embedded software I still have a lot to learn. Probing around on the gcc page linked above I found “typeof”. What could that be? More research is required!

    My assumption is the concepts described here apply to products that ship in the millions and not in the hundreds as mine always do. In the meantime my new products will continue to have an mcu that is two sizes larger than required so that I rarely have to worry about code space or execution time limitations.

    • Tim Wescott says:

      Sometimes one’s low-production project also requires small board space or super-low power, which in turn leaves one painted into a corner vis-a-vis processor performance.

      But on the whole, when I am working on a low production volume project, I also tend to buy big on the processor — saving $2.50 on the processor doesn’t buy much engineering time when it’s only amortised over 1000 units.

  14. Sivan Ramachandran says:

    These days, even the meaning of optimization seems to have morphed into looking entirely within the context of C. Today’s engineers have not been exposed enough to understand / appreciate the value of intrinsics or inline assembly. The debate on maintainability of code usually crushes any such thought, though I personally feel that intrinsics / inline assembly code can be gracefully modularized so as to provide for scalability / portability. Another reason probably is that software engineers (even embedded software engineers) do not understand the underlying processor architecture thoroughly (their project timelines do not account for this in most cases) in order for them to think differently about optimization.

  15. PMeilleur says:

    Every code design should be made as efficient as possible, and this implies using the processor’s hardware capabilities. In my case I created specialized libraries per processor type, and theses includes all specialized functions (like hardware CRC generator in a MSP430) and for whenever hardware “accelerators” are not available, a portable C equivalent is available.
    Portability has more to do with structuring the code than using or not specialized hardware features. With today’s versionning tools available that support sub-dependancies (not to mention Mercurial subrepos), writing code in non-library style for things that will be reused between projects is simply a waste of time and money.

  16. david collier says:

    I have a lovely file called “typedefs-for-sized-values.h”

    It actually does a bit more than the title suggests.

    First thing it does is to try to use predefined macros to decide:
    what compiler am I using?
    what is the target CPU?

    that allows me to set up…
    #undef SHORT_SHORT_IS_SUPPORTED /* GCC barfs on short short */
    #undef LONG_LONG_IS_SUPPORTED /* CodeVision barfs on long long */
    #undef U64_AND_S64_ARE_SUPPORTED /* assume that u64 and s64 are both supported, or neither is */
    #undef F64_IS_SUPPORTED /* set if 64-bit floats are supported */

    then it defines a set of u8, s8 …. u32,s32,f32
    and, where appropriate u64, f64 etc.
    plus things like:

    #define s16_MAX_as_represented_in_s16 ( 0x7fff )

    then ( isn’t this fun )

    #define U8_PRINT_DESCRIPTOR_TAIL “hhu”
    #define SIZE_PRINT_DESCRIPTOR_TAIL “zu” // linux.die.net says “for size_t or ssize_t ”

    for all printf and scanf types supported by the compiler. That really points out the subtel differences between scanf and prinf !!

    I’m a bit anal about relying on C re-using bit-patterns as a different type in its own way, so I define a few things like

    typedef union
    {
    u16 viewed_as_u16;
    s16 viewed_as_s16;
    } U16S16;

    so I can show the workings where I do it.

    I’m also paranoid about serialising things for issue on byte-wide serial lines, so I go on to create

    typedef union
    {
    u32 viewed_as_u32;
    u8 viewed_as_u8Array[ 4 ];
    } U32ArrayU8;

    Now that of course is useless unless I can define the big and little endianess on the target CPU…

    So now I define some bricks like:

    /* amount u8 or s8 array index changes for next most significant entry in a little-endian CPU */
    #define BYTE_BYTES_STEP_LITTLE_ENDIAN ( 1 )
    #define BYTE_BYTES_STEP_BIG_ENDIAN ( -1 )

    then I can set up

    #define BYTE_STEP_X86 STEP_LITTLE_ENDIAN
    #define WORD_STEP_X86 STEP_LITTLE_ENDIAN
    #define LWORD_STEP_X86 STEP_LITTLE_ENDIAN

    and from those after a bit of fiddling, I can build

    /* array index for bits 0- 7 of the u16 or s16 value in u8 or s8 array */
    #define X16ArrayX8_BYTE0_INDEX ( BYTE_BYTES_START_MULTIPLIER )
    /* array index for bits 8-15 of the u15 or s16 value in u8 or s8 array */
    #define X16ArrayX8_BYTE1_INDEX ( X16ArrayX8_BYTE0_INDEX + BYTE_BYTES_STEP )

    I told you this was fun…

    And now we come to the latest addition to the file – the question is “how big a chunk of memory can the CPU load or store in a single, uninterruptible operation.”
    That matter desperately if you have an interrupt routine incrementing a 16 or 32-bit milliseconds counter.
    Can you read the value outside the interrupt routine in one go, or do you have to tuern off interrupts around the operation to avoid getting helf the old value and half the new?

    I have MSP430 code which uses 16-bit buffer offsets. If I move that to an ATmega it will fail once-in-a-blue-moon because it assumes a 16-bit load is atomic.

    So I now define

    #define volatile_atomic_u16 volatile u16

    and if I am defning a 16-bit variable which my code will assume I can read with interrupts enable, I use the macro.

    and on a cpu where it isn’t available I define it as

    #define volatile_atomic_u32 #error 32-bit operations not atomic on this CPU

    I COULD disguise this better by using a standard subroutine to access anything 16-bit that is touched by an interrupt, but on the CodevisionAVR that can’t be inlined, and I have decied on balance that it’s clunky.

    I have a file called I have a lovely file called “typedefs-for-sized-values.c” which includes a very unpleasant routine to check that a sizeof( u8 ) really is 1, sizeof( f64 ) is 8, and so on. You can never be too careful.

    And all of this before I have written a line of portable code….

  17. david collier says:

    yes I have understood that C99 covers some of this, but I prefer u16 to uint_16, and Icant face all the global exchanges in old code šŸ™‚

  18. david collier says:

    on a point more tightly connected to the OP. I have a need to do some scaling on the ATmega, and have decided to use the fact that it has a hardware multiply, though no hardware divide.
    So I’ve been trying to write code that scales a value by k, using

    scaled = x *( k* 65536 ) / 65536

    Now you would hope that

    u32 x;
    u16 y;

    y = x/65536;

    would simply copy the upper 16 bits of x into y…. well you would wouldn’t you. Fat chance šŸ™‚

    and it is no better if you use >> 16

    I liken this exercise to pushing wet string. Except that you have to have a PhD in string before you even start.

  19. david collier says:

    I remain grateful that I started out writing 8080 assembler.

    It took a while to learn the difference between logical and arithmetic shifts, and rotates through and past carry.

    The worst bug I ever wrote ( 36 hours with a teletype printout ) was the result of using the wrong one. But at least each had a different name and I had to learn them.

    It’s like “why do we learn Latin/French/German/Spaniush/Japanese” – the answer is that you never get forced to work out what a past participle or a gerund, or a pluperfect tense is in your own language these days… the only way you get names for such things is by having to look at the whirling cogs of someone else’s.

  20. This was a good article, and just opinionated enough to fire up discussion.

    The solution I have seen somewhere and then also used myself (I think it was for IP checksums or CRCs or something like this) is to have two implementations of the critical functions, both a standard C implementation as reference and portable fallback, and an optimized assembly version for the current target hardware.

    The standard version is used on any new target hardware and on the PC for reference, to check the assembly code.

    Switching between them was done with an architecture-dependend #define.

  21. Simon Haworth says:

    An interesting article – one which certainly started me thinking…

    We’ve recently embraced TDD as the answer to all of our coding problems (!), which makes some of the comments above regarding portability very helpful.

    However (there’s always a but!), one of the points which was raised by the guy who ran the TDD course was to only optimise when required, thus making us give consideration to only using intrinsics and compiler optimisation when absolutely necessary. This wasn’t really from the viewpoint of portable code, more from not trying to second guess where the bottleneck in the system will be until you actually start testing with real hardware. He gave an interesting example from here – http://www.flounder.com/optimization.htm

    It’s never simple, is it?

    • Nigel Jones says:

      No it isn’t. However if you think of intrinsics as making certain operations easier and thus less prone to error rather than as an optimization strategy then perhaps you can reconcile the two paradigms. I think this is especially true with for example an algorithm that can use saturated addition to greatly simplify the implementation.

Leave a Reply