embedded software boot camp

Signed versus unsigned integers

Saturday, May 9th, 2009 by Nigel Jones

If you are looking for some basic information on signed versus unsigned integers, you may also find this post useful. That being said, on to the original post…

Jack Ganssle’s latest newsletter arrived the other day. Within it is an extensive set of comments from John Carter, in which he talks about and quotes from a book by Derek Jones (no relation of mine). The topic is unsigned versus signed integers. I have to say I found it fascinating in the same way that watching a train wreck is fascinating. Here’s the entire extract – I apologize for its length – but you really have to read it all to understand my horror.

“Suppose you have a “Real World (TM)” always and forever positive value. Should you represent it as unsigned?

“Well, that’s actually a bit of a step that we tend to gloss over…

“As Jones points out in section 6.2.5 the real differences as far as C is concerned between unsigned and signed are…

” * unsigned has a larger range.

” * unsigned does modulo arithmetic on overflow (which is hardly ever what you intend)

” * mixing signed and unsigned operands in an expression involves arithmetic conversions you probably don’t quite understand.

“For example I have a bit of code that generates code … and uses __LINE__ to tweak things so compiler error messages refer to the file and line of the source code, not the generated code.

“Thus I must do integer arithmetic with __LINE__ include subtraction of offsets and multiplication.

“* I do not care if my intermediate values go negative.

“* It’s hard to debug (and frightening) if they suddenly go huge.

“* the constraint is the final values must be positive.

“Either I must be _very_ careful to code and test for underflows _before_ each operation to ensure intermediate results do not underflow. Or I can say tough, convert to 32bit signed int’s and it all just works. I.e. Line numbers are constrained to be positive, but that has nothing to do representation. Use the most convenient representation.

“C’s “unsigned” representation is useless as a “constrain this value to be positive” tool. E.g. A device that can only go faster or slower, never backwards:

unsigned int speed; // Must be positive.
unsigned int brake(void)
{
–speed;
}

“Was using “unsigned” above any help to creating robust error free code? NO! “speed” may now _always_ be positive… but not necessarily meaningful!

“The main decider in using “unsigned” is storage. Am I going to double my storage requirements by using int16_t’s or pack them all in an array of uint8_t’s?

“My recommendation is this…

” * For scalars use a large enough signed value. eg. int_fast32_t
” * Treat “unsigned” purely as a storage optimization.
” * Use typedef’s (and splint (or C++)) for type safety and accessor functions to ensure constraints like strictly positive. E.g.

typedef int_fast32_t velocity; // Can be negative
typedef int_fast32_t speed; // Must be positive.
typedef uint8_t dopplerSpeedImage_t[MAX_X][MAX_Y]; // Storage optimization

I read this, and quite frankly my jaw dropped. Now the statements made by Carter / Jones concerning differences between signed and unsigned are correct – but to call them the real differences is completely wrong. To make my point, I’ll first of all address his specific points – and then I’ll show you where the real differences are:

Unsigned has a larger range

Yes it does. However, if this is the reason you are using an unsigned type you’ve probably got other problems.

Unsigned does modulo arithmetic on overflow (which is hardly ever what you intend)

Yes it does, and au contraire – this is frequently what I want (see for example this). However, far more importantly is the question – what does a signed integer do on overflow? The answer is that it is undefined. That is if you overflow a signed integer, the generated code is at liberty to do anything – including deleting your program or starting world war 3. I found this out the hard way many years ago. I had some PC code written for Microsoft’s Version 7 compiler. The code was inadvertently relying upon signed integer overflow to work a certain way. I then moved the code to Watcom’s compiler (Version 10 I think) and the code failed. I was really ticked at Watcom until I realized what I had done and that Watcom was perfectly within their rights to do what they did.

Note that this was not a case of porting code to a different target. This was the same target – just a different compiler.

Now let’s address his comment about modulo arithmetic. Consider the following code fragment:

uint16_t a,b,c, res;

a = 0xFFFF; //Max value for a uint16_t
b = 1;
c = 2;

res = a;
res += b; //Overflow
res -= c;

Does res end up with the expected value of 0xFFFE? Yes it does – courtesy of the modulo arithmetic. Furthermore it will do so on every conforming compiler.

Now if we repeat the exercise using signed data types.

int16_t a,b,c, res;

a = 32767; //Max value for a int16_t
b = 1;
c = 2;

res = a;
res += b; //Overflow - WW3 starts
res -= c;

What happens now? Who knows? On your system you may or may not get the answer you expect.

Mixing signed and unsigned operands in an expression involves arithmetic conversions you probably don’t quite understand

Well whether I understand them or not is really between me and Lint. However, the key thing to know is that if you use signed integers by default, then it is really hard to avoid combining signed and unsigned operands. How is this you ask? Well consider the following partial list of standard ‘functions’ that return an unsigned integral type:

  • sizeof()
  • offsetof()
  • strcspn()
  • strlen()
  • strpsn()

In addition memcpy(), memset(), strncpy() and others also use unsigned integral types in their parameter lists. Furthermore in embedded systems, most compiler vendors typedef IO registers as unsigned integral types. Thus any expression involving a register also includes unsigned quantities. Thus if you use any of these in your code, then you run a very real risk of running into signed / unsigned arithmetic conversions. Thus IMHO the usual arithmetic conversions issue is actually an argument for avoiding signed types – not the other way around! So what are the real reasons to use unsigned data types? I think these reasons are high on my list:

  • Modulus operator
  • Shifting
  • Masking

Modulus Operator

One of the relatively unknown but nasty corners of the C language concerns the modulus operator. In a nutshell, using the modulus operator on signed integers when one or both of the operands is negative produces an implementation defined result. Here’s a great example in which they purport to show how to use the modulus operator to determine if a number is odd or even. The code is reproduced below:

int main(void)
{
 int i;

 printf("Enter a number: ");
 scanf("%d", &i);

 if( ( i % 2 ) == 0) printf("Even");
 if( ( i % 2 ) == 1) printf("Odd");

 return 0;
}

When I run it on one of my compilers, and enter -1 as the argument, nothing gets printed, because on my system -1 % 2 = -1. The bottom line – using the modulus operator with signed integral types is a disaster waiting to happen.

Shifting

Performing a shift right on a signed integer is implementation dependent. What this means is that when you shift right you have no idea whether the sign bit is preserved or if it is propagated. The implications of this are quite profound. For example, if foo is an unsigned integral type, then a shift right is equivalent to a divide by 2. However, if foo is a signed type, then a shift right is most certainly not the same as a divide by 2 – and will generate different code. It’s for this reason that Lint, MISRA and most good coding standards will reject any attempt to right shift a signed integral type. BTW while left shifts on signed types are safer, I really don’t recommend them either.

Masking

A similar class of problems occur if you attempt to perform masking operations on a signed data type.

Finally…

Before I leave this post, I just have to comment on this quote from Carter

“Either I must be _very_ careful to code and test for underflows _before_ each operation to ensure intermediate results do not underflow. Or I can say tough, convert to 32bit signed int’s and it all just works”.

Does anyone else find this scary? He seems to be advocating that rather than think about the problem at hand, he’d rather switch to a large signed data type – and trust that everything works out OK. He obviously thinks he’s on safe ground. However consider the case where he has a 50,000 line file (actually 46342 to be exact). Is this an unreasonably large file – well yes for a human generated file. However for a machine generated file (e.g. an embedded image file), it is not unreasonable at all. Furthermore let’s assume that his computations involve for some reason a squaring of the number of lines in the file: i.e. we get something like this:

int32_t lines, result;

lines = 46342;
result = lines * lines + some_other_expression;

Well 46342 * 46342 overflows a signed 32 bit type – and the result is undefined. The bottom line – using a larger signed data type to avoid thinking about the problem is not recommended. At least if you use an unsigned type you are guaranteed a consistent answer.

Home

Tags: ,

17 Responses to “Signed versus unsigned integers”

  1. steve says:

    A very interesting and thought-provoking article!I prefer to use unsigned values as much as possible, mainly to avoid ambiguities (and hence Lint warnings) when doing masking and shifting, as you mention.Will you be sending any of your comments to Jack Ganssle for inclusion in his next newletter?I also think this would make an interesting discussion point on the other Stack Overflow website.

  2. Nigel Jones says:

    Hi Steve.I’ve sent an email to Jack Ganssle, pointing him to this posting. Since Jack has linked to this blog in the past I expect that there will be some sort of reply in the next newsletter.

  3. Peter B says:

    Singed ints? Whenever possible I use unsigned int. But then again I do mostly assembly on $0.40 8-bit machines that don’t have enough flash ROM to use a debugger. Has anybody looked at signed and unsigned 16-bit addition? The code is the same. 100% the same. In every compiler I’ve seen it’s the same.The fun stuff comes when signed 16-bit numbers are needed on an 8-bit machine. I remember writing DTMF decode using the Goertzel Algorithm on an 8051. PITA. A compiler could not generate code that ran fast enough for the application. Then later on a 72MHz ARM Cortex-M3 I used (software emulated) floating point. Piece of cake. And it was fast enough to handle 4 separate telephones. (And the 32-bit machine only cost ~2x the fast 8051.)

  4. AndersH says:

    A very good article, as always! I’m baffled, to say the least by the content of the article you quote…

  5. Nigel Jones says:

    I’m glad I’m not alone! I’ve heard from Jack Gannsle and so I expect he will mention this in next month’s newsletter. It will be interesting to see what his readers think.

  6. GroovyD says:

    i almost exclusively use unsigned unless i have to generate a signed result and only in those specific calculations i pay mind to the mixing of the datatypes (typically i convert unsigned to signed knowing the unsigned are small enough to convert without overflow). Since I have started using unsigned it is surprising to me actually how little needs to be done signed. In fact using unsigned is usually faster and cleaner since you so not have to check for < 0 on array indexes. I credit lint for pushing me in this direction.

  7. TK says:

    I found this discussion while sanitizing my code for release. I stick to unsigned integer types as much as possible to avoid undefined/implementation-dependent behavior. It simplifies the rules, a la MISRA. Then I treat all signed operations as exceptions that need to be scrutinized more rigorously. I noticed a typo that should be corrected: The words “integer” and “integral” both pass through a spell-checker, but they are very different animals. Only integers, be they signed/unsigned are under consideration. If I had a dime for every time that I mistakenly swapped the two terms… Anyhow, integrals belong to a completely different programming/engineering/mathematics discussion. Please edit the article to avoid confusion.

  8. Matt says:

    The following is a good example of one of the implications of signed overflow being undefined (asm output via gcc):

    /*
    loop_overflow.c
    */

    unsigned long
    foo_u(unsigned long x)
    {
    while(x < x + 1)
    {
    x += 1;
    }
    return x;
    }

    long
    foo_s(long x)
    {
    while(x this is NEVER true*/
    {
    x += 1;
    }
    return x;
    }

    #if 0
    6 foo_u:
    7 movq %rdi, %rax
    8 jmp .L2
    11 .L3:
    12 movq %rdx, %rax
    13 .L2:
    14 leaq 1(%rax), %rdx
    15 cmpq %rdx, %rax
    16 jb .L3
    17 rep
    18 ret
    #endif

    #if 0
    23 foo_s:
    26 .L6:
    27 jmp .L6
    #endif

  9. Matt says:

    (argh! my previous attempt was mangled, please disregard)

    The following is a good example of one of the implications of signed overflow being undefined (asm output via gcc):

    /*
    loop_overflow.c
    */

    unsigned long
    foo_u(unsigned long x)
    {
    while(x < x + 1)
    {
    x += 1;
    }
    return x;
    }

    long
    foo_s(long x)
    {
    while(x < x + 1)
    {
    x += 1;
    }
    return x;
    }

    #if 0
    6 foo_u:
    7 movq %rdi, %rax
    8 jmp .L2
    11 .L3:
    12 movq %rdx, %rax
    13 .L2:
    14 leaq 1(%rax), %rdx
    15 cmpq %rdx, %rax
    16 jb .L3
    17 rep
    18 ret
    #endif

    #if 0
    23 foo_s:
    26 .L6:
    27 jmp .L6
    #endif

  10. Matt says:

    (Moderator: apologies, please disregard the previous two attempts)

    The following is a good example of one of the implications of signed overflow being undefined (asm output via gcc):

    long foo_s(long x){while(x < x + 1) {x += 1;} return x;}

    compiles to:

    foo_s:
    .L6:
    jmp .L6

    whereas:

    unsigned long foo_u(unsigned long x){while(x < x + 1){x += 1;} return x;}

    compiles to:

    foo_u:
    movq %rdi, %rax
    jmp .L2
    .L3:
    movq %rdx, %rax
    .L2:
    leaq 1(%rax), %rdx
    cmpq %rdx, %rax
    jb .L3
    rep
    ret

  11. Narayan says:

    Hi Nigel,
    You have mentioned in the article that you also do not recommend left shifting of signed variables.
    Can you explain why exactly?

    • Nigel Jones says:

      I think MISRA summarizes it nicely:

      Rule 12.7 (required): Bitwise operators shall not be applied to operands whose underlying type is signed.

      Bitwise operations (~, <<, <>, >>=, &, &=, ^, ^=, |, |=) are not normally meaningful on signed integers. Problems can arise if, for example, a right shift moves the sign bit into number or a left shift moves a numeric bit into the sign bit.

  12. Alexander says:

    First I have to say that you blog is awesome. I alsways use unisgned data types where possible since I had a problem with my compiler converting -1 to unsigned 255 without any reason plausible for me. I just wonder why you do not use also unsigned notation 0u, 0xFFu, … and so on because it makes it easy to find number that should be signed and also makes sure the compiler uses the right data type.

  13. Steve Holt says:

    In C99 and C++11 the behaviour of % with signed integers is defined by the standard. So -1 % 2 = -1 is correct according to the current standards for C and C++. See Section 6.5.5 item 6 of http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

    That said, I would still avoid code using modulo on signed integers since although the result is well defined, I’d still be surprised to get remainders that are negative.

  14. John Carter says:

    Ah… My ancient sins are uncovered.

    Yup. That was me commenting on Jack’s article back in 2009.

    So it’s now 2018 and I have just discovered your reply…..

    Hmm. Have I changed my mind?

    I first discovered “Nasal Daemons” thanks to John Regehr’s excellent 2010 article…

    https://blog.regehr.org/archives/213

    …I think I had the warm comfortable feeling, that every compiler I had used in the previous decades, had done something consistent and vaguely sane with signed integer overflow… surely, surely no committee of of sober wise men would standardize on Nasal Daemons? Surely not?

    I can only conclude they were neither sober nor wise.

    Surely no compiler writer would predicate an optimization on this sad fact?

    Alas, as the decade progressed, they have done so more and more.

    So now I’m fully aware of this fact, how would I modify my advice?

    Having done a lot in the Ruby / Scheme type languages where overflows just do the obvious sane correct thing (promote to a type that can hold the result)… and then working quite a bit in the C/C++/D world where do I stand now?

    Can I point at any place where the Ruby / Scheme numeric tower saved my bacon?

    Hmm. It certainly gives the mathematician in me very nice warm fuzzy feelings, it certainly slows all of my code down very slightly, but nope.

    Can’t recall a case where it saved my bacon.

    I can recall one bug I wrote in C around clock constants and timers where the numeric tower would have saved me.

    In C/C++ world, UBSan has come along, gcc’s static warnings have become hugely stricter.

    …And yes, I have run all our code (millions of lines) through the most severe checks I can find, inspected what they were and cleaned up the result.

    Have I saved the companies bacon with UBSan and stricter warnings?

    Hmm. If I’m brutally honest with myself, the bugs I found and fixed were all, every one, zero customer impact. (In the sense none were currently being actively complained about by real customers).

    So what is happening here, why is a decade of experience at war with these big red flags around undefined behaviour?

    Whenever I try _prove_ code is correct, overflow is a total nightmare, it takes something that is hard to do into something that is nightmarishly complicated. Proving and reasoning about code is just so so so much simpler in a Scheme-like numeric tower.

    Here some thoughts about what is driving the practicalities of this issue….

    * (Almost) nothing in the real world grows at exponentially without bound, so doubling the word size until it works…. let’s face it. It Works in everyday life. You can prove that the code is buggy, or has reduced range, but for all physically possible cases… it now just works. This is especially true for any code dealing with real hardware)

    * Given the “double word size until it works” approach used by ye average programmer, unsigned underflows are much much much more common in daily practice than overflows.

    * The efforts taken around testing in the last decade (continuous integration, unit testing, fuzzing..) has been more effective than UBSan at catching bugs customer care about.

    * The value of unit tests have been massively amplified in effectiveness, especially by valgrind, but also the sanitizers.

    So where are we now?

    * Signed overflow/underflow is _never_ what we want.
    * Unsigned overflow/underflow is what we want in, in daily practice, only in a very very rare set of circumstances.
    * Unsigned underflow, in daily practice, is probably several orders of magnitude more likely than overflow of any flavour.

    So what do I recommend now?

    Prime Recommendation: Where possible, choose a language where the language designer has some guts. eg. Scheme / Ruby / D. (Yes, D’s integer rules are the same mess, but they are at least trying to make it safer)

    C and C++ is like juggling with chain saws.

    Seriously. I bet, even here in 2018, if I did a survey of good practicing C/C++ programmers, most of them _would_ get integer promotion rules wrong.

    Recommendation 2: Unless you explicitly want the unsigned behaviour overflow/underflow behaviour, or as a storage optimization, signed int’s will, in daily practice, burn you less often.

    Recommendation 3: Don’t trust yourself, let the compiler tell you via warnings when you’re on thin ice, and then fix it.

    Recommendation 4: Unit test and valgrind.

    Recommendation 5: In practice I use unsigned for things that are intrinsically unsigned like addresses and sizes, gcc’s warnings about mixing signed and unsigned are pretty fierce these days. (Possibly this behaviour of mine is driven by size_t and warnings!)

    Recommendation 6: Use the UBSans.

    If I were to climb into a time machine and hide the C comittee’s whisky bottles, what would I ask for before I gave them back?

    Let’s take a step back.

    Consider the humble “assert(expr)”.

    What does it mean?

    Personally I take a hardline Design by Contract view.

    It means, “Unless expr is true, nothing beyond this point will work anyway, you may as well dump and die right here and now and fix your code, and everything beyond this point must be coded under the expectation that expr is true”.

    Implicitly after every modification of an int is an “assert( (INT_MIN <= i)&& (i <=INT_MAX))"

    When you are compiling and running with UBSan, it will do that check for you.

    When you compiling with optimization on, it elides the checks, and optimizes _based_ on the assumption that the assert holds true.

    Personally I wish _all_ asserts were treated with the same respect by compilers.

    Ye olde Pascal range types were a fairly sane idea….. and would of solve the insane "how big is an int" insanity.

    The failure of pascal range types to "take over the world" reveals something about how we program.

    We are actually more desperately concerned with storage "sizeof" than dynamic range. We clearly spend more brain cycles on how many bytes something is than what it's actual range is.

  15. John Carter says:

    ps: I got interested in what I was responding to when I replied to Jack…. ie. The context.

    I believe it was http://www.ganssle.com/tem/tem178.htm

    > Paul Carpenter sent this:
    …..
    > “My main bug bear with types is signedness and lack of using it correctly. For instance:

    > “1/ Standard libraries and language usage have signed when they should have unsigned:
    > File Handles
    > __LINE__ macro
    > Array indecii (the actual memory index for associative arrays)
    > Database record numbers
    > Social Security numbers
    > Badge numbers
    > ID numbers

    > “Have you ever seen a negative one?

    > “I would love to see the negative line number for source code! What is a negative file handle?

    > “2/ Variables relating to real world physics that are impossible when something is Absolute it is UNSIGNED, and Relative it is SIGNED, e.g. Light is absolute, Reflection of light is relative to source and reflective, surface, taken from ABSOLUTE measurements/values.

    ie. My opening statement was to point out “open()” does indeed return negative file handles. Hideous, but true.

    My main point is whatever our real world physical intent with a type is, it’s divorced from CPU reality by being “whatever the C standard and compiler writer has defined and implemented that type as”. ie. Declaring a variable as “unsigned” may express our intent… but that intent is secondary in fact to whatever the compiler does with it. The speed of light is unsigned, but doesn’t wrap like an unsigned int. ie. A cpu int type is always a faulty model for physical reality. Therefore your choices in choosing types should always be dominated by the CPU reality, not physical reality.

Leave a Reply