Hacker News new | past | comments | ask | show | jobs | submit login
The type of char + char (knatten.org)
57 points by alexchamberlain on Oct 16, 2019 | hide | past | favorite | 45 comments

I wouldn't describe this as "no-one knows" the type of char + char.

I know what the type of char + char is. I know that it's either int or unsigned int, depending on the ranges of values supported by types char and int. I know what it is for any given implementation. And I know that it's int, not unsigned int, for every implementation I've ever used or am likely to use.

Implementation-defined features are not some unsolvable mystery. They're just implementation-defined.

And we can count 99.999% of used implementations on one hand. If you’re on a strange platform, there’s a reason for that, and chances are you’re uniquely aware of any differences or will be writing assembly.

This confirms one of the guidelines I've always been taught; char is not an arithmetic type, and never treat it as such. It represents ascii characters, and nothing else.

I disagree, mostly.

char is an arithmetic type, but it rarely makes sense to treat it as one, because its signedness is implementation-defined. If you want a very narrow integer type, both signed char and unsigned char are arithmetic types, and can reasonably be used that way. (Arrays of unsigned char are also used for raw memory.)

And you should understand how char, signed char, and unsigned char behave when you do use them as arithmetic types.

Promotion to int or unsigned int, depending on the range of the type, can be confusing. The same applies to all integer types with lower rank than int, including short, unsigned short, and intN_t and uintN_t for N==8 (and probably for N==16, and maybe for larger N).

Note also that this:

    char c = '0';
is guaranteed to set c to '1'. (This guarantee applies only to decimal digits, not to letters.)

The compiler doesn't agree with you.

    #include <type_traits>
    static_assert(std::is_arithmetic<char>::value, "char is arithmetic");

'0' through '9' are guaranteed to be contiguous, so doing arithmetic around that fact is legitimate. And char does not generally represent ASCII characters; it could be some other charset.

(u)int8_t have the same problems, including aliasing because they are just alias of (unsigned) char. Sometimes it's nice to have modular arithmetic mod 256, or compact memory layout for eg. count sketches.

If int8_t exists[1], then you know that char is 8 bits[2] and therefore know that char in char + char always promotes to int because int must have at least 16 value bits and a 16-bit int can represent any char value regardless of signedness.

[1] int8_t is not required.

[2] char is the fundamental unit of addressability. sizeof char always evaluates to 1, sizeof int8_t must be non-0, char must be at least 8 bits, and int8_t must be precisely 8 bits, therefore sizeof int8_t == sizeof char and CHAR_BIT == 8.

I would use this for anything < int/unsigned. a short * short can result in a signed integer overflow and that is UB.

Why do we need to be able to add two characters again?

  char toupper(char c) {
    if (c >= 'a' && c <= 'z') {
      return (c - 'a') + 'A';
    else { return c; }

It's worth mentioning explicitly that while `c - 'a'` is the more obvious application of character addition in your example, `c >= 'a'` is another one that's even more common. Pretty much everyone immediately understands that we want to be able to sort characters.

Yeah- but the problem of poorly-defined result type isn't present for comparison operators, since a bool is a b... oh, hold on, C. Since an int is an int. Sigh.

In C, 'a' is an int so most of those are not char/char operations.

In C++, 'a' is a char and the comparison result is a bool, though it doesn't really make a difference in that function.

When you're sorting, you're generally comparing two variables to each other, as opposed to comparing one variable to a literal constant.

Adding two characters isn't strictly needed for that -- you're relying on the assumption that (c - 'a') is of type character, but it's actually the offset between two characters. The rules for those two types would be:

char + char = invalid

char + offset = offset + char = char

offset + offset = offset

char - char = offset

char - offset = char

offset - char = invalid

offset - offset = offset

Given that, (c - 'a') + 'A' is perfectly valid without adding two characters.

edit: formatting

That relationship is valid for ASCII, and for character sets derived from ASCII, but it's not guaranteed by the language. In particular, in EBCDIC the alphabet is non-contiguous.

Absolutely- this code only works with ASCII-ish charsets.

You showed an example of subtracting a character from another. The GP asked for an example of adding two characters.

  (c - 'a') + 'A'
contains both

I thought about saying

   (c + 'A') - 'a'
to make this more clear, but I think that's actually UB with signed chars- e.g. for c='a', 'a'+'A' exceeds the range of a signed 8-bit value!

Promotion should save us here, but that's a bit too yikes-y for my comfort.

It does not. Subtracting a char from a char involves usual arithmetic conversions as well and the result is typically an int. Next, you have addition between an int and a char.

The Erlang is Perlilous:

  toUpper(char) -> $ + char.   % It's "$ "

That's a good question, now that we have [u]int8_t, [u]int16_t, etc. for explicit bitness values. (Although both can have more than specified number of bits on some platforms.)

But `uint16_t + uint16_t` has exactly the same problem -- if it's a typedef for `unsigned short`, there will be promotion to either `int` or `unsigned int`.

A multiplication `uint16_t * uint16_t` can still cause an overflow after promotion to signed int, which is undefined behavior! So "unsigned types wrap around" doesn't apply to `uintN_t`, because you can never know for sure whether those types are "smaller than int" and thus get promoted to signed types when you do any arithmetic.

Of course, in practice this just means: every C and C++ program relies on tons of implementation-defined behavior. A `sizeof(int)` greater than 32-bits would break most code in existence (e.g. hash code computations using `uint32_t`).

> A multiplication `uint16_t * uint16_t` can still cause an overflow after promotion to signed int

Last time similar thing bit me was when the platform had 16-bit int... so just adding two int16_t can very well cause int overflow.

> because you can never know for sure whether those types are "smaller than int" and thus get promoted to signed types when you do any arithmetic

You can deduce the width (number of sign + value bits) of the standard integer types from their limits (e.g. INT_MAX, INT_MIN, etc). The problem has been that this is non-trivial if not impossible to do from the preprocessor. The next C standard will include width constants (e.g. INT_WIDTH) for the standard integer types.

In practice programs on these weird large-int machines would just use `-fwrapv` and move on.

No, uint8_t is guaranteed to be exactly 8 bits wide (unlike uint_fast8_t or uint_least8_t).

String forming.

How would you form strings by adding 'char' type values together? This is not about concatenation operation, we're talking about C/C++. They do not have syntactic sugar concat for chars.

(In C++, you of course have operator overloading, that's how std::string concat sugar works.)

'1' + '1' == 'b'. Because 49 + 49 == 98. ASCII '1' == 49, and 'b' == 98.

I found some insightful comments below the post:

>I think that char – char should definitely be legal. The distance between characters is well defined. Same for char + numeric. Both logically makes sense. I think a good analogy might be floors in a building. Asking what’s the distance between the second and seventh floor makes sense, or what’s two floors above the 4th. But the question ‘what’s the 5th floor plus the 6th floor’ doesn’t make sense.

>Affine space describes these kind of relationships in mathematics. Eg position and disposition in n dimension, or count and offset in buffers, even timestamp and duration.

I agree with the article. Here are discussions of related problems with C/C++ arithmetic promotions and overflow:



I've been writing C for 25 years... and while I technically know "the answer," it's effectively a closed door in my mind because I don't always know where my code will end up.

A sadistic part of me would prefer if it was interpreted as a bitwise and... not because that's good or reasonable or smart... but to punish the behavior. But then that backfires when people use it for underhanded code.

Yes, yes. The spec is filled with anachronisms that are no longer pertinent in today's machines. char + char gets promoted to int every time in today's compilers. Try it out here: https://godbolt.org/z/V5HEvV

There is what the standard says, and there is what people actually do. If everyone promotes char to int in practice, then any machine where this doesn't happen is going to have a tough time running the bulk of code out there.

In a standards committee, the standard is the standard, in practice common practice is the standard.

50-50 anachronisms vs flexibility to allow C to run on novel machines that we don't currently envision. Sure, it would be nice for developers on today's machines to reduce it to the conventional subset.

If by novel you also imply compatible, then sure. The moment you create a machine that's incompatible with the conventions adopted by the most popular compilers and architectures, you break a ton of software built upon those conventions, and sink your hardware in the market because it's a portability nightmare.

Specs don't matter beyond the conventions they inspire.

A machine where char is as large as int is unlikely in practice as it isn't very useful. C11 (at least) defines INT_MIN/MAX as covering at least the range of an int16_t type.

That said, the int promotion alone may be surprising / nonobvious to some people (it was to me, when I learned about it!).

Also, if char is signed, char + char may be UB and with known overflowing values the compiler may deduce it's a can't-happen situation, generating code accordingly. Or when encountered at runtime, it may hose your program state arbitrarily, etc.

There are hardly any systems relevant today for which adding two char would result in an unsigned int. So basically just treat it as int and call it a day.

Or if you're using one, you're likely very aware of that fact and don't suddenly discover it from an blog post.

Does it matter if it is signed or unsigned int or char? Bitwise it contains the same amount of information. That's the most important thing.


Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact