Hacker News new | past | comments | ask | show | jobs | submit login

Does anyone know why "char" is unsigned on ARM/gcc?

To me it seems like a weird design choice that only complicates porting software from x86.




Why would you expect char to be signed?

If you mean just because it's signed on x86, fair enough; but it sounds as if you think signed is just a more natural option. My intuition goes the other way, for what it's worth.

Anyway, here is one possible reason. It used to be that if you wanted to load a single byte from memory on ARM, the only way to do it always treated it as unsigned. So if you wanted to work with signed chars, you needed explicit extra instructions to do the sign-extension. This isn't true for more recent versions of the ARM ISA -- there's an LDRSB instruction to go with the older LDRB -- but it may be one reason why that choice was originally made.


I'd expect char to be signed, because all other integer types are signed by default.


"char" has something else unusual going on. If you compare it to a "unsigned char" in an if-condition, and compile with -Wall -Wextra, you'll get a warning about casting. If you compare it to a "signed char" on the same system, you'll still get the warning! In fact "char" is not considered to be exactly the same as "signed char" or "unsigned char", it has 3 variants! Even though logically it must be one or the other on a particular platform. So you could think of char as mostly characters, whether ascii, or iso8859-1, or utf8 code-units ...

Functions like toupper() take an int but say "the value must be representable as an unsigned char" so you technically need to cast to "unsigned char" on most platforms, but I doubt anybody uses these functions for anything but ascii. They may work for single-byte locales like iso8859-1 etc if you have the locale env vars set right, but they won't work for non-ascii chars encoded as utf-8, which is generally what you always want to use these days. (There's towupper() which works with 2-byte locales like UCS-2, which is utf-16 without surrogate pairs, and can't represent all the new unicode chars, but you probably don't want to go there, you want some modern unicode library that works properly with utf-8 or UCS-4.)


towupper() works with wide chars, which aren't necessarily 2-bytes. In fact on UNIX-like systems wchar_t tends to be 32 bits and wide chars are usually UCS-4.


I honestly didn't know that ... I've never actually used the libc wchar_t functions :)


Good, because nobody should use wchar_t at all. It's the API that was thought up by people who got drunk and asked themselves "how can we make this char situation even worse?" wchar_t is widely recognized as one of those huge mistakes from the 90s, along with UCS-2. Today you should store strings in as bytes using UTF-8 and if you need to handle them in a fixed-width format you would choose an explicit 32-bit-wide type.


Except when you’re on Windows and have to use WCHAR to handle Unicode characters because they use UCS-2, not UTF-8


On Windows WCHAR is defined to hold a 16-bit unicode character, and is defined to be unsigned. In standard C wchar_t can be any damn thing, and isn't even guaranteed to be wider than char. It can be signed or unsigned. It is useless.


I know you're technically right, but it still seems bizarre to me to ever use char as an integer. If I wanted a byte-sized integer, I would use int8_t/uint8_t (today, anyway).

The only use for char that ever seemed intuitively reasonable to me was to hold ascii characters.


char has another use. I've always figured char pointers were the proper way to provide byte-level access to other objects, since char pointers are allowed to alias other pointer types. That is, they're not bound by strict aliasing rules. I don't think int8_t or uint8_t have the same special exception.

This means you could implement your own versions of memcpy, fread and fwrite by casting the void * arguments to char * , but if you cast them to uint8_t * , your code might not be correct.


That rule also applies to unsigned char and signed char. In practice I think uint8_t and int8_t are usually just typedefs for these respectively, but in principle they needn't be, so you're correct that the aliasing exemption might not apply to those.

I would tend to prefer explicitly using unsigned char or signed char rather than plain char though, partly to signal that I am treating the bytes as integers rather than character. (Actually I would still use uint8_t even though I just learnt it might not be unsigned char, because it looks clearer in my eyes, but I'm not sure I should admit it here...)


prior to C99 adoption char was usually how you got a int8_t. ascii characters are technically 7 bits, so should the extra bit be sign?

That's all just having fun, I like the consistency argument and the fact (is it a fact?) that char is signed on most platforms.


> char was usually how you got a int8_t

Not sure if that's just a typo, but you would use a signed char, which is a different type to char even on implementations where char is signed. Part of the reason for this, of course, is because char can be unsigned so if you want a signed integer you have to specify that. But more philosophically, unsigned char and signed char are numerical types that are not meant to be characters (despite their names), whereas char is a character type that just happens to be backed by an integer.

Indeed I believe that int8_t is almost always just a typedef for signed char (but I still would use int8_t where available for clarity).


I'd expect them to go by the standard. "C Programming Language" has this to say:

> Whether plain chars are signed or unsigned is machine-dependent, but printable characters are always positive.

Kernighan, Brian W.. C Programming Language (p. 36). Pearson Education. Kindle Edition.


Well, char has two semantic meanings. Either as a raw byte, or as an ASCII value. Both are represented as unsigned values (at least conceptually), so making them unsigned-by-default is fairly reasonable. Integers in mathematics and common usage are signed, so making them signed-by-default is also fairly reasonable.

But if you think of char as a typedef [u]int8_t, then I do get the consistency argument.


> I'd expect char to be signed, because all other integer types are signed by default.

But why would you expect char to represent an generic integer in the first place?

It's just a wrapper for bytes, which in their nature are just bits devoid of traditional mathematical, numerical value.


Performance. Historically, ARM didn’t have a “load byte and sign extend” instruction (http://www.drdobbs.com/architecture-and-design/portability-t...), making loading a signed char and promoting it to an int slower than loading an unsigned char and promoting it to an int.

In C, a function argument or return value of type char gets promoted to int. So, code that uses char a lot does a lot of such promotions.


> In C, a function argument or return value of type char gets promoted to int. So, code that uses char a lot does a lot of such promotions.

I think you're confusing standard C and machine/implementation-specific behavior. (If not for yourself, for people who read your comment.)


It's worse than that, arithmetic is never done in types narrower than int so even using + with a char type will do a such a promotion.


Signedness of "char" is purely platform-specific. If you want to write portable code, you always have to specify "signed" or "unsigned" before "char".


Why not use {,u}int{8,16,32,64}_t from stdint.h? Then you don't have to think twice whether some type is signed or unsigned.


stdint.h was introduced in C99, yes? There's a lot of code that started before that and hasn't been converted to newfangled things.


> newfangled things

stdint.h is nearing 20 years old now, I don't think it counts as a "newfangled thing" anymore.

But yes, there is plenty of very obsolete software still out there.


> stdint.h is nearing 20 years old now, I don't think it counts as a "newfangled thing" anymore.

In the world of C, it is. Custom compilers for particular embedded systems, such as MCUs, often only speak C89 and nothing else. C99 still hasn't found complete and widespread acceptance.


The compiler is irrelevant here as stdint.h can come from a variety of sources. There's no reason to not have stdint.h, even on ancient as dirt embedded systems stuck on C89. Even in the worst case of there's no vendor-provided one you can still just define it yourself using a copy from almost anywhere else, there's very little to it.


stdint.h can't be reliably be premade. A compiler is completely free to, say, define sizeof(unsigned int) to anything as long as it can represent at least 65535. stdint.h has to work with the C base types, so it does have to be customized for a particular compiler.


that's assuming your compiler even ships any kind of library outside of a header for your hardware's specific functions


Because these are optional. Consider {,u}int_least{8,16,32,64}_t instead.


Technically optional, yes. Any implementation that lacks them is in for a very bad time, though.


> Why not use (...)

Because sometimes you are not the author of the software you want to port. Of course, when you start porting the software, you'd use a (predefined) typedef.


The only argument I can see for using other types is for int, when you want the natural length integer, in any other case it seems to be than stdint.h types are better?


Using the natural integer whenever possible keeps the C code abstract. The program becomes less limited as it is ported to more capable machines with bigger integers, instead of continuing to pretend that everything is a 32 bit i386.

You need some low-level justification for using an uint32_t and such: like conforming to some externally imposed data format or memory-mapped register bank.

The justification for <stdint.h> is that it's better to have one way of defining these types in the language, than every program and every library in every program rolling its own configuration system for detecting types and the typedefs which name them. Let's see, for calling Glib, we use guint32, for OpenMax we use OMX_U32, ...

Funny how these situations persist almost 20 years after stdint.h was standardized (and a number of more years after being draft features).


Well, except that I don’t think most software authors actually know where it is possible to safely use an int versus one of the fixed width stdint types. In particular, you now need to make sure that your code works correctly no matter what the actual size of an int is. This involves complicated knowledge like int promotion rules and how they will interact with different sized int, long int etc. So instead of having portable software you just have software that may fail in unexpected ways on different types of machines. I don’t actually know the rationales , but I would think that making it easier to write portable software was possibly one of the goals of introducing stdint.


The promotion rules only get worse when you use a type alias like int32_t, which could be a typedef for short, int, long or conceivably even char.

An expression of type int doesn't undergo any default promotion to anything, period; the widening promotion is only applied to the char and short types.

An int operand may convert to the type of an opposite operand in an expression, and an int argument or return value will convert to the parameter or return value type. That applies to int32_t also.

Anyway, you really have to know the rules to be working in C. Someone who uses fixed width types for everything doesn't know what they are doing and are just taking swipes at imaginary ghosts in the dark out of fear.


The program becomes less limited as it is ported to more capable machines with bigger integers, instead of continuing to pretend that everything is a 32 bit i386.

I'm guessing you've never ported code that was written for 32-bit ints back to a 16-bit architecture.

Always better to make your type sizes crystal-clear to the reader, IMO, even if it risks using a suboptimal word size on some other platform down the line.


> You need some low-level justification for using an uint32_t and such: like conforming to some externally imposed data format or memory-mapped register bank.

This is backwards. You should use uint32_t etc. like any higher-level language would, unless you have specific reasons to know you need a machine-sized type. Making your code behave differently in some arbitrary, untested way on machines with different default int sizes isn't going to make it "less limited", it's just going to make it broken.


Can you cite one page in the K&R2 where the authors are using some 32-bit-specific integer? Or any other decent C book: Harbison and Steele? C Traps and Pitfalls?


If you don't care about signedness of char, using it might be better for performance. Otherwise, yes, you're right.


Because a char is meant to be used as a character type not a numerical type, and sign doesn't mean anything in that context? Yes, I know people use it to mean byte (even though in most cases the compiler will promote it to 16 bit or whatever the word length of the platform is) but if you mean to use an integer, you should use an integer. The only time I can think of when you would use char as an integer is on 8-bit microprocessors with limited RAM/storage (which, given they're 8-bit, they probably will be limited). But if there are other use-cases, or my understanding is wrong (I haven't done any serious C/low level work in 20 years), please do correct me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: