There's one more potential gotcha in your example that has burned me a few times.
You have a function that takes char *endptr and passes *endptr to isspace(). At least on the particular version of the MSVC compiler I usually use, char is signed by default, and isspace uses an ASCII lookup table internally, presumably 256 bytes long.
So if anything passes a so-called "negative ASCII" value to your routine, memory will be accessed prior to the beginning of the table. Usually just causes a crash, but the implications could go well beyond that. Whenever I use any of the is...() functions, I've learned to cast the value to unsigned char to make sure underflow can't happen. It's caused a lot of bug reports from non-US-based users who end up with ANSI (extended or 'negative' ASCII) characters in various places.
Obviously there are plenty of facepalms to go around here, ranging from the C standard's failure to disallow signed chars to certain compilers' adoption of them, to Microsoft's failure to protect their standard C library function inputs, to my negligence in not realizing the implications of the foregoing. Definitely something to watch out for in your own code.
As I write this, it occurs to me that they might actually use a 128-byte LUT. So I should either stop using iswhatever() functions altogether or write my own wrapper for them....
Isspace takes an int (signed) and its documentation says:
The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.
IOW, you invoked undefined behavior. That's one of the problems with the C standard library: it expects callers to be much more careful than the language wants you to. So many footguns. It doesn't help that the implementations of the standard library assume the callers are doing their due, mostly for speed reasons that were valid 20+ years ago.
If isspace is a function designed to take a character. And if the default type of a character is incompatible with the argument to that function then that is arguably a bug in the specification.
And, given that behavior, that the documentation fails to explicitly call it out is borderline malicious. Given that the authors even knew about it, which I guess is doubtful.
The C type system isn't really set up to deliver what you'd actually want for safety. The rational response is to not use C, but it took until relatively recently for people to really begin that migration.
As another example, notice that abs(number) has Undefined Behaviour for INT_MIN or whatever C calls it.
It is not, but then again a lot of people don't use a static type system at all.
I don't consider abs() comparable. It is a direct consequence of underlying representation that you are kind of expected to know, the consequences are known and considered.
Is it bad? Depends on context.
isspace on the other hand, as described, just seems like bad design.
No, the consequences aren't "known and considered" as I explained, they are Undefined Behaviour. Absolutely anything might happen, anywhere in your program.
This is an extraordinary choice, made in the name of convenience, decades ago.
Wrapping here would match your claim that it's "a consequence of underlying representation". Saturating, rather odd here, but that too would be explicable. Runtime Error? Perhaps too heavyweight for a language like C but otherwise reasonable. But Undefined Behaviour is extraordinary.
I meant the consequences of the two-complement system is known.
There are tons of quirks in C, but abs() doesn't make the list in my book. It is just that the underlying representation (in the case of two-complement) doesn't allow for it to exist.
> I meant the consequences of the two-complement system is known.
How does that matter? We're calling abs() which is not defined over the "two-complement system" it's defined on C's signed integer type, and the way it's defined is that if you call it with the valid integer INT_MIN you get Undefined Behaviour.
Let me underscore that, you don't necessarily get zero, or INT_MAX or five, or even INT_MIN (which might feel pretty unexpected given you just called abs and now you've got a negative number again - but is at least a defined behaviour) you get Undefined Behaviour. All bets are off.
The programmer who gets bitten by this isn't thinking about INT_MIN, they're thinking about how to check the input is in range, they've written abs(x) < 100 and they think this means they checked that x is between -99 and +99 inclusive, but they're wrong, what it did was introduce Undefined Behaviour into their program whenever x is INT_MIN.
This argument is a nonstarter. abs() can't return a defined answer if you pass INT_MIN to it because there is no such thing as abs(INT_MIN) in the two's-complement system used by virtually all real-world systems.
Meanwhile, -2322 is a valid int on every C system but not an ASCII whitespace character on any of them. Therefore, it would be not only possible but perfectly reasonable for isspace() to return 0 for it. Classic violation of the principle of least surprise.
Because C is a low level language. It is not a mistake, it by design exposes hardware characteristics. If you don't care about that you probably shouldn't be using C.
I am the first to agree that undefined behavior in C is pretty bad. But this is trivially understood and one of the most basic examples. Can it still bite you in the ass? Sure.
I would prefer it to be implementation defined, but relying on that blindly isn't much better anyway.
Undefined Behaviour doesn't "expose hardware characteristics". I assure you that the effect in hardware is not undefined.
It would be interesting to survey Rust programs here.
Rust's abs() on normal signed integer types function panics on overflow except in release builds (where it does the same as wrapping_abs)
Rust's wrapping_abs() gives MIN because that's what happens in two's complement arithmetic when you wrap around beyond MAX, this is probably undesirable for most people most of the time.
Rust's saturating_abs() gives MAX here, I would expect lots of programmers want that.
The Wrapping<T> and Saturating<T> types have abs() do what wrapping_abs() or saturating_abs() accordingly do on the basic integer type.
I didn't say that undefined behavior expose hardware characteristics, though I suppose for a lot of cases it really does - but the compiler won't assume so. I said that C exposes hardware characteristics. And C also has several architectures in mind.
Signed integer overflow is trivial on a two-complement system but not necessarily on others, so the lowest common denominator dictates what can be done. If the behavior is different or expensive you might shift that decision to the programmer.
Negating the sign of INT_MIN isn't cheap to do in a two-complement system.
C is full of things that are different from one architecture to the next but don't have Undefined Behaviour. Instead they are: "Implementation Defined".
The documentation does call it out explicitly. The message you responded to pointed out where the documentation called it out. You're attacking the authors' intentions and competence when there is no reason for that.
They do not explicitly call out that the systems default character is implementation specific and that the function as is might be incompatible with default behavior (on as it turns out the vast majority of systems).
Just saying it needs to be able to fit an unsigned char and call it a day is hardly clear documentation when it isn't even obvious that the signness of the character type is implemention defined for most c-programers.
It almost reads as a deliberate trap (of course it isn't, but imo it is not good documentation).
> mostly for speed reasons that were valid 20+ years ago.
I think it is more likely the same issue Posix had, the official standard came after the implementations. So it had to unify all existing implementations under the same definition. Which means the standard could not require sane behavior if even a single implementation failed to validate its input or two implementations disagreed on the exact error.
Yep, that's what I meant by plenty of facepalms to go around. The entire class of functions is "unsafe at any speed." Code which seems perfectly safe on one compiler will blow up on another, literally without warning.
I do wonder why they didn't have the function take an unsigned int, or at least do a simple bitwise AND with 0xFF before the table lookup.
Nitpick: it is only guaranteed to be unequal to any character and less than zero.
(And no, I don’t know any implementation that has it unequal to minus one. I would love to see a devilish C compiler/standard library that would make all kind of standard-conforming, yet non-standard implementation choices (EOF not minus one, RAND_MAX a fairly small prime, etc. Goal would be to have no single major program run correctly on it)
Good point, but if the standard defines EOF as a negative value while at the same time permitting implementations to support default-unsigned char types, then the sanity train obviously left the tracks at that time, or at some time prior to it.
You have a function that takes char *endptr and passes *endptr to isspace(). At least on the particular version of the MSVC compiler I usually use, char is signed by default, and isspace uses an ASCII lookup table internally, presumably 256 bytes long.
So if anything passes a so-called "negative ASCII" value to your routine, memory will be accessed prior to the beginning of the table. Usually just causes a crash, but the implications could go well beyond that. Whenever I use any of the is...() functions, I've learned to cast the value to unsigned char to make sure underflow can't happen. It's caused a lot of bug reports from non-US-based users who end up with ANSI (extended or 'negative' ASCII) characters in various places.
Obviously there are plenty of facepalms to go around here, ranging from the C standard's failure to disallow signed chars to certain compilers' adoption of them, to Microsoft's failure to protect their standard C library function inputs, to my negligence in not realizing the implications of the foregoing. Definitely something to watch out for in your own code.
As I write this, it occurs to me that they might actually use a 128-byte LUT. So I should either stop using iswhatever() functions altogether or write my own wrapper for them....