Hacker News new | past | comments | ask | show | jobs | submit login
Fputc semantics for conforming C implementations (thephd.dev)
52 points by goranmoomin on Dec 14, 2022 | hide | past | favorite | 42 comments



> unsigned char cread = (unsigned char)fgetc(f);

Don't do this. The return type of fgetc is int for a reason.

When you do this, you will have no ability to distinguish between EOF, which is typically -1, and (unsigned char)EOF, which is typically 0xff. So if the user types '\xff' or has some binary data you may treat it as end of file.

In practice that issue is way more common than the purpose of this article, a theoretical compiler where somebody has chosen to make char exceed 8 bits, which in the 21st century I would classify as crazy town.


This is just pretty bad design, and it means if char is 16 bits and so is int, which as far as I can tell is allowed, the interface is unusable.


You've just illustrated one of the reasons I wrote that it would be crazy town to make such a choice.

The C standard and POSIX are full of things like this. Areas where an implementation could make different choices, but actually making certain choices would be very foolish, so by and large people do not.

Re-reading the article, it seems like Texas Instruments has made such a choice. I think it'd be quite challenging to find others.

Edit: i found this documentation from TI confirming this. https://www.ti.com/document-viewer/lit/html/SPRAC71A/GUID-3F... "There are no 8-bit objects on C28x devices. This presents unique challenges for implementing the ELF object file format on C28x devices." Int is also 16 bits.


There is a whole world of those systems, those are usually called "DSP", and TI is just one example of those. Basically, microprocessors optimized at math operations and lower cost, at the expense of compatibility.

I have programmed them in the past, and yeah, you have to thread carefully. Many existing libraries are nor ready for the wide chars.


The systems where this is the case aren't intended for text processing.


this doesn't mean that fwrite should only write every 2nd or every 4th byte, when you give it a pointer to some non-text data


The fine author wrote that they spoke to numerous engineers who have been writing code targeting these compilers for decades though?


The C char type is an aberration. (and always has been)

The simple fact that a char can be defined as unsigned is telling enough. C should have defined a byte type long ago.

Overall, I think the whole C standard library aged poorly, the semantics, the somewhat cryptic and overly abbreviated naming style, all the corner cases with old types like char that was used instead of the missing byte type.

I very rarely use C on top of the standard C library, I prefer using a framework or at least a single-header with a list of saner/more modern types and wrappers.


Except we’ve had stdint.h since C99 (23 years now), so we’ve had a byte type for a long time now. Are you saying C99 historically took too long, or that you don’t get what you need from C99?


stdint.h was a huge improvement. But uint8_t is kind of ugly compared to "byte".

I personally use :

byte u16 u32 u64 i32 etc. and u32x4 f32x4 for vectorized/simd types

Not perfect but a good-enough compromise that would not take long to understand for anyone reading my code.


I'm wading through a sea of third party projects right now that variously use uint8_t, u8, byte, tchar, BYTE, and endless variations that prepend one or two underscores for no apparent reason other than simple paranoia. Now, on to other types: Care to guess how many bits WORD is? (Trick question: It depends on the project you're in. You can imagine the adaptation layers when someone thinks that a bool is 32 bits).

It's even more fun when they #define these types, rather than use typedef or "using". ("Remember, children, every time you use #define, God kills a startup").

These types don't make things safer, or in any way better. They are long-term damage.

For the love of Schmoo, use the standards, I'm losing my hair.


I’m a bit surprised that this needed an entire essay when the first section title (“no”) seems rather obvious?

It feels a bit like when ddevault got pissy about “int” not meaning “codepoint”, when the minimum size of int is 16 bits (though in that case what they were doing was literally listed as UB in the docs).


I don't feel it's obvious at all. Is it not surprising that you can't reliably store chars to a file using fgetc/fputc?

OTOH I'm not surprised at all that embedded compilers play fast and loose with the standard.


> Is it not surprising that you can't reliably store chars to a file using fgetc/fputc?

It was surprising until I needed I/O between two systems with different CHAR_BITs to work.

Many systems need to access octet based field and communicate over octet based busses, and CHAR_BIT>8 machines are rare.

If the entire char is serialized, interoperability becomes a much larger mess.


I understand the reasoning. If you have a string each letter is an individual character. You wouldn't pack two ASCII letters into a a char just because your char is 16 bit.

On such a platform, if you have the string "Hello" and would write the chars un-truncated you'd end up with "H\0e\0l\0l\0o\0" in the file.

However, if you are not writing a string, but binary data, this would mean you now loose every 2nd 8-bit byte.


If you are writing binary data from memory, _and_ want to make sure any system can read it no matter what endianness it uses, you might need something like this:

    buf[0] = (data >> 24) & 0xFF;
    buf[1] = (data >> 16) & 0xFF;
    buf[2] = (data >> 8) & 0xFF;
    buf[3] = data & 0xFF;
    
and if you want this code to work on all systems (including wide-char ones), you need to make sure fwrite truncates too.


> I don't feel it's obvious at all. Is it not surprising that you can't reliably store chars to a file using fgetc/fputc?

Chars 2 to 4 times larger than what the standard requires? Not really. It means one platform would be reading / writing 4 bytes (octets) at a time, at which point endianness rears its ugly head.

That sounds a lot worse than clamping values to a byte.


Where is the ddevault bit?



I’ve never had the displeasure of programming on one of these large-char systems, but;

> The fwrite() function shall write, from the array pointed to by ptr, up to nitems elements whose size is specified by size, to the stream pointed to by stream.

So I guess fwrite writes nitems * size things of whatever size fputc writes. So, if fputc truncates, how large is the resulting file? Specifically, if I write ten chats (e.g. “Mic check\n”), do I get a ten-chat file? A ten-byte file? If I have two-byte chars and I write:

{0xab, 0xab}

Do I get a four-byte file containing 0xb, 0, 0xa, 0? Do I get a two-byte file? Does every file on the entire system consist of 50% zeros? Or do I end up with 0xb, 0xa, so the file is two bytes, filesystems have one-byte characters, but C’s CHAR_BIT is nonetheless 16?

(The latter actually seems like it makes a tiny bit of sense on a system that doesn’t have 8-but hardware types at all.)

But this is all pretty ridiculous, and none of the choices are really good. I can understand having bona fide 16-bit characters, but having this mismatch is nasty.

To be clear, basically everything on old embedded designs is nasty. I have used some of these compilers, and they were unbelievably buggy messes.


> (The latter actually seems like it makes a tiny bit of sense on a system that doesn’t have 8-but hardware types at all.)

> But this is all pretty ridiculous, and none of the choices are really good. I can understand having bona fide 16-bit characters, but having this mismatch is nasty.

I’d say the latter (and the behaviour observed by TFA) is the only sensible behaviour, otherwise it's plain not possible to do IO portably, instead you would have to write different codepaths based on CHAR_BIT.

Not only that but:

> Do I get a four-byte file containing 0xb, 0, 0xa, 0? Do I get a two-byte file?

There's actually a third choice because multibyte IO means endianness becomes a factor. So given char[]{0xab, 0xab} the output of writing memory contents literally with CHAR_BIT=16 could be

    0x00 0xAB 0x00 0xAB
or

    0xAB 0x00 0xAB 0x00


Previous "discussion": https://news.ycombinator.com/item?id=31077574

As it has only one comment, let me copy-paste it in full, below.

---

I have read the article, and my first reaction is to disagree with the conclusions. The whole problem stems from the fact that the units in which the data is accessed by the ("strange") CPU and by the external storage can be different. Let's suppose that the CPU-character is 16-bit wide (i.e. that the CPU can't address or process 8-bit data units individually), while the storage-character is 8-bit wide. In that case, both the behavior of writing each CPU-character as two storage-characters, _and_ the behavior of writing only the 8 bits of each CPU-character to storage, might make sense, for different applications. This is not really about truncation, this is about choosing an interleaved or non-interleaved representation of the storage content in memory as the source layout. And yes, for the representation where each storage-character is represented by one CPU-character, it is logical that the higher 8 bits can't matter, must be ignored, and so the program can return 2 in this case.

The real bug here is the attempt to stipulate one and the same storage representation for all applications.

This should ideally be a flag to fopen(), just like we have a "b" flag to denote binary streams on some systems. I think it is too late to rule on the default, though.


IMHO the point of the author is backwards.

unsigned char c = UCHAR_MAX; // e.g. 65'535 for CHAR_BIT = 16

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is line that feels wrong for me.

CHAR_BIT = 16 is a corner case, I don't even understand why it is allowed by the standard.


There is apparently a bunch of dsps and small hw that has char bit at 16 or 32. Someone would have to tell them that they’re compilers are no longer conforming


This is where the standard should have been more strict.

For historical reasons the original C types bit-width were not strongly defined as most hardware of the time had "exotic" registers and addresses by today standards.

But it was a long time ago. It is reasonably safe to assume that 14bits ints and 21bits pointers are deprecated.

Why would DSPs would need to change the bit-width of chars? We already have shorts and ints that are better candidates for 16bits and 32bits width.

I would not mind if the C standard was made both stricter and simpler, and I maintain that the author is wrong, fputc (and all other IO libs depending on it) should not be modified to take into account this kind of absurd corner cases.

Doing so will introduce more bugs and complexity that would have to be maintained.


"char" is supposed to match the smallest addressable object in the memory, which is why "sizeof" returns the ratio between the size of an object and the size of "char".

So if a DSP cannot address bytes separately, but it can address 16-bit words, the size of a "char" must be 16 bits. This is independent of choosing to define "short" as a 16-bit integer.

It is useful to have a type corresponding to the smallest addressable part of the memory, but it should not have been named "char", which is a name that should have been reserved for ASCII or UTF-8 data, and never used for numbers or bit strings.


If the hardware can't directly write or read 8-bit chars in memory, it is easy to solve by the compiler with some bit masks, not by creating an artificial CHAR_BIT constraint that will annoy every other implementers down the line.

This is beyond ridiculous.


There is even some precedent, for example original Alpha AXP instruction set only supports 32-bit and 64-bit memory reads/writes, while still having 8-bit char.

Although, I guess the bigger problem is to how to represent pointers to 8-bit values - on Alpha a pointer with value 1 would point to the octet at offset 1 in memory, where presumably on this 16-bit DSP such 1-pointer would point at the octet at offset 2.


Because char is loaded here. It is the minimum addressable bit size. So they need to do it because it is faster and cheaper to address larger chunks. The issue in the article is that calls to fputc/fwrite to a binary file will truncate to 8bits still on some of those platforms.

A byte type that is always 8bits would be a better approach, i think. But c++ bungled this too, it is the size of a char still. I dont think it should exist on systems that cannot address 8bits


Either the char type should not exist on hardware that can't address 8bits, or better yet, the compiler can help with like 2-3 asm instructions instead of one.

Slower, maybe, but saner and conformant to a pre-existing standard.


The main point of contention seems to be that fputc is defined to return the value that was written to the stream. However, TI's implementation of it returns the character passed in as a parameter (eg 0x1600), but actually writes (unsigned char)(0x1600 & 0xff) to the stream.


That doesn’t seem to be the case. The tests are strangely interleaved, but if you read carefully:

- cread and cwritten are always the same, which is required by the standard

- they differ from c

Meaning fputc does return the value actually written, not the value input.

And OP spends the entire essay freaking out about that. Well really that fputc would operate on octets when the implementation’s char is larger than that. You can see it in the second section, TFA complains that TI / DSPs return 2. Which, mind, is the entire point of fputc returning what it wrote.


Ah ok. Well if that's the case then that's correct, fputc does operate on octets right? The only reason it returns an int is to distingush EOF AFAIK. To be honest I found the article hard to follow.


> Well if that's the case then that's correct, fputc does operate on octets right?

That is what I would think obvious here, as it's generally how char-based functions operate, and doing CHAR_BITS IO would be a complete mess for portable IO (and would introduce ambiguities due to endianness).

Although technically what it says is:

> The fputc() function shall write the byte specified by c (converted to an unsigned char) to the output stream pointed to by stream

So if you have CHAR_BITS=16 and WORD_BIT=16 things get dicey. But indeed the easiest way to resolve the ambiguity is to assume that like everywhere else it only operates on the "minimum" char boundaries (an octet), and `int` is only used to allow or return `EOF`. This is what usually occurs in other char-based functions.

Though in all honesty I don't know how a platform with CHAR_BITS=16 handles `char*` / `char[]`.


Incidentally, this is the exact same problem that we have in the political/legal world: sure, the words of the constitution are written with the unspoken expectation that they have a specific meaning, but in practise, whoever is in power can decide the meaning of the words to be what they want them to be:

"As a starting point you need to understand that the C specification is not written in a rigorous language. It relies on making sense of natural language to draw any conclusions from it. If you insist that means you can just interpret it however you want, well, no one can stop you, but it's not a productive activity. The point is human consensus on what it means." [0]

Or, how Alice In Wonderland put it:

“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean — neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.” “The question is,” said Humpty Dumpty, “which is to be master – – that’s all.”

As for the (less interesting) technical argument, I think this one nails it on why it is illegal to truncate:

" "Characters" here are the values representable in unsigned char as fputc is specified to write." [1]

[0] https://twitter.com/RichFelker/status/1451040776855638019

[1] https://twitter.com/RichFelker/status/1451038290006663176


I'm not really a c programmer, but in terms of weird c behaviour, this doesn't sound that insane (relative to usual c craziness).

So if i understand, a char type is allowed to be bigger than a "character", but it will still get truncated to be a character when written. Sounds reasonable enough to me if you have some weird architecture .


No matter how frustrating this might be, it seems several points are clear from this post:

1. Implementations claiming to be 'standards conforming' can and do exhibit this behaviour.

2. The groups that wrote the standards do not, as a whole, assert that implementations which behave this way do not conform to the standard.

3. There seems to be consensus that the standard documents themselves as works of natural language produced by the members of the relevant standards bodies are open to a certain amount of interpretation.

It would be good if members of the standards body could unify what the standard documents say with what is actually going on somehow.

But it would be easy for TI to update their fputc to return the value that was actually written. Why don't they just do it?

If the author feels the implementation is not standards conforming, why can't they write a shim to make it so?


If TI updates their fputc, then files written on TI microcontrollers would not be compatible with any other system, see other comments for plenty of examples.

Most people who actually write for TI systems actually want to fputc to truncate. And the blog author got incensed right away without actually bothering to figure out _why_ the truncation happened or seeing what would be the implication of his proposed change on real-world programs.


I am making a personal C standard library of sorts, and I made the decision to only use the low-level I/O routines on each platform and implement buffering and file semantics myself. I did this to make it easier to have an implementation that works the same on POSIX and Windows. (I wanted to use the native CreateFile(), WriteFile(), etc. on Windows.)

This article justifies that decision in a way I never thought would happen. I didn't think I would have other reasons for doing so. Now I do.


This is a great article, but C is full of ways to easily make mistakes with memory. I think with rust being more mature everyone should consider using it, or another memory safe language, instead of starting to write new programs using C.


I gave up after two pages of the article not getting to the point.


> code full of #ifdefs is not portable code - it’s code that has been ported a lot

Definitely true - the more preprocessor magic you rely on, the harder it is to figure out what’s actually going into the compiler.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: