
EOF is not a character - UkiahSmith
https://ruslanspivak.com/eofnotchar/
======
charlysl
This is very well explained in the classic book _The UNIX Programming
Environment_ , by Kernighan and Pike, in page 44:

 _Programs retrieve the data in a file by a system call ... called read. Each
time read is called, it returns the next part of a file ... read also says how
many bytes of the file were returned, so end of file is assumed when a read
says "zero bytes are being returned" ... Actually, it makes sense not to
represent end of file by a special byte value, because, as we said earlier,
the meaning of the bytes depends on the interpretation of the file. But all
files must end, and since all files must be accessed through read, returning
zero is an interpretation-independent way to represent the end of a file
without introducing a new special character._

Read what follows in the book if you want to understand Ctrl-D down cold.

------
rectang
Like NULL, confusion over EOF is a problem which can be eliminated via
algebraic types.

What if instead of a char, getchar() returned an Option<char>? Then you can
pattern match, something like this Rust/C mashup:

    
    
       match getchar() {
         Some(c) => putchar(c),
         None => break,
       }
    

Magical sentinels crammed into return values — like EOF returned by getchar()
or -1 returned by ftell() or NULL returned by malloc() — are one of C's
drawbacks.

~~~
nothrabannosir
What always annoyed me about C is that it has all the tools to simulate
something approaching this, save for some purely syntactical last-mile
shortcomings. We can already return structs; if only there were a way to
neatly define a function returning an anonymous struct, and immediately
destructure on the receiving end. Something like:

    
    
      #include <stdio.h>
    
      struct { int err; char c; } myfunc() {
        return { 0, 'a' };
      }
    
      int main(int argc, const char *argv[]) {
        { int err; char c; } = myfunc();
        if (err) {
          // handle
          return err;
        }
        printf("Hello %c\n", c);
    
        return 0;
      }
    

This is (semantically) perfectly possible today, you just have to jump through
some syntactic hoops explicitly naming that return struct type (because among
others anonymous structs, even when structurally equivalent, aren't equivalent
types unless they're named...). Compilers could easily do that for us! It
would be such a simple extension to the standard with, imo, huge benefits.

Every time I have to check for in-band errors in C, or pass a pointer to a
function as a "return value", I think of this and cringe.

~~~
juped
Sounds like you'd like Go, which works this way.

~~~
apta
Which is a strictly inferior and botched way to go about it, especially since
golang was designed from scratch.

------
Animats
In the beginning, there was the int. In K&R C, before function prototypes, all
functions returned "int". ("float" and "double" were kludged in, without
checking, at some point.) So the character I/O functions returned a 16-bit
signed int. There was no way to return a byte, or a "char". That allowed room
for out of band signals such as EOF.

It's an artifact of that era. Along with "BREAK", which isn't a character
either.

~~~
bhaak
You can still today declare a function without a return value like this: "a()
{ return 1; }".

GCC only outputs a warning by default: "warning: return type defaults to ‘int’
[-Wimplicit-int]"

------
reidacdc
Seems like the confusion arises because getchar() (or its equivalent in
langauges other than c) can produce an out-of-band result, EOF, which is not a
character.

Procedural programmers don't generally have a problem with this -- getchar()
returns an int, after all, so of course it can return non-characters, and did
you know that IEEE-754 floating point can represent a "negative zero" that you
can use for an error code in functions that return float or double?

Functional programmers worry about this much more, and I got a bit of an
education a couple of years ago when I dabbled in Haskell, where I engaged
with the issue of what to do when a nominally-pure function gets an error.

I'm not sure I really _got_ it, but I started thinking a lot more clearly
about some programming concepts.

~~~
nixpulvis
What does "Procedural" vs "Functional" have to do with this? It's a choice in
data type.

If by procedural you mean, nonsense, then sure... I agree that a function
named `getchar` returning an `int` is procedural. :P

~~~
eyegor
What they mean to say is: when I was working with a language that enforced
pure functions, I had to actually consider purity. It's rare to see a way to
enforce purity in procedural languages, whereas most fp langs support it.

~~~
nixpulvis
Are we talking about even roughly the same concept of functional purity [1]?
Nothing is stopping a pure function from representing EOF as -1.

Implementing IO in a "pure" way, is however another discussion.

[1]:
[https://en.wikipedia.org/wiki/Pure_function](https://en.wikipedia.org/wiki/Pure_function)

~~~
eyegor
Mostly, do you know of a single procedural language with a concept of IO
monads in its stdlibs?

------
anonymousiam
CP/M and DOS use ^Z (0x1A) as an EOF indicator. More modern operating systems
use the file length (if available). Unix/Linux will treat ^D (0x04) as EOF
within a stream, but only if the source is "cooked" and not "raw". (^D is
ASCII "End Of Transmission or EOT" so that seems appropriate, except in the
world of unicode.)

~~~
pwdisswordfish2
That is a common misconception.

[http://jdebp.info/FGA/dos-character-26-is-not-
special.html](http://jdebp.info/FGA/dos-character-26-is-not-special.html)

~~~
unilynx
I'm pretty sure the DOS TYPE command (its version of cat) would stop at the
first ^Z it encountered, even if the file was longer.

It was sometimes used to have TYPE print something human readable and stop
before the remaining (binary) file data would scroll everything away

~~~
cesarb
> It was sometimes used to have TYPE print something human readable and stop
> before the remaining (binary) file data would scroll everything away

Notably, in the PNG file format (created back when MS-DOS was still very
relevant):

"The first eight bytes of a PNG file always contain the following values:
[...] The control-Z character stops file display under MS-DOS. [...]"
([http://www.libpng.org/pub/png/spec/1.2/PNG-
Rationale.html#R....](http://www.libpng.org/pub/png/spec/1.2/PNG-
Rationale.html#R.PNG-file-signature))

------
combatentropy
The kernel returns EOF "if k is the current file position and m is the size of
a file, performing a read() when k >= m..."

So, is the length of each file stored as an integer, along with the other
metadata? This reminds me of how in JavaScript the length of an array is a
property, instead of a function that counts it right then, like say in PHP.

Apparently it works. I've never heard of a situation where the file size
number did not match the actual file size, nor of a time when the JavaScript
array length got messed up. But it seems fragile. File operations would need
to be ACID-compliant, like database operations (and likewise do JavaScript
array operations). It seems like you would have to guard against race
conditions.

Does anyone have a favorite resource that explains how such things are
implemented safely?

~~~
JdeBP
You are not thinking about it clearly. Ask yourself this: Filesystem formats
use blocking and deblocking. How would a filesystem know the file size
_without_ having metadata for it?

------
chrisseaton
So what is CP/M-style character 26? Isn’t that documented as end-of-file?

~~~
jcrawfordor
Perhaps a marginally better title would be "EOF is not a character [on Unix]".
There are some OS that have an explicit EOF character, but it seems to have
been the less common approach historically. CP/M featured an explicit end of
file marker because the file system didn't bother to handle the problem of
files which were not block-aligned, so the application layer needed to detect
where the actual end of the file was located (lest it read the contents of the
rest of the block). This is a pretty unusual thing to do, and was definitely a
hassle for developers, so CP/M descendants like MS-DOS fixed it.

~~~
mark-r
I think CP/M copied that convention from an even older OS but I can't remember
which one.

~~~
jcrawfordor
CP/M was developed on TOPS-10 and copied a lot of concepts from it. I can't
immediately tell whether or not this is an example, but for any given
eccentricity of CP/M it's a good bet that it came from TOPS-10.

It's amusing that almost the same can be said about NT: for any given
eccentricity of Windows NT it's a good bet that it came from VMS, since the
two had the same principal designer.

------
IndexPointer
Of course it isn't, you couldn't have arbitrary binary files if one of the 256
possible bytes was reserved. That's why getchar returns int and not char; one
char wouldn't be enough for 257 possible values (256 possible char values +
eof).

------
schoen
Recently (though mine was the only comment):
[https://news.ycombinator.com/item?id=22461647](https://news.ycombinator.com/item?id=22461647)

~~~
nixpulvis
Well then try explaining ctrl+c vs ctrl+d to someone who's never touched a
terminal at all. Starts off so easily... "see one tells the program to stop"
the other, well, if you're in a shell... or some programs... oh god. IDK
anymore, just assume it works. What was the question?"

~~~
ChristianBundy
Maybe you can correct me if I'm wrong, but I've always considered Ctrl+C and
Ctrl+D to be signals that you can send a process rather than explicit
characters. You might _also_ get some stdout for those key combinations
because ???, but they should be thought of as signals rather than as
characters you're sending via stdin.

Hoping Cunningham's Law comes into play with this comment. :)

~~~
rgoulter
I liked this explanation.
[https://www.linusakesson.net/programming/tty/](https://www.linusakesson.net/programming/tty/)

When the TTY device takes (by default) Ctrl+C or Ctrl+D, it sends the signals
to the program. The TTY's 'line discipline' (the policy for when the program's
STDIN can read from a line of input) can be changed from a default 'cooked' to
a 'raw mode'. In with raw mode line discipline the Ctrl+C doesn't send the
signal. Presumably that's why e.g. vi or emacs don't just close on Ctrl+C.

~~~
nixpulvis
> Now you press ^Z. Since the line discipline has been configured to intercept
> this character (^Z is a single byte, with ASCII code 26), you don't have to
> wait for the editor to complete its task and start reading from the TTY
> device. Instead, the line discipline subsystem instantly sends SIGTSTP to
> the foreground process group.

This helps me, thanks for pointing me back at this great write-up.

------
nixpulvis
I find it interesting that Rust's `Read` API for `read_to_end` [1] states that
it "Read[s] all bytes until EOF in this source, placing them into buf", and
stops on conditions of either `Ok(0)` or various kinds of `ErrorKind`s,
including `UnexpectedEof`, which should probably never be the case.

[1]: [https://doc.rust-
lang.org/std/io/trait.Read.html#method.read...](https://doc.rust-
lang.org/std/io/trait.Read.html#method.read_to_end)

~~~
comex
The reason for that is that, for simplicity's sake, all of the I/O functions
share the same error type. `UnexpectedEof` should never be returned from
`read_to_end`, but it can be returned from `read_exact`.

------
badrabbit
Banged my head against the wall once after trying to figure out why Ctrl+D
generates some character in bash but I can't send that character in a pipe to
simulate termination.

~~~
kylek
Fun fact, ctrl-v in bash sets "verbatim insert" mode for the next character,
so you can type a ^D "character" by doing "ctrl-v ctrl-d".

~~~
pwdisswordfish2
It’s not bash, it’s the tty device driver. Applications can switch between the
‘cooked’ mode (which recognises it as EOF) and ‘raw’ mode (which passes it
through) by performing some ioctl I don’t really want to look up right now.

------
jwilk
Um, no, you can't use Python to infer that "EOF (as seen in C programs) is not
a character".

The exception even tells you that "chr() arg not in range(0x110000)" which has
nothing to do with range of C's character types.

------
unnouinceput
For me EOF is a boolean state. Either I am at the end of file (stream / memory
mapped etc) or not. That's how I was taught when I started programming. Never
occurred to me to think of it like a character.

------
agumonkey
And this is why I failed C IO classes. Lack of information and improper
abstraction.

------
Thorrez
Another weird thing is that sometimes you can read an EOF, then keep reading
more real bytes. So EOF doesn't necessarily mean the permanent end.

~~~
jwilk
The EOF condition for stdio functions is supposed to be sticky, although glibc
didn't implement it correctly until 2.28:

[https://sourceware.org/bugzilla/show_bug.cgi?id=1190](https://sourceware.org/bugzilla/show_bug.cgi?id=1190)

[https://sourceware.org/legacy-ml/libc-
alpha/2018-08/msg00003...](https://sourceware.org/legacy-ml/libc-
alpha/2018-08/msg00003.html)

> _All stdio functions now treat end-of-file as a sticky condition. If you
> read from a file until EOF, and then the file is enlarged by another
> process, you must call clearerr or another function with the same effect
> (e.g. fseek, rewind) before you can read the additional data. This corrects
> a longstanding C99 conformance bug. It is most likely to affect programs
> that use stdio to read interactive input from a terminal._

~~~
Thorrez
Wow, very interesting! That sounds like a somewhat significant change, and I
wonder how much stuff will be broken by it.

Although interestingly somehow I'm still seeing the old behavior in Debian
Buster with glibc 2.28 with python3.

    
    
        import sys
        while True:
            b = sys.stdin.read(1)
            print(repr(b))
    

With old glibc with both python2 and python3 the EOF isn't sticky (as
expected). With 2.28 with python2 the EOF is sticky (like you said). With 2.28
with python3 it's not sticky for some reason.

~~~
jwilk
In Python 3, file I/O is is implemented using POSIX read(), write() etc.,
rather than C stdio.

~~~
Thorrez
Interesting, and EOF on POSIX read() isn't supposed to be sticky?

That seems like a weird situation, that EOF is sticky in some cases but not
others.

------
cjohansson
Interesting read, I suspected it was like this but I didn’t know for sure

------
jes5199
yeah, this author doesn’t know the history. Unix I/O was defined in opposition
to practices in other OSes, that no longer exist

~~~
guerrilla
Clearly, since they barely know the system they are talking about but could
you elaborate instead of leaving it vague? Which systems?

~~~
jes5199
there’s plenty of other comments that explain it, but, CP/M, VAX,
teletypewriters, punch cards - all used in-band control characters rather than
an external signal

------
ineedasername
This strikes me as the sort of pedantic and "I'm witty" click bait that
occasionally percolates upwards on HN, especially considering the specifics of
"EOF" are very much contingent on operating context.

------
1996
\r \n (0x0a 0x0d, or just one of them, or the combination of them, depending
on your OS) is EOL

^D (0x04) is EOT and 0x03 is EOText: [https://www.systutorials.com/ascii-
table-and-ascii-code/](https://www.systutorials.com/ascii-table-and-ascii-
code/)

So, kinda, but somehow I'm happy it never got turned into a weird combinations
depending on the OS.

~~~
mark-r
Those are just conventions, and they aren't consistent from one OS to another
at all. ASCII tried to standardize it but failed.

~~~
1996
That should have been my point - I'm happy there aren't 3 standards to specify
what EOFile is.

