Technically the “newline” character is actually a line _terminator_. Hence “A\n”...

wtetzner · 2024-03-20T13:03:47 1710939827

So if you have "A" in a file with no newline, there are no lines in that file?

jepler · 2024-03-20T13:16:21 1710940581

Yes, that is a file with zero lines that ends with an "incomplete line". Processing of such files by standard line-oriented utilities is undefined in the opengroup spec. So, for instance, the effect of "grep"ping such a file is not defined. Heck, even "cat"ting such a file gives non-ideal results, such as colliding with the regular shell prompt. For this reason, a lot of software projects I work on check and correct this condition whenever creating a commit.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... ("text file")

rovr138 · 2024-03-20T13:31:59 1710941519

> Yes, that is a file with zero lines that ends with an "incomplete line".

It's a file with zero complete lines. But it has 1 line, that's incomplete, right?

The file starts empty. Anything in it starts "a line". So it's 1 incomplete line.

I hate weird states.

xyzzy_plugh · 2024-03-20T13:56:31 1710942991

No, it is valid for a file to have content but no lines.

Semantically many libraries treat that as a line because while \n<EOF> means "the end of the last line" having just <EOF> adds additional complexity the user has to handle to read the remaining input. But by the book it's not "a line".

If I said "ten buckets of water" does that mean ten full buckets? Or does a bucket with a drop in it count as "a bucket of water?" If I asked for ten buckets of water and you brought me nine and one half-full, is that acceptable? What about ten half-full buckets?

A line ends in a newline. A file with no newlines in it has no lines.

joshjje · 2024-03-20T16:49:37 1710953377

Thats beyond ridiculous. Most languages when you are reading a line from a file, and it doesn't have a \n terminator, its going to give you that line, not say, oops, this isn't a line sorry.

int_19h · 2024-03-20T22:14:13 1710972853

I don't think you can meaningfully generalize to "most languages" here. To give an example, two extremely popular languages are C and Python. Both have a standard library function to read a line from a text stream - fgets() for C, readline() for Python. In both cases, the behavior is to read up to and including the newline character, but also to stop if EOF is encountered before then. Which means that the return value is different for terminated vs unterminated final lines in both languages - in particular, if there's no \n before EOF, the value returned is not a line (as it does not end with a newline), and you have to explicitly write your code to accommodate that.

squeaky-clean · 2024-03-20T18:35:34 1710959734

Most languages but not all. I've even been bit by this recently in cron.

Assuming that EOF is identical to \\nEOF will end up causing trouble for you one day, because it's not actually identical.

LK5ZJwMwgBbHuVI · 2024-03-20T17:36:03 1710956163

That's a relatively recent invention compared to tools like `wc` (or your favorite `sh` for that matter). See also: https://perldoc.perl.org/functions/chop wherein the norm was "just cut off the last character of the line, it will always be a newline"

nativeit · 2024-03-20T18:47:33 1710960453

I get this is largely a semantic debate, but find it a little ironic so many programmers seem put off with the idea of a line count that starts at “0”.

DougBTX · 2024-03-20T15:54:02 1710950042

Another way to look at it is that concatenating files should sum the line count. Concatenating two empty files produces an empty file, so 0 + 0 = 0. If “incomplete lines” are not counted as lines, then the maths still works out. If they counted as lines, it would end up as 1 + 1 = 1.

akdev1l · 2024-03-20T13:58:25 1710943105

No, a line is defined as a sequence of characters (bytes?) with a line terminator at the end.

Technically as per posix a file as you describe is actually a binary file without any lines. Basically just random binary data that happens to kind of look like a line.

pxc · 2024-03-23T19:38:53 1711222733

Here's another way to think about this:

This isn't a weird state. It's a language problem. An 'incomplete line' isn't a type of line, it's an unfortunate name for a thing that is not a line. Just like how the 'wor' is an incomplete word (the word 'word'), but 'wor' is, of course, not a word.

Same thing for formalisms like equations in algebra or formulas in propositional logic— we have the phrase 'well-formed formula', and we might describe some sequences of terms as 'incomplete formulas' or perhaps 'ill-formed formulas', but those phrases don't describe anything that meets the formal system's definition of 'formula' at all— they are not formulas. 'Ill-formed formula' is not a compositional phrase where 'ill-formed' describes a feature of a 'formula'. It's a bit of convenient language for what we can intuitively or metaphorically recognize as a formula-ish thing.

coryrc · 2024-03-20T16:19:25 1710951565

Pedantically, if it doesn't end with a newline, it's considered a binary file and not a text file. Binary files don't have lines.

In practice, most utilities expecting text files will still operate on it.

wtetzner · 2024-03-22T13:54:57 1711115697

That's a weird way to look at it. Binary files might not have "lines", but there's no reason they couldn't include a byte with value 10 (the ASCII value for \n). Software reading that file wouldn't know the difference, right?

Also, why couldn't you have a text file without any lines?

coryrc · 2024-03-22T22:35:52 1711146952

All I'm addressing is GP's comment:

    It's a file with zero complete lines. But it has 1 line, that's incomplete, right?

Because the Unix definition of text file requires the file to end with a newline. "Lines" only exist in the context of text files. If there's no terminating newline, it's (pedantically) not a text file and so has no lines. Now, in practice, if you open() that file in text mode, it doesn't TMK return an error if the terminating newline isn't present, but it's undefined behaviour.

And if you do have a terminating newline, then you have at least one line :).

PaulDavisThe1st · 2024-03-20T17:41:30 1710956490

No file has lines.

"Lines" are a convention established by (or not) software reading a data stream.

coryrc · 2024-03-20T19:16:59 1710962219

Ackshully

mort96 · 2024-03-20T15:36:02 1710948962

It's a file with 0 lines and some trailing garbage.

rerdavies · 2024-03-20T15:41:45 1710949305

The opengroup spec says no such thing.

simonh · 2024-03-20T16:35:04 1710952504

3.206 Line

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

See also ‘3.403 Text File’ for the definition of a text file. No new line characters, no lines. No lines, not a text file.

wtetzner · 2024-03-22T13:56:38 1711115798

> No lines, not a text file.

That seems like a broken (maybe just bad?) definition/specification to me. A blob of JSON in a file isn't "text" if there's no newline character trailing it?

simonh · 2024-03-23T21:42:17 1711230137

There are other definitions of a text file than the opengroup spec, particularly for specific OS platforms. I’m not sure what convention JSON follows.

As a spec it’s fine. It defines a text file in such a way that you can easily write code to process such a file deterministicaly.

mbrubeck · 2024-03-20T14:27:35 1710944855

    $ echo -n "A" | wc --lines
    0

keybored · 2024-03-20T16:09:38 1710950978

Yep. since wc(1) apparently strictly adheres to what a newline-terminated text file is. This is why plaintext files should end with a newline. :)

https://stackoverflow.com/questions/729692/why-should-text-f...

LK5ZJwMwgBbHuVI · 2024-03-20T17:33:56 1710956036

Why don't you go ask?

    $ echo -n foo | wc -l
    0

wtetzner · 2024-03-22T13:57:58 1711115878

wc just counts newline characters. I'm not sure why it would be the ultimate authority on anything.

Gormo · 2024-03-20T14:11:52 1710943912

Suddenly the DOS/Windows solution of using \r\n instead of just \n seems to offer some advantages.

samatman · 2024-03-20T14:29:09 1710944949

This does precisely nothing to solve the ambiguity issue when a final line lacks a newline. The representation of that newline isn't relevant to the problem.

Izkata · 2024-03-20T17:09:40 1710954580

It's actually slightly worse: Windows defines newline as a delimiter, not a terminator. So this:

  foo\nbar\n

Would be 2 lines in *nix and 3 lines in windows.

Gormo · 2024-03-25T12:45:56 1711370756

The point is that having a sequence of two delimiters to signal the end of the logical line allows you to have single instances of either delimiter included within the text. This allows visual line breaks to be included within the same line as understood by the regex parser.

danbruc · 2024-03-21T14:56:41 1711033001

Despite the downvotes your comment received, I think you have a good point. There are two uses for a newline, first to signal the end of a line, for example when sending text over a serial connection, and second to separate two lines, for example in a text file.

To indicate that a serially received line is complete, the interpretation as a terminator makes perfect sense - abcd\n is a complete line, abc is a still incomplete line. In a text file the interpretation as a separator might be preferable because that gets rid of the issue of the last line not having a newline - a\nb\nc are three lines separated by two newlines, a\nb\nc\n are four lines separated by three newlines and the last line is empty.

But then it might also be useful to have a terminator in a text file to be able to detect an incompletely written line. So using two characters, one for each purpose, could solve the problem. \r means the line is complete, \n means it follows a next line. abc is an incomplete line, abcd\r is a complete line and no line follows, abcd\r\n is a complete line and a second incomplete line follows which is currently empty. abcd\r\n\r are two complete lines, the second one empty. abcd\r\nefg is a complete line followed by an incomplete line. abcd\r\nefg\r are two complete lines. You could even have two incomplete lines abc\nefg.

But I think Windows always uses \r\n because this is how you get to a newline on a typewriter or really old printer, you return the carriage and feed the paper one line. I do not think that they had the idea of differentiating between terminator and separator, otherwise you could have only \r and maybe even only \n sometimes. But in principle this could work quite nicely, I guess. You could start a line with \n and end it with \r, this would give you \r\n between lines and \r after the final line. Or nothing if the final line is incomplete or \r\n if the final line is incomplete and currently empty. The odd thing would be a newline as the very first character, maybe one could suppress that. This would also be compatible with Windows and nix, it would just consider all nix lines incomplete. Only abc\rdef\r would not really make sense, two complete lines but the second one is not a new line.

If I ever get to write a new operating system, I will inflict this on humanity.

deaddodo · 2024-03-20T16:14:06 1710951246

The "Windows way" is the "right way" for a few reasons.

This is definitely not one of them.

int_19h · 2024-03-20T22:15:50 1710972950

Which are the valid reasons, legacy meanings of those characters aside?

deaddodo · 2024-03-24T00:37:22 1711240642

I mean, it was what everyone had agreed upon previously. Microsoft was the only party to follow through. For all the guff they get for not following standards, it was the one standard they did.

You don't have to love a company to acknowledge they did something right.

joshjje · 2024-03-20T16:41:37 1710952897

“A\n” is two lines.

LK5ZJwMwgBbHuVI · 2024-03-20T17:40:13 1710956413

Factually incorrect.

rerdavies · 2024-03-20T15:40:21 1710949221

Technically, that is one of two possible interpretations, and you seem to have invented a "by definition" out of thin air.

Very very technically a "newline" character indicates the start of a new line, which is why it is not called the "end-of-line" character.

LK5ZJwMwgBbHuVI · 2024-03-20T17:39:33 1710956373

It doesn't indicate the start of a new line, or files would start with it. Files end with it, which is why it is a line terminator. And it is by definition: by the standard, by the way cat and/or your shell and/or your terminal work together, and by the way standard utilities like `wc` treat the file.

cortesoft · 2024-03-20T16:01:29 1710950489

I mean, the person you are responding to didn't invent the definition out of thin air... the POSIX standard did:

3.206 Line A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...

mabster · 2024-03-20T22:02:26 1710972146

I don't know why no-one here sees this as a bad design...

If a line is missing a newline then we just disregard it?!

A way better way to deal with newline is it's a separator like comma. And like in modern languages we allow a final separator, but ignore it so that is easier for tools to generate files.

Now all combinations of characters, including newline characters, has an interpretation without dropping anything.

danbruc · 2024-03-21T17:09:43 1711040983

I also always preferred the interpretation of a newline as a separator instead of as a terminator for files because I never liked the final newline causing a new empty line in the editor and as you thought that it was bad design that you can have a somewhat invalid file.

But if you look beyond files, the interpretation as a terminator also makes perfect sense, when you receive text over a serial connection it signals that the line is complete which does not necessarily imply that another line will follow. The same in a file, if the terminating newline is missing, you can deduce that an incomplete write occurred and some data might be missing. If you decide to have a newline as a separator after the last line but to ignore it, then you can not represent an empty last line.

I guess you would need two different characters, one terminator and one separator. You could start a line with \n and end it with \r. The \n separates the line from the one before, then \r terminates the line and marks it as complete. You would get \r\n between lines as on Windows and the last line would only have \r if complete or would otherwise count as incomplete. Then again you could almost get the same thing with \n only, you would just have to change the interpretation, instead of \n giving you a line and no \n giving you not a line, you would have to say that \n gives you a complete line and no \n gives you an incomplete line. With that you could however not have an incomplete empty line.

pepa65 · 2024-03-29T06:56:10 1711695370

This effort of building in redundancy is pointless. We just need a newline to know where to start the output on a new line. If you want to safeguard the proper content of a file, a whole lot more is needed.

nomel · 2024-03-20T16:44:38 1710953078

Posix getline() includes EOF as a line terminator:

    getline() reads an entire line from stream, storing the address
       of the buffer containing the text into *lineptr.  The buffer is
       null-terminated and includes the newline character, if one was
       found.
    ...
    ... a delimiter character is not added if one was
       not present in the input before end of file was reached.

EOF seems same as end-of-string.

lsaferite · 2024-03-22T15:55:30 1711122930

Your quoted documentation says otherwise. It says that a 'line' include the delimiter, '\n', in the line buffer. It also says that is no delimiter is found before the EOF is reached that the line buffer will not include the delimiter. That means the line buffer can clearly indicate an incomplete line by the absence of the delimiter. To be clear, EOF isn't a 'line terminator', it's the end of the data stream.

nomel · 2024-03-22T22:24:27 1711146267

Yes, "EOF seems same as end-of-string."

pepa65 · 2024-03-29T06:58:38 1711695518

No, getline() will stop reading at the newline, even if more (non-NUL) characters follow. EOF is end-of-file.

pepa65 · 2024-03-29T06:52:59 1711695179

So this is what "3.403 Text File" says:

A file that contains characters organized into zero or more lines [so characters with no newlines are OK]

No NUL, and lines (delimited by and including newline) not exceeding LINE_MAX bytes.

pepa65 · 2024-03-29T06:45:01 1711694701

How about a null-byte then? That's not a newline character, but all POSIX tools will treat it as EOF.