Hacker News new | past | comments | ask | show | jobs | submit login
RFC 4180: Common Format and MIME Type for CSV Files (2005) (ietf.org)
48 points by tosh on June 3, 2021 | hide | past | favorite | 64 comments



(I'm the author of the RFC)

Not sure why this is trending now, but this RFC is being revised now: https://datatracker.ietf.org/doc/html/draft-shafranovich-rfc...

Suggestions and comments are welcome here: https://github.com/nightwatchcybersecurity/rfc4180-bis


Is Microsoft somehow on board for this? In my experience there are CSV files and there are CSV files that excel understands and as long as you don't get your file format into excel that will seriously hinder adoption.


In my experience excel doesn't even understand its own csv files. If you save one using the Swedish locale it uses semicolons as field separators since comma is used as decimal points. Trying to open the resulting file using a English locale results in garbage.


When saving as CSV, Excel will use the regional "List separator" setting. You can override this in Windows 7 with Region and Language > Additional Settings > List separator.

If you are trying to generate a file that plays nice with Excel, there is a way to force a specific delimiter with the "sep" pragma:

    sep=|
    a|b|c
    1|2|3


Please give up on that and make csv itself nice and simple. :) Let's have it err comma separated.

Simple is good.

If it's really simple even Microsoft will be able to implement it.


Would that work if you try to use the ascii record separator character with the sep pragma?


Just open it in Libreoffice and save it to .xlsx. It's the only way I've found that actually works reliably.


I think we should just stop using commas and newlines and start using the ASCII unit separator and record separator. It would alleviate most quoting and escaping issues because those two rarely appear in actual text.


Until I can type those ASCII separators on my keyboard, they’re not going to happen. That’s why CSV won out: the comma key already existed.


For terminal mode applications, unit separator is Control+Underscore and record separator is Control+Caret (Ctrl+Shift+6 on a US English keyboard).

However, many terminal mode applications intercept those keys and expect them to be commands rather than data. Often there is some kind of quote key which you can press first. In readline and vim, the quote key is Ctrl+V. In emacs, it is Ctrl+Q instead.

GUI applications are more variable. But vim and emacs still support their respective quote keys in GUI mode.

And if you have to type this a lot, you can modify your vim/emacs/readline/etc configuration to remove the requirement for the quote key to be pressed first.


Sure, maybe, but that won’t be CSV.


Maybe that would be an "SSV"? Separator Separated Values file.


On the other hand, if the characters still technically can exist in the content, the rareness just makes the problems harder to spot.

In this sense sticking to a relatively common separators is good, because they encourage you to do the right thing from the start.


That would just make it easier to implement the standard in a way that works most of the time but breaks on valid input. Why would that be desirable?


significant whitespace is nasty, by definition its hard to see errors. Not convenient if you can't edit the format with a normal keyboard.


Oh interesting it includes a specification for comments. That is likely going to be the most controversial part.

Does any implementation support/generate comments that way? The most I've seen so far is an oversized multiline header.


Wadler's Law strikes again https://wiki.c2.com/?WadlersLaw


Part of me thinks that a new CSV standard with technical breaking changes (lines starting with #) is not needed because we have JSON array-of-arrays files.

The other part thinks this is quite cool and wishes your efforts well.


The new RFC allows for names for the fields: "3. The first record in the file MAY be an optional header with the same format as normal records. This header will contain names corresponding to the fields in the file" Is there a reliable way to tell if this first record has names or actual data?


iirc text/csv; header=present


Oh that's new to me! But not available after downloading of course.


'Tis a pity most platforms don't (in practice) support saving MIME types as a file attribute.

(Even when some platforms have the facility to store the MIME type in an extended attribute, few applications will actually support retrieving and acting on that attribute.)


I feel the simple recommendation for all CSV usecases should be "use JSON".


JSON is easier for people to consume, but formats like csv can be much more efficient when comparing the amount of space used, and easier to consume for an application.


Space doesn’t matter when most you documents are few kb. And when it does, zip it.


If you're dealing with files that small then JSON is a great solution.

In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.


> In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.

If you store tables as list of objects, sure, but that's comparing the JSON way of doing something CSV can’t handle at all (lists of heterogenous objects) to CSV doing a much simpler thing.

A compact, dedicated JSON table representation (list of lists with the first list as header, or a single flat list with the header length — number of columns — as the first item, then the headers, then the data) is pretty closely comparable to CSV in size.


Sadly "ASCII delimited" text, using ASCI 28 to 31, doesn't seem to be a thing:

* https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

* http://en.wikipedia.org/wiki/Unit_separator


I don't really see the point... Using special characters makes it harder to manipulate by hand, and you still don't solve conflicts (where content is able to corrupt the container structure). Ultimately, the only safe way is to use Pascal string-style length prefixes.

Escaping is too easy to ignore, until it comes back to bite you in the ass. Balanced delimiters break as soon as the content contains invalid data (or is delimited according to a different convention than the container).


It's much less likely that content is going to include ASCII 28 to 31 than that it will include a comma or a tab. But you're right, certainly not impossible.

I think the human-editability you mention may be a major reason why comma-separated has been such a practical winner?

Since ASCII 28 to 31 are invisible or indistinguishable in most editors, I suspect a format relying on them would be subject to routine corruption.

(I have some experience working with a "legacy" "binary" format which DOES use those ASCII separator values... https://en.wikipedia.org/wiki/MARC_standards)


In machine encoding formats, "much less likely" is the same as "very likely". You're either OK for something to happen or you categorically make it impossible to happen.

Invisible characters also mean you can no longer call a data format "human readable".

The special ASCII characters you're referring to are mostly a historical oddity today.


How much have you worked with actually existing CSV data? If it was as easy as "making it categorically impossible to happen", working with CSV wouldn't be such a mess! You can say whatever you want in whatever standard you want, data bugs still happen.

Any CSV format description anywhere, will of course you need to use various internal escaping mechanisms for a literal comma, or forbid them entirely. Does that "categorically make it impossible to happen"? Obviously not, words in a standard can't make something categorically impossible. How likely a protocol/standard violation is to happen of course matters to actual practical pain.

Yes, of course the ascii separator values are a historical path not taken. This thread, not started by me, was musing about why, and what-if, by people who, believe it or not, have plenty of experience dealing with data, and bad data.


You’ve misinterpreted my comment. And arguing about unrelated things.


> It's much less likely that content is going to include ASCII 28 to 31 than that it will include a comma or a tab. But you're right, certainly not impossible.

Yeah, that's the same problem as UTF-8's almost-compatibility with ASCII. Things work, until they break spectacularly.

Ultimately, you either have to solve the issue properly, or keep it visible enough that people are forced to handle it.


> Escaping is too easy to ignore

This is why we have editors that do syntax checking as we type. Editing CSV by hand is just not important enough for such editors to be commonplace. CSV files are exported and imported.

And since we have JSON, whose spec is stable, known and at this point consistently implemented, CSV should really just be sun set rather than "fixed".

Also, the simplest escaping format is wrap in quotes and represent quotes by typing quotes twice. That's easy to remember and relatively hard to screw up even when editing by hand.


If you're editing by hand then length prefixes are also broken easily and counting the length of strings is cumbersome unless you're using a "technical" text editor like VS Code in which case its just as easy to write a delimiter like \x1F (unit separator) as it is to highlight the text and look at character count.

Unless "editing by hand" means using something like excel?


There's a difference between "take on this inconvenience and gain nothing" and "take on this inconvenience in exchange for solving this problem".


> Ultimately, the only safe way is to use Pascal string-style length prefixes.

Note that length-prefixes aren't necessarily safer when nested. For example, `line(10 bytes, [cell(1000 bytes, "xxxxxxx...`. This can lead to vulnerabilities like buffer overflows (naively allocating 10 bytes for the line, then trying to put 1000 bytes into it)


Sure, without bounds-checking all hope is lost, but then you're in a hopeless place anyway.


> Using special characters makes it harder to manipulate by hand

I generally use CSV as an import/export format, and would like for software to at least support the option (even if CSV/TSV remains the default).


Manipulating by hand is easily solved with alt sequences and adding support to distinctly display those characters to text editors. It's easier than figuring out the corner cases of YAML.


Anyone who has written software to import CSV files knows that in general it's a complete mess.

Also, geneticists have been forced to rename certain gene names because Excel messed them up[0].

It would be nice if this standard would be generally accepted. This standard is from 2005 and it is pretty clear that it is not universally used.

[0] https://www.theregister.com/2020/08/06/excel_gene_names/


Excel is the ultimate bane of CSV files. If Microsoft would fix Excel and distribute the fixes to old Excels, then life would be good again.


The "General" data column behavior thing that caused the incorrect parsing of gene names as Dates goes all the way back to the original Visicalc. Are you suggesting Microsoft should patch Visicalc and Lotus 1-2-3 as well?


It is not just Excel, there are a lot of systems that export CSV. Many of them are broken.

Examples:

- double quotes in text fields

- using the wrong decimal separator, should it be "," or "."

- using the wrong column separator, should be "," or ";"


But they will never do that because that would break existing systems that depend on the old behaviour.


> But they will never do that because that would break existing systems that depend on the old behaviour.

Which is an inevitability for anything that's been a dependency for long enough.

Either you admit that things have to change and sometimes shit that's been secretly broken for years will have to change too, or you give in to the shit and admit that you can never solve any problems that made it to production.

Both answers suck, but one is a lot more suck so I choose the "fix it anyways and if something else turns out to have depended on the brokeness then add that to the list of shit to fix" approach whenever possible.


We've had similar issues with clients using 1-2-1 as a meeting type name. When data about those records are exported to CSV and loaded by Excel, it gets seen as the date 2001-02-01 (or, presumably, 2001-01-02 for those using a US locale in their setup) even if the value is quoted (which we hoped would be interpreted as "this value is definitely nothing fancier than a basic string", but it was not).

> This standard is from 2005 and it is pretty clear that it is not universally used.

Unfortunately a lot of non-conforming software can't change or they risk compatibility problems with old versions of themselves. 2005 is relatively recent in the history of CSV files.


Opening CSV in excel rather than importing is pretty much broken by design. Import data from text and you can set the column types. Open by default and suffer consequences.


You are not at all wrong, but unfortunately "you are opening it wrong" did not fly with the client for us in this case.


By that logic though, all text formats are a mess because text processors will autocorrect and break the formatting.


I recommend CSVY (CSV with YAML frontmatter, https://csvy.org/), which solves several problems in CSV.

Currently R data.table supports read and write csvy format. https://www.rdocumentation.org/packages/data.table/versions/...


CSV in the abstract has no problems. The problem is two tools attempting to interact by using subtly incompatible versions of CSV. And CSVY just adds one more of those.


Ok, so now I know that text/csv is proper mime type, I have to ask, why don't browsers offer a native way to view CSVs as a [html] table complete with sorting and filtering?


This is timely for me - I was just sharing this 2014 article with my team.

http://thomasburette.com/blog/2014/05/25/so-you-want-to-writ...

It made an impact on my approach to CSV handling, and helped me understand, "just because I can, doesn't mean I should".


The reason I don't like CSV is that CSV files can use any text encoding: ASCII, ISO-8859, UTF-8.. It's not nice to try to guess the encoding when reading files.

RFC 4180 also fails to define the default text encoding. It seems that text encoding can be given only when MIME type is given for the file.

JSON is so much easier.


Default encoding is now UTF-8, due to the way the RFC process works it is not in the original document: https://www.iana.org/assignments/media-types/text/csv


Excel generates RFC-compliant CSV files if you are using the en-us/en-uk locales, but not for many other locales. Of course, this is a system-wide setting, and Excel has no option to generate a standard CSV independent of locale.


I used Excel recently and it added a BOM, which is pretty awful. Is there any way to stop that nonsense?


AFAIK: Nope. Microsoft loves that BOM, for reasons I cannot comprehend.


Almost certainly for backwards compatibility. Excel in particular uses the BOM as a single to open a file in a Unicode encoding, rather than cp1252 etc.


I use a script to strip it from the file. Definitely not ideal.


> the en-us/en-uk locales

Though be careful if you communicate with people using both locals. Things interpreted as dates will get corrupted in fun ways.


Oh heck. I'm cited reference #5 (almost last) but the link to my site is broken. I'll have to put a 301 on there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: