Is Microsoft somehow on board for this? In my experience there are CSV files and there are CSV files that excel understands and as long as you don't get your file format into excel that will seriously hinder adoption.
In my experience excel doesn't even understand its own csv files. If you save one using the Swedish locale it uses semicolons as field separators since comma is used as decimal points. Trying to open the resulting file using a English locale results in garbage.
When saving as CSV, Excel will use the regional "List separator" setting. You can override this in Windows 7 with Region and Language > Additional Settings > List separator.
If you are trying to generate a file that plays nice with Excel, there is a way to force a specific delimiter with the "sep" pragma:
I think we should just stop using commas and newlines and start using the ASCII unit separator and record separator. It would alleviate most quoting and escaping issues because those two rarely appear in actual text.
For terminal mode applications, unit separator is Control+Underscore and record separator is Control+Caret (Ctrl+Shift+6 on a US English keyboard).
However, many terminal mode applications intercept those keys and expect them to be commands rather than data. Often there is some kind of quote key which you can press first. In readline and vim, the quote key is Ctrl+V. In emacs, it is Ctrl+Q instead.
GUI applications are more variable. But vim and emacs still support their respective quote keys in GUI mode.
And if you have to type this a lot, you can modify your vim/emacs/readline/etc configuration to remove the requirement for the quote key to be pressed first.
Part of me thinks that a new CSV standard with technical breaking changes (lines starting with #) is not needed because we have JSON array-of-arrays files.
The other part thinks this is quite cool and wishes your efforts well.
The new RFC allows for names for the fields:
"3. The first record in the file MAY be an optional header with the same format as normal records. This header will contain names corresponding to the fields in the file"
Is there a reliable way to tell if this first record has names or actual data?
'Tis a pity most platforms don't (in practice) support saving MIME types as a file attribute.
(Even when some platforms have the facility to store the MIME type in an extended attribute, few applications will actually support retrieving and acting on that attribute.)
JSON is easier for people to consume, but formats like csv can be much more efficient when comparing the amount of space used, and easier to consume for an application.
If you're dealing with files that small then JSON is a great solution.
In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.
> In the past I've used CSV to handle files that are several GB after being compressed and encrypted. Formats such as JSON would have added a lot more to the total file size.
If you store tables as list of objects, sure, but that's comparing the JSON way of doing something CSV can’t handle at all (lists of heterogenous objects) to CSV doing a much simpler thing.
A compact, dedicated JSON table representation (list of lists with the first list as header, or a single flat list with the header length — number of columns — as the first item, then the headers, then the data) is pretty closely comparable to CSV in size.
I don't really see the point... Using special characters makes it harder to manipulate by hand, and you still don't solve conflicts (where content is able to corrupt the container structure). Ultimately, the only safe way is to use Pascal string-style length prefixes.
Escaping is too easy to ignore, until it comes back to bite you in the ass. Balanced delimiters break as soon as the content contains invalid data (or is delimited according to a different convention than the container).
It's much less likely that content is going to include ASCII 28 to 31 than that it will include a comma or a tab. But you're right, certainly not impossible.
I think the human-editability you mention may be a major reason why comma-separated has been such a practical winner?
Since ASCII 28 to 31 are invisible or indistinguishable in most editors, I suspect a format relying on them would be subject to routine corruption.
In machine encoding formats, "much less likely" is the same as "very likely". You're either OK for something to happen or you categorically make it impossible to happen.
Invisible characters also mean you can no longer call a data format "human readable".
The special ASCII characters you're referring to are mostly a historical oddity today.
How much have you worked with actually existing CSV data? If it was as easy as "making it categorically impossible to happen", working with CSV wouldn't be such a mess! You can say whatever you want in whatever standard you want, data bugs still happen.
Any CSV format description anywhere, will of course you need to use various internal escaping mechanisms for a literal comma, or forbid them entirely. Does that "categorically make it impossible to happen"? Obviously not, words in a standard can't make something categorically impossible. How likely a protocol/standard violation is to happen of course matters to actual practical pain.
Yes, of course the ascii separator values are a historical path not taken. This thread, not started by me, was musing about why, and what-if, by people who, believe it or not, have plenty of experience dealing with data, and bad data.
> It's much less likely that content is going to include ASCII 28 to 31 than that it will include a comma or a tab. But you're right, certainly not impossible.
Yeah, that's the same problem as UTF-8's almost-compatibility with ASCII. Things work, until they break spectacularly.
Ultimately, you either have to solve the issue properly, or keep it visible enough that people are forced to handle it.
This is why we have editors that do syntax checking as we type. Editing CSV by hand is just not important enough for such editors to be commonplace. CSV files are exported and imported.
And since we have JSON, whose spec is stable, known and at this point consistently implemented, CSV should really just be sun set rather than "fixed".
Also, the simplest escaping format is wrap in quotes and represent quotes by typing quotes twice. That's easy to remember and relatively hard to screw up even when editing by hand.
If you're editing by hand then length prefixes are also broken easily and counting the length of strings is cumbersome unless you're using a "technical" text editor like VS Code in which case its just as easy to write a delimiter like \x1F (unit separator) as it is to highlight the text and look at character count.
Unless "editing by hand" means using something like excel?
> Ultimately, the only safe way is to use Pascal string-style length prefixes.
Note that length-prefixes aren't necessarily safer when nested. For example, `line(10 bytes, [cell(1000 bytes, "xxxxxxx...`. This can lead to vulnerabilities like buffer overflows (naively allocating 10 bytes for the line, then trying to put 1000 bytes into it)
Manipulating by hand is easily solved with alt sequences and adding support to distinctly display those characters to text editors. It's easier than figuring out the corner cases of YAML.
The "General" data column behavior thing that caused the incorrect parsing of gene names as Dates goes all the way back to the original Visicalc. Are you suggesting Microsoft should patch Visicalc and Lotus 1-2-3 as well?
> But they will never do that because that would break existing systems that depend on the old behaviour.
Which is an inevitability for anything that's been a dependency for long enough.
Either you admit that things have to change and sometimes shit that's been secretly broken for years will have to change too, or you give in to the shit and admit that you can never solve any problems that made it to production.
Both answers suck, but one is a lot more suck so I choose the "fix it anyways and if something else turns out to have depended on the brokeness then add that to the list of shit to fix" approach whenever possible.
We've had similar issues with clients using 1-2-1 as a meeting type name. When data about those records are exported to CSV and loaded by Excel, it gets seen as the date 2001-02-01 (or, presumably, 2001-01-02 for those using a US locale in their setup) even if the value is quoted (which we hoped would be interpreted as "this value is definitely nothing fancier than a basic string", but it was not).
> This standard is from 2005 and it is pretty clear that it is not universally used.
Unfortunately a lot of non-conforming software can't change or they risk compatibility problems with old versions of themselves. 2005 is relatively recent in the history of CSV files.
Opening CSV in excel rather than importing is pretty much broken by design. Import data from text and you can set the column types. Open by default and suffer consequences.
CSV in the abstract has no problems. The problem is two tools attempting to interact by using subtly incompatible versions of CSV. And CSVY just adds one more of those.
Ok, so now I know that text/csv is proper mime type, I have to ask, why don't browsers offer a native way to view CSVs as a [html] table complete with sorting and filtering?
The reason I don't like CSV is that CSV files can use any text encoding: ASCII, ISO-8859, UTF-8.. It's not nice to try to guess the encoding when reading files.
RFC 4180 also fails to define the default text encoding. It seems that text encoding can be given only when MIME type is given for the file.
Excel generates RFC-compliant CSV files if you are using the en-us/en-uk locales, but not for many other locales. Of course, this is a system-wide setting, and Excel has no option to generate a standard CSV independent of locale.
Almost certainly for backwards compatibility. Excel in particular uses the BOM as a single to open a file in a Unicode encoding, rather than cp1252 etc.
Not sure why this is trending now, but this RFC is being revised now: https://datatracker.ietf.org/doc/html/draft-shafranovich-rfc...
Suggestions and comments are welcome here: https://github.com/nightwatchcybersecurity/rfc4180-bis