> CSV uses decimal representations of numeric data, which means you are getting 3.5 bits of data for every 8 bits of storage space (and that's assuming you are using a reasonably compact text encoding... if you are using UTF-16, it's 16 bits). Using a binary representation you can store 8 bits of data for every 8 bits of storage space.
XML, JSON, and YAML all have this issue, too.
> CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.
This is also identical to XML, YAML and JSON.
And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
> CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).
This is also identical to XML (escaping XML entities, sometimes having to resort to CDATA), YAML (escaping dashes) and JSON (escaping double quotes).
All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right? Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF. If you don't like the fact that two systems can format text differently, you're going to have a whole lot of fun when you see how they can screw up binary formatting. Nevermind things like, "Hey, here's a binary file from 25 years ago... and nothing can read it and nobody alive knows the format," that you just don't get with plain text.
Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.
Yes. Though to their credit, some of those work with her numbers, which at least gets you 4 bits out of every 8 bits.
> And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
I'm not sure what argument you thought I was making, or why that comment is relevant.
> All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
I'm not sure what your mean by "eliminate", or why you think it matters that there are other formats with the same design trade offs.
> This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
CSV is used for a variety of purposes. The context of the article is using it for data transfer.
The claim was that it was a compact format for data transfer, which is demonstrably not true.
> Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right?
I'm not sure what "straight binary" means to you. JSON is, for the most part, a binary encoding standard (just not a particularly good one).
You've got the heritage a bit wrong, as XML was not originally designed for data transfer at all. It was an attempt to simplify the SGML document markup language, and the data transfer aspects were subsequently grafted on. JSON & YAML have a slightly more complicated heritage, but neither was intended as a data transfer format. They've all been pressed in to service for that purpose, for a variety of reasons, that can charitably described as tactically advantageous but strategically flawed.
> Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF.
I don't know how to break this to you, but text formats can have endianess (in fact, insanely UTF-8 does!), and systems being confused and whether they are at EOF as well.
> Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.
Wouldn't ZIP be a binary format, with all the problems and concerns you have with binary formats?
So to summarize what you are saying... "CSV is a compact format because you can compress it if you are concerned and all the space it wastes".
Would it be fair to say then that any binary format is a text format because you can convert the binary into a text representation of the data? ;-)
XML, JSON, and YAML all have this issue, too.
> CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.
This is also identical to XML, YAML and JSON.
And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
> CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).
This is also identical to XML (escaping XML entities, sometimes having to resort to CDATA), YAML (escaping dashes) and JSON (escaping double quotes).
All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right? Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF. If you don't like the fact that two systems can format text differently, you're going to have a whole lot of fun when you see how they can screw up binary formatting. Nevermind things like, "Hey, here's a binary file from 25 years ago... and nothing can read it and nobody alive knows the format," that you just don't get with plain text.
Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.