I don't agree with giving up csvs until the following conditions are met:
1) A truly open format is available and accessible. Csvs are textfiles. There is no system around that cannot open a textfile. If the format is binary or requires patents or whatever, then it's a non-starter.
2) Applications have a speed increase from using csvs. To wit, I loved csvs because often they finish preparing much faster than a "formatted" output in excel, etc. and sometimes I just want to see the numbers, not have everything colored or weird features like merged cells in excel, etc.
3) The new format should not be grossly larger than the one it is replacing. Extracts in Excel format are orders of a magnitude larger than csv in filesize. This affects run-time to prepare the extract, open it (memory constraints, etc.)
Is there truly a problem? The author is not forced to use csvs.
You're comparing CSVs to other spreadsheet document formats. But a CSV is not a spreadsheet. A CSV is raw data. (It's data that is restricted to a shape that enables it to be easily imported into a spreadsheet—but data nevertheless.) As such, it should be compared to other data formats—e.g. YAML, JSON Lines, etc.
These other data formats all win on your #2 against CSV, as CSV is actually horrible at parse-time vs. other data formats — the fact that both of its separators (newlines and commas) can appear as-is inside column values, with a different meaning, if those column-values are quoted, means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically. You actually need to fully parse it (serially), and only then will you know where the row boundaries are. If you've ever dealt with trying to write ETL logic for datasets that exist as multi-GB CSV files, vs. as multi-GB any-other-data-format files, you'll have experienced the pain.
> The new format should not be grossly larger than the one it is replacing.
Self-describing formats like JSON Lines are big... but when you compress them, they go back to being small. General-purpose compressors like deflate/LZMA/etc. are very good at shearing away the duplication of self-describing rows.
As such, IMHO, the ideal format to replace ".csv" is ".jsonl.gz" (or, more conveniently, just ".jsonl" but with the expectation that backends will offer Transport-Encoding and your computer will use filesystem compression to store it — with this being almost the perfect use-case for both features.)
-----
There's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.
The data-warehouse ecosystem already standardized on Avro as its data interchange format. And spreadsheets are just tiny data warehouses. So why not? ;)
"Self-describing formats like JSON Lines are big... but when you compress them, they go back to being small."
For CSV file replacements, I'd expect something like "one JSON array per line, all values must be JSON scalars". In that case, it's not much larger than a CSV, especially one using quotes already for string values.
But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object? Is it wrapped in a top-level array or not? If it is objects, do the objects have to be one line? If it is an object, where do we put field order? The whole problem we're trying to solve with CSV is that it's not a format, it's a family of formats, but without some authority coming in and declaring a specialized JSON format we end up with a family of JSON formats to replace CSV as well. I'd still say it's a step up; at least the family of JSON formats is unambiguously parseable and the correct string values will pop out. But it's less of a full solution than I'd like.
(It wouldn't even have to be that much of an authority necessarily, but certainly more than "The HN user named jerf declares it to be thus." Though I suppose if I registered "csvjson.org" or some obvious variant and put up a suitably professional-looking page that might just do the trick. I know of a few other "standards" that don't seem to be much more than that. Technically, even JSON itself wasn't much more than that for a lot of its run, though it is an IETF standard now.)
To be fair it's not an objectionable format. Using line breaks to separate objects makes it streamable, and you don't need to enclose the whole thing in an array to make it a valid JSON document.
That is not quite a CSV replacement. I use it for things with objects and stuff all the time. To be a CSV replacement you really need to add that each line must be a JSON array, and that it can only have scalars in it (no sub-arrays or objects). That would be a decent enough replacement for CSV itself. Not perfect, but the CSV "standard" is already a nightmare at the edge anyhow and honestly a lot of it can't be fixed anyway, so, this is probably as good as it could get.
> that it can only have scalars in it (no sub-arrays or objects)
I see CSV files that contain JSON arrays/objects in their fields all the time. Mainly from exporting Postgres tables that contain json/jsonb-typed columns. Are you saying that these aren't valid CSVs?
> But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object?
How is that not a problem with every data serialization format? It does me no real good if I have an XML schema and a corresponding file. If I don't know what those elements and attributes represent I'm not really any better off.
It's not like JSON or XML can meaningfully be marshaled back into objects for use generically without knowledge of what is represented. There are generic JSON and XML readers that allow you to parse the data elements sure, but so, too, do generic CSV readers like C#'s CsvHelper or Python's csv. In all cases you have to know what the object turns into in the application before the serialized data is useful.
And, yes, CSV has slightly differing formats, but so does JSON. Date formats are conventionally ISO 8601, but that's not in the spec. That's why Microsoft got away with proprietary date formats in System.Text.Json. XML isn't really any better.
Sorry, that should be System.Runtime.Serialization.Json. System.Text.Json is the newer class that replaced it.
In .Net Framework 4.6 and earlier, the only built-in JSON serializer in the .Net Framework was System.Runtime.Serialization.Json.DataContractJsonSerializer.
You can still see it. If you're on Windows 10, run Windows Powershell v5.1 and run:
Microsoft didn't fix their weird JSON serialization until quite late. They may have back ported it to the .Net Framework, but they've deleted that documentation. Powershell v6 and v7 include the newer classes that are properly behaved. This is why Json.NET used to be so popular and ubiquitous for C# and ASP applications. It generated JSON like most web applications do, not the way Microsoft's wonky class did. Indeed, I believe it may be what System.Text.Json is based on.
Oh that one - yeah I've always steered clear of DataContractJsonSerializer. Never understood why they did it so weird.
To be fair, RFC 3339 wasn't even published back when this class was implemented (in .NET 3.5) so I guess they just went with whatever worked for their needs. ¯\_(ツ)_/¯
I'd be quicker to believe that it's because 2007 was still in the middle of Steve Ballmer's Microsoft, where embrace-extend-extinguish was their de jure practice.
I have wondered about a file format where a parser could be specified for [at the start of] each line. You could even have different json parsers with different well-characterized limits and relative speeds. Formats could change over time on a line-by-line basis, without being locked into a full-file IDL or similar.
JSON Lines is a specified format that answers those questions. https://jsonlines.org/ Seems like it qualifies to the level of authority you're requiring.
Take an Excel file and change the extension to .zip, then extract the contents. You will see that it is a collection of XML files. Therefore it should be reasonable to conclude that this approach can work for Excel sized datasets.
However it is not particularly readable/diff-able if this is part of your use case.
To be fair, xlsx came out with Office 2007, so it's not exactly 'new' anymore. Perhaps at this point it's reasonable for 'excel file' to mean the one that's been the default for 14 years?
>the fact that both of its separators (newlines and commas) can appear as-is inside column values, with a different meaning, if those column-values are quoted, means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically
Yes, this is a major pain. It can be avoided by using Tab separated value (TSV) files, which don't use escaping. But then you can't store Tabs or carriage returns in your data. Also there is no way to store metadata in TSV.
JSON is far from ideal for storing 2D data tables as it is a tree. This means it is much more verbose than it needs to be. The same is also true for XML.
ASCII has special delimiters 0x1E Record Separator and 0x1F Unit Separator to avoid conflicting with values, but they have never gained widespread adoption.
More generally than editor support, how is the average person expected to type them? I could easily add them to my keyboard, but if "editor support" turned out to mean "Edit > Insert > Record Separator Character" and I was now bound only to editors that support this, absolutely nobody is going to bother. They specifically need to be trivially typable in exactly the way commas and newlines are.
Realistically I don't think anyone (at least on HN) would presume that editor support would be so weirdly obtuse, but it's not exactly well-defined here, either.
I'm mostly just thinking out loud, here, not really critiquing anyone.
Actually, there is already a standard way for someone to type these with the control key. To be specific, they are:
FS: ^\ GS: ^] RS: ^^ US: ^_
Since these delimiters are not much used, on many terminals ^\ is mapped to send SIGQUIT. Use 'stty quit undef' and you can then type this at the keyboard. This works on Linux and OS X, and likely most other systems. Note that with most keyboards you are also using shift for RS and US. For emacs, prefix with C-q so that it inserts the literal.
Other posters have noted that they have ASCII separators recently, I have as recently as 2018 internally at a FAANG. So they are not completely unknown in the wild.
which modifier would that be, Alt/Ctrl or Cmd? Also there is a chance that this combo will not work on some platforms, some may already be in use as a hotkey.
I worked on a project that used these in a proprietary file format. It was actually quite nice. There was no escaping required because they simply stripped those characters out of the actual content.
The other advantage not being discussed here is that once you have a csv parser, it can be trivially generalized to use any character as a delimiter, and you can usually find a character that isn't found in the data you're working with unless you're working with binary data. I work on a legacy system where passing delimited files around is usually the integration method of the day, and if you can't delimit with commas (common) then you just delimit with tabs. If you can't delimit with tabs or commas, then delimit with pipes, which are rarely if ever used outside of code. Or use tildes, or grave accents, or...
Doing this avoids quoting, which avoids an entire class of problems. And yes, if you're passing multi-megabyte files around on a regular basis, or transferring giant blobs of text or binary data, or looking for a storage format with metadata then csvs are an absolutely awful choice, but as with any problem you pick the right tool for the job, and csvs are a great tool when you want to easily move some structured alphanumeric data around in a format that's trivial to parse in any language and widely supported.
>it can be trivially generalized to use any character as a delimiter
But then you have the additional problem of trying to work out what the delimiter is. You can take an educated guess, but this isn't going to be 100% reliable.
This comes back to "picking the right tool for the job". If you're ingesting csvs/tsvs, you're expecting a given format anyway. If you expect the wrong delimiter, the worst thing that happens is you detect that the file has the wrong number of columns and fail out. Even if you're using a more structured format, if you ingest a file containing the wrong dataset, your process will still fail because the correct fields aren't there. No matter what your format is, you can't recover from somebody sending you the wrong file.
If you interpret "CSV" as purely comma seperated values then maybe. But in my bubble "CSV" means textfiles that are separated by some separator. Be it tabs, spaces, commas, or any other ASCII character. Some are more usable then others, if you have commas in your data then use tabs. If you have tabs use Form Feed or Record Separator or vertical tabs ... and so on.
Of course this is not always applicable, since you sometimes don't control the format you get your data.
> But in my bubble "CSV" means textfiles that are separated by some separator. Be it tabs, spaces, commas, or any other ASCII character. Some are more usable then others, if you have commas in your data then use tabs. If you have tabs use Form Feed or Record Separator or vertical tabs ... and so on. Of course this is not always applicable, since you sometimes don't control the format you get your data.
When I see CSV parsers like https://www.papaparse.com/ that even try to support comments and empty lines in the format, I wonder if it'd really be that bad to just raise an error on anything that doesn't fit RFC 4180[1], with explanation on where and how the file is corrupted. Push the issue to the writers of such files.
Then I remember the Robustness Principle[2] and I chill a little.
It seems quite common in some European countries to use semi-colons as the delimiter instead of commas (because they use commas as the decimal separator?), adding a new level of fun to parsing. In Easy Data Transform we count the number of commas, semi-colons, tabs in the file to make an educated guess as the delimiter (which the user can override).
Semicolon-separated "CSV" is a quirk of Microsoft Excel, which they have never fixed. I figure Microsoft would prefer that people use XLS instead of CSV.
In theory, yes, to be pedantic, but for example, LibreOffice saves its exported CSVs by default as tab delimited. You actually have to manually specify you want commas to get those.
> LibreOffice saves its exported CSVs by default as tab delimited
Maybe it's actually presenting what you've selected last?
It's giving me comma as the default separator, and it's the first option in the dropdown. Tab is the 3rd option.
>> Those other things have different names like TSV
That depends on the writer. I've gotten what should be named PTVs (pipe-terminated values) as CSVs. I can understand how it happened. If the underlying software outputs PTVs, you don't want to bother converting that because you're working in a legacy language that's a pain to work with (the type where identifiers can't be longer than 4 chars), and you want the user to be able to double-click on it and have it open in a spreadsheet without prior configuration, you just push the issue to the reader of the file, since by tradition readers are already quite tolerant of format differences...
Of course, there'll always be the case where the reader is simply not tolerant enough, like when the escaping syntax differs. There doesn't seem to be a way to get LO Calc to interpret "foo|bar\|baz|" as cells "foo" and "bar|baz", for example.
And then Excel imports them all using heuristics when you select "Type: CSV". So you'll never train anyone on the demand end of these documents that they're called anything besides CSVs.
Technically true, but like I said "CSV" is more a term for human readable data, with some delimiter in them. Maybe it's a comma, maybe not. In every case you need someonne to look at it. If you want a machine-machine data protocol you can use XML or JSON if it needs to be somewhat human readable.
Commas, quotes, colons, newlines, and hyphens (JSON and YAML separators) can occur in content as well. How is that any different from separators occurring in csvs?
In practice, I used TSVs a lot, as tabs do not usually occur in most data. Alternatively, you could use pipes (|) or control characters as field or row separators.
CSV does not have date or currency types. Those are an Excel conventional microformat overlaid on top of CSV.
The semantics of CSV — how it decodes, if you don't have an application making additional configured assumptions or using heuristics to detect microformats — is just N columns, one of header names, N of raw text cells. Nothing more, nothing less. No schema defining what a column's type is.
You misunderstood my point. CSV, for all its faults (and I am intimately aware of them), is ubiquitous. You're not going to replace a widespread standard with something that only offers a tiny incremental improvement. Ubiquitous is a feature, and generally trumps all other features.
If you want something better than CSV to take off, at the very least it needs to solve the common pain points of CSV. JSON doesn't - all it brings to the table is the ability to distinguish number vs text. That's a yawn.
> If you want something better than CSV to take off...JSON doesn't...
Agreed. I don't ever see something better than CSV taking off (having worked on this problem for better part of a decade) unless you could get to the point where you've got a largely similar notation, but a richer semantic grammar powered by something like Schema.org.
In other words, a strongly typed CSV with a community backed schema. I think GPT-3 and the like are a big wild card here. I can't describe how, but I expect in 20 years we will be using something better for storing data and will look back and saw "CSV used to be the best but then GPT-N _____________".
> If you want something better than CSV to take off, at the very least it needs to solve the common pain points of CSV. JSON doesn't - all it brings to the table is the ability to distinguish number vs text. That's a yawn.
It brings the ability to parse in parallel - that's a big deal. And while number vs text might not be a huge difference in theory, in practice it eliminates what, 95% of real-world parse problems?
At least in my experience, I don't see a lot of trouble with number vs string in CSV data. You convert the string to a number or you don't. The pain points are usually dates or currencies - the same problem I usually have with JSON, because there's no standard format.
I think you could fix most of the pain of CSV simply by adding a second header row which defines the type of the column, using a common vocabulary of types. TEXT, NUMBER, ISO8861-DATE, etc.
> I don't see a lot of trouble with number vs string in CSV data. You convert the string to a number or you don't.
The trouble is that a lot of existing software (not just Excel) won't properly roundtrip text that looks like numbers and will e.g. strip the leading zero from phone numbers, or worse, change the last digit to make a number that exists in floating point.
> The pain points are usually dates or currencies - the same problem I usually have with JSON, because there's no standard format.
Hmm, I've never had a problem with ISO8861 dates - a string is either a valid date or not, it's very rare for someone to "accidentally" put data in ISO8861 format when it's not actually a date. Dates without timezones can cause problems, but that's more of a semantic issue than a serialization issue. What are the problems that you get?
I can see how currencies could be an issue with the lack of a standardised fixed-precision type. But in my experience they're an order of magnitude less common than issues with phone numbers, postal codes, and the like.
> I think you could fix most of the pain of CSV simply by adding a second header row which defines the type of the column, using a common vocabulary of types. TEXT, NUMBER, ISO8861-DATE, etc.
I'm sure you could. But at that point you're defining a new and incompatible format - you have to make it incompatible, or otherwise people will open these files with a tool that doesn't understand the header format and you're back to square 1 - so you'll pay all the same adoption costs as a completely new format. So it make sense to fix all the issues we can - and a format which can be split and parsed is definitely a major improvement for many use cases.
I envy your life of pristine CSVs and compatible implementations, where JSON doesn't solve any problems. But CSVs have a broad range of uses, and there are, in fact, plenty of implementations that do not follow RFC 4180, and even if everything did there would still be issues like the header row being optional (without an in-band way of telling). JSON is, in practice, much more well-defined.
On text vs numbers, at least some widely-used software (e.g. R, Excel) will try to guess for you. It should be obvious how this might cause problems. Maybe one should turn auto-conversion off (or not use things that don't let you turn it off) and specify which columns are numbers. Some datasets have a lot of columns, so this can be a PITA, even if you do know which ones should be numbers. But the bigger problem is if you have to deal with other people, or the files that they've touched. There are always going to be people that edit data in excel, don't use the right options when they import, etc.
> At least in my experience, I don't see a lot of trouble with number vs string in CSV data.
I definitely had my fair share of trouble with locale-defined number format. Importing a column where a thousand is spelled "1.000", any integer between 1000 and 999999 would be wrongly parsed as a float between 1 and 999, while for any other number (like "0,1" for one tenth, "1.000.000" for a million, "1.000,56" for a thousand euros and change) the parser would give up and keep the string.
I usually have had more luck importing as text, then doing some string replacement of separators before finally converting to number.
JSON solves tons and tons of pain points around CSV... for implementing anything that reads/imports/normalizes CSV data. Just the fact that CSV has multiple standards around the quote/escape/separator/NULL character-sequences, with the choice not get encoded into the document, means that any format that doesn't have that problem is instantly 100x better in terms of not experiencing a disaster of data-garbling-on-import.
The formats you describe sound interesting, but in the big corporate/government world the barriers to adoption are high. For example, the software package Cognos which is owned by IBM and connects to database servers. To extract data from it, you have a choice of csv, Excel or pdf. The pdf is only used if you have done a perfect job creating a report in Cognos and don't need to edit/adjust it further in Excel. So really, only 2 output formats.
I realize that this is a specific use case here, but I was on Cognos for years, and then when they started shifting over to Tableau, it wasn't any better.. csv, MS formats, proprietary formats, etc.
Cognos is a BI tool, and the formats you are talking about are for rendering reports, not data. If you're picking data formats based on being able to load & edit them in to Excel, you're already doomed.
The entire article is about replacing CSVs for exchanging data exported from Excel... so why wouldn't he be picking data formats based on being able to load and edit them in Excel? If you're trying to solve this problem in a way that EXCLUDES Excel, you're already doomed. The business world will laugh at you and continue on their merry CSV way.
>The biggest and most thorny problem to solve is the people problem: how do you convince people to stop creating new CSVs when they’ve never done things any other way? Fortunately - for this problem, anyway - most of the world’s business data is born in one of a handful of programs that are owned by an increasingly small number of companies. If Microsoft and Salesforce were somehow convinced to move away from CSV support in Excel and Tableau, a large portion of business users would move to a successor format as a matter of course. Of course, it’s debatable whether that kind of a change is in the best interest of those companies, but I’m cautiously optimistic.
> The entire article is about replacing CSVs for exchanging data exported from Excel...
No, it's not. It's about replacing CSVs for exchanging data. It mentions that CSVs often are the product of someone exporting data from a spreadsheet or doing a table dump, and how just doing that tends to create a ton of problems, but Excel is an example, not the subject matter of the article.
> The business world will laugh at you and continue on their merry CSV way.
The business world pays me a lot of money to teach them not to use CSVs.
Yes. You could get a long way with a text format in which:
-the first line is always a header
-fields are separated by Unit separator characters
-records are separated by Record separator characters
-encoding is UTF8
If you wanted to get fancy you could also have:
-comment lines
-column metadata (e.g. column 0 is an ISO date, column 2 is text, column 3 is an integer)
Both the above could start with a Unicode character unlikely to be used for anything else.
I think that would avoid 99% of the pain of CSV files. The downside is that the use of things like the Unit separator mean that it wouldn't be easy to create/edit manually.
I mean, you'd have to be using a pretty terrible tool for it not being able to handle that, and I suspect if such an approach were to become prevalent, that tool would either fix the glitch or become largely unused.
As with pretty much everything else computing, the world suffers because Microsoft has been dumping terrible tools on it for decades, and people just take their garbage as the way things have to be.
Yup. I mean, if you're going to go with a text encoding, you might want to, you know, use the features of the text encoding that were put there explicitly for said purpose...
...or you could invent abominations like CSV, TSV, etc. ;-)
If only I could type them on my keyboard. (I think this is a big part of why CSV is the way it is — people want to be able to hand-edit it, or at least hand-produce small test datasets to test the systems on the other end.)
The funny thing is you can type any character on a keyboard. It's the same weird disconnect about "text file formats are human-readable". Either way you need a tool that can render & read the format in question. It does't much matter what format you actually store the data in, because you don't read & write the bites directly.
> There's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.
Avro is not a lossless alternate encoding of what's canonically a JSON document. Yes Avro supports a JSON encoding, but it's not canonical.
In general though, you're point about Avro being able to be represented as text is valid, and applies to practically any binary formThere's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.at, which is why the whole "but it needs to be a text format" argument is garbage.
> The data-warehouse ecosystem already standardized on Avro as its data interchange format. And spreadsheets are just tiny data warehouses. So why not? ;)
I wish that the data-warehouse ecosystem standardized on anything. ;-)
That said, there are plenty of good reasons why a data-warehouse standard would not be advisable for spreadsheets.
> Yes Avro supports a JSON encoding, but it's not canonical.
To be clear, I'm not talking about using an Avro library to encode data to JSON. I'm saying that when you decode an Avro document, the result that comes out — presuming you don't tell the Avro decoder anything special about custom types your runtime supports and how it should map them — is a JSON document.
Where, by "JSON document" here, I don't mean "a JSON-encoded text string", but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less. The sum-type of (JSONArray | JSONObject | String | Integer | Float | true | false | null). The "top" expression type recognized by a JSON parser. Some might call such a document a "JSON DOM." But usually it's a "JSON document", same as how the ADT you get by parsing XML is usually referred to as an "XML document."
Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data. They're all alternative encodings that have equivalent lossless encoding power over the same supported inputs.
> I'm saying that when you decode an Avro document, the result that comes out (presuming you don't tell the Avro decoder anything special about custom types your runtime supports and how it should map them) is a JSON document.
Semantic point: it's not a "document".
There are tools which will decode Avro and output the data in JSON (typically using the JSON encoding of Avro: https://avro.apache.org/docs/current/spec.html#json_encoding), but the ADT that is created is by no means a JSON document. The ADT that is created has more complex semantics than JSON; JSON is not the canonical representation.
> By which I don't mean JSON-encoded text, but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less.
Except Avro has data types that are not the exact set of types that exist in JSON. The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.
> Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data
BSON, by design, was meant to be a more efficient way to encode JSON data, so yes, it is a way to encode JSON-typed data. Avro, however, was not defined as a way to encode JSON data. It was defined as a way to encode data (with a degree of specialization for the case of Hadoop sequence files, where you are generally storing a large number of small records in one file).
A simple counter example: Avro has a "float" type, which is a 32-bit IEEE 754 floating point number. Neither JSON nor BSON have that type.
Technically, JSON doesn't really have types, it has values, but even if you pretend that JavaScript's types are JSON's types, there's nothing "canonical" about JavaScript's types for Avro.
Yes, you can represent JSON data in Avro, and Avro in JSON, much as you can represent data in two different serialization formats. Avro's data model is very much defined independently of JSON's data model (as you'd expect).
> The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.
My understanding was always:
1. that the "primitive Avro types" are Avro's wire types, which are separate from its representable domain types. (Sort of like how RLE-ified data has wire types of "literal string" and "repeat literal N times".)
2. that any data that would not be valid as input to a JSON encoder, is not valid as input to an Avro encoder, because its wire types are defined in terms of their a mapping from a set of domain types that are exactly the set of domain types accepted by JSON encoders (whether they're explicitly noted as being those or not.)
Or, to put that another way: an Avro schema is — besides a validation step that constrains your data into a slightly-more-normalized/cleaned format — mostly a big fat hint for how to most-efficiently pack an (IMHO strictly JSONly-typed) value into a binary encoding. Differences between "long" and "int" on the wire aren't meant to decode to different domain types (at least, by default); they're just meant to restrict the data's allowed values (like a SQL DOMAIN constraint) in ways that allow it to be more predictable, and so to be wire-encoded more optimally.
Let me lay out some evidence for that assertion:
• Avro supports specifying e.g. "bytes" vs. {"array": "byte"} — there's literally no domain-type difference in those! But one is a wire-encoding optimization over the other.
• Avro has a "default" property, and this property—as part of the JSON-typed schema—can only take on JSON-typed values. Do you think this is an implementation constraint, or a design choice?
• Avro's enum type's "symbols" array? Once again, defined by (and therefore limited to) JSON string values.
• Avro doesn't implement an arbitrary-precision integer type, even though its wire-encoding for integers would support one just fine. Why? Seemingly only because JSON doesn't have an arbitrary-precision integer type (because JavaScript doesn't have a native BigNum type); nor does JavaScript/JSON have any obvious type to O(1)-efficiently deserialize a BigNum out into. (Deserializing BigNums to strings wouldn't be O(1).) Every other language offers a clean 1:1 mapping for bignums, but JavaScript doesn't, so JSON didn't, so Avro doesn't.
• And why do you think Avro schemas are stored as embedded explicitly-defined-to-be-JSON documents within the root-level record / .avsc file, anyway? This means that you are required to have a JSON decoder around (either at decode time, or at decoder codegen time) to decode Avro documents. Why would this be, if not because the Avro implementation is (ot at least originally was) expected to decode the Avro document's wire types into the JSON library's already-defined ADTs, relying on e.g. having those "default"-parameter values already loaded in in JSON-value format from the schema's decode-output, ready to be dropped seamlessly into the resulting Avro decode-output?
And the biggest knock-down argument I'm aware of:
• Avro "string" doesn't support "\u0000". Why not? Because as you've said, Avro has a "JSON encoding", which specifies one-to-one mapping for strings; and JSON doesn't support "\u0000" in strings. (Just ask Postgres's jsonb type about that.) Since an Avro string containing "\u0000" wouldn't round-trip losslessly between the JSON and binary wire-encodings, it's not allowed in strings in the binary encoding.
Since it is a serialization format, Avro's types are its wire types. However, the primitive types are just a subset of the types that Avro supports.
Based on these comments, my best guess is you got the idea that Avro was for encoding JSON because the schema declaration is encoded in JSON, but that's not nearly the same as the data model. There are some terrible implementations of Avro libraries out there that use JSON as some kind of middleware, but that's not how Avro actually works.
If there's a type model it is derived from at all, it's the Java type model.
"byte" is not a valid type in Avro. There is only "bytes", and the domain model reflects this. You can't work with individual "byte" of a "bytes" object.
Default values are encoded in the schema, and so that does limit what kind of default values you can have, but again this is a limitation derived from the schema being defined in JSON, and how the schema language was defined in general. So your defaults have to be represented as JSON literals, but they don't even necessarily share the type of the JSON literal (e.g. a field defined as: '{"name": "foo", "type": "long", "default":1}' does not have the same default value as '{"name": "bar", "type": "int", "default":1}", because "foo" has a default value that is a long while "bar" has one that is an "int"). Note that "default values" are a property of the type, and only apply to elements inside complex data types. JSON has no such equivalent concept.
Avro's type model does have an arbitrary precision type that doesn't correlate to anything in JSON: the "decimal" logical type.
You aren't required to use a JSON decoder to decode Avro documents, nor are you required to use a .avsc file. The Avro schema file is just the standard way to represent a schema. If you have the schema, you don't need the file. JSON schema files are one of the poorer choices in the Avro design, but you'll notice that the schema is defined the way it is specifically so that it can cover a type model well outside of JSON. You'll also notice the names of types in Avro don't directly correlate to names of types in JSON.
* The \u0000 thing is a bug in avro tools, but there is nothing about the spec that prohibits having \u0000 in your strings.
I feel like in general this is like a retcon exercise, where you've reached a conclusion and are drawing evidence to prove it, while ignoring the information that contradicts. I spoke with Cutting a fair bit when he came up with Avro, and I can assure you, while the schema language does very intentionally use JSON, Avro is not a representation for JSON types.
As someone who's used Avro [0], it's a pain because Avro records must be ordered, but JSON by definition is unordered. Avro's "JSON" format is more of a JSON-like format. At one point, when I was writing a script that would ingest an Avro record and then output a new one, I had to fiddle with things to make Python use an OrderedDict [1] so the new record would be output in the right order.
[0] though the last time I touched it was in 2014, so things might've changed since
[1] as per [0], this was before Python dicts were ordered by default
I'm not sure what you mean by Avro records "must" be ordered. If you mean that the serialization format specifies the ordering of the fields, then yes, that is true, but that's an advantage in terms of compactness and processing efficiency (https://avro.apache.org/docs/current/spec.html#order). If you don't like it though, there are other formats like protobuf and thrift that have no such requirement, at the cost of ~3 bits per field, which can be a comparatively efficient trade off.
The problem is that fields aren't ordered in the JSON spec, so fields being ordered in Avro's dialect of JSON automatically makes it non-standard JSON which makes it difficult to use standard tooling with Avro JSON.
OIC. Yeah, in general the JSON representation is really just for cases where some tool can't handle the binary. Think of it as an escape hatch more than a core feature.
If you're using Avro as a way to do JSON, that's definitely a bad choice. That's why I found the notion of Avro as a way to work with JSON documents as really... odd. (https://news.ycombinator.com/item?id=28222491) It's a decent way to manage data transfers (and I continue to argue a much better one that CSV or JSON), but as a "lossless alternate encoding of what's canonically a JSON document", it's a bad fit.
It's not possible to read ahead and chunk with 100% assurance it will always work, but libraries like pandas and R's data.table do a reasonable job of reading in the first X rows, doing some usually correct type inference on the columns, and then read ahead and chunking the rest of the rows.
For what it's worth, I totally agree something like compressed json lines is a better data exchange format, but part of why csv remains as universal and supported as it is is that so much existing data storage applications export to either csv or excel and that's about it. So any ETL system that can't strictly control the source of its input data has no choice but to support csv.
> means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically.
There's a subset of CSV that forbids escapes that is super fast to parse. All fast CSV parsers I'm aware of take advantage of this subset. I try to never ever publish a CSV that has quotes, and always aim for a more restrictive grammar that is cleaner, better thought out data.
>CSV is actually horrible at parse-time vs. other data formats
I find this really hard to believe given it's a simple enough syntax. And parsing is usually not the limiting factor, usually fast enough to not be noticed alongside interpreting or loading the source data. Every (much more sophisticated) compiler I can think of uses a linear parser based on this assumption.
CSV is far from perfect, but it's nice that I can easily work with them without needing any libraries. All I need is file I/O and the ability to split strings. It doesn't get much simpler.
I'll admit though that "import JSON" and then being able to essentially convert the entire file into a dictionary is nice if the data has more structure to it.
The real advantage of CSV, in my mind, is that if the CSV is valid and normal then it's going to be a rectangular dataset (ignoring semantics within the dataset).
If I import JSON data I have no idea what shape the result will be in, and it requires a separate standard to let me know about columns and rows and validation can get complicated.
CSV is that way too. There's nothing that says each row has to have the same number of what maps out to columns, or anything that tells me what those columns really represent (there's no schema). You could use the first line of the CSV doc to say what each tuple is named, but that's not a standard or anything. And without a schema, it certainly could be easy to lose the metadata of the info the CSV file is trying to represent. Is this column just a bunch of numbers or a date format? (for example). CSV is OK for importing and exporting data across systems that know what the format is without the help of a schema, but anything else and you run into a pile of edge cases. Even using a CSV file to import into a spreadsheet works usually but context is often lost.
and the result is we have 50 different rogue JSON libraries instead of an STL solution. Until the STL folks wake up, boost::split can deal with the CSV.
A string split function is a poor choice if there's any possibility the CSV file contains quoted fields. Robust handling of both CSV and JSON requires a parser. In my experience, CSV can actually be trickier than JSON to parse because there are so many edge cases, alternatives, and ambiguities.
> All I need is file I/O and the ability to split strings.
...until there is a newline inside a field.
The moronic quoting mechanism of CSV is one half of the problem; people like you, who try to parse it by "just splitting strings" is the other half. The third half is that it's locale dependent and after 30+ years, people still don't use Unicode.
As printed above, I'm not writing software where that would be a problem. In general if I was writing a commercial application or something in production where I don't control all the inputs I would agree 100%, but I'm lucky to not have those problems. I could use a bunch of libraries and additional code to try to catch non-existent errors I don't have, or fix my actual problems and move on.
I appreciate the perspective though for sure. I'm guilty of the same thing on HN, where I assume people have the same uses as me when they're writing code that has to be extremely robust or blazingly fast.
You've assumed an awful lot about my use cases. The data I deal with in .CSV form is always pre-processed and doesn't have any of the minefield occurrences you've mentioned. There can't be a newline or anything like that in an input. In my decade of using .CSV files daily, I've only had one tertiary system where that is a problem.
Also, when doing interactive work, it's a bit different than writing production IT software.
Define parsing. I'm still going through GB of data in thousands of files and building complex reports, and data structures for scientific analysis. Just because I don't need thousands of lines of code to navigate edge cases doesn't mean I'm not parsing .CSV files.
I personally believe that at least SQLite matches all those criteria :
"1) A truly open format is available" : sqlite is open-source, MIT-licensed, and well specified (even though I am usually not so happy with its weak typing approach, yet in this case this precisely enables a 100% correspondance between CSV and sqlite since CSV has also no typing at all...)
"2) Applications have a speed increase from using csvs" : I think it should be obvious to everyone that this is the case...
"3) The new format should not be grossly larger than the one it is replacing" : this is also the case
It's worth mentioning that SQLite is committed by its publishers to be backward-compatible out to 2050 and is the recommended by the Library of Congress as a safe long-term storage format (as are XML, JSON, and CSV). https://www.sqlite.org/locrsf.html
For single tables a database is probably overkill, but it's nice to have around when you need something reasonably powerful without being overly complex or hard to get started with.
I suppose if I was to recommend some "safe long-term storage formats" I, too, would choose things that are readable in plain ASCII and/or have open source roots.
Shoutout to the Library of Congress for doing the real heavy lifting here?
How about an app that opens an SQLite DB so that strings and numbers can easily be edited, and then on file save, SQLite's internal pointers and markers are updated ? Maybe an Emacs mode ?
There is a definite demand for some import format that you can trivially edit - CSV excels at this right now and JSON isn't that bad. Binary dump formats definitely do have a time and a place but there is also a separate need for trivially human readable formats.
In some incredibly pedantic sense, sure. But the difference is that every computer comes with what a human needs to read a CSV file. That's not true of an SQLite database.
Sure, could be. In about the same sense that it could be that every computer comes with a free and open source operating system. Could be; it's not theoretically impossible -- but...
The author appears to be a consultant selling data prep/transformation services. As long as the market is using CSVs, he’s forced to use CSVs, at least as end-of-pipeline inputs and outputs.
Of course, “people optimize their workflows for something other than making my job easy” is a common, but also rarely persuasive in motivating action from others with different jobs, complaint.
The consultant should be less concerned about what makes their life/job easier, but what makes their client's job easier by working with the consultant. If the client has to jump through hoops to serve consultant1 data in a format they are not used to, but consultant2 will accept the data they have already available or can be exported in a format familiar to them, the client will prefer consultant2.
Sure, but if everyone is blocked from using the format consultant1 opposes because the industry wakes up one day and abandons it utterly, then consultant2 no longer has an advantage, which is why someone who would be consultant1 except that they don't want to lose business to consultant2 lobbies for it.
My point is that the consultant needs to understand the capabilities of the client more so than dictating terms just because it is convenient to the consultant. The scenario you describe is not out of the realm of possibility, but it's also not unheard of to have a client with such antiquated systems that there is no using of modern anything. Hence, why CSV will never die.
No, SQLite's dynamic data types would silently coerce data just like opening a CSV directly with Excel does.
The advantage of CSV is that it's as accurate as your plain text representation of your data can be. Since binary data can be represented by character data, that's 100% accurate. As soon as you introduce a storage format that has made assumptions about the type of data being stored, you've lost flexibility.
SQLite is not intended for data serialization. It's intended for data storage to be read back by essentially the same application in the same environment.
Excel's default insistence on trying to coerce the data into what it expects the data to be is the source of many of the problems of people using CSV. Excel will even ignore quotes around values if it thinks that the value inside might be a number. I often work with CSV files that contain 17 digit IDs that use numeric characters "01233998742021925" which Excel will convert lossily into scientific notation rather than leaving it as a string. There are ways to override that but they are tedious and people don't do it by default.
I'd be satisfied if Excel was just a little less aggressive about it's data coercion on import.
> No, SQLite's dynamic data types would silently coerce data just like opening a CSV directly with Excel does.
SQLite's "dynamic data types" coerce data on input, not output. Once the data is in sqlite the way you wanted it, excel has no interpretation to perform, except insofar as really really wanting dates.
> The advantage of CSV is that it's as accurate as your plain text representation of your data can be.
Yeah nah.
> SQLite is not intended for data serialization. It's intended for data storage to be read back by essentially the same application in the same environment.
Data analysis, exchange, container, and archiving, are literally all use cases listed as "appropriate uses for sqlite" in "Situations Where SQLite Works Well" on the official sqlite website: https://www.sqlite.org/whentouse.html
How is a csv not the most accurate representation of the data? If you trust the other agent encoded it properly in the db, then sure. Your flippant dismissal was inappropriate in tone and detracted from the rest of your opinion.
Cockiness tells me that you’re insecure about your knowledge, not that you know more than GP.
> How is a csv not the most accurate representation of the data? If you trust the other agent encoded it properly in the db, then sure.
The idea that a CSV would be more likely to be correctly encoded than a DB is hilarious, thanks for the laugh. But that you were confident enough to seriously put it in writing shows how little experience you have with CSV.
Yep, you got me. I'm actually a trapeze artist moonlighting on these forums.
A CSV file represents the exact digits that are to be stored. You have unlimited precision. You could even store irrational numbers, equations, or mix data types in a column. OTOH, you have to make sure the delimiting character is not present in the data - that can be pretty easy, if you use the field-separator character in ASCII, or even just a \t. I've even seen people terminate fields with ^8675309| because they felt confident no data would contain Jenny's number.
A database, like Excel, likes to conform data. This is usually awesome! But sometimes, it's not.
"A bird, like homo sapiens, has two feet" doesn't mean that birds are humans or humans are birds either. It means that in this respect, birds and humans are alike. Which is what GP meant: With respect to conforming data, databases behave just like Excel.
Language note: I think it's the fact that there are two commas around the inserted-clause [terminology?] ", like Excel," that does it. If there were only one comma, before or after "like Excel" it would read the way you read it. Can be tricky for non-native (and sometimes also native) speakers.
It's strictly dependent on the table definitions you use. A column with no type (the simplest way to use them) will take data in any of the natively supported types—whatever you actually insert in your SQL statement—and will not attempt any translation.¹
If you do add column types to your tables as in traditional databases, then of course you should expect some varying behavior based on that.
I can imagine some potential for issues from people expecting to insert exact decimal numbers and choosing floating-point by accident, perhaps, or using integer types for actually-text numeric ID fields. If you go all-strings on input with no column affinity, then you'll get the same strings on output, though, so that's isomorphic to good-CSV.
¹ https://www.sqlite.org/datatype3.html – §3.1 “If the declared type for a column contains the string "BLOB" or if no type is specified then the column has affinity BLOB.” + §3 “A column with affinity BLOB does not prefer one storage class over another and no attempt is made to coerce data from one storage class into another.”
Just because most popular operating systems have libraries, doesn't yet mean it is portable.
Can I click on it and open it in Excel?
If not then it is not portable for me and for a lot of other people.
The main reason I use CSV is to produce reports that I can either open myself or send to other people so that they can click on it and open themselves and immediately start hacking away.
Excel is still corporate lingua franca and until this changes CSV is not going to retire.
CSV means for me that I can send it to anybody and everybody will know how to use it.
Try sending SQLite binary to people you work with, to your accountant, to your wife. Then share your experience.
I find files where the exact type of whitespace used is meaningful to be a sick joke.
Not all editors make it easy for you to use a tab character.
If only there was some other solution that would not require you to use a character that a lot of editors refuse to insert or even try to convert to spaces...
Agree. But a lot of this is tooling though, whether its adequate and whether its ubiquitously available. e.g. you can diff two directories in Linux, which is super useful, but directories are not text files. The issue is, can the differences be reasonably represented in text formatted output.
Besides, CSV can be written by a human and initially it was the way how data was entered and read, because it is a text file - database. If there is a need for something more convenient, then there are other file types. Nonissue, IMO.
A lot of things 'around here' run off spreadsheets, because relational databases weren't invented here and no-one ever needs more than a few million rows apparently.
I seem to spend half my job on the current project exporting stuff to CSV, running it through my code, then opening the resulting CSV in excel and formatting it a bit and saving as .xlsx again.
Still, at least I don't have to use Visual Basic that way.
There are mature CSV libraries for most major languages that handle 99% of the problems of CSV.
CSV should be better standardized, but ... whatever, what should be done to "fix" CSV is to advertise the proper use of the libraries and the nontrivial aspects of a superficially trivial format.
A format that is trivially useful in 99% of cases is far better than many other "worse is better" things in computing.
> There are mature CSV libraries for most major languages that handle 99% of the problems of CSV.
They really don't. In fact I'd go further and confidently state that they really can't, because tons of mis-parsed CSVs are heuristic judgement values, and those tools don't really have the ability to make those calls.
I've never seen a "mature CSV library for most major language" which'd guess encoding, separators, quoting/escaping, jaggedness, … to say nothing of being able to fix issues like mojibake.
No, they really do. But 99% isn't 100%. Compare Python's csv parser with Go's. The former will prefer a parse for any input, and this is done in a fairly consistent way that at least gives one the opportunity to deal with "malformed" csv. The latter is a strict parser and coughs up an error upon almost any deviation from RFC 4180. I couldn't use the latter in practice because of it. It would just choke on too many csv files I had.
Neither is 0. Python’s csv library does not handle any problem, it applies the separators / quoting / escapes it was given and that’s that. It gives the developer the opportunity to handle things, but if the developer isn’t aware nothing is handled.
If you're just going to quote me out of context and not bother responding to the actual substance of my comment (which was a high level but concrete comparison of two different CSV parsers, demonstrating an important qualitative difference with respect to csv parsers solving problems), then please, don't bother responding.
That's only true if you're trying to send all your data in a single, monolithic CSV.
If you're sending multiple CSVs, you're capable of representing data as well as a relational data store. Which is to say, you're representing your data using a system of data normalization specifically designed to minimalize data duplication. A single CSV represents a single table, and in most cases with intelligent delimiter selection you can represent an entire data set with no more than one character spent between fields or records.
Yes, you do have situations where you're storing losing data density due to using plain text strings, but that's not a limitation particularly unique to CSV for data serialization formats. Additionally, it is a problem that can largely be mitigated by simple text compression. Furthermore, once you switch to a non-text representation, you're limiting yourself to whatever that data representation is. It's easy to represent an arbitrary precision decimal number in plain text. It's hard to find a binary representation that universally represents the same data regardless of the system on the other end. Again, that's not a problem unique to CSVs.
If you're working with an API, object by object, then JSON is certainly going to be better, yes, because you can use the application's object representation. If you're working with bulk data of many disparate, unrelated, complex objects, however, or where you're transferring and entire system, you're not going to do much better than CSV.
> That's only true if you're trying to send all your data in a single, monolithic CSV.
No, that's true in general. There are exceptional cases where it might not be true, but it's true in general.
> Yes, you do have situations where you're storing losing data density due to using plain text strings, but that's not a limitation particularly unique to CSV for data serialization formats.
That there are other inefficient data serialization formats does not mean that it is not an inefficient data serialization format.
> Additionally, it is a problem that can largely be mitigated by simple text compression.
"Simple text compression"? Not sure what you are referring to there. If you are speaking of generalized compression, there's a bunch of information theory pointing out the problem with that thinking.
> Furthermore, once you switch to a non-text representation, you're limiting yourself to whatever that data representation is. It's easy to represent an arbitrary precision decimal number in plain text. It's hard to find a binary representation that universally represents the same data regardless of the system on the other end. Again, that's not a problem unique to CSVs.
It's pretty easy to represent arbitrary precision decimal numbers in non-text formats and they mean the same thing regardless of the system on the other end, but let's say you were right. Great, now represent arbitrary precision binary numbers (you know, the kind that computers work with), in text formats.
> If you're working with an API, object by object, then JSON is certainly going to be better,
Let me stop you right there. It is not "certainly going to be better". JSON is generally worse.
> If you're working with bulk data of many disparate, unrelated, complex objects, however, or where you're transferring and entire system, you're not going to do much better than CSV.
There's a whole world out there of formats that are better than CSV at this job, and in many cases they are A LOT better than CSV. Even Avro, for all its pitfalls, is MUCH better.
> > Additionally, it is a problem that can largely be mitigated by simple text compression.
>
> "Simple text compression"? Not sure what you are referring to there. If you are speaking of generalized compression, there's a bunch of information theory pointing out the problem with that thinking.
For most natural text-only data, .csv.gz is probably close to the limits of what we can do. For numerical or categorical text containing patterns or structure, I agree that better formats may exist (e.g. H.264 for CSV files containing video frames). Not sure how a more general format will solve anything you couldn't already do by converting to a more specific format and tarring everything afterwards.
CSV uses decimal representations of numeric data, which means you are getting 3.5 bits of data for every 8 bits of storage space (and that's assuming you are using a reasonably compact text encoding... if you are using UTF-16, it's 16 bits). Using a binary representation you can store 8 bits of data for every 8 bits of storage space.
CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.
CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).
Then there's floating point data...
Honestly, this is a silly argument. CSV is, by design, a format that trades space for other factors. It should not be debatable whether it is a compact format or not.
> CSV uses decimal representations of numeric data, which means you are getting 3.5 bits of data for every 8 bits of storage space (and that's assuming you are using a reasonably compact text encoding... if you are using UTF-16, it's 16 bits). Using a binary representation you can store 8 bits of data for every 8 bits of storage space.
XML, JSON, and YAML all have this issue, too.
> CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.
This is also identical to XML, YAML and JSON.
And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
> CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).
This is also identical to XML (escaping XML entities, sometimes having to resort to CDATA), YAML (escaping dashes) and JSON (escaping double quotes).
All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right? Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF. If you don't like the fact that two systems can format text differently, you're going to have a whole lot of fun when you see how they can screw up binary formatting. Nevermind things like, "Hey, here's a binary file from 25 years ago... and nothing can read it and nobody alive knows the format," that you just don't get with plain text.
Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.
Yes. Though to their credit, some of those work with her numbers, which at least gets you 4 bits out of every 8 bits.
> And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
I'm not sure what argument you thought I was making, or why that comment is relevant.
> All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
I'm not sure what your mean by "eliminate", or why you think it matters that there are other formats with the same design trade offs.
> This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
CSV is used for a variety of purposes. The context of the article is using it for data transfer.
The claim was that it was a compact format for data transfer, which is demonstrably not true.
> Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right?
I'm not sure what "straight binary" means to you. JSON is, for the most part, a binary encoding standard (just not a particularly good one).
You've got the heritage a bit wrong, as XML was not originally designed for data transfer at all. It was an attempt to simplify the SGML document markup language, and the data transfer aspects were subsequently grafted on. JSON & YAML have a slightly more complicated heritage, but neither was intended as a data transfer format. They've all been pressed in to service for that purpose, for a variety of reasons, that can charitably described as tactically advantageous but strategically flawed.
> Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF.
I don't know how to break this to you, but text formats can have endianess (in fact, insanely UTF-8 does!), and systems being confused and whether they are at EOF as well.
> Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.
Wouldn't ZIP be a binary format, with all the problems and concerns you have with binary formats?
So to summarize what you are saying... "CSV is a compact format because you can compress it if you are concerned and all the space it wastes".
Would it be fair to say then that any binary format is a text format because you can convert the binary into a text representation of the data? ;-)
Parquet (and a few other competitors) have open-source implementations in many languages, are much faster to load than csv, are natively compressed, are strongly typed and don't require parsing...
There are few reasons to continue using csv in this day and age.
The issue is CSV is not compatible with CSV. It's not possible to write a spec that covers all CSV files in the world. CSV means things that are mutually incompatible in the less common cases, and the only way to really parse them correctly is to know which variant generated it. But you can't even tag that variant in the file by your criteria, as existing CSV parsers won't understand it.
Couldn't tools that read CSV files scan them first and see which variant best matches the file? The questions arise about which end of line character(s) are used and how double-quotes and commas are handled. There can't be that many ways to escape them, and there are three sets of end of line characters mentioned - just see which one is used (i.e. don't assume only \n if you run into \r\n or \r alone).
The assumption that most software uses is that the import file will be in the same variant of the format as what that tool exports. That seems to be more of a problem than anything else.
> Couldn't tools that read CSV files scan them first and see which variant best matches the file?
Sure, and they sometimes do that if they have to ingest CSVs whose origin they don't control (although not every system implementor cares enough to do it).
But that's still just a bunch of shitty faillible heuristics which would not be necessary if the format was not so horrible.
It also doesn't prevent a human or other system doing:
cat input1.csv input2.csv > output.csv
resulting in a single file containing multiple formats.
Also, what variant is this:
1,5,Here is a string "" that does stuff,2021-1-1
What is the value of the third column?
Is this a CSV file without quoting? Then it's
Here is a string "" that does stuff
Or is it a CSV file with double quote escaping? Then it's
Here is a string " that does stuff
This is fundamentally undecidable without knowledge of what the format it is.
You can decide to just assume RFC compliant CSVs in the event of ambiguity, but then you absolutely will get bugs from users with non-RFC compliant CSV files.
That's true. You could scan the file and see if there are any other types of double quote escaping happening, but if there isn't any that wouldn't help either. It's also negated by the multiple formats in the same file point.
So, yeah. Can't really be done without making too many assumptions that will break later.
By this argument UTF8 can't exist. And yet here it is.
PS: I never said 100% forward/backward compatible with all variant at the same time and without any noticeable artifact. I meant compatible in a non blocking way.
ASCII was well defined, CSV was not. Therefore they could take the highest bit, which they could know that was unused per the ASCII spec, and use that to encode their extra UTF-8 information.
Also UTF-8/ascii compatibility is unidirectional. A tool that understands ASCII is going to print nonsense when it encounters emoji or whatever in UTF-8. Even the idea that tools that only understand ASCII won't mangle UTF-8 is limited - sure dumb passthroughs are fine, but if it manipulates the text at all, then you're out of luck - what does it mean to uppercase the first byte of a flag emoji?
To be fair, there is basically no way to manipulate arbitrary text at all without mangling it, UTF-8-aware or not. What does it mean to take the first 7 characters of a UTF-8 string which might contain combinator characters and left-to-right special chars? What if the text uses special shaping chars, such as arranging hieroglyphs in cartouches? You basically need a text-rendering aware library to manipulate arbitrary strings.
Given the infinity of CSV variants existing is the very reason why CSV blows goats, creating a spec compatible with that is
1. a fool's errand, CSV "variants" are not compatible with one another and regularly contradict one another (one needs not look any further than Excel's localised CSVs)
2. resulting in getting CSV anyway, which is a lot of efforts to do nothing
> If we manage to write a spec that meet this criteria we'll have a powerful standard with easy adoption.
So, a binary format consisting of:
(1) a text data segment
(2) and end of file character
(3) a second text data segment with structured metadata describing the layout of the first text data segment, which can be as simple (in terms of meaning; the structure should be more constrained for machine readability) as “It’s some kind of CSV, yo!” to a description of specific CSV variations (headers? column data types? escaping mechanisms? etc.) or even specify that the main body is JSON, YAML, XML, etc. (which would probably often be detectable by inspection, but this removes any ambiguity).
Almost any current CSV parser, even the bad ones, tolerate a header line.
So it should be possible to define a compact and standardized syntax that is appended before the real header of the first cell (separator,encoding,decimal separator (often disregarded by most parsers but crucial outside USA),quote character,escape character,etc...). Following headers would just use special notation to inform on (data-type,length,comment).
Newest parsers would use theses clues, older ones would just append some manageable junk to headers.
> So someone opens this CSV in Excel and there's garbage in A1?
Yeah, that's why I chose the “thing that looks like a text file—including optionally CSV—but has additional metadata after the EOF mark” approach instead of stuffing additional metadata in the CSV; there's no way to guarantee that existing implementations will safely ignore any added metadata the main CSV body. (My mechanism has some risk in that there are probably CSV readers that treat the file as a binary byte stream and use the file size rather than a text stream that ends at EOF, but I expect its far fewer than will do the wrong thing with additional metadata before the first header.
If by EOF char you mean Ctrl-Z, Python's `csv` module is at least one case where it will read past the EOF char and you'll get rows of garbage data for any content in the file after that.
That's ignoring two big points in the OP: that CSV variants are ill-defined and conflicting; and that existing CSV data is too poorly-specified.
There is no reason to try to be "backwards-compatible" with existing CSV files - we don't have a single definition of correctness to use to check that the compatibility is correct. Every attempt to be parse existing data would result in unexpected results or even data loss for some CSVs in the wild, because there is no way to reconcile all the different expectations and specifications that people have for their own CSV data.
This is true. The best way would be if everyone generated CSV files that were compatible with on one or two already existing CSV formats. (e.g. no-quote-plain-old-csv or quotes-forced-csv.) Hopefully, it would be easy to distinguish between the two in some well-defined manner. Maybe call it .csvs (CSV standardized). Or just introduce new extensions like .csvq ("CSV quoted").
Note the difference is that I am suggesting reducing the number of standards in-use by using only one already existing CSV format. :)
But some people are. There are entire industries built around the exchange of CSV files, and the producers and consumers don't necessarily talk to each other.
I'd add the constraint that it needs to be understandable by most people. A typical non-tech office worker can use CSV without any trouble. Sqlite has an extremely steep learning curve for someone unfamiliar with databases.
Exactly. Worst case a CSV can be edited and viewed or even created in MS Notepad or any other text editor which is pretty much guaranteed to be on a system. Something to deal with Sqlite? No such luck and the barrier is much higher.
> Worst case a CSV can be edited and viewed or even created in MS Notepad or any other text editor which is pretty much guaranteed to be on a system.
Good luck getting that garbage to be ingestible by anything. Odds are the system you're trying you shove it in will start by choking on the UTF8 BOM notepad insists on, then it will choke on the record separators, then on the incorrect escaping (or quoting).
> Something to deal with Sqlite? No such luck and the barrier is much higher.
SQLite is already present in most if not all environments (usually multiple times too, you have at least one instance of sqlite per browser on your machine, and probably a system one as well, and a few others), so it's really at most a matter of providing SQLite clients by default.
1) A truly open format is available and accessible. Csvs are textfiles. There is no system around that cannot open a textfile. If the format is binary or requires patents or whatever, then it's a non-starter.
2) Applications have a speed increase from using csvs. To wit, I loved csvs because often they finish preparing much faster than a "formatted" output in excel, etc. and sometimes I just want to see the numbers, not have everything colored or weird features like merged cells in excel, etc.
3) The new format should not be grossly larger than the one it is replacing. Extracts in Excel format are orders of a magnitude larger than csv in filesize. This affects run-time to prepare the extract, open it (memory constraints, etc.)
Is there truly a problem? The author is not forced to use csvs.