Hacker News new | past | comments | ask | show | jobs | submit login
CSV 1.1 – CSV Evolved (for Humans) (csv11.github.io)
84 points by polm23 5 months ago | hide | past | web | favorite | 77 comments

There are a ton of problems here:

* From the documentation "No quotes needed for values ... Use dual quotes ... Use triple quotes"

* You haven't solved the problem of describing what the columns are. In fact, its worse because you are encouraging people to put units in the field

* Spaces matter! What if the data is literally " Word " vs "Word". This format makes them both the same.

* For some reason, the header row is removed. The only thing in a CSV that provides structure is gone. In the first "improved" example, I have no way of knowing what the columns meaning without them.

The only good improvement in this is the "#" convention as a comment. If we could get every CSV parser in future to agree (we already can't get them to agree now) to ignore lines that start with "#" then CSV's would be vastly improved.

> Spaces matter! What if the data is literally " Word " vs "Word". This format makes them both the same.

If leading and/or trailing whitespace are significant, you can quote the value. Problem solved.

Having said that, almost always when you see data persisted with leading or trailing whitespace it should have been trimmed away before storage.

Edit to add: your main points are spot on, IMO, but:

Original: I'm not even sure I agree that 'comments' are good, particularly if they're going to be intentionally ignored by GUI s/s apps.

If you have meaningful information, put it in the furthest right-column, so it appears in spreadsheet editors on the right of the data.

If it's not meaningful to the person who will open the file, why include it?

Regarding quotes, they're optional (e.g. when you need a comma, leading whitespace, or newline). With arguments by position possible I guess colons have to be quoted too.

It's not clear to me that it's possible to convert this to a standard CSV; it can't guess the header and guessing whether a column contains units (consistent units?) is asking for trouble.

More pragmatically I can't share these with other people because they're almost like a csv, but incompatible.

It's great to try to make CSV more readable (I like the leading whitespace, but it'd be finicky to maintain with simple text editors), but not worth the technical risk of format confusion.

Hello, I'm the original author of CSV 1.1.

There are no problems :-). You're making up things / problems.

Nowhere says it that the header row is removed. You CANNOT auto-detect a header, you have to tell your parser (see the csvreader library as an example - https://github.com/csv11/csvreader - from my humble self.) if you have a header or not (it's optional).

If Spaces matter! but them in quotes. By default you don't need quotes (and, thus, discouraged).

> You haven't solved the problem of describing > what the columns are

That is solved / done by a schema with a (tabular) datapackage, see https://github.com/csv11/csvpack as a real-world example how that works in practice or use csvrecord, see https://github.com/csv11/csvrecord

> If Spaces matter! but them in quotes. By default you don't need quotes (and, thus, discouraged).

So, just like a regular CSV?

Well, if you want to improve tabular data formats:

1. Add a version identifier / content-type on the first line!

2. Create a formal grammar for this CSV format

3. Specify preferred character-encoding

4. Provide some tooling (validation, CSV 1.1 => HTML, CSV => Excel)

5. Add the option to specify column type (string, int, date)

6. Specify ISO-8601 as the preferred date format

7. Allow 'reheading' the columns in the file itself. This is useful in streaming data.

8. Specify the format of the newlines.

CSVW: CSV on the Web https://w3c.github.io/csvw/

"CSV on the Web: A Primer" http://www.w3.org/TR/tabular-data-primer/

"Model for Tabular Data and Metadata on the Web" http://www.w3.org/TR/tabular-data-model/

"Generating JSON from Tabular Data on the Web" (csv2json) http://www.w3.org/TR/csv2json/

"Generating RDF from Tabular Data on the Web" (csv2rdf) http://www.w3.org/TR/csv2rdf/


N. Allow authors to (1) specify how many header rows are metadata and (2) what each row is. For example: 7 metadata header rows: {column label, property URI [path], datatype URI, unit URI, accuracy, precision, significant figures}

With URIs, we can merge, join, and concatenate data (when e.g. study control URIs for e.g. single/double/triple blinding/masking indicate that the https://schema.org/Dataset meets meta-analysis inclusion criteria).

"#LinkedReproducibility"; "#LinkedMetaAnalyses"

You just invented XML or JSON

Or how about include the parser with the file? That's self specifying.

What problem is this trying to solve?

CSV's problems are the nature of a very flexible convention. It's so simple that everyone writes their own generators and parsers that are slightly different. That's what happens when you use a convention like csv.

Revving the spec won't help anything... Because csv is the kind of convention where no one reads the spec anyway!

I even don't see how it is better.

    bc,British Columbia,Victoria,922509,en|western canada

    # Oh, Canada! 10 provinces and 3 territories
    # see en.wikipedia.org/wiki/Provinces_and_territories_of_Canada
    # note: key is two-letter canadian postal code
    # for regions tags see
    #   en.wikipedia.org/wiki/List_of_regions_of_Canada

    bc, British Columbia,       Victoria,     922 509 km², en|western canada

In 1.1 I don't see what is what - what is `bc`? What is `en|western canada`? In vanilla CSV you can clearly see what is what because it shows you on top.

Why not just write a CSV formatter?

Right, where are the tag names specified? What if a new column is added, and my reader needs to handle both formats?

Also, seems like specification-by-example, unless I’m missing something.

To be honest, I'd be quite happy if CSV just had a 'frontmatter' block. Sometimes a bit of background info or description of what the data represents or its source is really useful.

Removing column headings would be a killer. Although efforts in the past to add type information are misguided, by that point you may as well just use XML.

I can't see this gaining any traction. There's nothing wrong with CSV as it is for basic data transfer. CSV had been unchanged now for so long, that any attempt to update it will fail - there are simply too many CSV parsers already implemented.

If you are worried about people in the year 3000 understanding your data, add a .txt file alongside the CSV explaining the fields.

> There's nothing wrong with CSV as it is for basic data transfer.

I disagree. CSV is horribly underspecified and many parsers have conflicting ideas on how things should work. I've hit areas where, for example, an API was generating a CSV that another API could read, but NOT if it was opened and then re-saved by Excel first (just re-saving, not editing), because Excel was making some trivial change in the format that the consuming parser couldn't handle. Then there's issues with encoding, localisation, and escaping (see, eg, https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standa... although that's just scratching the surface).

That being said, this proposal seems worse in every way that existing CSVs, but I don't think the issue is "there's a well specified CSV standard with tons of parsers that won't change to support this new one", it's more that there is no standard, all the parsers are incompatible, and no meaningful improvements are possible. :(

To your point it's hard to see how the referenced article qualifies as a standard. It does not deal with escape characters or missing values, two places where CSV implementations tend to vary in random ways.

My comment may have been unclear, but I agree with this completely.

I've recently been dealing with some CSV data exported from Excel. One of the columns has lengths measured in inches or feet. A column that says

exports as

    , '"14"""',
It feels downright silly to me.

A cell w/

would be

And I verified that all four of Excel's CSV varieties write this, but this is also what RFC CSV writers would emit. Not saying you can't pick up oddities elsewhere, but as stated, it's not that bad. (Not that I condone use of CSV… it's an awful format.)

Ah you're right, the spot I was looking at was after having read it into Python and split into a list on the commas, so the outer ' is my string container. Whoops.

What we need, I think, is an HTML5-like effort from the industry. That's the one format I can think of which went from "complete mess" to "parses the same everywhere".

1. Look at how everybody is extending CSV in the real world

2. Make a new format which does those things, as closely to all the existing implementations as is reasonable

3. Write a fully-specified new syntax for it, including a complete parser (state machine including error cases)

4. Make a complete set of tests, which make it very obvious when you don't pass, for shaming everyone into compliance

5. Get buy-in from a major standards organization, and all of the major corporate players

I don't see any improvements/replacements succeeding if they don't hit all of these points. "CSV 1.1" here hits #1 and #2 only.

I agree. The biggest problem with CSV is Excel's terrible support.

And while CSV is primarily used for data transfer, Excel could hypothetically be a great debug-viewer if it didn't do brain dead stuff like unrecoverably corrupt UPCs and other long numbers by default.

There's an Excel Uservoice about this very issue from 2015:


They're "looking into it." Just as they've been looking into it for the last twenty years. Any. Day. Now.

Excel's braindead csv corruption is why I never use it for csv now. Add guids to the list of things it can corrupt. (if the first group is numeric it parses as an expression which then causes an error because the next group has hex digits, what!?) Also any long string that happens to only be digits. And two numbers separated by a "/".

It's so bad I check any csv file I get if it's been opened in Excel, and if I find any evidence that it has I reject it and make whoever I got it from give me the original source, or transform it with literally any other tool.

I was surprised by how bad Microsoft's CSV support is in Excel. So much that I use Numbers or text editors for everything CSV, never do I use Excel. Excel is still great for a basic level of intensive data analysis though, so can't get away from it.

IIRC there is a hack where you put a single quote after the comma and Excel will treat it as a string. Unfortunately it doesn't help exporting to unknown systems which l likely don't do the same.

Fix it so now its easier for customers to not use Excel? Heretic!

> If you are worried about people in the year 3000 understanding your data, add a .txt file alongside the CSV explaining the fields.

Good point. That's what (tabular) data packages are for, see https://github.com/csv11/csvpack for some real-world examples.

While it is nice to add some spacing and make things a bit easier to read, I don't really see the point in this.

If you want it to be easily parsable by a human then there are hundreds of applications designed for this. The most common being spreadsheets like Excel.

And what happens if you have a list of domains, but one contains a really long url (300 characters wide?). It will mess up the columns for every single row above and below it as well.

Also adding comments will completely break any existing software (I think most programs can handle some more whitespace, but you couldn't import the final example into Excel, I don't think (untested))

For those looking for a modern take on improving on the CSV format, I'd recommend Frictionless Data's Datapackage specification[1] which basically consists of a json file of metadata that accompanies a CSV file that describes column types, versions, sources, and how to validate correctness of the CSV's data. This allows for quite a lot of tooling and workflow improvements to CSV files without mucking with the CSV itself

Another hack to improve CSV workflows is OCHA's HXL[2] that is used by humanitarian organizations. Basically adding a row of hashtags in addition to column names, which is surprisingly useful considering the ease of adding these to a file.

[1] https://frictionlessdata.io/docs/tabular-data-package/ [2] http://hxlstandard.org/

FYI: The (tabular) datapackage is great / fantastic but one layer up the stack.

See https://github.com/csv11/csvpack for real-world data package examples.

By the way, the tabular data csv dialect specification - is a great start/initiative (mostly a 1:1 copy from the python parser :-), really - would need an update, for more options, to reflect the reality of the csv formats out there. The big insight and breakthrough - csv is NOT one spec or format - but various flavors / formats / dialect - let the computer (that is, csvreader library) handle it.

One of the authors of the Frictionless Data specifications here. The spec directly relevant to CSV is Table Schema [1], and we’ve also got some nice tools that leverage Table Schema and the family of specifications, such as goodtables [2].

[1] https://frictionlessdata.io/specs/table-schema/ [2] https://github.com/frictionlessdata/goodtables-py

This looks interesting and useful. It might be a feature that wouldn't get used a lot but I've always felt that it would be great to have a 'unit' field to specify the physical units of the values. Have you ever discussed adding this as an optional field to the specification?

We’ve looked into it:


Feel free to add your use cases to that issue.

I was extremely confused why the website nor the repository does not contain the specification. There is an informal specification littered in a separate repo [1], and the whole organization [2] is dedicated for the documentations and associated Ruby implementations. So well, good luck, but next time try to be more specific.

[1] https://github.com/csv11/csv-next

[2] https://github.com/csv11/

FYI: The readme (the website) is the specification :-), see https://github.com/csv11/csv11.github.io

PS: csv-next is NOT an informal specification - these are notes (collection of ideas).

PPS: How do I know? I'm the author of the website - I should know ;-).

Oh, hello! My gut feeling was that csv-next is pointing to the informal specification (and they read like it, but specific to a single problem point).

I guess you might not have understood my complaint (my bad), so let me give some examples how the specification should look like.

An informal specification looks like this [1]. You should give examples and rules enough to use your format and sufficient to implement most of the things. You should give definitions for keywords that are specific to your specification (but you can omit common definitions). You should give some (but probably not all) ideas where the specification can go wrong: Unicode whitespaces, Byte Order Mark, platform-specific newlines, escape sequences, duplicate keys in the front matter (well, actually there is no provision how to put the front matter at all), numeric overflows, and so on.

A formal specification looks like this [2]. In addition to what an informal specification provides, you should give lots of examples and formalized rules (most frequently ABNF [3]) to implement all the things. There should be clear and reasonable error handling policies. The wording of the specification should be clear, unambiguous and preferably standardized (there are specific meanings to "MUST", "SHOULD" etc. [4]). The specification should be honest about its pros and cons. You should be explicit about the flexibility of the format: you should give a list of what can be extended or modified later and what can't.

I strongly suggest you to put an informal specification at the least, and to prepare for the eventual development of a formal specification by pondering about missing pieces (it is a daunting task, I know).

[1] https://github.com/toml-lang/toml/blob/bb47759841ac368d86eb7...

[2] https://tools.ietf.org/html/rfc7049

[3] https://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_...

[4] https://tools.ietf.org/html/bcp14

I think you don't understand :-). A specification doesn't always have to be complicated, that's the point. I'm all for tech notes with all the details. Here's a challenge for you - write a csv parser - I know - it's a daunting task. For inspiration, here's an example from your humble self - https://github.com/csv11/csvreader

You should stop using smilies in this passive-aggressive manner. You should also stop condescending to people. You may feel that you can't be wrong so they must not understand because they are too dumb and daunted by it, but you should not be revealing that to them in your comments, because it's so offputting.

> A specification doesn't always have to be complicated, that's the point.

That's right, but your "specification" was not even enough for writing a CSV 1.1 file. For example I have mentioned a front matter issue---there is no example using it, and I'm deeply confused how the front matter looks like (or even where it is). I'm seeing numeric units there, but I don't know how they are handled (or, say, what happens if different units like m² are there) from your page.

I'm not saying I'm a good specification writer (and I'm not even a native speaker of English), but I at least try. I have once written an informal specification [1] which should be almost enough to write valid files and start implementing parsers, without being too complicated. You can avoid a complicated specification without being too vague.

> For inspiration, here's an example from your humble self - https://github.com/csv11/csvreader

That's frankly much better (in terms of explicitness) than the front page, why not linking to it? :-)

But I believe that it is still not sufficient. Rather, I'm now unsure about the goal of your project: is it a set of Ruby libraries with fancy, modern but incompatible (by itself) CSV extensions? Or is it hopefully going to be a universally used format? If your intention was the former, then your front page should have been clear about it---it is not about a format but about a library. And if you are going to have a format, then my suggestions hold.

[1] https://github.com/lifthrasiir/cson (with a bit of formal materials, just for the clarity)

The goal of the project is to improve CSV :-) - it's in the title e.g. CSV Evolved and in the version 1.1 (that is, it's not version 2.0). It's about not breaking things - it's about small changes (mostly) for humble humans (not machines).

It's NOT about one universal csv format and the ultimate specification to settle the matter until the end of history etc.

Thanks for the detailed suggestions. I see your points. I appreciate your helpfulness.

A specification has to be, um, specified. For example, the first thing to explain is whether the file format is actually text and if it is what the allowed characters and encodings are. And let's not forget the big one: how are records and fields delimited? Too boring to explain?

On the whole, some ideas to make CSV files human-friendly, but neglecting backward compatibility (comments, use of spaces...) and introducing major syntactic and semantic cans of worms (multiline values, named fields, defaults...). I think CSV files should evolve towards tighter constraints instead.

The best thing we have today for "CSV evolved" is JSON array of arrays, IMO. It's much better specified, everything agrees on how things should be quoted, it's perfectly human-readable if you do one array per line, its semantics follows straightforwardly from that of JSON, and we already have parsers for it that produce a reasonable output.

Is it slightly more verbose? Sure, but why does it matter? Having to quote all strings is really not a big deal.

Good point. See the CSVJSON format - http://csvjson.org - Love it. I will add a new variant called CSV <3 JSON shortly to the CSV 1.1 repo and csvreader etc. too.

This is not VCS friendly. Resize any column and you have to rewrite 99% of the file, tracking changes becomes hell on earth.

Besides VCS unfriendly, the storage space for all the additional space characters provides little utility. For the most part, computers are reading CSV, not humans (though we might scan them).

What do you mean by "resize any column"?

I presume the parent means changing the data s.t. the max length of strings in a column changes; this would cause the padding after that column to change, so as to keep subsequent columns aligned & looking nice. This doesn't play well w/ VCS diffs, as a bunch of data that isn't changing semantically is now changing syntactically, and a simplistic textual diff reflects now that syntactic change, whereas a human reader would want to see the semantic change.

Now… most diff tools also have options to ignore whitespace changes, for cases like this.

CSV is hell, glad to see improvements. Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons (who would have thought files would be shared across country borders!?), so when we open CSV files that are actually comma separated all the information is in the first cell (until a semicolon appears).

To get comma separated CSVs to show properly in Excel we have to mess around with OS language settings. CSV as a format should have died years ago, it's a shame so many apps/services only export CSV files. Many developers (mainly US/UK based) are probably not aware of how much of a headache they inflict on people in other countries by using CSV files.

Semicolons come from many countries using the decimal comma in lieu of the decimal period. Of course, locale-specific number formatting should be a big no-no in any kind of a portable format anyway. But...

Some time ago I've worked on a similar, CSV-like thing that improves on readability and canonical representation, and which can hold multiple tables in a single file or stream like a database: http://jstimpfle.de/projects/wsl/main.html . (I hope it's not rude to reference it here)

No worries. Thanks for highlighting WSL - whitespace separated literals. The world needs more of this. Keep it up.

I like the intent "Easy-to-Write, Easy-to-Read", but the contents from the README leave much to be desired. There are far too many corner cases which the README doesn't address (many already mentioned below), and I can think of more, like, how does a parser know "city: Halifax" as "key: Value" and not just value="city: Halifax"?

Even if a `spec` is created which clearly defines these cases and standardizes how they should be handled, does it qualify to be called CSV 1.1 (or 2.0? really, as it probably won't play nice with existing CSV 1.0 implementations). It is almost as-good-as creating a new format altogether. And there are many to complete with.

I also wonder if it REALLY solves the problems it aims to solve (even if the tech specs were in place). CSV in any form is not human-friendly. This is especially true when you have wide columns (say >30). Even if the records were spaced-out in a human-friendly way in STATE-1, editing records where values are of highly varying width (within a single column) will soon mess the justification when you get to STATE-2. If you skip the requirement for `fixed-number-of-fields` and `key:value` style named values; that messes readability (by humans) even more!

I've had good success with `reading` CSVs using the `csvtk` (https://github.com/shenwei356/csvtk). It provides excellent support for pretty-printing CSVs, filtering select fields etc. `writing` is still a pain, but I still feel spreadsheets are the way to go. They have been around for decades. Its sad if the formatting by specific tools is screwing CSVs, but, solving the composition/modification requirement purely by way of formatted-plain-text is a really tall ask.

ASCII defines control codes for record and field separators, thus avoiding CSV's problems with quoting and escaping characters. The only hassle is entering and displaying the control characters.


Good effort. And I like the minimal documentation. However, and I would love to be proved wrong, I think you are about 20 years late for this project to get any interest aside from some effimerous moments in hackernews...

No worries. It's just a write-up for what gets used in football.db, beer.db, world.db, etc. day in, day out. Also remember - as the say it's never too late :-).

[1] https://github.com/openfootball [2] https://github.com/openmundi [3] https://github.com/openbeer

If it ain't broken don't fix it.

it was already broken, but i dont think this will fix it either

I’m sorry, but when do you ever need to read a CSV file in an environment where you don’t have a spreadsheet editor anyway?

I work in the public sector, we use a lot of CVS, even to non tech savvy people.

I’ve never heard the use case for this project.

Now that might not matter, but you’re breaking CVS to fulfill a nonexistent use case.

How does this improve on RFC 4180? (I only _wish_ excel and many other parsers behaved predictably like that RFC, but it is a decent proposal. )

( https://www.ietf.org/rfc/rfc4180.txt )

The [strict] RFC 4180 CSV Format memo is way too simplistic and "specifies" a format for machine-reading and writing. It's not a human csv format. What's human? It's a format that you hand edit and so on - thus, you want comments, blank lines, short cuts etc.

Just for the record to quote from the memo:


Just repeating it three times in case you missed it, see https://www.ietf.org/rfc/rfc4180.txt

Uh... so the bottom one of the or is CSV 1.1?

How do you define headers? There's no header for the bottom csv version so I have to manually type headers for my data frame?

I got a dataset with 87 obs and ~8700 columns here. I'm not going to manually name those columns. What's the solution?

Yes, you can add a header, of course. No worries. CSV 1.1 is still CSV. It's not CSV 2.0 :-).

[ ["If", "we're", "going", "to", "improve", "CSV"], ["Let's", "do", "it", "with", "some", "JSON"] ]

Good point. You're not the first one (even in this thread). To repost:

See the CSVJSON format - http://csvjson.org - Love it. I will add a new variant called CSV <3 JSON shortly to the CSV 1.1 repo and csvreader etc. too.

You can already do this in R with a fixed-width text file, comments, and the names of variables in a header. The commas serve no actual purpose AFAICT.

The number parsing is going to be a nightmare with this!

Reminds me of old IBM punch card fixed-column-aligned tables... Still in use in many systems (such as the US ACH wire system...)

Everything old is new again.

What if you want to start a row with a field that has a `#`? Not to mention how to handle backwards compatibility with the above concern...

If a field starts with # you have to escape it, that is, put it into a quote e.g. "# no comment".

I don't get the point. If your are on to replace your CSV im-/exporter you can switch to a non-ridiculous format anyway.

Why not use tab-separated values, you get improved readability and less quoting needed, or am I missing something?

See the Awesome CSV page, for the discussion about tabs vs commas - https://github.com/csv11/awesome-csv#tab

Yes, ideally the tab is perfect (no escape rules needed, etc). In practice you cannot tell if you're human if a tab is a space or a space is a tab and than you will get into trouble reading / writing your data etc. Anyways, both are great and needed and use tab2csv or csv2tab to convert or pipe (when using command line tools) :-)

The perfect character is ASCII (& Unicode) 0x1E (30) - "Record Separator" -- except that no modern keyboard can naturally type it, the ones that could are long since gone.

Or just open the CSV in a spreadsheet?

Yeah... No. This doesn't solve anything.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact