
CSV 1.1 – CSV Evolved (for Humans) - polm23
https://csv11.github.io/
======
legostormtroopr
There are a ton of problems here:

* From the documentation "No quotes needed for values ... Use dual quotes ... Use triple quotes"

* You haven't solved the problem of describing what the columns are. In fact, its worse because you are encouraging people to put units in the field

* Spaces matter! What if the data is literally " Word " vs "Word". This format makes them both the same.

* For some reason, the header row is removed. The only thing in a CSV that provides structure is gone. In the first "improved" example, I have no way of knowing what the columns meaning without them.

The only good improvement in this is the "#" convention as a comment. If we
could get every CSV parser in future to agree (we already can't get them to
agree now) to ignore lines that start with "#" then CSV's would be vastly
improved.

~~~
geraldbauer
Hello, I'm the original author of CSV 1.1.

There are no problems :-). You're making up things / problems.

Nowhere says it that the header row is removed. You CANNOT auto-detect a
header, you have to tell your parser (see the csvreader library as an example
- [https://github.com/csv11/csvreader](https://github.com/csv11/csvreader) \-
from my humble self.) if you have a header or not (it's optional).

If Spaces matter! but them in quotes. By default you don't need quotes (and,
thus, discouraged).

> You haven't solved the problem of describing > what the columns are

That is solved / done by a schema with a (tabular) datapackage, see
[https://github.com/csv11/csvpack](https://github.com/csv11/csvpack) as a
real-world example how that works in practice or use csvrecord, see
[https://github.com/csv11/csvrecord](https://github.com/csv11/csvrecord)

~~~
legostormtroopr
> If Spaces matter! but them in quotes. By default you don't need quotes (and,
> thus, discouraged).

So, just like a regular CSV?

------
edejong
Well, if you want to improve tabular data formats:

1\. Add a version identifier / content-type on the first line!

2\. Create a formal grammar for this CSV format

3\. Specify preferred character-encoding

4\. Provide some tooling (validation, CSV 1.1 => HTML, CSV => Excel)

5\. Add the option to specify column type (string, int, date)

6\. Specify ISO-8601 as the preferred date format

7\. Allow 'reheading' the columns in the file itself. This is useful in
streaming data.

8\. Specify the format of the newlines.

~~~
westurner
CSVW: CSV on the Web
[https://w3c.github.io/csvw/](https://w3c.github.io/csvw/)

"CSV on the Web: A Primer" [http://www.w3.org/TR/tabular-data-
primer/](http://www.w3.org/TR/tabular-data-primer/)

"Model for Tabular Data and Metadata on the Web"
[http://www.w3.org/TR/tabular-data-model/](http://www.w3.org/TR/tabular-data-
model/)

"Generating JSON from Tabular Data on the Web" (csv2json)
[http://www.w3.org/TR/csv2json/](http://www.w3.org/TR/csv2json/)

"Generating RDF from Tabular Data on the Web" (csv2rdf)
[http://www.w3.org/TR/csv2rdf/](http://www.w3.org/TR/csv2rdf/)

...

N. Allow authors to (1) specify how many header rows are metadata and (2) what
each row is. For example: 7 metadata header rows: {column label, property URI
[path], datatype URI, unit URI, accuracy, precision, significant figures}

With URIs, we can merge, join, and concatenate data (when e.g. study control
URIs for e.g. single/double/triple blinding/masking indicate that the
[https://schema.org/Dataset](https://schema.org/Dataset) meets meta-analysis
inclusion criteria).

"#LinkedReproducibility"; "#LinkedMetaAnalyses"

------
gwbas1c
What problem is this trying to solve?

CSV's problems are the nature of a very flexible convention. It's so simple
that everyone writes their own generators and parsers that are slightly
different. That's what happens when you use a convention like csv.

Revving the spec won't help anything... Because csv is the kind of convention
where no one reads the spec anyway!

------
romanovcode
I even don't see how it is better.

    
    
        ID,Name,Capital,Area,Tags
        bc,British Columbia,Victoria,922509,en|western canada
    

vs

    
    
        ###########################
        # Oh, Canada! 10 provinces and 3 territories
        #
        # see en.wikipedia.org/wiki/Provinces_and_territories_of_Canada
        #
        # note: key is two-letter canadian postal code
        #
        # for regions tags see
        #   en.wikipedia.org/wiki/List_of_regions_of_Canada
    
        bc, British Columbia,       Victoria,     922 509 km², en|western canada
    
    

In 1.1 I don't see what is what - what is `bc`? What is `en|western canada`?
In vanilla CSV you can clearly see what is what because it shows you on top.

Why not just write a CSV formatter?

~~~
Tempest1981
Right, where are the tag names specified? What if a new column is added, and
my reader needs to handle both formats?

Also, seems like specification-by-example, unless I’m missing something.

~~~
petepete
To be honest, I'd be quite happy if CSV just had a 'frontmatter' block.
Sometimes a bit of background info or description of what the data represents
or its source is really useful.

Removing column headings would be a killer. Although efforts in the past to
add type information are misguided, by that point you may as well just use
XML.

------
Jaruzel
I can't see this gaining any traction. There's nothing wrong with CSV as it is
for basic data transfer. CSV had been unchanged now for so long, that any
attempt to update it will fail - there are simply too many CSV parsers already
implemented.

If you are worried about people in the year 3000 understanding your data, add
a .txt file alongside the CSV explaining the fields.

~~~
Someone1234
I agree. The biggest problem with CSV is Excel's terrible support.

And while CSV is primarily used for data transfer, Excel could hypothetically
be a great debug-viewer if it didn't do brain dead stuff like unrecoverably
corrupt UPCs and other long numbers by default.

There's an Excel Uservoice about this very issue from 2015:

[https://excel.uservoice.com/forums/304921-excel-for-
windows-...](https://excel.uservoice.com/forums/304921-excel-for-windows-
desktop-application/suggestions/10374741-stop-excel-from-changing-large-
numbers-actually?page=8&per_page=20)

They're "looking into it." Just as they've been looking into it for the last
twenty years. Any. Day. Now.

~~~
infogulch
Excel's braindead csv corruption is why I never use it for csv now. Add guids
to the list of things it can corrupt. (if the first group is numeric it parses
as an expression which then causes an error because the next group has hex
digits, what!?) Also any long string that happens to only be digits. And two
numbers separated by a "/".

It's so bad I check any csv file I get if it's been opened in Excel, and if I
find any evidence that it has I reject it and make whoever I got it from give
me the original source, or transform it with literally _any other tool_.

------
webdevetc
While it is nice to add some spacing and make things a bit easier to read, I
don't really see the point in this.

If you want it to be easily parsable by a human then there are hundreds of
applications designed for this. The most common being spreadsheets like Excel.

And what happens if you have a list of domains, but one contains a really long
url (300 characters wide?). It will mess up the columns for every single row
above and below it as well.

Also adding comments will completely break any existing software (I think most
programs can handle some more whitespace, but you couldn't import the final
example into Excel, I don't think (untested))

------
ewheeler
For those looking for a modern take on improving on the CSV format, I'd
recommend Frictionless Data's Datapackage specification[1] which basically
consists of a json file of metadata that accompanies a CSV file that describes
column types, versions, sources, and how to validate correctness of the CSV's
data. This allows for quite a lot of tooling and workflow improvements to CSV
files without mucking with the CSV itself

Another hack to improve CSV workflows is OCHA's HXL[2] that is used by
humanitarian organizations. Basically adding a row of hashtags in addition to
column names, which is surprisingly useful considering the ease of adding
these to a file.

[1] [https://frictionlessdata.io/docs/tabular-data-
package/](https://frictionlessdata.io/docs/tabular-data-package/) [2]
[http://hxlstandard.org/](http://hxlstandard.org/)

~~~
_pwalsh
One of the authors of the Frictionless Data specifications here. The spec
directly relevant to CSV is Table Schema [1], and we’ve also got some nice
tools that leverage Table Schema and the family of specifications, such as
goodtables [2].

[1] [https://frictionlessdata.io/specs/table-
schema/](https://frictionlessdata.io/specs/table-schema/) [2]
[https://github.com/frictionlessdata/goodtables-
py](https://github.com/frictionlessdata/goodtables-py)

~~~
robochat42
This looks interesting and useful. It might be a feature that wouldn't get
used a lot but I've always felt that it would be great to have a 'unit' field
to specify the physical units of the values. Have you ever discussed adding
this as an optional field to the specification?

~~~
_pwalsh
We’ve looked into it:

[https://github.com/frictionlessdata/specs/issues/537](https://github.com/frictionlessdata/specs/issues/537)

Feel free to add your use cases to that issue.

------
lifthrasiir
I was extremely confused why the website nor the repository does not contain
the specification. There is an informal specification littered in a separate
repo [1], and the whole organization [2] is dedicated for the documentations
and associated Ruby implementations. So well, good luck, but next time try to
be more specific.

[1] [https://github.com/csv11/csv-next](https://github.com/csv11/csv-next)

[2] [https://github.com/csv11/](https://github.com/csv11/)

~~~
geraldbauer
FYI: The readme (the website) is the specification :-), see
[https://github.com/csv11/csv11.github.io](https://github.com/csv11/csv11.github.io)

PS: csv-next is NOT an informal specification - these are notes (collection of
ideas).

PPS: How do I know? I'm the author of the website - I should know ;-).

~~~
lifthrasiir
Oh, hello! My gut feeling was that csv-next is _pointing_ to the informal
specification (and they read like it, but specific to a single problem point).

I guess you might not have understood my complaint (my bad), so let me give
some examples how the specification should look like.

An informal specification looks like this [1]. You should give examples and
rules enough to use your format and sufficient to implement most of the
things. You should give definitions for keywords that are specific to your
specification (but you can omit common definitions). You should give some (but
probably not all) ideas where the specification can go wrong: Unicode
whitespaces, Byte Order Mark, platform-specific newlines, escape sequences,
duplicate keys in the front matter (well, actually there is no provision how
to put the front matter at all), numeric overflows, and so on.

A formal specification looks like this [2]. In addition to what an informal
specification provides, you should give lots of examples and formalized rules
(most frequently ABNF [3]) to implement all the things. There should be clear
and reasonable error handling policies. The wording of the specification
should be clear, unambiguous and preferably standardized (there are specific
meanings to "MUST", "SHOULD" etc. [4]). The specification should be honest
about its pros and cons. You should be explicit about the flexibility of the
format: you should give a list of what can be extended or modified later and
what can't.

I strongly suggest you to put an informal specification at the least, and to
prepare for the eventual development of a formal specification by pondering
about missing pieces (it is a daunting task, I know).

[1] [https://github.com/toml-
lang/toml/blob/bb47759841ac368d86eb7...](https://github.com/toml-
lang/toml/blob/bb47759841ac368d86eb7a459bd7eea7162b9a80/README.md#user-
content-spec)

[2] [https://tools.ietf.org/html/rfc7049](https://tools.ietf.org/html/rfc7049)

[3]
[https://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_...](https://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_form)

[4] [https://tools.ietf.org/html/bcp14](https://tools.ietf.org/html/bcp14)

~~~
geraldbauer
I think you don't understand :-). A specification doesn't always have to be
complicated, that's the point. I'm all for tech notes with all the details.
Here's a challenge for you - write a csv parser - I know - it's a daunting
task. For inspiration, here's an example from your humble self -
[https://github.com/csv11/csvreader](https://github.com/csv11/csvreader)

~~~
lifthrasiir
> A specification doesn't always have to be complicated, that's the point.

That's right, but your "specification" was not even enough for writing a CSV
1.1 file. For example I have mentioned a front matter issue---there is no
example using it, and I'm deeply confused how the front matter looks like (or
even where it is). I'm seeing numeric units there, but I don't know how they
are handled (or, say, what happens if different units like m² are there) from
your page.

I'm not saying I'm a good specification writer (and I'm not even a native
speaker of English), but I at least try. I have once written an informal
specification [1] which should be almost enough to write valid files and start
implementing parsers, without being too complicated. You can avoid a
complicated specification without being too vague.

> For inspiration, here's an example from your humble self -
> [https://github.com/csv11/csvreader](https://github.com/csv11/csvreader)

That's frankly much better (in terms of explicitness) than the front page, why
not linking to it? :-)

But I believe that it is still not sufficient. Rather, I'm now unsure about
the goal of your project: is it a set of Ruby libraries with fancy, modern but
incompatible (by itself) CSV extensions? Or is it hopefully going to be a
universally used format? If your intention was the former, then your front
page should have been clear about it---it is not about a format but about a
library. And if you are going to have a format, then my suggestions hold.

[1] [https://github.com/lifthrasiir/cson](https://github.com/lifthrasiir/cson)
(with a bit of formal materials, just for the clarity)

~~~
geraldbauer
The goal of the project is to improve CSV :-) - it's in the title e.g. CSV
Evolved and in the version 1.1 (that is, it's not version 2.0). It's about not
breaking things - it's about small changes (mostly) for humble humans (not
machines).

It's NOT about one universal csv format and the ultimate specification to
settle the matter until the end of history etc.

Thanks for the detailed suggestions. I see your points. I appreciate your
helpfulness.

------
int_19h
The best thing we have today for "CSV evolved" is JSON array of arrays, IMO.
It's much better specified, everything agrees on how things should be quoted,
it's perfectly human-readable if you do one array per line, its semantics
follows straightforwardly from that of JSON, and we already have parsers for
it that produce a reasonable output.

Is it slightly more verbose? Sure, but why does it matter? Having to quote all
strings is really not a big deal.

~~~
geraldbauer
Good point. See the CSVJSON format - [http://csvjson.org](http://csvjson.org)
\- Love it. I will add a new variant called CSV <3 JSON shortly to the CSV 1.1
repo and csvreader etc. too.

------
enygmata
This is not VCS friendly. Resize any column and you have to rewrite 99% of the
file, tracking changes becomes hell on earth.

~~~
jstimpfle
What do you mean by "resize any column"?

~~~
deathanatos
I presume the parent means changing the data s.t. the max length of strings in
a column changes; this would cause the padding after that column to change, so
as to keep subsequent columns aligned & looking nice. This doesn't play well
w/ VCS diffs, as a bunch of data that isn't changing semantically is now
changing syntactically, and a simplistic textual diff reflects now that
syntactic change, whereas a human reader would want to see the semantic
change.

Now… most diff tools also have options to ignore whitespace changes, for cases
like this.

------
kristofferR
CSV is hell, glad to see improvements. Some idiot somewhere decided that Comma
Separated Values in certain locales should be based on semicolons (who would
have thought files would be shared across country borders!?), so when we open
CSV files that are actually comma separated all the information is in the
first cell (until a semicolon appears).

To get comma separated CSVs to show properly in Excel we have to mess around
with OS language settings. CSV as a format should have died years ago, it's a
shame so many apps/services only export CSV files. Many developers (mainly
US/UK based) are probably not aware of how much of a headache they inflict on
people in other countries by using CSV files.

~~~
int_19h
Semicolons come from many countries using the decimal comma in lieu of the
decimal period. Of course, locale-specific number formatting should be a big
no-no in any kind of a portable format anyway. But...

------
jstimpfle
Some time ago I've worked on a similar, CSV-like thing that improves on
readability and canonical representation, and which can hold multiple tables
in a single file or stream like a database:
[http://jstimpfle.de/projects/wsl/main.html](http://jstimpfle.de/projects/wsl/main.html)
. (I hope it's not rude to reference it here)

~~~
geraldbauer
No worries. Thanks for highlighting WSL - whitespace separated literals. The
world needs more of this. Keep it up.

------
pshirali
I like the intent "Easy-to-Write, Easy-to-Read", but the contents from the
README leave much to be desired. There are far too many corner cases which the
README doesn't address (many already mentioned below), and I can think of
more, like, how does a parser know "city: Halifax" as "key: Value" and not
just value="city: Halifax"?

Even if a `spec` is created which clearly defines these cases and standardizes
how they should be handled, does it qualify to be called CSV 1.1 (or 2.0?
really, as it probably won't play nice with existing CSV 1.0 implementations).
It is almost as-good-as creating a new format altogether. And there are many
to complete with.

I also wonder if it REALLY solves the problems it aims to solve (even if the
tech specs were in place). CSV in any form is not human-friendly. This is
especially true when you have wide columns (say >30). Even if the records were
spaced-out in a human-friendly way in STATE-1, editing records where values
are of highly varying width (within a single column) will soon mess the
justification when you get to STATE-2. If you skip the requirement for `fixed-
number-of-fields` and `key:value` style named values; that messes readability
(by humans) even more!

I've had good success with `reading` CSVs using the `csvtk`
([https://github.com/shenwei356/csvtk](https://github.com/shenwei356/csvtk)).
It provides excellent support for pretty-printing CSVs, filtering select
fields etc. `writing` is still a pain, but I still feel spreadsheets are the
way to go. They have been around for decades. Its sad if the formatting by
specific tools is screwing CSVs, but, solving the composition/modification
requirement purely by way of formatted-plain-text is a really tall ask.

------
cpeterso
ASCII defines control codes for record and field separators, thus avoiding
CSV's problems with quoting and escaping characters. The only hassle is
entering and displaying the control characters.

[https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_...](https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_separators)

------
lozzo
Good effort. And I like the minimal documentation. However, and I would love
to be proved wrong, I think you are about 20 years late for this project to
get any interest aside from some effimerous moments in hackernews...

~~~
geraldbauer
No worries. It's just a write-up for what gets used in football.db, beer.db,
world.db, etc. day in, day out. Also remember - as the say it's never too late
:-).

[1] [https://github.com/openfootball](https://github.com/openfootball) [2]
[https://github.com/openmundi](https://github.com/openmundi) [3]
[https://github.com/openbeer](https://github.com/openbeer)

------
keeguon
If it ain't broken don't fix it.

~~~
juststeve
it was already broken, but i dont think this will fix it either

------
eksemplar
I’m sorry, but when do you ever need to read a CSV file in an environment
where you don’t have a spreadsheet editor anyway?

I work in the public sector, we use a lot of CVS, even to non tech savvy
people.

I’ve never heard the use case for this project.

Now that might not matter, but you’re breaking CVS to fulfill a nonexistent
use case.

------
peapicker
How does this improve on RFC 4180? (I only _wish_ excel and many other parsers
behaved predictably like that RFC, but it is a decent proposal. )

( [https://www.ietf.org/rfc/rfc4180.txt](https://www.ietf.org/rfc/rfc4180.txt)
)

~~~
geraldbauer
The [strict] RFC 4180 CSV Format memo is way too simplistic and "specifies" a
format for machine-reading and writing. It's not a human csv format. What's
human? It's a format that you hand edit and so on - thus, you want comments,
blank lines, short cuts etc.

Just for the record to quote from the memo:

This memo provides information for the internet community. IT DOES NOT SPECIFY
AN INTERNET STANDARD OF ANY KIND. IT DOES NOT SPECIFY AN INTERNET STANDARD OF
ANY KIND. IT DOES NOT SPECIFY AN INTERNET STANDARD OF ANY KIND.

Just repeating it three times in case you missed it, see
[https://www.ietf.org/rfc/rfc4180.txt](https://www.ietf.org/rfc/rfc4180.txt)

------
digitalzombie
Uh... so the bottom one of the or is CSV 1.1?

How do you define headers? There's no header for the bottom csv version so I
have to manually type headers for my data frame?

I got a dataset with 87 obs and ~8700 columns here. I'm not going to manually
name those columns. What's the solution?

~~~
geraldbauer
Yes, you can add a header, of course. No worries. CSV 1.1 is still CSV. It's
not CSV 2.0 :-).

------
billpg
[ ["If", "we're", "going", "to", "improve", "CSV"], ["Let's", "do", "it",
"with", "some", "JSON"] ]

~~~
geraldbauer
Good point. You're not the first one (even in this thread). To repost:

See the CSVJSON format - [http://csvjson.org](http://csvjson.org) \- Love it.
I will add a new variant called CSV <3 JSON shortly to the CSV 1.1 repo and
csvreader etc. too.

------
cntlzw
[https://xkcd.com/927/](https://xkcd.com/927/)

------
bachmeier
You can already do this in R with a fixed-width text file, comments, and the
names of variables in a header. The commas serve no actual purpose AFAICT.

------
worldofgreed
The number parsing is going to be a nightmare with this!

------
kalleboo
Reminds me of old IBM punch card fixed-column-aligned tables... Still in use
in many systems (such as the US ACH wire system...)

Everything old is new again.

------
mvanveen
What if you want to start a row with a field that has a `#`? Not to mention
how to handle backwards compatibility with the above concern...

~~~
geraldbauer
If a field starts with # you have to escape it, that is, put it into a quote
e.g. "# no comment".

------
CptKriechstrom
I don't get the point. If your are on to replace your CSV im-/exporter you can
switch to a non-ridiculous format anyway.

------
ripperdoc
Why not use tab-separated values, you get improved readability and less
quoting needed, or am I missing something?

~~~
geraldbauer
See the Awesome CSV page, for the discussion about tabs vs commas -
[https://github.com/csv11/awesome-csv#tab](https://github.com/csv11/awesome-
csv#tab)

Yes, ideally the tab is perfect (no escape rules needed, etc). In practice you
cannot tell if you're human if a tab is a space or a space is a tab and than
you will get into trouble reading / writing your data etc. Anyways, both are
great and needed and use tab2csv or csv2tab to convert or pipe (when using
command line tools) :-)

------
icholy
Or just open the CSV in a spreadsheet?

------
clatan
Yeah... No. This doesn't solve anything.

