
RFC4180 – Common Format and MIME Type for CSV Files (2005) - geezerjay
https://tools.ietf.org/html/rfc4180
======
deusex_
I don't see the semicolon separated CSV format going away. CSV is and will be
a mess.

The RFC should perhaps capture the current state of affairs, something like
new Web standards

~~~
tinus_hn
If someone comes up with a clever alternative it'll be gone really quickly.
Look at how quickly json caught on.

CSV is easy to write and easy to parse if you only care about 90% of the use
cases. If you need to do the other 10% it's much harder and you get a lot of
interoperability problems. And that 10% is only becoming more important due to
internationalization.

------
geezerjay
submitter here.

I've submitted the link to the formal specification of the comma-separated
value (CSV) file format as I've noticed that someone else had posted a link to
a CSV parser.

[https://news.ycombinator.com/item?id=12770373](https://news.ycombinator.com/item?id=12770373)

A CSV parser is trivial to implement, and the complexity of the endeavour is
so low that it doesn't even justify as a tutorial on how to implement a proper
parser.

~~~
twic
It's worth pointing out that nobody actually uses the RFC 4180 version of CSV.
And so ...

> A CSV parser is trivial to implement

A parser for any one dialect of CSV is trivial to implement. A parser which
can robustly parse many different dialects of CSV is quite a challenge. Have a
read of Python's:

[https://github.com/python/cpython/blob/master/Lib/csv.py](https://github.com/python/cpython/blob/master/Lib/csv.py)

EDIT: It's probably not fair to say that nobody actually uses the RFC 4180
version of CSV. Some programs use it explicitly, and many more conform to it
by chance because it codified the most common practices around CSV. For
example, it appears that Excel conforms to RFC 4180:

[http://superuser.com/questions/302334/true-difference-
betwee...](http://superuser.com/questions/302334/true-difference-between-
excel-csv-and-standard-csv)

~~~
geezerjay
> A parser for any one dialect of CSV is trivial to implement.

That's what the formal definition is for. If a document doesn't match the CSV
description then it isn't a CSV document. It's something else.

> A parser which can robustly parse many different dialects of CSV is quite a
> challenge.

A hand-written parsers that's less than 1k loc is not a challenge. JSON is
also a trivial format to parse, and a JSON lexer alone exceeds 500loc.

~~~
twic
> That's what the formal definition is for. If a document doesn't match the
> CSV description then it isn't a CSV document. It's something else.

Right, and if there was a formal definition of CSV, that's the situation we'd
be in. RFC 4180 isn't one. It's a description of _one dialect_ of CSV, which
was written decades after CSV came into use. There are dialects of CSV out
there which don't conform to RFC 4180, and which are perfectly valid in their
own right.

> JSON is also a trivial format to parse, and a JSON lexer alone exceeds
> 500loc

FWIW, i wrote a JSON tokenizer in 235 lines, and it really should be shorter:

[https://bitbucket.org/twic/jsonomic/src/e0cae9587b20def003b9...](https://bitbucket.org/twic/jsonomic/src/e0cae9587b20def003b90d7b2f3a166213a692a3/src/main/java/li/earth/urchin/twic/json/parser/Tokeniser.java?at=default&fileviewer=file-
view-default)

~~~
geezerjay
> Right, and if there was a formal definition of CSV, that's the situation
> we'd be in. RFC 4180 isn't one.

You're mixing things up. You're confusing "formal definition", which rfc4180
is, with the concept of an official standard, which no RFC is.

Yet, that doesn't stop the world from working based on RFCs such as this one.

> It's a description of one dialect of CSV

There you go.

> FWIW, i wrote a JSON tokenizer in 235 lines, and it really should be
> shorter:

I've browsed through your code and I have to say you did a good job. I see
your coding style is very frugal with the linebreaks and in breaking up
statements, and in a manner that doesn't affect readability that much. OTOH,
your absence of comments should be taken care of.

However, I should point out a couple of design flaws in your lexer.

First, your lexer handles any invalid token by throwing exceptions. Not only
is that bad form and a misuse of what exceptions actually are (a document with
a parsing issue isn't an exceptional event), but any parser that uses your
lexer is unable to catch any syntactical errors, provide informative error
messages, and most importantly be able to recover from parsing issues.

One standard technique is to support frequent lexial/syntactical errors in the
parser and use them to handle parser errors.

The other design flaw is that you failed to include length limits on your
strings. That's a potential source of problems, including security issues.

Another issue is that your lexer fails to parse JSON numbers, which is the
most complex part of the lexer. Your lexer simply dumps any character matched
by isNumber to a string, and then try to convert it to a number by calling
Java's Double.parseDouble(). As you may know, JSON's definition of a number
doesn't match JAVA's specification. If your lexer supported JSON's number
format then it would easily double the number of states it already has.

So, eventough your lexer is a couple of hunded loc shorted than my estimate,
it is in fact missing basic and fundamental chunks required by a lexer for
JSON. Once you add those in, you could recount your loc and see how it steers
away from the 500 estimate I pointed out.

~~~
twic
Good point about the numbers! Double::parseDouble accepts a superset of JSON
numbers, so the risk here is that something which is not a valid JSON number
will be accepted. That could be tackled by building up the string, checking it
against a regular expression for JSON numbers, and then parsing it. It
wouldn't add any states to the parser, and would take two lines of code.

I would suggest that in a JSON parser, attempting to recover from parse
errors, or supporting frequent errors, would be a mistake. JSON, unlike CSV,
has a rigorous definition, and we wouldn't be doing the world any favours by
tolerating deviations from it.

I'm not sure what the security issues with long strings would be. Text gets
accumulated in a StringBuilder, which will handle strings up to 2 GB and then
throw an exception. I don't think there's any way to get into an incorrect
state.

Your other points relate to design decisions, and ones where i'm quite happy
with the options i chose.

------
1wd
Do CSV files on really mainly use CRLF line endings?

~~~
jgalt212
That's a good question. I know email is supposed to use only CRLF line
endings, but we use LF across a number of vendors' email API endpoints and
have never had any issues.

