
Rust and CSV parsing - burntsushi
http://blog.burntsushi.net/csv/
======
pornel
> If you wanted to build your own csv-like library, you could build it on top
> of csv-core.

It's pretty cool that Rust's dependency handling is so easy (thanks Cargo!)
that even such a single-task library can come with a smaller, even more
reusable library inside.

~~~
burntsushi
AND...

* csv-core doesn't require use of the standard library. (Notably, zero allocations.)

* No use of `unsafe` and it's very very fast.

:-)

~~~
_ar7
Could you elaborate on the zero allocation part? I made a CSV parser based on
yours [0] before the 1.0.0-beta rewrite, and at the very least I have to
allocate as much memory to hold a single field in the CSV (and then reuse that
memory).

[0]: [https://github.com/AriaFallah/csv-
parser](https://github.com/AriaFallah/csv-parser)

~~~
Manishearth
It's an iterator, so the parser will parse fields from the already-allocated
raw CSV string and yield them to you as it reads them as slices borrowed from
the string. It's up to you to store them in whatever structured format you
want.

------
userbinator
I've looked through all the docs and there is no mention of RFC4180
compliance? That's one of the things a lot of people looking for a CSV library
would consider highly.

~~~
jasode
RFC4180 is more helpful to _generate_ correct CSV files. It's not as useful
for _parsing_ files since many (most?) csv writers will not have followed
RFC4180 in the first place.

Or to put it differently:

 _" Be conservative in what you send {use RFC4180!}, be liberal in what you
accept {parse broken ambiguous csv that was ignorant of RFC4180}."_ \-- from
the
[https://en.wikipedia.org/wiki/Robustness_principle](https://en.wikipedia.org/wiki/Robustness_principle)

It also doesn't help that RFC4180 is dated 2005 which is trying to enforce a
standard 20 to 30 years after csv files have been created adhoc in the wild
without a formal specification.

~~~
pdkl95
> Robustness_principle

A slight variation on this is important for security. Be liberal in what you
accept, but _define_ what "liberal" means. It's important to define what
counts as a valid input. If you don't have a validation/recognizer verify that
the input is acceptable, you are probably creating a "weird machine"[1]
waiting to be exploited in ways you didn't expect.

I highly recommend Meredith and Sergey's talk[2][3] at 28c3, "The Science of
Insecurity", where they explain this from a language-theoretic perspective.

[1]
[http://www.cs.dartmouth.edu/~sergey/wm/](http://www.cs.dartmouth.edu/~sergey/wm/)

[2] [https://media.ccc.de/v/28c3-4763-en-
the_science_of_insecurit...](https://media.ccc.de/v/28c3-4763-en-
the_science_of_insecurity)

[3] [http://www.cs.dartmouth.edu/~sergey/langsec/insecurity-
theor...](http://www.cs.dartmouth.edu/~sergey/langsec/insecurity-
theory-28c3.pdf)

~~~
MichaelGG
Except that's not an interpretation of liberal anyone uses. That'd just be "by
the spec". The robustness principle in practice has made things a nightmare.
Consider SIP and HTTP. SIP explicitly encourages implementers to guess at the
meaning of messages, and delights in how complicated their parsing rules are.

But even without getting crazy, simple line endings end up being a security
issue. Some treat LF like CRLF. Or CRCR as a single line break. So you end up
with a setup where your SIP or HTTP proxy will interpret a message one way,
and your server interprets it differently. This lets you exploit things by
sneaking in headers that your proxy would otherwise deny.

~~~
pdkl95
The point is it's important to _define a new spec_ if you are being "liberal"
in what you accept. For example, while RFC 4180 defines line endings as CRLF,
it might be reasonable to "be liberal" and accept files with just LF _if and
only if_ you add a rule similar to

    
    
        <line-ending> ::= "\r\n" | "\n"
    

to your input grammar, _and_ properly recognize that the input matches that
grammar before using any parsed values. It's an extension ("being liberal") to
the official spec, but still strictly defined in the implemented parser.

> This lets you exploit things by sneaking in headers

In the talk at my previous [2][3], Serge specifically mentions Travis
Goodspeed's packet-in-packet[4] exploit as an example of why you have to
define what "liberal" means if you want to implement Postel's principle. (You
should also limit input grammar's complexity to deterministic context-free, or
the recognizer will be undecidable)

> in practice has made things a nightmare

That's what [2][3] is trying to fix. Stop using Turing Complete input
languages, explicitly define your input grammar no matter what you're choosing
to parse, and formally recognize all input against that grammar. The data
either matches the defined grammar or it should be failed as invalid. Any
allowances for sloppy input must be baked into that process.

[4] [http://travisgoodspeed.blogspot.com/2011/09/remotely-
exploit...](http://travisgoodspeed.blogspot.com/2011/09/remotely-exploiting-
phy-layer.html)

------
chaosfox
don't forget to checkout xsv ! its the missing piece between standard unix
tools and SQL, I love it.

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

------
tankfeeder
csv parsing on PicoLisp:
[https://bitbucket.org/mihailp/tankfeeder/src/5a91a025d78eacf...](https://bitbucket.org/mihailp/tankfeeder/src/5a91a025d78eacf68655b6dbec5c2dd346624146/csv.l?at=default&fileviewer=file-
view-default)

