> I suspect the problem with CSV will be working around the myriad of broken implementations and making sense of malformed data.
Indeed. My CSV parser (and Python's) is pretty interesting in that regard. There are very few things that actually cause a parse error. You can see here[1] that the only two errors occur if there are unequal length records (which can be disabled by enabling the "flexible" option) and invalid UTF-8 data (which can be avoiding by reading everything into plain byte strings). That means that any arbitrary data gets parsed into something. There are various mechanisms in the CSV parser's state machine that make decisions for you. Mostly, I used the same types of decisions that Python makes. For example:
Whaaaa? Yeah, if our CSV parsers were conformant with the spec, then both of these examples should fail. But they succeed and result in slightly different interpretations based on whether a space character precedes the quote. Therefore, "good" CSV parsers tend to implement a superset of RFC 4180 when parsing, but usually implement it strictly when writing.
(My CSV parser ends up with the same parse as Python here, because it seemed like a good decision to follow its lead since it is used ubiquitously.)
Indeed. My CSV parser (and Python's) is pretty interesting in that regard. There are very few things that actually cause a parse error. You can see here[1] that the only two errors occur if there are unequal length records (which can be disabled by enabling the "flexible" option) and invalid UTF-8 data (which can be avoiding by reading everything into plain byte strings). That means that any arbitrary data gets parsed into something. There are various mechanisms in the CSV parser's state machine that make decisions for you. Mostly, I used the same types of decisions that Python makes. For example:
Whaaaa? Yeah, if our CSV parsers were conformant with the spec, then both of these examples should fail. But they succeed and result in slightly different interpretations based on whether a space character precedes the quote. Therefore, "good" CSV parsers tend to implement a superset of RFC 4180 when parsing, but usually implement it strictly when writing.(My CSV parser ends up with the same parse as Python here, because it seemed like a good decision to follow its lead since it is used ubiquitously.)
[1] - http://burntsushi.net/rustdoc/csv/enum.ParseErrorKind.html