
CSV as a Data Source - thingsilearned
http://chartio.com/blog/2013/08/csvdata
======
jeremysmyth
MySQL has supported CSV as a data source for quite some time:
[http://dev.mysql.com/doc/refman/5.6/en/csv-storage-
engine.ht...](http://dev.mysql.com/doc/refman/5.6/en/csv-storage-engine.html)

You can run full SQL queries directly against a text file as if it was a
table.

~~~
thingsilearned
Yeah, and that's how we used to direct customers who wanted upload CSV data.
We even wrote some nice tutorials on how to set it up

[https://chartio.com/education/databases/excel-to-
mysql](https://chartio.com/education/databases/excel-to-mysql)

But we found that many people either aren't technical enough, or didn't want
to go through the hassle of setting up a MySQL instance, defining a schema,
and cleaning the data. So now we do that for them.

~~~
tryitnow
Good call. I'm a financial analyst who is becoming increasingly comfortable
with code in general, but I can tell you 80-90% of business users lose
interest as soon as anything SQL comes up. Heck, 70% of business users chicken
out if an Access db is involved.

So this should really get a much, much bigger audience to use chartio.

------
guard-of-terra
Data isn't "locked up" in CSV, rather it is available in CSV.

CSV is one of the most open things possible.

As a programmer I do a lot of one-time makeshift data reports for other
people, and I always use CSV (or precisely, tab separated) because that's what
every program happily emits and consumes. If it does not, it's trivial to
transform thanks to UNIX sort, awk and uniq.

~~~
zbanks
Once your CSV (or TSV) files start having quoted fields, they become _very_
tricky to parse using standard multi-purpose tools like sort, awk, & uniq.

It's hard enough when you have delimiters in quoted fields, but dealing with
quoted newlines starts to become unreasonable, especially for line-based
tools.

CSV files, as you say, are absolutely _wonderful_ to create. Problems come up
when you try to parse files _other_ people write. Not everyone follows RFC
4180.

~~~
guard-of-terra
With tab separated values this is not a big problem in practice. On the other
hand, you can sort tsv, but you can't sort quoted csv.

~~~
sopooneo
Why not? I ask in earnest.

~~~
guard-of-terra
sort -n gets confused by quotes.

------
JulianMorrison
CSV is the biggest pile of nuisance you'd never expect from a seemingly simple
data format.

\- Header line or none?

\- "\n" or "\r\n"?

\- Is there a newline at the last line? How about two?

\- Escape quotes with doubling or backslash? How about both in the same file?
How about both, inconsistently, in different fields? How about quotes
including a newline and commas?

\- Strings always quoted? Only if necessary? Is ,, a null or an empty string,
or an error?

\- How about mixed line lengths? Are missing trailing entries nulls? How about
multiple data types in a file, with the first field being type, and line
length only fixed per type?

I have generally found "TSV with a rule that data cannot represent tabs or
newlines, period" as vastly superior.

~~~
jwilliams
TSV has all those same issues. However, you just have less frequent need of a
tab, so most of the escaping edge cases never come up.

Those aspects are defined in RFC 4180 - just a lot of systems don't bother.
How would you define a simpler data format?

~~~
JulianMorrison
It's a lot easier to administratively ban tabs/newlines in your data than it
is to ban commas, and TSV doesn't have escape mechanisms or quoting. So you
actually can parse it with line.chomp().split("\t"), and that doesn't break
horrendously.

TSV is streamable and minimally wasteful, I rather approve of it. Netstrings
are better though if having sized data and nested data is needed. They are
proof against all the ills of quoting and escapes.

~~~
jwilliams
Largely agree. I was just picking up on "vastly superior". All the issues of
line breaks, nulls, column lengths, headers still remain.

You're adding the constraint that you can't use tabs or newlines in your data
(to use a newline in your String, you'd need to escape it). In all other
cases, you need escaping, and once you've assumed escaping then CSV and TSV
aren't really any different.

------
choffstein
Is it possible to set this up so that you can point it to a remote URL to
fetch the CSV? For example, I'd love it if I could point it towards a Yahoo!
Finance API call that returns a CSV file and have my dashboard auto-update. Or
Dropbox URL. Or any remote URL. That would be awesome.

~~~
maspwr
That exact use case is not currently possible, but it's something we've talked
a bit about. We're still considering doing something like it in the future.

------
michaelrbock
As someone who has worked in an industry where FTP'ing massive CSV files was
the norm, I must say this looks really nice.

~~~
thingsilearned
Thanks!

------
tbrownaw
The biggest issue I have with csv files, is when the people sending them _don
't get the $%^&@ quotes right_. The second biggest issue, is when they say
"sorry, we're not able to do that" after I tell them exactly how to fix what
they're sending.

~~~
akinder
What is the proper way to encode the quotes? I'm playing around with writing a
Livejournal data dump reader and will have to do the quote dance pretty soon.
In LJ files they are encoded like so:

col1,col2,col3,"Longer data and somethin ""with"" quotes",col5

~~~
tbrownaw
Yes, that's how they're supposed to be. Any field with quotes, or commas, or
newlines in the data needs to have quotes around it. Any quotes in the data
need to be escaped by preceding them with another quote.

~~~
justinator
From what I understand, you're basically right, but you're basically right
only in a sense that that's what most people do. There's no "CSV Data Format"
spec. It's all just what most people agree on, most of the time. Unless
someone has another idea.

~~~
steveklabnik
text/csv is defined in RFC 4180:
[https://tools.ietf.org/html/rfc4180](https://tools.ietf.org/html/rfc4180)

~~~
nemetroid
Strictly, the definition in RFC 4180 mandates ASCII which makes it unusable
for many purposes. I guess there's nothing practical stopping you from using
another encoding though.

~~~
steveklabnik

      >      Common usage of CSV is US-ASCII, but other character sets defined
      >      by IANA for the "text" tree may be used in conjunction with the
      >      "charset" parameter.
    

[http://www.rfc-editor.org/rfc/rfc6657.txt](http://www.rfc-
editor.org/rfc/rfc6657.txt)

[https://www.iana.org/assignments/character-sets/character-
se...](https://www.iana.org/assignments/character-sets/character-sets.xhtml)

------
dergachev
If you use SQL, check out
[http://github.com/dergachev/csv2sqlite](http://github.com/dergachev/csv2sqlite)

Its a little ruby script that in 1 command takes one or several CSV files,
parses their structure into simplistic sqlite table definitions, and then
creates a new sqlite database file populated with structure and data from
these CSVs.

~~~
eli
Hey neat. A csv2sqlite script is one of the first things I wrote in Python:
[https://github.com/elidickinson/csv-
tools](https://github.com/elidickinson/csv-tools)

It's a terrible hack, but I actually still use it pretty frequently.

------
caissy
That is really nice. I remember implementing something similar to step 2 and 3
on AppEngine a few years back for a project. It turned out quite nice and non
tech-savvy users used it without much problems. We managed to have something
that could easily work with different CSV formats (MS Excel and others).

------
bdon
The type/error detection would be really cool paired with something like
DataWrangler:
[http://vis.stanford.edu/wrangler/](http://vis.stanford.edu/wrangler/)

~~~
thingsilearned
Yeah, DataWrangler is an awesome project. There are actually a few great
companies coming out with products based around it. We may just integrate with
one of those.

~~~
porker
Ooh, which companies? I haven't come across any yet.

~~~
dude_abides
Eg. [http://www.trifacta.com/](http://www.trifacta.com/) founded by the guys
who started DataWrangler :)

~~~
porker
Thanks!

------
AdrianRossouw
I'm rather fond of using [1] csvkit for doing my csv wrangling. That and
google refine (now open refine).

[1] [http://csvkit.readthedocs.org/](http://csvkit.readthedocs.org/)

------
funstr
CSV format does not define encoding. Also customers usually think about CSV as
export/import format from/to Microsoft Excel. Unfortunately each Office
localization uses different output encoding and delimiters (and date format).
And you cannot suppose that the encoding is UTF-8. The most complete importer
of CSV file which I have seen is in Open Office. It is also worth mentioning
that Excel can import CSV with other than system locale and different
delimiters - but it just not work through open file command.

~~~
zbanks
This also detects the file encoding & delimiters.

~~~
funstr
Automatic file encoding detection is only heuristic and cannot be in principle
exact. But I do not want to criticize the import functionality. I just wanted
to say that CSV is an obsolete format and should not be used for data
interchange.

------
netvarun
This is really cool! Are there any open source libraries to do steps 2 and 3 -
auto type detection and error correction/detection?

~~~
thingsilearned
Not really. We utilize python-dateutil to help recognize the different date
formats, but everything else we built ourselves.

[http://labix.org/python-dateutil](http://labix.org/python-dateutil)

------
hedonist
A vastly under-rated and under-appreciated art form, indeed.

