
Q: Run SQL Directly on CSV Files - devy
https://harelba.github.io/q/
======
rudolfwinestock
“Any file is a database if you awk hard enough.” —Bryan Horstmann-Allen

[https://twitter.com/neilkod/status/914217352564137984](https://twitter.com/neilkod/status/914217352564137984)

Furthermore, no thread on CSV files can be complete without mentioning this
infamous bit of fact-trolling: the ASCII standard has had unit- and record-
delimiters baked into it from the beginning.

[https://ronaldduncan.wordpress.com/2009/10/31/text-file-
form...](https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-
ascii-delimited-text-not-csv-or-tab-delimited-text/)

~~~
z92
Usability suffers, if you can't type the delimiters easily.

~~~
hyperbovine
I can’t stand it when I’m banging out a CSV by hand and have to manually
escape the field and record separators. Happens all the time.

~~~
z92
I use TSV for that reason. Most of the time you don't need to escape anything.
And from code you need to escape/unescape only 4 chars. \t \n \NULL and \\\\.

------
delinka
If you're on Windows, you've had the ability to do this for Quite Some Time®:
[https://support.microsoft.com/en-us/help/850320/creating-
an-...](https://support.microsoft.com/en-us/help/850320/creating-an-odbc-data-
source-with-the-microsoft-text-driver)

~~~
eli
Ah, I thought you were going to link to the fascinating skunkworks Log Parser
tool [https://www.microsoft.com/en-
us/download/details.aspx?id=246...](https://www.microsoft.com/en-
us/download/details.aspx?id=24659) Wonder if it still runs on Windows 10?

~~~
NicoJuicy
If you need a UI, don't forget the tool: Log Parser Lizard ^^

[http://www.lizard-labs.com/log_parser_lizard.aspx](http://www.lizard-
labs.com/log_parser_lizard.aspx)

~~~
gronne
Good of you to also accomodate the disabled.

~~~
guipsp
Is this a joke or is the tool actually accessible?

------
ppande
Sqlite virtual tables. You can literally query any data source. All you need
is a module implementation of interface exposed by sqlite for the data source
of interest. I once wrote a module that could query protocol buffers encoded
data

------
joshumax
Fun fact: MySQL has actually supposed CSV files as a storage backend for quite
some time now[1]

1: [https://dev.mysql.com/doc/refman/8.0/en/csv-storage-
engine.h...](https://dev.mysql.com/doc/refman/8.0/en/csv-storage-engine.html)

~~~
athenot
Came here to post the same. This can be used in very interesting ways, either
for quick ad-hoc queries or for import/export functionality (ETL), where the
core of your data is in InnoDB but you keep a CSV version for one of the
phases of your processing, to interface with other tools.

This may looks like a dirty hack compared to a clean API, but if you have
performance considerations or need to interface with some legacy thing that
happens to understand CSV, this can be a good choice.

------
cbcoutinho
I've never heard of this before, but it reminded me of this other command line
tool for wrangling csv files in a cli. This tool makes use of SQL syntax much
more than xsv, so there isn't 100% overlap here.

[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv)

~~~
devy
Thanks! xsv is new to me. I love Andrew Gallant's ripgrep(rg) grepping cmd
tool.

~~~
bhengaij
His name is Burnt Sushi I think.

------
yellowapple
Another trick along these lines is to cut out the middleman (middleprogram?
middleware?) and use SQLite to do it:

    
    
        $ sqlite3
        sqlite3> .mode csv
        sqlite3> .import foo.csv foo
        sqlite3> SELECT * FROM foo WHERE bar = 'baz'; -- you get the gist...
        (a bunch of rows)
    

Q seems to be much easier to use (and certainly easier to remember), but I've
always found it handy to have the full power of SQL at my fingertips when
needing to do a bunch of CSV manipulations.

~~~
pixelmonkey
I think the csvkit CLI tool, csvsql, provides this functionality, too.

------
Tepix
The perl CPAN module DBD::CSV also lets you do this. It was started in 1998
and still gets updates.

~~~
walshemj
Text::CSV_XS is my goto module for csv just use that and whack it into MySQL
postgress would be my advice.

~~~
singingfish
nah sqlite is what you want when you want a step up from CSV.

~~~
walshemj
Well all I want to do is ingest some Google analytics and combine it with the
data from SEmRush

------
codeulike
Real-world CSV files generally contain some or all of the following horrors:

\- some strings enclosed in speechmarks, but some not

\- empty fields

\- speechmarks within strings

\- commas within strings

\- carriage returns within strings

How does Q do up against a CSV file with those traits?

~~~
setr
All of your “horrors” seem...correct? Its comma delimited, so anything that is
between two commas should be parsed without issue; if it’s a string with a
comma in it, and unquoted, you simply have a broken csv file. If its quoted,
than anything until the next (unescaped) quote is fine, including commas

Unless you’re trying to parse csv files with regexes, none of those should be
difficult, or even unexpected, to handls with a PEG parser, or any equivalent
device

Ofc if you’re accepting ambiguity then its just arbitrary how you handle it,
but none of your examples afaict present any ambiguity (I’m assuming strings
are either quoted or unquoted, with the former primarily allowing
commas/newlines in strings; escaping exists as well; comma delimited columns,
newline delimited rows)

~~~
codeulike
Yes, it would be valid CSV. I suppose my point is that naive attempts to roll-
your-own CSV parsers tend to fail on the points I listed. Hopefully Q does not
do that.

~~~
Dylan16807
They do? Commas and quotes are the two basic features of CSV, so it seems very
strange to forget to implement half.

~~~
83457
There is a lot of inconsistency out there. I have seen csv files saved in
Excel not be import-able by Access because the latter doesn't handle breaks in
fields correctly. I've seen csvs saved from various systems such as sql mngmnt
studio grid view and wufoo exports not generate csv correctly. There are many
lazy attempts at csv generators out there that just throw breaks between
records and commas between fields and call it a day.

~~~
83457
And even if they do wrap all fields in double quotes it is very common to
forget to escape double quotes in fields, then it depends on the parser as to
whether it can determine the proper structure of the record.

------
jannes
Excel has a lot of this stuff builtin as well in the PowerQuery editor [0]
which supports CSV, JSON and XML data sources. Albeit not SQL, it allows you
to do almost everything SQL can do, but in a GUI.

After you're done preparing your data in PowerQuery you can run PowerPivot on
it for aggregations.

[0]: [https://cdn-5a6cb102f911c811e474f1cd.closte.com/wp-
content/u...](https://cdn-5a6cb102f911c811e474f1cd.closte.com/wp-
content/uploads/2018/02/Main-Areas-of-the-Power-Query-Editor.png)

------
ashrk
Looks like PostgreSQL has a foreign data wrapper for CSV.

[https://wiki.postgresql.org/wiki/Foreign_data_wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)

In fact, seems it has several.

~~~
garyclarke27
Yes Also trivially easy and lightning fast to import csv to table using copy
command in Postgres.

------
daotoad
Perl in 1998.
[https://metacpan.org/pod/DBD::CSV](https://metacpan.org/pod/DBD::CSV) Okay,
2002--it took a while before anyone bothered to write a shell.
[https://metacpan.org/pod/release/TLOWERY/DBI-
Shell-11.9/lib/...](https://metacpan.org/pod/release/TLOWERY/DBI-
Shell-11.9/lib/DBI/Shell.pm)

And there are drivers for everything from PostgreSQL and Oracle to text files
and the Azure WMI.

------
hprotagonist
Related, but with a free web interface
[https://github.com/simonw/datasette](https://github.com/simonw/datasette) ,
[https://github.com/simonw/csvs-to-sqlite](https://github.com/simonw/csvs-to-
sqlite)

------
herdrick
csvkit includes csvsql, which does this. I’ve used it and liked it.
[https://csvkit.readthedocs.io/en/1.0.3/tutorial/3_power_tool...](https://csvkit.readthedocs.io/en/1.0.3/tutorial/3_power_tools.html#csvsql-
and-sql2csv-ultimate-power)

~~~
fiveFeet
Came here to tell about csvkit and the csvsql that comes with it. Highly
recommended. It is one of the first things I install on a new system.

------
minxomat
So is this a wrapper around
[https://www3.sqlite.org/csv.html](https://www3.sqlite.org/csv.html) ? It
mentions sqlite.

~~~
alex_stoddard
Looks like it is implemented in python(2) with use of the stdlib sqlite3
module.
[https://github.com/harelba/q/blob/master/bin/q](https://github.com/harelba/q/blob/master/bin/q)

------
bonesss
We've had this for ages now: the F# type provider gives strongly types access
to CSV files, and the resulting object heiarchies are LINQ compatible. There's
even support for applying units of measure.

Veeeery nice for data munging :)

[http://fsharp.github.io/FSharp.Data/library/CsvProvider.html](http://fsharp.github.io/FSharp.Data/library/CsvProvider.html)

------
aaaaaaaaaab
The letter Q is already taken...

[https://en.m.wikipedia.org/wiki/Q_%28programming_language_fr...](https://en.m.wikipedia.org/wiki/Q_%28programming_language_from_Kx_Systems%29)

~~~
kkarakk
>It is proprietary software, commercialized by Kx Systems

q is available i'm thinking

~~~
aaaaaaaaaab
What? kdb, K and Q are part of a renowned array processing toolkit/environment
widely used by data scientists _for decades_.

Calling this command-line tool Q, which happens to be used for the same
purpose as the Q programming language (i.e. querying data) begs for confusion
and misunderstanding...

------
usgroup
I think tools like this don’t exist much because it typically makes more sense
to just put it into a database or at least SQLite ... then do whatever you
want from there.

Meanwhile for all your streaming, filtering and aggregating need there is awk.

I’ve never been board enough to write my own little sql library for awk, but
I’d be surprised if it doesn’t exist.

~~~
TkTech
That's exactly what q (the linked tool) and almost all other tools like it do
- it loads into an sqlite database and provides some handy wrappers, that's
it.

------
phaedrus
Many of the applications at my workplace use flat text files in CSV format for
logs and configuration. For new development I've been using Sqlite to replace
some of these usages. Sometimes when I need to analyze legacy log files I
import the CSV data into Sqlite tables. After doing this a few times I hit
upon the idea of why not skip the import (which balloons my otherwise-small db
files) and write an Sqlite plugin that allows treating the flat CSV file as a
virtual table? It's neat to see there's some existing work here!

------
mathiasrw
If you need to work with CSV or Excel or tab files from the command line I
suggest having a look at alasql. Can be installed via npm and gives you a good
list of export options.

------
shodan757
Eep, no Python 3 support? :(

~~~
f00_
I was about to just recommend you pandas, but just realized it doesn't
natively support running sql on dataframes, but pyspark does!

------
jzelinskie
I've done a similar thing, but using jq syntax on any object-like file:
[https://github.com/jzelinskie/faq](https://github.com/jzelinskie/faq)

Ya know how some people want everything to be vim after they use it? I'm
surprised more people haven't done that with other tools.

~~~
aepiepaey
> faq is pronounced "fah queue".

So... "fuck you"?

------
iblaine
> Have you ever stared at a text file on the screen, hoping it would have been
> a database so you could ask anything you want about it?

Not really. I could see this being helpful for debugging, but at that point
you can get by with some simple bash string operations. Doing that same
operation w/SQL seems like overkill.

~~~
mjirv
Maybe my bash just isn't good enough anymore, but I do this all the time. In
fact, I was just doing some work this morning where I needed to load a CSV
into SQLite for analysis.

Since I still had the file, I ran some queries on the same file via Q to test
it out. And it works great! (With some handholding to get quoting and
delimiters right). The major downside was it's pretty slow, presumably because
it reloads the file into a SQLite database each time you run a command. So,
I'll probably stick with loading CSVs into SQLite myself, but I could see this
being a useful tool for running one-off analysis on data from stdin.

------
abathur
Not sure how they compare, but this reminded me of a thread about textql from
earlier this year:

[https://news.ycombinator.com/item?id=16781294](https://news.ycombinator.com/item?id=16781294)

------
eismcc
Piling on with pre-existing tools: The demo for calcite is a CSV to SQL
wrapper.

[https://calcite.apache.org](https://calcite.apache.org)

------
rmbeard
This would make a nice complement to xsv and csvkit but really needs to
support python3. That is the main downside to using it.

~~~
joelthelion
csvkit actually supports SQL queries on csv files. But performance is pretty
poor.

------
chrisweekly
Is there any use for this not handled easily by lnav?

[https://lnav.org](https://lnav.org)

------
qwerty456127
Great! I just wonder why did it take so long for somebody to finally invent
this...

~~~
gpvos
It didn't. See the other comments for about a dozen of already existing
similar tools.

------
iron0013
You can do something very similar to this using R's sqldf package.

------
purplezooey
We are doing this with Apache Drill on MapR and it works wonderfully.

------
JacKTrocinskI
What is the performance like compared to say Oracle External Tables?

------
therealmarv
Is there something similar for JSON? Thanks

~~~
dghf
There's jq, which doesn't support SQL but does allow you to apply filters and
reductions to data in JSON files:
[https://stedolan.github.io/jq/](https://stedolan.github.io/jq/)

