
Execute SQL against structured text like CSV or TSV - mhausenblas
https://github.com/dinedal/textql
======
rogerbinns
It is importing the data into SQLite and running the query there. I don't see
any attempt at type detection.

SQLite has virtual tables where app provided code can provide the underlying
data and you can then use SQL queries against it. This is by far the easiest
approach for foreign data formats.
[http://www.sqlite.org/vtab.html](http://www.sqlite.org/vtab.html)

I'm the author of a Python wrapper for SQLite. It provides a shell compatible
with the main SQLite one invocable from the command line (ie you don't need to
go anywhere near Python) -
[http://rogerbinns.github.io/apsw/shell.html](http://rogerbinns.github.io/apsw/shell.html)

That shell has an autoimport command which automatically figures things out
for csv like data (separators, data types etc)

sqlite> .help autoimport

.autoimport FILENAME ?TABLE? Imports filename creating a table and
automatically working out separators and data types (alternative to .import
command)

The import command requires that you precisely pre-setup the table and schema,
and set the data separators (eg commas or tabs). In many cases this
information can be automatically deduced from the file contents which is what
this command does. There must be at least two columns and two rows.

If the table is not specified then the basename of the file will be used.

Additionally the type of the contents of each column is also deduced - for
example if it is a number or date. Empty values are turned into nulls. Dates
are normalized into YYYY-MM-DD format and DateTime are normalized into ISO8601
format to allow easy sorting and searching. 4 digit years must be used to
detect dates. US (swapped day and month) versus rest of the world is also
detected providing there is at least one value that resolves the ambiguity.

Care is taken to ensure that columns looking like numbers are only treated as
numbers if they do not have unnecessary leading zeroes or plus signs. This is
to avoid treating phone numbers and similar number like strings as integers.

This command can take quite some time on large files as they are effectively
imported twice. The first time is to determine the format and the types for
each column while the second pass actually imports the data.

~~~
tstack
I've written a log file viewer ([http://lnav.org](http://lnav.org)) that
exports log messages as SQLite virtual tables. I was hoping this project also
leveraged virtual tables since they're one of the coolest features in SQLite.

lnav tries to push things a little further and extract data from unstructured
log messages. For example, given the following message:

    
    
      "Registering new address record for 10.1.10.62 on eth0."
    

It will create a virtual table with two columns that capture the IP address
(10.1.10.62) and the device name (eth0). Queries against this table will
return the values from all messages with the same format (Registering new
address record for COL_0 on COL_1). See
[http://lnav.readthedocs.org/en/latest/data.html](http://lnav.readthedocs.org/en/latest/data.html)
for more info.

~~~
pfraze
Very cool! Looking forward to trying this out today.

------
nodata
Straight from 1998! Perl's CPAN Module DBD::CSV
[http://search.cpan.org/~hmbrand/DBD-
CSV-0.41/lib/DBD/CSV.pm](http://search.cpan.org/~hmbrand/DBD-
CSV-0.41/lib/DBD/CSV.pm)

~~~
stillbourne
I actually wrote a small data collection script that used this to parse the
data some time in 2008. Using DBD::CSV along with dbish:
[http://search.cpan.org/~tlowery/DBI-
Shell-11.95/lib/DBI/Shel...](http://search.cpan.org/~tlowery/DBI-
Shell-11.95/lib/DBI/Shell.pm) would give you pretty much the same
functionality.

------
mkching
Perl has a CSV driver for DBI: [http://search.cpan.org/~hmbrand/DBD-
CSV-0.41/lib/DBD/CSV.pm](http://search.cpan.org/~hmbrand/DBD-
CSV-0.41/lib/DBD/CSV.pm)

------
csmithuk
Surprisingly, we've been doing this on the Microsoft side of things for years
with ADO and csv data sources!

[http://msdn.microsoft.com/en-
us/library/ms974559.aspx](http://msdn.microsoft.com/en-
us/library/ms974559.aspx)

(article dated 2004)

~~~
dmethvin
Even better is LogParser, which lets you do SQL queries against just about any
data format you'll find in a file on Windows. It's been around for a decade
and continues to be improved.

[http://technet.microsoft.com/en-
us/scriptcenter/dd919274.asp...](http://technet.microsoft.com/en-
us/scriptcenter/dd919274.aspx)

~~~
burmask
Microsoft LogParser is not only an established utility, it's extremely fast,
too.

~~~
DEinspanjer
I remember LogParser from years back. I used to be very involved in that
nascent community, and I developed a plugin for it that let you parse an input
file using a regex with capture groups so you could use SQL queries on log
files that did not have records with simple delimited columns.

------
rosser
If you use PostgreSQL, consider also the "file_fdw" Foreign Data Wrapper
module:

[http://www.postgresql.org/docs/current/interactive/file-
fdw....](http://www.postgresql.org/docs/current/interactive/file-fdw.html)

------
A_Beer_Clinked
An alternative approach is Microsoft Log Parser. While it is old in internet
time (2005), it works brilliantly on very large files. I've successfully used
it on 10+ GigaByte files. The documentation is a little sparse but once you
get the hang of it it's pretty simple. Clearly it only works on windows boxen.

Example: #Find all unique from col1 logparser -i:csv -o:csv -stats:off
-dtlines:2000 -headers:off "select distinct col1 from input.csv" >out.csv

~~~
cbsmith
In general Microsoft provided an ODBC driver over flat files that would at
least give you limited SQL query capabilities.

------
vsbuffalo
We use a format called Tabix[1] frequently in bioinformatics. It doesn't have
a full query language, but is optimized for extracting regions from a text
file and supports compression. It's pretty clever — it's based on a block
compressed (BGZF) tab file with virtual offsets. An index allows for fast
random access.

[1][http://samtools.sourceforge.net/tabix.shtml](http://samtools.sourceforge.net/tabix.shtml)

------
yummyfajitas
I can currently do this:

    
    
        sqlite> create table mydata (id integer, value1 integer, description text);
        sqlite> .separator ","
        sqlite> .import mydatafile.csv mydata
    

I assume textql is basically just running a similar SQL query, but perhaps
autodetecting column types for you? (Haven't looked at the code.) Is that the
value proposition, or does it do something more?

~~~
twotwotwo
[https://github.com/dinedal/textql/blob/master/main.go](https://github.com/dinedal/textql/blob/master/main.go)
around line 154: appears to make TEXT cols for everything. SQLite is pretty
liberal about types so some math works anyway:

    
    
      sqlite> select "3"+"3";
      6
    

though note:

    
    
      sqlite> select "24" < 300;
      0
      sqlite> select "9" < "10";
      0
    

(see
[http://www.sqlite.org/datatype3.html](http://www.sqlite.org/datatype3.html)
for more about SQLite typing.)

~~~
rogerbinns
Your examples should be using single quotes for strings not double quotes. The
latter are for quoting identifiers (eg if you named a table "select") and
through an accident of SQLite history degrade to a string if no matching
identifier is found.

\+ is the SQL numeric addition operator. || is string concatenation.

    
    
        sqlite> select 3 || 3 ;
        33
        sqlite> select 'a'+'b' ;
        0

~~~
twotwotwo
Interesting, thanks.

------
gadders
I was doing this with MS Query and the CSV ODBC driver 15+ years ago.

~~~
userbinator
Coincidentally that was the first thing I thought of when I read the title.

------
glogla
This is awesome, but please make that gif into a html5 video so the playback
may be controlled.

EDIT: here's gfycat version:
[http://gfycat.com/FlusteredFarApatosaur#](http://gfycat.com/FlusteredFarApatosaur#)

------
sheetjs
I do something similar with Excel files using jq
([http://stedolan.github.io/jq/](http://stedolan.github.io/jq/)) and j
([https://npmjs.org/package/j](https://npmjs.org/package/j)):

    
    
        $ j -J sample_data.xlsb tbl | jq 'length'
        3
        $ j -J sample_data.xlsb tbl | jq '[.[]|.id]|max'
        3
        $ j -J sample_data.xlsb tbl | jq '[.[]|.id]|min'
        1
        $ j -J sample_data.xlsb tbl | jq '[.[]|.value]|add'
        18
    

jq syntax is very terse compared to the equivalent SQL

~~~
spullara
At the expense of being nearly unintelligible (or writable) without the
documentation open. I don't think anyone was trying to make SQL more terse.

------
kamaal
I do a great deal of work in DevOps and I have a text file full of scripts
which come handy to do work like this during production issues.

Nearly every sysadmin/firefigher I know has command line perl/awk/sed magic
text files full of this.

Having said that I would appreciate if something like this was available for
XMLs. Having to deal with SQL designs force fully shoe horned into NoSQL
databases so often these days. I frequently run into situations where I have
to query data from XML dumps. Most of the time, I have to invent a Perl one
liner that essentially does the job of a select query.

~~~
rcfox
I've done this a few times with XPath from an ipython shell, using the lxml
module to parse the XML and handle the XPath queries.

30-second XPath pitch:

    
    
        //foo/bar
    

This gets the 'bar' children from every 'foo', regardless of where it occurs
in the hierarchy.

    
    
        //baz[@name='Alice']
    

Gets everything like: <baz name="Alice">

There's lots of more complicated things you can do, but these are the most
frequent types of things I've had to do.

~~~
barrkel
I have a command-line thing put together with HtmlAgilityPack that represents
HTML in an XPath-navigable document. It lets me apply the same XPath queries
to HTML, for screen scraping.

~~~
rcfox
Oh, that reminds me: You can use XPath in Chrome with $x(). To get all of the
links on the current page, open up the console and type:

    
    
        $x('//a')

------
userbinator
Good to see that it's not using SQLite's own CSV parser, which the last time I
checked didn't seem to support the standard convention for quoting fields that
contain commas themselves ( _very_ common if your CSV stores text strings in
some columns):

[https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles](https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles)

(They also seem ignorant of the existence of RFC4180 when they claim there is
no CSV standard.)

------
collyw
I end up parsing a lot of excels at work (not my choice). I saw this and
thought it looked interesting - export Execl to a CSV and run a query there.

At the end of the day the problem is that non technical users don't fill in
things properly. They have a space before a number so it gets treated as text.
Or in the same column Excel spits out an integer as a float. Basically the
things that a real database keeps in order for you, so unfortunately I don't
see it solving my problems. Shame.

~~~
iagooar
I was going to write something like this. From my experience parsing a lot of
CSV files, what's really tough isn't parsing data out of a nicely formatted
CSV file, but cleaning and correcting the junk that is written into it.
Sometimes I'd have to pass 3 or 4 different "correctors" and checkers in order
to get an acceptable CSV file.

~~~
collyw
I assume this is a common problem to be solved. Are there any best practices,
or libraries, or techniques that I could look at to help.

At the moment I parse and update the parser every time a new error appears.
Most of the time it seems to be excel presenting things to the user that
appear correct when they are not.

I have even been tempted to write my own "data import grid". It would look
like excel, have very little of the functionality, but could allow a developer
to set rules on what can / can't be entered in each column.

------
TheRubyist
[http://hackage.haskell.org/package/txt-
sushi](http://hackage.haskell.org/package/txt-sushi)

~~~
egonschiele
Also, the Haskell version parses the SQL, so no sqlite required.

------
mkaufmann
On a related note there is a recently published paper exploring how to load
CSV data into a database efficiently[1].

The authors get a loading throughput (including index creation!) of over 1.5
GB/S with multithreading. They don't seem to do any dirty tricks when loading
the data as they immediately can run hundreds of thousands of queries per
second on the data.

There also seem to be more interesting papers on the research project page
[http://www.hyper-db.de/index.html](http://www.hyper-db.de/index.html)

[1] "Instant Loading for Main Memory Databases"
[http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf](http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf)

------
kyberias
I suggest another approach:

textql -query "select count(*) from persons.csv"

...because csv files are basically tables. This would also allow you to do
joins on multiple files:

textql -query "select p.name, c.name from persons.csv p inner join
companies.csv c on p.companyid = c.id"

~~~
dinedal
[https://github.com/dinedal/textql/issues/6](https://github.com/dinedal/textql/issues/6)

------
vkb
What's the benefit (or tradeoff) of doing this over some variation of "command
line data science" of the type outlined in Greg's post [1]?

Would it theoretically be faster than the unix join/grep commands for small
CSV files, the kind you get usually via email?

I love SQL and prefer it for data work of this type, but I'm not entirely sure
of the benefit of this, since it looks like you're just reading it into SQL,
which means having to format and normalize anyway?

[1][http://www.gregreda.com/2013/07/15/unix-commands-for-data-
sc...](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/)

~~~
blueblob
I think the benefit is that a lot of people are proficient with SQL and are
not very proficient with awk. I think whether it will be faster or not depends
on how complex your query/queries are relative to the amount of data that you
need to import.

------
ggrothendieck
The R package sqldf together with backend packages RSQLite (the default), RH2,
RMySQL and RPostgreSQL can be used in a somewhat similar manner automatically
generating create statements, uploading referenced tables to a database and
performing the specified SQL. Tables referenced in the SQL statement are
assumed to be R data frames or external files depending on whether an R data
frame or file description object of the same name exists in the R work space.
The result is returned as an R data frame. See
[http://sqldf.googlecode.com](http://sqldf.googlecode.com)

------
stevekemp
Reminds me of something I wrote to import Apache logfiles into a temporary
SQLite database:

[http://www.steve.org.uk/Software/asql/](http://www.steve.org.uk/Software/asql/)

------
abhgh
I do this with the unix commands 'join' and 'awk'

------
qwerta
H2 SQL has this functionality as well. You can specify CSV file at table
creation. It can even mix regular tables and CSV. And it has nice web
interface build in.

------
dergachev
Cool stuff, and it's written in go!

I made a simple ruby version of this a while back:
[https://github.com/dergachev/csv2sqlite](https://github.com/dergachev/csv2sqlite)

Subsequently also I discovered CSVKIT, a python implementation which is
probably more robust:

[https://github.com/onyxfish/csvkit](https://github.com/onyxfish/csvkit)

------
ansgri
I'd use this if it was in Python. For go, I don't have a compiler handy on any
machine I might need this.

~~~
jdp
I created a very similar project in Python a few weeks ago, based on
csvkit[1]. Check out the Gist[2].

[1]:
[https://pypi.python.org/pypi/csvkit](https://pypi.python.org/pypi/csvkit)
[2]:
[https://gist.github.com/jdp/8447221](https://gist.github.com/jdp/8447221)

------
julochrobak
if anyone is interested there is a similar tool (called "comp") but the
queries are expressed with list comprehension syntax. It also allows to join
data from json and xml.

It has two modes: 1) as a commnad line

./comp -f commits.json,authors.txt '[ i.commits | i <\- commits, a <\- author,
i.commits.author.name == a ]'

2) as a service to allow querying of the files through simple http interface

./comp -f commits.json,authors.tx -l :9090

curl -d '{"expr": "[ i.commits | i <\- commits, a <\- author,
i.commits.author.name == a ]"}'
[http://localhost:9090/full](http://localhost:9090/full)

or through an interactive console on
[http://localhost:9090/console](http://localhost:9090/console)

disclosure: I'm a co-author, and happy to get a feedback or answer any
questions :)

~~~
julochrobak
well and here is the link -
[http://github.com/ostap/comp](http://github.com/ostap/comp)

:)

------
dan1234
I'd be interested to see how well this scales with a multi megabyte CSV.

~~~
torrance
From what I can tell, this creates an SQLite instance each time it's executed,
and creates it in memory unless told not to do so. So I'd guess it wouldn't
scale well. Still really neat though!

~~~
slashcom
Even then, it should scale reasonable well if one added a repl, and kept the
database in memory. Probably would be fine then up until at least a couple
gigabytes, depending on how well sqlite handles in-memory tables (probably
very well).

------
ravenking
I have been doing this with the unix commands 'join' and 'awk'. For larger
files I import it to a database (ex sqllite). I am not sure what advantages
this tool provides.

------
ilitirit
I use Linq with a few addon assemblies to do stuff like this.

------
crorella
Looks like a light version of Optiq.
[https://github.com/julianhyde/optiq](https://github.com/julianhyde/optiq)

------
fibo
See also
[https://news.ycombinator.com/item?id=7176323](https://news.ycombinator.com/item?id=7176323)

------
dserban
A related project is this:

[http://www.steve.org.uk/Software/asql/](http://www.steve.org.uk/Software/asql/)

------
visarga
Offtopic: is there a better implementation of 'cut' that handles field
reordering and multi-character separators?

~~~
ancaster
awk. :-)

------
thorin
You can do this with external tables in Oracle (and probably other db servers)

------
solidsnack9000
Perhaps something similar could be implemented more directly in Bash.

------
infinii
Slightly OT but how can I create an animated gif like that?

~~~
masklinn
Screencast, then convert screencast to gif?

------
mbq
One can also import csv to R and use sqldf package.

------
taejo
txt-sushi does this too. It doesn't use SQLite, but processes the files
directly, and supports joins.

------
duochrome
Ever used Google Refine?

------
guard-of-terra
Why not simplify the syntax to allow textql 'select * from sample-data.csv
where i = 0'?

Otherwise - already checking out, that's what I wanted for a while.

~~~
p4bl0
It seems that textql doesn't parse the query at all, it imports the CSV into
an sqlite3 db and then runs the query against it. What you ask for would be
doable but it would be much more complex (parsing SQL is not that trivial, and
people would want to do more complex queries like JOINing multiple CSV files
etc., so you can't just look for "from <filename>").

------
guard-of-terra
Can we please have -tab instead of -dlm="\t"? Because really.

Also, without support for multiple tables & joins this is not very useful.

~~~
untog
_Also, without support for multiple tables & joins this is not very useful._

Not even slightly true. Can we stop crapping on v1 releases? I'd rather a
developer release early and often than hold their code back until it has every
possible feature.

~~~
guard-of-terra
It's a wonderful concept but I don't know if its development would just sit
there. I guess I should offer pull requests.

