Hacker News new | comments | show | ask | jobs | submit login
Execute SQL against structured text like CSV or TSV (github.com)
130 points by mhausenblas on Feb 4, 2014 | hide | past | web | favorite | 91 comments



It is importing the data into SQLite and running the query there. I don't see any attempt at type detection.

SQLite has virtual tables where app provided code can provide the underlying data and you can then use SQL queries against it. This is by far the easiest approach for foreign data formats. http://www.sqlite.org/vtab.html

I'm the author of a Python wrapper for SQLite. It provides a shell compatible with the main SQLite one invocable from the command line (ie you don't need to go anywhere near Python) - http://rogerbinns.github.io/apsw/shell.html

That shell has an autoimport command which automatically figures things out for csv like data (separators, data types etc)

sqlite> .help autoimport

.autoimport FILENAME ?TABLE? Imports filename creating a table and automatically working out separators and data types (alternative to .import command)

The import command requires that you precisely pre-setup the table and schema, and set the data separators (eg commas or tabs). In many cases this information can be automatically deduced from the file contents which is what this command does. There must be at least two columns and two rows.

If the table is not specified then the basename of the file will be used.

Additionally the type of the contents of each column is also deduced - for example if it is a number or date. Empty values are turned into nulls. Dates are normalized into YYYY-MM-DD format and DateTime are normalized into ISO8601 format to allow easy sorting and searching. 4 digit years must be used to detect dates. US (swapped day and month) versus rest of the world is also detected providing there is at least one value that resolves the ambiguity.

Care is taken to ensure that columns looking like numbers are only treated as numbers if they do not have unnecessary leading zeroes or plus signs. This is to avoid treating phone numbers and similar number like strings as integers.

This command can take quite some time on large files as they are effectively imported twice. The first time is to determine the format and the types for each column while the second pass actually imports the data.


I've written a log file viewer (http://lnav.org) that exports log messages as SQLite virtual tables. I was hoping this project also leveraged virtual tables since they're one of the coolest features in SQLite.

lnav tries to push things a little further and extract data from unstructured log messages. For example, given the following message:

  "Registering new address record for 10.1.10.62 on eth0."
It will create a virtual table with two columns that capture the IP address (10.1.10.62) and the device name (eth0). Queries against this table will return the values from all messages with the same format (Registering new address record for COL_0 on COL_1). See http://lnav.readthedocs.org/en/latest/data.html for more info.


Very cool! Looking forward to trying this out today.


Cool.

I did a similar thing at work, but with virtual tables. Supports zip/gz files and multiple files with glob expansion. Might open source if there's interest. It's based on a sample CSV vtab source. Has type automatic detection. No indexing yet, though.


Straight from 1998! Perl's CPAN Module DBD::CSV http://search.cpan.org/~hmbrand/DBD-CSV-0.41/lib/DBD/CSV.pm


I actually wrote a small data collection script that used this to parse the data some time in 2008. Using DBD::CSV along with dbish: http://search.cpan.org/~tlowery/DBI-Shell-11.95/lib/DBI/Shel... would give you pretty much the same functionality.



Surprisingly, we've been doing this on the Microsoft side of things for years with ADO and csv data sources!

http://msdn.microsoft.com/en-us/library/ms974559.aspx

(article dated 2004)


Even better is LogParser, which lets you do SQL queries against just about any data format you'll find in a file on Windows. It's been around for a decade and continues to be improved.

http://technet.microsoft.com/en-us/scriptcenter/dd919274.asp...


Microsoft LogParser is not only an established utility, it's extremely fast, too.


I remember LogParser from years back. I used to be very involved in that nascent community, and I developed a plugin for it that let you parse an input file using a regex with capture groups so you could use SQL queries on log files that did not have records with simple delimited columns.


Even older, the ODBC text driver for Windows 3.1x supported this, 20 years ago: http://support.microsoft.com/kb/146220


Apropo of absolutely nothing, my advanced C++ final was a basic console CRUD application using ADO. It was a nightmare - fun, and I broke the database (and didn't get points for that) but a nightmare.

I'm so glad C# is a thing now.


For SQL Server OPENDATASOURCE and OPENQUERY can do the trick too


One can use Oracle "external tables" with delimited files; they are only so flexible, though, and you want to think hard about security.


NeXT DBKit, 1992 or so, had a CSV adapter as well.


If you use PostgreSQL, consider also the "file_fdw" Foreign Data Wrapper module:

http://www.postgresql.org/docs/current/interactive/file-fdw....


An alternative approach is Microsoft Log Parser. While it is old in internet time (2005), it works brilliantly on very large files. I've successfully used it on 10+ GigaByte files. The documentation is a little sparse but once you get the hang of it it's pretty simple. Clearly it only works on windows boxen.

Example: #Find all unique from col1 logparser -i:csv -o:csv -stats:off -dtlines:2000 -headers:off "select distinct col1 from input.csv" >out.csv


In general Microsoft provided an ODBC driver over flat files that would at least give you limited SQL query capabilities.


We use a format called Tabix[1] frequently in bioinformatics. It doesn't have a full query language, but is optimized for extracting regions from a text file and supports compression. It's pretty clever — it's based on a block compressed (BGZF) tab file with virtual offsets. An index allows for fast random access.

[1]http://samtools.sourceforge.net/tabix.shtml


I can currently do this:

    sqlite> create table mydata (id integer, value1 integer, description text);
    sqlite> .separator ","
    sqlite> .import mydatafile.csv mydata
I assume textql is basically just running a similar SQL query, but perhaps autodetecting column types for you? (Haven't looked at the code.) Is that the value proposition, or does it do something more?


https://github.com/dinedal/textql/blob/master/main.go around line 154: appears to make TEXT cols for everything. SQLite is pretty liberal about types so some math works anyway:

  sqlite> select "3"+"3";
  6
though note:

  sqlite> select "24" < 300;
  0
  sqlite> select "9" < "10";
  0
(see http://www.sqlite.org/datatype3.html for more about SQLite typing.)


Your examples should be using single quotes for strings not double quotes. The latter are for quoting identifiers (eg if you named a table "select") and through an accident of SQLite history degrade to a string if no matching identifier is found.

+ is the SQL numeric addition operator. || is string concatenation.

    sqlite> select 3 || 3 ;
    33
    sqlite> select 'a'+'b' ;
    0


Interesting, thanks.


If your description column has any embedded commas in it that are escaped in quoted fields as per the CSV spec, it won't work whereas textql will.


>>I assume textql is basically just running a similar SQL query

textql imports your data into SQLite. In other words its a import utility.


That capability is built into SQLite already -- it comes with an import utility. textql seems to just be a wrapper around this.


Author here:

Key differences: - sqlite import demands an on disk db, textql will use an in memory db if it can, increases performance - sqlite import will not accept stdin, breaking unix pipes. textql will happily do so.

I will make a note of this in the project readme, seems to crop up a lot.


I was doing this with MS Query and the CSV ODBC driver 15+ years ago.


Coincidentally that was the first thing I thought of when I read the title.


This is awesome, but please make that gif into a html5 video so the playback may be controlled.

EDIT: here's gfycat version: http://gfycat.com/FlusteredFarApatosaur#


I do something similar with Excel files using jq (http://stedolan.github.io/jq/) and j (https://npmjs.org/package/j):

    $ j -J sample_data.xlsb tbl | jq 'length'
    3
    $ j -J sample_data.xlsb tbl | jq '[.[]|.id]|max'
    3
    $ j -J sample_data.xlsb tbl | jq '[.[]|.id]|min'
    1
    $ j -J sample_data.xlsb tbl | jq '[.[]|.value]|add'
    18
jq syntax is very terse compared to the equivalent SQL


At the expense of being nearly unintelligible (or writable) without the documentation open. I don't think anyone was trying to make SQL more terse.


I do a great deal of work in DevOps and I have a text file full of scripts which come handy to do work like this during production issues.

Nearly every sysadmin/firefigher I know has command line perl/awk/sed magic text files full of this.

Having said that I would appreciate if something like this was available for XMLs. Having to deal with SQL designs force fully shoe horned into NoSQL databases so often these days. I frequently run into situations where I have to query data from XML dumps. Most of the time, I have to invent a Perl one liner that essentially does the job of a select query.


I've done this a few times with XPath from an ipython shell, using the lxml module to parse the XML and handle the XPath queries.

30-second XPath pitch:

    //foo/bar
This gets the 'bar' children from every 'foo', regardless of where it occurs in the hierarchy.

    //baz[@name='Alice']
Gets everything like: <baz name="Alice">

There's lots of more complicated things you can do, but these are the most frequent types of things I've had to do.


I've found xmlstarlet is wonderful if you are want to use XPath on the terminal. Great with curl + XML APIs like Jenkins.

xml sel --template --value-of "//foo/bar" \ --template --value-of "//baz[@name='Alice']" << EOF <foo> <bar>hello</bar> <baz name="Alice"> world</baz> </foo> EOF

http://xmlstar.sourceforge.net/docs.php

It also can edit with XPath which is really handy.


I have a command-line thing put together with HtmlAgilityPack that represents HTML in an XPath-navigable document. It lets me apply the same XPath queries to HTML, for screen scraping.


Oh, that reminds me: You can use XPath in Chrome with $x(). To get all of the links on the current page, open up the console and type:

    $x('//a')


Well there is XSL and XQL (XML Query Language). There doesn't seem to be many XQL implementations, but this looks interesting: http://www.cs.york.ac.uk/fp/Xtract/. Also check xmllint and xmlstar.


Author here.

XML/HTML support, via XPATH selectors, is coming. I would love to be able to pipe the output of curl through this!


Good to see that it's not using SQLite's own CSV parser, which the last time I checked didn't seem to support the standard convention for quoting fields that contain commas themselves (very common if your CSV stores text strings in some columns):

https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

(They also seem ignorant of the existence of RFC4180 when they claim there is no CSV standard.)


I end up parsing a lot of excels at work (not my choice). I saw this and thought it looked interesting - export Execl to a CSV and run a query there.

At the end of the day the problem is that non technical users don't fill in things properly. They have a space before a number so it gets treated as text. Or in the same column Excel spits out an integer as a float. Basically the things that a real database keeps in order for you, so unfortunately I don't see it solving my problems. Shame.


I was going to write something like this. From my experience parsing a lot of CSV files, what's really tough isn't parsing data out of a nicely formatted CSV file, but cleaning and correcting the junk that is written into it. Sometimes I'd have to pass 3 or 4 different "correctors" and checkers in order to get an acceptable CSV file.


I feel you. Cleaning data is always a bitch.

That said, if you are dealing with excel, you can use Spreadsheet-ParseExcel and Spreadsheet-WriteExcel to have perl handle excel .xls files directly (though I'm not sure about xls files including embedded vba macros).


I assume this is a common problem to be solved. Are there any best practices, or libraries, or techniques that I could look at to help.

At the moment I parse and update the parser every time a new error appears. Most of the time it seems to be excel presenting things to the user that appear correct when they are not.

I have even been tempted to write my own "data import grid". It would look like excel, have very little of the functionality, but could allow a developer to set rules on what can / can't be entered in each column.



Also, the Haskell version parses the SQL, so no sqlite required.


On a related note there is a recently published paper exploring how to load CSV data into a database efficiently[1].

The authors get a loading throughput (including index creation!) of over 1.5 GB/S with multithreading. They don't seem to do any dirty tricks when loading the data as they immediately can run hundreds of thousands of queries per second on the data.

There also seem to be more interesting papers on the research project page http://www.hyper-db.de/index.html

[1] "Instant Loading for Main Memory Databases" http://www.vldb.org/pvldb/vol6/p1702-muehlbauer.pdf


I suggest another approach:

textql -query "select count(*) from persons.csv"

...because csv files are basically tables. This would also allow you to do joins on multiple files:

textql -query "select p.name, c.name from persons.csv p inner join companies.csv c on p.companyid = c.id"



Author here:

I like this idea! I will consider it, but a problem is that '.' is not allowed as a table name, without escaping brackets. May have to drop file-ext for it to work.


What's the benefit (or tradeoff) of doing this over some variation of "command line data science" of the type outlined in Greg's post [1]?

Would it theoretically be faster than the unix join/grep commands for small CSV files, the kind you get usually via email?

I love SQL and prefer it for data work of this type, but I'm not entirely sure of the benefit of this, since it looks like you're just reading it into SQL, which means having to format and normalize anyway?

[1]http://www.gregreda.com/2013/07/15/unix-commands-for-data-sc...


I think the benefit is that a lot of people are proficient with SQL and are not very proficient with awk. I think whether it will be faster or not depends on how complex your query/queries are relative to the amount of data that you need to import.


The R package sqldf together with backend packages RSQLite (the default), RH2, RMySQL and RPostgreSQL can be used in a somewhat similar manner automatically generating create statements, uploading referenced tables to a database and performing the specified SQL. Tables referenced in the SQL statement are assumed to be R data frames or external files depending on whether an R data frame or file description object of the same name exists in the R work space. The result is returned as an R data frame. See http://sqldf.googlecode.com


Reminds me of something I wrote to import Apache logfiles into a temporary SQLite database:

http://www.steve.org.uk/Software/asql/


I do this with the unix commands 'join' and 'awk'


H2 SQL has this functionality as well. You can specify CSV file at table creation. It can even mix regular tables and CSV. And it has nice web interface build in.


Why not simplify the syntax to allow textql 'select * from sample-data.csv where i = 0'?

Otherwise - already checking out, that's what I wanted for a while.


It seems that textql doesn't parse the query at all, it imports the CSV into an sqlite3 db and then runs the query against it. What you ask for would be doable but it would be much more complex (parsing SQL is not that trivial, and people would want to do more complex queries like JOINing multiple CSV files etc., so you can't just look for "from <filename>").


Cool stuff, and it's written in go!

I made a simple ruby version of this a while back: https://github.com/dergachev/csv2sqlite

Subsequently also I discovered CSVKIT, a python implementation which is probably more robust:

https://github.com/onyxfish/csvkit


I'd use this if it was in Python. For go, I don't have a compiler handy on any machine I might need this.


I created a very similar project in Python a few weeks ago, based on csvkit[1]. Check out the Gist[2].

[1]: https://pypi.python.org/pypi/csvkit [2]: https://gist.github.com/jdp/8447221


if anyone is interested there is a similar tool (called "comp") but the queries are expressed with list comprehension syntax. It also allows to join data from json and xml.

It has two modes: 1) as a commnad line

./comp -f commits.json,authors.txt '[ i.commits | i <- commits, a <- author, i.commits.author.name == a ]'

2) as a service to allow querying of the files through simple http interface

./comp -f commits.json,authors.tx -l :9090

curl -d '{"expr": "[ i.commits | i <- commits, a <- author, i.commits.author.name == a ]"}' http://localhost:9090/full

or through an interactive console on http://localhost:9090/console

disclosure: I'm a co-author, and happy to get a feedback or answer any questions :)


well and here is the link - http://github.com/ostap/comp

:)


I'd be interested to see how well this scales with a multi megabyte CSV.


From what I can tell, this creates an SQLite instance each time it's executed, and creates it in memory unless told not to do so. So I'd guess it wouldn't scale well. Still really neat though!


Even then, it should scale reasonable well if one added a repl, and kept the database in memory. Probably would be fine then up until at least a couple gigabytes, depending on how well sqlite handles in-memory tables (probably very well).


It seems that it loads the CSV into sqlite and the queries it. Inside a transaction I remember sqlite inserts about 60.000 records/sec.


Is that 60 (int) or 60 (float), or is that 60000?


Even if you aren't familiar with the thousands separator, I don't think there's any ambiguity here.


I have been doing this with the unix commands 'join' and 'awk'. For larger files I import it to a database (ex sqllite). I am not sure what advantages this tool provides.


I use Linq with a few addon assemblies to do stuff like this.


Looks like a light version of Optiq. https://github.com/julianhyde/optiq



A related project is this:

http://www.steve.org.uk/Software/asql/


Can we please have -tab instead of -dlm="\t"? Because really.

Also, without support for multiple tables & joins this is not very useful.


Also, without support for multiple tables & joins this is not very useful.

Not even slightly true. Can we stop crapping on v1 releases? I'd rather a developer release early and often than hold their code back until it has every possible feature.


It's a wonderful concept but I don't know if its development would just sit there. I guess I should offer pull requests.


Author here.

- Multiple file/table support is coming. - I like the -tab setting. Will add. I also want to support -dlm=xFF format for arbitrary delimiters, which makes working with Hive text file formats easier.


Nice, I have a bunch of suggestions for you.

Basically I recommend you to look at xmlstarlet sel (and find) as an example of converting complex languages into command line.

Did you think of loading every sqlite file on boot (-load?)

How about textql -header -source data.csv -uniq id -tstmp created,updated -index created 'SELECT ...'? - i.e. changes to table structure from command line switches?

Conventions also won't hurt. I.e. my-data.csv turns to my_data table that is saved to my-data.sqlite3 without me mentioning so. -save should do that.

The idea is to create a re-entrant workspace in the current directory which becomes richer with every invocation.

Also, does CREATE TABLE ... SELECT ... FROM tbl work? Plus -save.


> Also, does CREATE TABLE ... SELECT ... FROM tbl work? Plus -save.

Yes. You can even modify and append to existing sqlite datebases. I'd love for this to become, eventually, a general purpose tool for working on data from the CLI.

I will look into xmlstarlet sel and find for sure. I like the -load tag idea too. There's lots of possibilities here.



Offtopic: is there a better implementation of 'cut' that handles field reordering and multi-character separators?


awk. :-)


You can do this with external tables in Oracle (and probably other db servers)


Perhaps something similar could be implemented more directly in Bash.


Slightly OT but how can I create an animated gif like that?


Screencast, then convert screencast to gif?


One can also import csv to R and use sqldf package.


txt-sushi does this too. It doesn't use SQLite, but processes the files directly, and supports joins.


Ever used Google Refine?




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: