It is importing the data into SQLite and running the query there. I don't see any attempt at type detection.
SQLite has virtual tables where app provided code can provide the underlying data and you can then use SQL queries against it. This is by far the easiest approach for foreign data formats. http://www.sqlite.org/vtab.html
I'm the author of a Python wrapper for SQLite. It provides a shell compatible with the main SQLite one invocable from the command line (ie you don't need to go anywhere near Python) - http://rogerbinns.github.io/apsw/shell.html
That shell has an autoimport command which automatically figures things out for csv like data (separators, data types etc)
sqlite> .help autoimport
.autoimport FILENAME ?TABLE? Imports filename creating a table and automatically working out
separators and data types (alternative to .import command)
The import command requires that you precisely pre-setup the table and schema, and set the data
separators (eg commas or tabs). In many cases this information can be automatically deduced from
the file contents which is what this command does. There must be at least two columns and two rows.
If the table is not specified then the basename of the file will be used.
Additionally the type of the contents of each column is also deduced - for example if it is a number
or date. Empty values are turned into nulls. Dates are normalized into YYYY-MM-DD format and
DateTime are normalized into ISO8601 format to allow easy sorting and searching. 4 digit years must
be used to detect dates. US (swapped day and month) versus rest of the world is also detected
providing there is at least one value that resolves the ambiguity.
Care is taken to ensure that columns looking like numbers are only treated as numbers if they do not
have unnecessary leading zeroes or plus signs. This is to avoid treating phone numbers and similar
number like strings as integers.
This command can take quite some time on large files as they are effectively imported twice. The
first time is to determine the format and the types for each column while the second pass actually
imports the data.
I've written a log file viewer (http://lnav.org) that exports log messages as SQLite virtual tables. I was hoping this project also leveraged virtual tables since they're one of the coolest features in SQLite.
lnav tries to push things a little further and extract data from unstructured log messages. For example, given the following message:
"Registering new address record for 10.1.10.62 on eth0."
It will create a virtual table with two columns that capture the IP address (10.1.10.62) and the device name (eth0). Queries against this table will return the values from all messages with the same format (Registering new address record for COL_0 on COL_1). See http://lnav.readthedocs.org/en/latest/data.html for more info.
I did a similar thing at work, but with virtual tables. Supports zip/gz files and multiple files with glob expansion. Might open source if there's interest. It's based on a sample CSV vtab source. Has type automatic detection. No indexing yet, though.
Even better is LogParser, which lets you do SQL queries against just about any data format you'll find in a file on Windows. It's been around for a decade and continues to be improved.
I remember LogParser from years back. I used to be very involved in that nascent community, and I developed a plugin for it that let you parse an input file using a regex with capture groups so you could use SQL queries on log files that did not have records with simple delimited columns.
Apropo of absolutely nothing, my advanced C++ final was a basic console CRUD application using ADO. It was a nightmare - fun, and I broke the database (and didn't get points for that) but a nightmare.
An alternative approach is Microsoft Log Parser. While it is old in internet time (2005), it works brilliantly on very large files. I've successfully used it on 10+ GigaByte files. The documentation is a little sparse but once you get the hang of it it's pretty simple. Clearly it only works on windows boxen.
Example:
#Find all unique from col1
logparser -i:csv -o:csv -stats:off -dtlines:2000 -headers:off "select distinct col1 from input.csv" >out.csv
We use a format called Tabix[1] frequently in bioinformatics. It doesn't have a full query language, but is optimized for extracting regions from a text file and supports compression. It's pretty clever — it's based on a block compressed (BGZF) tab file with virtual offsets. An index allows for fast random access.
I assume textql is basically just running a similar SQL query, but perhaps autodetecting column types for you? (Haven't looked at the code.) Is that the value proposition, or does it do something more?
Your examples should be using single quotes for strings not double quotes. The latter are for quoting identifiers (eg if you named a table "select") and through an accident of SQLite history degrade to a string if no matching identifier is found.
+ is the SQL numeric addition operator. || is string concatenation.
Key differences:
- sqlite import demands an on disk db, textql will use an in memory db if it can, increases performance
- sqlite import will not accept stdin, breaking unix pipes. textql will happily do so.
I will make a note of this in the project readme, seems to crop up a lot.
I do a great deal of work in DevOps and I have a text file full of scripts which come handy to do work like this during production issues.
Nearly every sysadmin/firefigher I know has command line perl/awk/sed magic text files full of this.
Having said that I would appreciate if something like this was available for XMLs. Having to deal with SQL designs force fully shoe horned into NoSQL databases so often these days. I frequently run into situations where I have to query data from XML dumps. Most of the time, I have to invent a Perl one liner that essentially does the job of a select query.
I have a command-line thing put together with HtmlAgilityPack that represents HTML in an XPath-navigable document. It lets me apply the same XPath queries to HTML, for screen scraping.
Well there is XSL and XQL (XML Query Language). There doesn't seem to be many XQL implementations, but this looks interesting: http://www.cs.york.ac.uk/fp/Xtract/. Also check xmllint and xmlstar.
Good to see that it's not using SQLite's own CSV parser, which the last time I checked didn't seem to support the standard convention for quoting fields that contain commas themselves (very common if your CSV stores text strings in some columns):
I end up parsing a lot of excels at work (not my choice). I saw this and thought it looked interesting - export Execl to a CSV and run a query there.
At the end of the day the problem is that non technical users don't fill in things properly. They have a space before a number so it gets treated as text. Or in the same column Excel spits out an integer as a float. Basically the things that a real database keeps in order for you, so unfortunately I don't see it solving my problems. Shame.
I was going to write something like this. From my experience parsing a lot of CSV files, what's really tough isn't parsing data out of a nicely formatted CSV file, but cleaning and correcting the junk that is written into it. Sometimes I'd have to pass 3 or 4 different "correctors" and checkers in order to get an acceptable CSV file.
That said, if you are dealing with excel, you can use Spreadsheet-ParseExcel and Spreadsheet-WriteExcel to have perl handle excel .xls files directly (though I'm not sure about xls files including embedded vba macros).
I assume this is a common problem to be solved. Are there any best practices, or libraries, or techniques that I could look at to help.
At the moment I parse and update the parser every time a new error appears. Most of the time it seems to be excel presenting things to the user that appear correct when they are not.
I have even been tempted to write my own "data import grid". It would look like excel, have very little of the functionality, but could allow a developer to set rules on what can / can't be entered in each column.
On a related note there is a recently published paper exploring how to load CSV data into a database efficiently[1].
The authors get a loading throughput (including index creation!) of over 1.5 GB/S with multithreading. They don't seem to do any dirty tricks when loading the data as they immediately can run hundreds of thousands of queries per second on the data.
I like this idea! I will consider it, but a problem is that '.' is not allowed as a table name, without escaping brackets. May have to drop file-ext for it to work.
What's the benefit (or tradeoff) of doing this over some variation of "command line data science" of the type outlined in Greg's post [1]?
Would it theoretically be faster than the unix join/grep commands for small CSV files, the kind you get usually via email?
I love SQL and prefer it for data work of this type, but I'm not entirely sure of the benefit of this, since it looks like you're just reading it into SQL, which means having to format and normalize anyway?
I think the benefit is that a lot of people are proficient with SQL and are not very proficient with awk. I think whether it will be faster or not depends on how complex your query/queries are relative to the amount of data that you need to import.
The R package sqldf together with backend packages RSQLite (the default), RH2, RMySQL and RPostgreSQL can be used in a somewhat similar manner automatically generating create statements, uploading referenced tables to a database and performing the specified SQL. Tables referenced in the SQL statement are assumed to be R data frames or external files depending on whether an R data frame or file description object of the same name exists in the R work space. The result is returned as an R data frame. See http://sqldf.googlecode.com
H2 SQL has this functionality as well. You can specify CSV file at table creation. It can even mix regular tables and CSV. And it has nice web interface build in.
if anyone is interested there is a similar tool (called "comp") but the queries are expressed with list comprehension syntax. It also allows to join data from json and xml.
It has two modes:
1) as a commnad line
./comp -f commits.json,authors.txt '[ i.commits | i <- commits, a <- author, i.commits.author.name == a ]'
2) as a service to allow querying of the files through simple http interface
./comp -f commits.json,authors.tx -l :9090
curl -d '{"expr": "[ i.commits | i <- commits, a <- author, i.commits.author.name == a ]"}' http://localhost:9090/full
From what I can tell, this creates an SQLite instance each time it's executed, and creates it in memory unless told not to do so. So I'd guess it wouldn't scale well. Still really neat though!
Even then, it should scale reasonable well if one added a repl, and kept the database in memory. Probably would be fine then up until at least a couple gigabytes, depending on how well sqlite handles in-memory tables (probably very well).
I have been doing this with the unix commands 'join' and 'awk'. For larger files I import it to a database (ex sqllite). I am not sure what advantages this tool provides.
It seems that textql doesn't parse the query at all, it imports the CSV into an sqlite3 db and then runs the query against it. What you ask for would be doable but it would be much more complex (parsing SQL is not that trivial, and people would want to do more complex queries like JOINing multiple CSV files etc., so you can't just look for "from <filename>").
Also, without support for multiple tables & joins this is not very useful.
Not even slightly true. Can we stop crapping on v1 releases? I'd rather a developer release early and often than hold their code back until it has every possible feature.
- Multiple file/table support is coming.
- I like the -tab setting. Will add. I also want to support -dlm=xFF format for arbitrary delimiters, which makes working with Hive text file formats easier.
Basically I recommend you to look at xmlstarlet sel (and find) as an example of converting complex languages into command line.
Did you think of loading every sqlite file on boot (-load?)
How about textql -header -source data.csv -uniq id -tstmp created,updated -index created 'SELECT ...'? - i.e. changes to table structure from command line switches?
Conventions also won't hurt. I.e. my-data.csv turns to my_data table that is saved to my-data.sqlite3 without me mentioning so. -save should do that.
The idea is to create a re-entrant workspace in the current directory which becomes richer with every invocation.
Also, does CREATE TABLE ... SELECT ... FROM tbl work? Plus -save.
> Also, does CREATE TABLE ... SELECT ... FROM tbl work? Plus -save.
Yes. You can even modify and append to existing sqlite datebases. I'd love for this to become, eventually, a general purpose tool for working on data from the CLI.
I will look into xmlstarlet sel and find for sure. I like the -load tag idea too. There's lots of possibilities here.
SQLite has virtual tables where app provided code can provide the underlying data and you can then use SQL queries against it. This is by far the easiest approach for foreign data formats. http://www.sqlite.org/vtab.html
I'm the author of a Python wrapper for SQLite. It provides a shell compatible with the main SQLite one invocable from the command line (ie you don't need to go anywhere near Python) - http://rogerbinns.github.io/apsw/shell.html
That shell has an autoimport command which automatically figures things out for csv like data (separators, data types etc)
sqlite> .help autoimport
.autoimport FILENAME ?TABLE? Imports filename creating a table and automatically working out separators and data types (alternative to .import command)
The import command requires that you precisely pre-setup the table and schema, and set the data separators (eg commas or tabs). In many cases this information can be automatically deduced from the file contents which is what this command does. There must be at least two columns and two rows.
If the table is not specified then the basename of the file will be used.
Additionally the type of the contents of each column is also deduced - for example if it is a number or date. Empty values are turned into nulls. Dates are normalized into YYYY-MM-DD format and DateTime are normalized into ISO8601 format to allow easy sorting and searching. 4 digit years must be used to detect dates. US (swapped day and month) versus rest of the world is also detected providing there is at least one value that resolves the ambiguity.
Care is taken to ensure that columns looking like numbers are only treated as numbers if they do not have unnecessary leading zeroes or plus signs. This is to avoid treating phone numbers and similar number like strings as integers.
This command can take quite some time on large files as they are effectively imported twice. The first time is to determine the format and the types for each column while the second pass actually imports the data.