Hacker News new | past | comments | ask | show | jobs | submit login
Q – Run SQL Directly on CSV or TSV Files (harelba.github.io)
190 points by gjvc on Sept 21, 2022 | hide | past | favorite | 58 comments



A satisfied user here. Found it very useful when tools like cut and sort weren't enough, usually when I need to do a join on two different tables (err, files). Left joins work, but I don't think right joins are supported.

I've used this in combination with jq as well. I'll use jq to convert json to CSV, and then use SQL to do whatever else.


Cut, sort, join and awk can be pretty powerful and fast. If it becomes too tedious to manually write them, you can also use BigBash [1] to convert a SQL query automatically to a one-liner that only use these tools to execute the query.

[1] http://bigbash.it


Any experienced programmer learns to not use string processing on structured data, because that will bite them in the ass.

Meanwhile HN luddites: let me use awk, cut and whatnot despite the existence of an util that explicitly sidesteps this issue.


/me runs the example on bigbash.it, cleaned up a bit:

    (
      trap "kill 0" SIGINT;
      export LC_ALL=C;
      find movies.dat.gz -print0
        | xargs -0 -i sh -c "gzip -dc {} | tail -n +2"
        | sed "s/::/;/g"
        | cut -d $';' -f2
        | sort -t$';'  -k 1,1
        | head -n10
        | awk -F ';' '{print $1}'
    )
Yeah, how about no. That's a very neat site and a clever hack, but there are clear escaping flaws in there for valid movie names.

bash and standard unix tools are a terrible structured-data manipulator. it's part of why `jq` is so widely used and loved, despite being kinda slow and hard to remember at times - it does things correctly, unlike most glued-together tools.


Yep, pretty sure that this script doesn't handle quoted strings in any way, and would promptly mangle a title that contains a semicolon.


"structured data" usually means there are delimiting characters, states, etc. AWK can certainly handle this. It's a simple and powerful language.

I don't think I've ever used it to parse JSON, but I've definitely used it to output simple JSON.


Are you telling me that awk can correctly identify delimiters inside quoted strings? Escaped quotes inside quoted strings? Newlines inside quoted strings? I.e. that awk actually has a csv parser? Very cool if so.


Yeah, you can implement a basic FSM and use `next` to handle fake `RS` (e.g. newlines).

I'm not necessarily recommending it, but it's certainly possible and could be portable and really fast to run with a low memory footprint.


Well, awk having a csv parser via the user implementing that parser is not quite what I have in mind when I turn to awk for some quick field splitting—and I don't think it's what others in the thread meant either, as evidenced by the linked site.

Personally I prefer using a readymade and tested library in any language that I might touch, so I can just do my own thing on top. Or, in command line, to use an util that employs such a library. Kind of hope that I'm never so constrained that only awk is available and I can't even spin up Lua.


Powerful and fast, and also *portable*. It'll run on your low-privileged tools box, 20 year old beige box, vhost, you name it.


In addition to the usual cut/paste/sort/awk stuff, we've had really powerful "stream-operator" databases based on flat text data files for decades. They used to be somewhat slow. Not anymore, esp when running from RAMdisks.

One good one is Strozzi NoSQL (his use of the term NoSQL predates the current use of the term by many years...): http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20P...

Starbase is another, with interesting extensions for astronomical work.

Linux Review article on the concept here: https://www.linuxjournal.com/article/3294

The article that started it all: http://www.linux.it/~carlos/nosql/4gl.ps

And there's even a book on the subject, centered on the /rdb implementation by the late RSW software. But I warn you, reading this WILL permanently change the way you think about databases: https://www.amazon.com/Relational-Database-Management-Prenti...


Depending on how complex the task is, I also jump from sort/join/awk/sed to the SQL train more often. But if I have already gone this step, then I would also like to have the whole SQL(ite) power and that would then but really blow up the command line. In such cases I usually write a TCL script, the integration of SQLite3 [0] is quasi native and besides the full SQLite3 functionality I also have flexible extensions (e.g. directly usable TCL procedures of any complexity) at my disposal. Tools like Q represent a middle ground, although they build on SQLite they remain behind in functionality [1]. But as long as I want to keep it simple on the command line while adhering to the UNIX philosophy, coreutils, sed, awk and possibly perl remain my best friends.

[0]: https://www.sqlite.org/tclsqlite.html [1]: http://harelba.github.io/q/#limitations


Same. You can go a long way with cut, sort, etc. and also awk with its pattern matching. But if you're handy with SQL, that can often feel more natural and certainly things like joins among separate CSV files, as well as sums and other aggregates, are easier.

If you have "unclean" CSV data, e.g. where the data contains delimiters and/or newlines in quoted fields, you might want to pipe it through csvquote.

https://github.com/dbro/csvquote


You also probably have GNU join installed: https://www.gnu.org/software/coreutils/manual/html_node/join...


Have you looked at the Unix command ‘join’? This is a cool tool, but I think join is pretty much everywhere.


Hmm, I'll have to check it out!


Love seeing more projects focused on using SQL to query many things. It is a common and familiar language to query and report.

I help lead an open source project https//steampipe.io which can query CSV with SQL, among 85+ other endpoints like cloud providers, SaaS APIs, code, logs and more using SQL to query and join data: https://hub.steampipe.io/plugins

There is also an interesting dashboards as code concept where you can codify interactive dashboards with HCL + SQL: https://steampipe.io/blog/dashboards-as-code


I have been using Dremio to query large volume of CSV files: https://docs.dremio.com/software/data-sources/files-and-dire...

Although having them in some columnar format is much better for fast responses.

GitHub: https://github.com/dremio/dremio-oss


They're sort of different categories.

Tools like Q are for command line use.

Dremio is a server/web application, right?

They accomplish the same thing but you might deploy a tool like Q on production servers for adhoc log analysis or install it in a docker container. (Not saying you should, just explaining the difference.)


I will give it a shot!

The web application is just a UI to get started. It acts like a database providing you with jdbc and odbc drivers and arrow flight protocol: https://github.com/dremio-hub/arrow-flight-client-examples


If you like the idea of this but aren't on Linux or want something with a little ui involved, check out the Rainbow CSV extension for VS Code, vim, Sublime, and Atom. It includes RBQL [0] which lets you query CSVs with a sql-like syntax.

(Not affiliated with Rainbow CSV or RBQL at all, just a happy user)

[0]: https://rbql.org/


You can do the same thing with csvsql from csvkit: https://towardsdatascience.com/analyze-csvs-with-sql-in-comm...

Since csvkit comes with so many other tools, I'm not sure I see a reason to use q over csvsql


Big fan of csvsql here. I use it often to look at log files.


Previous discussion:

https://news.ycombinator.com/item?id=18453133 (284 points|devy|4 years ago|96 comments)

https://news.ycombinator.com/item?id=27423276 (121 points|thunderbong|1 year ago|63 comments)

https://news.ycombinator.com/item?id=24694892 (11 points|pcr910303|2 years ago|2 comments)


Nice work! I am a fan of tools like this.

However, in my first attempted query (version 3.1.6 on MacOS), I ran into significant performance limitations and more importantly, it did not give correct output.

In particular, running on a narrow table with 1mm rows (the same one used in the xsv examples) using the command "select country, count(1) from worldcitiespop_mil.csv group by country" takes 12 seconds just to get an incorrect error 'no such column: country'.

using sqlite3, it takes two seconds or so to load, and less than a second to run, and gives me the correct result.

Using https://github.com/liquidaty/zsv (disclaimer, I'm one of its authors), I get the correct results in 0.95 seconds with the one-liner `zsv sql 'select country, count(1) from data group by country' worldcitiespop_mil.csv`.


Hi, author of q here.

Regarding the error you got, q currently does not autodetect headers, so you'd need to add -H as a flag in order to use the "country" column name. You're absolutely correct on failing-fast here - It's a bug which i'll fix.

In general regarding speed - q supports automatic caching of the CSV files (through the "-C readwrite" flag). Once it's activated, it will write the data into another file (with a .qsql extension), and will use it automatically in further queries in order to speed things considerably.

Effectively, the .qsql files are regular sqlite3 files (with some metadata), and q can be used to query them directly (or any regular sqlite3 file), including the ability to seamlessly join between multiple sqlite3 files.

http://harelba.github.io/q/#auto-caching-examples


Ah, got it, thank you!

Just one minor suggestions/feedback point, in case you find helpful, which is that I had to also add the `-d` flag with a comma value. Otherwise with just -H, I get the error "Bad header row" even though my header was simply "Country,City,AccentCity,Region,Population,Latitude,Longitude".

This suggests to me that `q` is not assuming the input to be a CSV file, but that seems at odds with the first example in the manual, which is `q "select * from myfile.csv"`, with no `-d` flag. Or perhaps the first example also isn't using a csv delimiter, but it doesn't matter because no specific column is being selected?

In addition, given that, from what I gather, a significant convenience of `q` is its auto-detection, then I think it would make sense for it to notice when the input table name ends in ".csv" and based on that, to assume a comma delimiter.

Just my 2 cents. Great job!


Hi again, thanks a lot for the suggestions!

You're absolutely right about the auto-detection (and documentation) of both the header row and the delimiter, I was busy with the auto-caching ability in the last few months in order to provide generic sqlite3 querying, so never got around to it.

I will update the docs and also add the auto-detection capability soon.

Harel


Another solution would be to leverage the existing SQL engines.

Someone mentioned sqlite virtual table.

My home made solution to the very same problem is creating a simple winform that you can drag and drop spreadsheets and csv files onto, analyses them, creates an instance of localdb if there isn't one running, creates the table and uploads the data (so it reads the csv file twice). Then I can use my loved and trusted SQL Server Management studio with the MS SQL engine. The same UI allows to quickly delete tables and databases and create new database in two clicks. (future development: auto-normalise the table to reduce disk space and improve performance).

What lacks is good import tools. Most csv import tools are super picky in term of the format of the data (dates in particular) and have too many steps.


Q is built on top of sqlite. :)



It's fast. But it's also a 1-2gb binary. And its SQL implementation is work in progress and often makes up its own names for common functions.

If you can put up with both for an adhoc cli exploration tool then yeah it's incredible.

For analytics queries in general though (not talking about clickhouse-local) I don't think there's any OSS competition.


The large binary size argument is a bit old.

It seems possible to reduce it below 50MB.

https://github.com/ClickHouse/ClickHouse/issues/29378


When I want to quickly query a csv file (usually log files), I like to use lnav which also supports running SQL queries and supports pretty much any log file format I happened to deal with.

https://lnav.org/


For Windows users who click here, this functionality is built in.

https://learn.microsoft.com/en-us/cpp/data/odbc/data-source-...


Temp DB's feel like a slightly underused technique too. (Not saying it's always better.)

Ie, just a quick script that adds a serial id as the first column. Then imports to postgres/mysql based on header names (column names) and file name (becomes table name) to a brand new db.

Usually DBs are so long lived and carefully designed that there's a bit of mental block to just importing trash data and dropping the whole database later. I'm always 15 mins into awk before i remember.

Also in postgres you can do it as a new schema in an existing database, and join with the existing data. Probably safest to not do that in production :-).

Like so: https://stackoverflow.com/questions/5712387/can-we-join-two-...

Then just drop the whole schema when you are done screwing around.



When will people stop giving things one character names, it's impossible to search for.


http://harelba.github.io/q/#requirements

"q is packaged as a compiled standalone-executable that has no dependencies, not even python itself."

This is not quite true, on MacOS:

"q: A full installation of Xcode.app 12.4 is required to compile this software. Installing just the Command Line Tools is not sufficient.

Xcode can be installed from the App Store. Error: q: An unsatisfied requirement failed this build."


I don't understand this objection. Being able to compile something with no dependencies is different from being able to run something with no dependencies.


I think you're confusing installing the compiled package and using it (which doesn't require anything else) with _compiling_ the package, which requires XCode on MacOS


Reminds me of the textQL extension that's available in Asciidoc.

Point it to an external CSV file, enable TextQL, and bam, there's your query returned as a table. Handy for parts lists, inventory, that kind of crap.

https://github.com/dinedal/textql

https://gist.github.com/mojavelinux/8856117


Is this similar to CSV virtual tables in sqlite?

https://www.sqlite.org/csv.html


It's easier to use than that. With virtual tables you must CREATE TABLE for every schema you have which is very tedious if you are querying CSV files with differing columns.



MariaDB sports several storage engines such as InnoDB, MyISAM, Aria etc

One of them is CSV which uses plain text files as backends https://mariadb.com/kb/en/csv-overview/


I've made a similar app, but I aim it to be more convenient, especially for people who don't want to handle installation and command line.

Check it out: https://superintendent.app -- it is a paid app though.


Maybe relevant.

A while ago I wrote relational (https://ltworf.github.io/relational/) to do relational algebra queries… It can load csv files, has a gui and a cli.


Q runs SQLite under the hood. If you want real firepower, take a look at DuckDB, the codebase is also a work of art.

https://github.com/duckdb/duckdb


"Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite."

https://github.com/dbohdan/sqawk


These tools, useful as they are, appear to be only for querying files. AFAICT, insert and update are typically not supported as a way to modify the data.


I always just imported the csv into sqlite (one line command) Then I can have fast reliable queries on it and join it to other data.


Don't know if it's the first such implementation, but perl's DBD::CSV is 25 years old this year.


What is the difference between Q and csvsql?


sooo.. clickhouse? XD


Q is a great project! Here's a comparison of Q against some of the other tools out there [0] (including dsq, which I wrote). And there's a benchmark in there too [1].

Whichever tool you end up using, I'm sure it will help out with your CLI data exploration!

[0] https://github.com/multiprocessio/dsq#comparisons

[1] https://github.com/multiprocessio/dsq#benchmark


dsq is a great tool, using it regularly to work against csv files, and really nice to use.


Awesome to hear!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: