A satisfied user here. Found it very useful when tools like cut and sort weren't enough, usually when I need to do a join on two different tables (err, files). Left joins work, but I don't think right joins are supported.
I've used this in combination with jq as well. I'll use jq to convert json to CSV, and then use SQL to do whatever else.
Cut, sort, join and awk can be pretty powerful and fast. If it becomes too tedious to manually write them, you can also use BigBash [1] to convert a SQL query automatically to a one-liner that only use these tools to execute the query.
Yeah, how about no. That's a very neat site and a clever hack, but there are clear escaping flaws in there for valid movie names.
bash and standard unix tools are a terrible structured-data manipulator. it's part of why `jq` is so widely used and loved, despite being kinda slow and hard to remember at times - it does things correctly, unlike most glued-together tools.
Are you telling me that awk can correctly identify delimiters inside quoted strings? Escaped quotes inside quoted strings? Newlines inside quoted strings? I.e. that awk actually has a csv parser? Very cool if so.
Well, awk having a csv parser via the user implementing that parser is not quite what I have in mind when I turn to awk for some quick field splitting—and I don't think it's what others in the thread meant either, as evidenced by the linked site.
Personally I prefer using a readymade and tested library in any language that I might touch, so I can just do my own thing on top. Or, in command line, to use an util that employs such a library. Kind of hope that I'm never so constrained that only awk is available and I can't even spin up Lua.
In addition to the usual cut/paste/sort/awk stuff, we've had really powerful "stream-operator" databases based on flat text data files for decades. They used to be somewhat slow. Not anymore, esp when running from RAMdisks.
And there's even a book on the subject, centered on the /rdb implementation by the late RSW software. But I warn you, reading this WILL permanently change the way you think about databases: https://www.amazon.com/Relational-Database-Management-Prenti...
Depending on how complex the task is, I also jump from sort/join/awk/sed to the SQL train more often. But if I have already gone this step, then I would also like to have the whole SQL(ite) power and that would then but really blow up the command line.
In such cases I usually write a TCL script, the integration of SQLite3 [0] is quasi native and besides the full SQLite3 functionality I also have flexible extensions (e.g. directly usable TCL procedures of any complexity) at my disposal.
Tools like Q represent a middle ground, although they build on SQLite they remain behind in functionality [1]. But as long as I want to keep it simple on the command line while adhering to the UNIX philosophy, coreutils, sed, awk and possibly perl remain my best friends.
Same. You can go a long way with cut, sort, etc. and also awk with its pattern matching. But if you're handy with SQL, that can often feel more natural and certainly things like joins among separate CSV files, as well as sums and other aggregates, are easier.
If you have "unclean" CSV data, e.g. where the data contains delimiters and/or newlines in quoted fields, you might want to pipe it through csvquote.
Love seeing more projects focused on using SQL to query many things. It is a common and familiar language to query and report.
I help lead an open source project https//steampipe.io which can query CSV with SQL, among 85+ other endpoints like cloud providers, SaaS APIs, code, logs and more using SQL to query and join data: https://hub.steampipe.io/plugins
They accomplish the same thing but you might deploy a tool like Q on production servers for adhoc log analysis or install it in a docker container. (Not saying you should, just explaining the difference.)
If you like the idea of this but aren't on Linux or want something with a little ui involved, check out the Rainbow CSV extension for VS Code, vim, Sublime, and Atom. It includes RBQL [0] which lets you query CSVs with a sql-like syntax.
(Not affiliated with Rainbow CSV or RBQL at all, just a happy user)
However, in my first attempted query (version 3.1.6 on MacOS), I ran into significant performance limitations and more importantly, it did not give correct output.
In particular, running on a narrow table with 1mm rows (the same one used in the xsv examples) using the command "select country, count(1) from worldcitiespop_mil.csv group by country" takes 12 seconds just to get an incorrect error 'no such column: country'.
using sqlite3, it takes two seconds or so to load, and less than a second to run, and gives me the correct result.
Using https://github.com/liquidaty/zsv (disclaimer, I'm one of its authors), I get the correct results in 0.95 seconds with the one-liner `zsv sql 'select country, count(1) from data group by country' worldcitiespop_mil.csv`.
Regarding the error you got, q currently does not autodetect headers, so you'd need to add -H as a flag in order to use the "country" column name. You're absolutely correct on failing-fast here - It's a bug which i'll fix.
In general regarding speed - q supports automatic caching of the CSV files (through the "-C readwrite" flag). Once it's activated, it will write the data into another file (with a .qsql extension), and will use it automatically in further queries in order to speed things considerably.
Effectively, the .qsql files are regular sqlite3 files (with some metadata), and q can be used to query them directly (or any regular sqlite3 file), including the ability to seamlessly join between multiple sqlite3 files.
Just one minor suggestions/feedback point, in case you find helpful, which is that I had to also add the `-d` flag with a comma value. Otherwise with just -H, I get the error "Bad header row" even though my header was simply "Country,City,AccentCity,Region,Population,Latitude,Longitude".
This suggests to me that `q` is not assuming the input to be a CSV file, but that seems at odds with the first example in the manual, which is `q "select * from myfile.csv"`, with no `-d` flag. Or perhaps the first example also isn't using a csv delimiter, but it doesn't matter because no specific column is being selected?
In addition, given that, from what I gather, a significant convenience of `q` is its auto-detection, then I think it would make sense for it to notice when the input table name ends in ".csv" and based on that, to assume a comma delimiter.
You're absolutely right about the auto-detection (and documentation) of both the header row and the delimiter, I was busy with the auto-caching ability in the last few months in order to provide generic sqlite3 querying, so never got around to it.
I will update the docs and also add the auto-detection capability soon.
Another solution would be to leverage the existing SQL engines.
Someone mentioned sqlite virtual table.
My home made solution to the very same problem is creating a simple winform that you can drag and drop spreadsheets and csv files onto, analyses them, creates an instance of localdb if there isn't one running, creates the table and uploads the data (so it reads the csv file twice). Then I can use my loved and trusted SQL Server Management studio with the MS SQL engine. The same UI allows to quickly delete tables and databases and create new database in two clicks. (future development: auto-normalise the table to reduce disk space and improve performance).
What lacks is good import tools. Most csv import tools are super picky in term of the format of the data (dates in particular) and have too many steps.
When I want to quickly query a csv file (usually log files), I like to use lnav which also supports running SQL queries and supports pretty much any log file format I happened to deal with.
Temp DB's feel like a slightly underused technique too. (Not saying it's always better.)
Ie, just a quick script that adds a serial id as the first column. Then imports to postgres/mysql based on header names (column names) and file name (becomes table name) to a brand new db.
Usually DBs are so long lived and carefully designed that there's a bit of mental block to just importing trash data and dropping the whole database later. I'm always 15 mins into awk before i remember.
Also in postgres you can do it as a new schema in an existing database, and join with the existing data. Probably safest to not do that in production :-).
I don't understand this objection. Being able to compile something with no dependencies is different from being able to run something with no dependencies.
I think you're confusing installing the compiled package and using it (which doesn't require anything else) with _compiling_ the package, which requires XCode on MacOS
It's easier to use than that. With virtual tables you must CREATE TABLE for every schema you have which is very tedious if you are querying CSV files with differing columns.
These tools, useful as they are, appear to be only for querying files. AFAICT, insert and update are typically not supported as a way to modify the data.
Q is a great project! Here's a comparison of Q against some of the other tools out there [0] (including dsq, which I wrote). And there's a benchmark in there too [1].
Whichever tool you end up using, I'm sure it will help out with your CLI data exploration!
I've used this in combination with jq as well. I'll use jq to convert json to CSV, and then use SQL to do whatever else.