There's a Relational Database in Your Unix CLI

orf · on June 17, 2019

https://github.com/BurntSushi/xsv is all you need here, not a mess of sort commands.

   xsv join enrollments.csv courses.csv | xsv table

Uses an index as well.

da_chicken · on June 17, 2019

Yeah, xsv is great for one time tasks like this. If you're going to need to do a lot of queries, though, it's easier just to .mode csv .import in SQLite.

6thaccount2 · on June 17, 2019

Doesn't this convert all data to strings? I think that makes some queries wonky.

Or does it search through each row and see that one element out of 1000 integers was a float, so the whole column should be floats?

da_chicken · on June 17, 2019

The format is .import CSVFILE TABLENAME [0]. If TABLENAME already exists, it would use that schema. Otherwise it will all be TEXT data using the first row as column names. However, SQLite uses dynamic typing (they call it "type affinity") so it kind of doesn't matter.

I like SQLite a lot, but one of the funniest things about it is how they try to sell you that type affinity is a great thing [1] when really it's not at all what you want in 99% of RDBMS situations. There's a reason that everybody from MySQL and PostgreSQL to MS SQL Server, Oracle, and DB2 use static typing.

[0]: https://sqlite.org/cli.html#csv_import

[1]: https://www.sqlite.org/datatype3.html

6thaccount2 · on June 17, 2019

Will doing a where clause (ex: where column_x > 3) actually work if column_x is now all strings? I recall my query executing and returning incorrect results. I figured out it would work if I explicitly caste the where clause as ( where float(column_x) > 3) or something like that (this was a few years ago).

Edit: I assume most databases like Oracle (it's SQL packages looked bizzare to me until I started learning Ada and figured out it has the same syntax) use static typing for performance, query optimization, and error checking? Note that although I use SQL almost daily, I am no expert on the innards. SQLite has a different use case. It seems to be for embedded and ad-hoc analysis where dynamic typing makes more sense. This is just a guess though.

jitl · on June 17, 2019

All the data in a CSV is already strings.

smitty1e · on June 17, 2019

The mighty SQLite is well worth considering. http://www.sqlitetutorial.net/sqlite-import-csv/

rongenre · on June 17, 2019

While that's cool, you can do it with way more functionality by loading comma or tab-separated data into sqlite.

Maxious · on June 17, 2019

It even will use the CSV headers to make the table schema for you https://sqlite.org/cli.html#csv_import

ineedasername · on June 17, 2019

There's quite a bit more to a relational database than the equivalent of multiple tables that might share a "key", and some utility to join.

Further, if you're going to steer clear of easy options like SQLlite, there are more robust commands than "join". Awk, for example, can do much more than just join. Though if we're keeping this to what can be done via CLI, you still have tools like python. Between Pandas and, if you really must have SQL syntax, Pandasql, you can simulate an RDB even more simply.

gumby · on June 17, 2019

> Did you know that there’s a relational database hiding in your Unix shell?

No I did not know this because there isn't.

There likely happens to be a 'join' program installed on your computer and like any program it can be invoked from your shell.

shakna · on June 17, 2019

> There likely happens to be a 'join' program installed on your computer and like any program it can be invoked from your shell.

'join' is a Linux coreutil, so though it might not always be there for strictly UNIX, it will be for Linux.

Calling a coreutil part of the shell does make sense in some contexts, and I'd suggest when talking about what you can do with a cli, this is one of those times.

HillaryBriss · on June 17, 2019

there's also this list: https://en.wikipedia.org/wiki/List_of_Unix_commands which includes join (and, importantly, awk), as you pointed out

HillaryBriss · on June 17, 2019

IDK, what about indexes? what about atomic transactions which add/update/delete rows to/from/from multiple tables?

PaulHoule · on June 17, 2019

It's more accurate to say that you can implement the relational algebra via the Unix command line

https://en.wikipedia.org/wiki/Relational_algebra

HillaryBriss · on June 17, 2019

ok. yeah. pursuing that, i'm wondering if there's a simple way to convince someone that this is true.

one idea I saw somewhere was that because awk is part of a Turing complete language, it can be used to implement the relational algebra. though I don't know how to prove that myself, except by actual construction: implementing each part of the relational algebra definition.

looking over the definition of relational algebra, at first I wondered: what about a theta join? it's not directly built into the CLI "join" command. but then the article points out that if you can obtain a Cartesian product of two tables then you can use a Select to filter out the rows that do not meet the criteria of your theta join.

and then there's division, which I can only guess would depend heavily on some careful awk scripting.

and so it goes. IDK. it seems a bit laborious to prove it all by construction.