Shameless plug but if you’re wanting to load raw data into CSV then it might be worth looking at my shell.
You can import CSV, jsonlines, even human formatted tabulated data like the output from ‘ls -l’ and ‘ps’. And you can import from files directly, gzip archives, pipes, and shell variables.
Eg
ps aux | select count(*), user GROUP BY user ORDER BY 1
(FROM is dropped here because you’re importing from STDIN)
It uses sqlite3 under the hood and was written to solve exactly the problem of needing to quickly validate a bunch of CSV files. But it proved so useful that I ended up making it an official part of the shell.
This looks terrific. But why create a new shell, why not a `select` executable (or, given that `select` seems to be a bash keyword for synchronous I/O multiplexing, an alias such as `sel`)?
The idea looks perfect for so many of my use cases, but the implementation as a shell is a significant roadblock to anyone invested in their current shell. I'd bet that's a significant portion of the people who would find this useful.
That's a fair question. The shell has been around for > 8 years (and been my primary shell for around 5 of those years) so it made sense to integrate it into the shell because I could then leverage the existing code that transparently handles different data formats. That shell also sends metadata about the documents being piped which means you effectively have typed pipes (like Powershell but backwards compatible with POSIX tools). So it means you can do more with the table afterwards too.
And since the shell is already good for handling structured data (like `jq` et al but not limited to JSON) it also made sense to allow inlining of SQL for when you're working with relational data.
An example of this last point is a recent problem I had at work where I wanted to see which users had 2FA enabled. Unfortunately this service provided a user list API (returns a complex JSON document) and a separate user meta data API (again JSON). The meta data didn't have names on it and the user list didn't have 2FA details on it. So I had two JSON documents that weren't tables but ostensibly held relational data. Using the shell I could convert the two JSON documents into two tables and then use the SQL inlining `select` command to do the relational query. The whole thing was 3 lines of code whereas to do the equivalent in Python / Typescript / etc would have been much more verbose.
> That shell also sends metadata about the documents being piped which means
> you effectively have typed pipes (like Powershell but backwards compatible
> with POSIX tools). So it means you can do more with the table afterwards too.
It looks like such a shell would greatly benefit from a stdmeta file descriptor, as described here:
The idea is that such a file descriptor would output lines that are not part of the data yet are not errors. One example usage would be to output the headers of commands such as ps, and murex could use such stdmeta lines as column names.
That would be awesome but I think what we need is something a little more programmatic. eg a standard way of passing Content-Type / MIMEs with fields for meta-data. A hypothetical example might look like this:
I'm not suggesting that data should be JSON formatted though. Nor even have fields identical to the above. But a standard agreed base so that applications can understand what the pipe is without the developers having to write a thousand different case statements for each different stdmeta content but while still allowing some enhanced functionality for common commands.
In an idea world this would be an API rather than file descriptor but that would be backwards incompatible and even the FD use case feels like nice thing that is always going to be out of reach. :(
Thanks for sharing that by the way. I hadn't seen that post before.
Piggybacking.. What's a good resource to learn SQL from scratch?
Maybe out of ignorance, but in all the years I've programmed I've never had a situation when I said to myself "I need a database"! I either have data in maps/vectors in memory or if I need tables I'll use some table datastructure (like in R/Matlab). So reading a CSV file into a SQL table seems a bit weird - though maybe some people have multi-gig CSV files? I guess at a certain size (when it can't fit in RAM?) you gotta turn to SQL - or maybe I should be using SQL for smaller/simpler problems as well - but I don't really know where to start.
Otherwise SQL is best learned by solving problems and googling when you’re stuck.
Most developers only need a few SQL constructs like SELECT, WHERE, WITH, JOIN, INSERT, CREATE and GROUP BY. Most devs learn and forget because they don’t use SQL very often. It’s too bad because the relational data structure (or more generally the table) is a super useful abstraction and often much more powerful than lists and dicts.
Data analysts need a few more like OVER (PARTITION BY) and maybe PIVOT/UNPIVOT.
Data engineers of course need many more specifically those related to database object manipulation, and stuff around performance like INDEX.
Oh the course starts today.. Is there way to view the lectures without enrolling?
Just more generally, do you have some mental model when you move from using something like a dataframe to using SQL?
For instance people working in R rarely talk about using an SQL database - even when they have huge datasets. And I guess I don't really see when you should be thinking about making the switch over.
The only thing I see distinct about databases is it's always hand in hand with web-tech - and it more focused on mutability (which if you are doing some stats your data is usually immutable)
I’m not aware of other ways to get at the material though I’m sure Widom’s website at Stanford might have some.
It’s not just a size thing but also a concurrency thing. Data frames are like tiny local databases. SQL databases are like central databases that many parties read and write from. There’s no switchover point as such. A common use case is to use SQL to query a subset of data from a central database, and that subset becomes a local data frame that an R user can work off of.
Data frames and relational databases are very similar. When you’re using dplyr many of the operations conceptually map 1-to-1 to SQL. SQL is just a data frame manipulation DSL for databases. It’s all about filtering, aggregation and projections.
Also not true about mutability and web tech. Most SQL operations do not mutate. Also SQL is more associate with data analytics, data engineering and report creation from data warehouses than web tech. The web tech use case came about because of PHP/MySQL and Django etc. but there SQL is usually used as an API language for the underlying data store rather than the full query engine that it is.
Right! the dplyr functionality seems to match 1-to-1 to what a cursory look at SQL told me - hence my confusion as to why they reinvented the wheel instead of using an industry standard
The async table access is an interesting angle - assuming you're doing mutations that probably places some constraints. But in the case of in-app databasese my SQLite that's generally not a concern. From what you've said, at the end of the day, it sounds like the people using tables/dataframes are just too lazy to setup and use a SQL? I don't really see any distinction btwn the two domains then
I think it's more of a case of preference and ergonomics.
For local dataframes, it's still more ergonomic to use a language's own constructs than to adopt a totally separate DSL for data manipulation. SQL is a different "language" which has its own cognitive overheads. So things like dplyr, pandas, and LINQ (in C#) were invented as language-native ways to manipulate tabular data structures without the overhead of context-switching [1]. After all, learning SQL and being good at it on top of mastering one's own programming language takes something extra.
That said, it's possible and indeed sometimes preferable to use pure SQL for local data tables. I use duckdb to do complex manipulations on dataframes in Python because it's a lot faster than Pandas (due to some columnar optimizations) and because I can express certain things more succinctly in SQL than in Pandas (it's true -- I appreciate Pandas for the achievements that it is, but its syntax can be verbose with all its .apply() and df[df[col=="abc"]] incantations).
Using SQL via sqlite to manipulate local tables confer similar benefits.
[1] To be fair, pandas, dplyr and LINQ -- by virtue of being built into their respective languages -- can do things that SQL can't (or can't do easily) because it's not constrained by some of SQL's design.
I published a new introductory SQL tutorial last week, which I hope to keep improving and expanding over time. Would very much appreciate any feedback on what I've got so far!
https://selectstarsql.com/ is an interactive book which aims to be the best place on the internet for learning SQL. It helps you learn by running queries against a real-world dataset to complete projects of consequence. It is free of charge, free of ads and doesn't require registration or downloads.
Two major reasons I use SQLite are the built in full text search https://www.sqlite.org/fts5.html and fuzzy matching https://www.sqlite.org/spellfix1.html these are two things you often want to do with your data frames but other solutions are usually pretty slow whereas SQLite is fast.
Other reasons: every language can use a SQLite db, you can add indexes, you can add multiple tables in one file, you can store large amounts of data on disk, it has built in json queries, there’s a great ecosystem of SQLite libraries, support for it is built into python.
It’s not really reinventing the wheel, I make use of sql, pandas, pyspark, sparksql, data.table, bigquery, it all depends on the use case.
Dataframe abstractions are really easy to understand so I have worked with journalists and biologists who are able to work productively with pandas or R. Transforming and analyzing data is often easier to write with pyspark while spark sql makes it easy to access and extract data from some data sources.
It all really depends on who is your customer, who are your coworkers, how much volume of data do you have, how much time do you have to accomplish your task etc.
I found that it depends on where your interest lies.
If you just want to learn practical SQL then I have historically found the Celko books (https://en.wikipedia.org/wiki/Joe_Celko) as well as, more recently, the No Starch Press books (https://nostarch.com/practical-sql-2nd-edition) very well written.
Everyone learns differently however. I have a colleague who really enjoyed the O’Reilly books specific to SQL from a data analytics perspective. (Apologies, no link for that)
You might frequently find the No Starch and O’Reilly books on Humble Bundle, (https://www.humblebundle.com/books?hmb_source=navbar) if that is available in your location.There’s often loads of overlap between bundles. I’m sure I’ve bought the python book about 5 times so far but I don’t mind as it’s great value.
If you want to learn about database theory, however, as well as the practicalities of SQL, then I found that most of the resources I used when I did this at uni were online. The book we used was the Connolly/Begg book (https://www.pearson.com/us/higher-education/program/Connolly...)
I don’t think this effectively answered the “why” of your question, however. My guess is that SQL is a very well established domain language and when it comes to data normalised across many tuples it’s the standard for manipulation.
I don’t think that dataset size is the primary reason to manipulate data via SQL; I think that the moment your programme starts to need more than a single flat file data source, you naturally start to think about normalisation, indexes etc for performance and sanity.
If I may, however: I am coming at this from the other way in that I am far more often manipulating CSV and excel data and would like some good resources in how to quickly load that in to a dataframe in pandas or similar vs using SQL. (If I’m responding to a thread hijack I may as well go all in and totally derail it)
The SQLite shell supports importing and writing CSVs pretty easily - here are some snippets I reach for often enough to copy to a blog post:https://www.bbkane.com/blog/sqlite3-snippets/
So uhh, when are you gunna use a dataframe and when will you stick to SQL?
My understanding is that SQL can sorta do a very powerful (and standardized/fast) subset of what a dataframe can
Not necessarily SQL, but to get a firm understanding of (relational) databases, I like to recommend "The Manga Guide to Databases". Yes, it's a comic. Yes, not everyone reacts well to the story of a teenage faux-fantasy princess being helped by the database fairy to get her fruit export business to run efficiently. But it teaches you the basic concepts you'll be using 95% of the time way better than any documentation I've ever read.
I don't want to advertise and am not affiliated in any way. For a first cursory understanding I do recommend the 'Mode Analytics' tutorial [0] to juniors interested in starting sql.
never needed a database huh? you and i live in different worlds. every web app I've written has needed a database. need somewhere to put user data and transaction logs
I don't think it's that uncommon. I've done very little database work over a forty year career. Web development is a subset of the industry, there is a whole lot of other stuff going on.
All my data are in TSV files, separated by tabs instead of comma and strings are not quoted. It makes parsing or manually editing data so much easier. You need to escape only like 5 characters \0 \n \r \\ \t and even binary data will fit in there.
I was checking if sqlite's CSV supports custom separator and unquoted strings for this to work. Found it doesn't.
This is gonna blow your mind: There are already ASCII (and Unicode) characters for that![1],[2] Have been since forever; the inception of the ASCII standard. So it turns out everyone has been (ab)using commas, tabs, semi-colons, newlines, carriage returns, carriage-return-and-newlines, single quotes, double quotes, quotes around every field, quotes only around text fields, etc, etc... For fifty years -- totally for nothing.
Use the data-delimiter characters actually defined in the ASCII / Unicode specifications to, you know, delimit your frigging data, and you won't have to bother about whether to quote or not to quote, nor about painstakingly "escaping" (and un-escaping) whatever characters you're (mis)using for delimiters.
Sure as Hell blew away at least my mind when I came across it.
(Not that I got to implement it at work, of course.)
Interesting I think that makes it the first mainstream real world CSV tool I've used that is truly comma separated values then, rather than delimiter-separated-defaulting-to-comma.
Another cool thing about sqlite is that the virtual tables seen here can be extended through your own custom code.
You can then run SQL queries that in the end call your own data functions to generate data. I had a project where I had several diverse data files (each with its own set of functions to iterate over, but no useful querying) where wrapping each data file with a SQL Virtual table allowed med to use SQL to join them up and query them.
Postgres has some support for this (you run your SQL code but it really talks to some other service, or another postgres database) through foreign data wrappers. There's some Python library to also make adding that easy.
Interestingly a few projects do it in reverse: you are talking the postgres protocol (using library, psql client) to your database server, but it's not postgres but something else entirely (CockroachDB does this).
I like the idea of the entire world being just SQL, but you have to be careful with those wrappers and expose any indexing you might have, otherwise it's lots of expensive table scans.
>What about this is "virtual" and not a regular table?
The article's first sentence describes what makes it "virtual". Note the "as if" in the fragment:
>, and returns that content _as_ _if_ it were rows and columns of an SQL table.
In other words, instead of creating a real SQLite empty table and then doing bulk INSERTs or ".import" from the csv file, you treat the existing the .csv on disk _as_if_ you did that. The csv is treated as a "virtual" table you can execute SQL queries against.
>Is that an important difference? Who cares how the table is created.
You seem to be argumentative instead of trying to understand why they called it "virtual".
I'll try with another example. Let's pretend you have a 10 gigabyte csv file on disk::
Using a real SQLite table:
CREATE TABLE t1 (...); -- an empty db file of 8192 bytes
.import thefile.csv t1 -- total disk space consumed now 20 GB because .db file is now +10 GB larger
SELECT FROM t1; -- SQLite reads from .db file and not the .csv file
Using a "virtual" SQLite table:
CREATE VIRTUAL TABLE temp.t1 USING csv(filename='thefile.csv'); -- disk space stays the same
SELECT from t1; -- SQLite directly reads from csv disk file
>Is it a table or not?
At the risk of stating the obvious, the extra "VIRTUAL" keyword in between "CREATE" and "TABLE" thus makes a virtual table.
It is a "table" in the sense that it "acts like a table" via an interface for SQL queries. However, it's not a real table in the sense that there's no data blocks within the .db that represents that table. It's virtual.
I'm not understanding what annoys you about their description.
I think you might be getting mixed up between the query language and the implementation. Virtual tables define an interface that you can use to write an extension allowing you to query resources using sql semantics. The underlying representation of those resources doesn't change.
So yeah the difference matters depending on your use. I don't think you can add indexes for example, unless they are specifically supported by the underlying resource, or implemented in the virtual table extension you're using.
Note that the interpretation of CSV files in the wild is all over the map, that RFC is not a very high authority on how things actually work with CSV processing in software you encounter.
Unsurprisingly there's still a lot of data flowing around with these pure text size header + contents structure files, I didn't know SQLite was in that camp as well.
You can import CSV, jsonlines, even human formatted tabulated data like the output from ‘ls -l’ and ‘ps’. And you can import from files directly, gzip archives, pipes, and shell variables.
Eg
(FROM is dropped here because you’re importing from STDIN)It uses sqlite3 under the hood and was written to solve exactly the problem of needing to quickly validate a bunch of CSV files. But it proved so useful that I ended up making it an official part of the shell.
https://GitHub.com/lmorg/murex