Hacker News new | past | comments | ask | show | jobs | submit login
Practical SQL for Data Analysis (hakibenita.com)
602 points by steve-chavez on May 3, 2021 | hide | past | favorite | 191 comments



This is an excellent example of what I call the Copy-Object-Copy effect. It's particularly apparent in frameworks with ORMs like Django. In Pandas case, devs will do 'SELECT *' and use pandas as a sort of pseudo ORM.

You run a query to get your data as a bunch of objects, but you're copying the data over the db connection from the postgres wire protocol into Python objects in memory, which are typically then garbage collected at the end of the transaction, then copy the result to a JSON buffer that is also garbage collected, and then send the final result to send to the browser which has been waiting this whole time.

I regularly see Django apps require several gigs of RAM per worker and when you look at the queries, it's just a bunch of poorly rendered SELECTs. It's grotesque.

Contrast PostgREST: 100 megabytes of RAM per worker. I've seen Django DRF workers require 50 to 1 memory vs PostgREST for the same REST interface and result with PostgREST being much faster. Since the database is generating the json, it can do so immediately upon generating the first result row which is then streamed to the browser. No double buffering.

Unlike Django, it's not that I don't like Pandas, it's that people way overuse it for SQL tasks as this article points out. I've seen pandas scripts that could be a simpler psql script. If psql can't do what you want, resist the temptation to 'SELECT *' into a data frame and break the problem up into stages where you get the database to do the maximum work before it gets to the data frame.


Within the VA hospital system, the data admins often got onto their "soapboxes" to dress down the 1400+ data analysts for writing inefficient SQL queries. If you need 6 million patients, joining data from 15 tables, gathering 250 variables, a beginning SQL user has the potential to take 15-20 hours where they could be pulling for 1-2 if they do some up-front filtering and sub-grouping within SQL. If you already know you'll throw out NA's on certain variables, if you need a certain date or age range, or even the way you format these filters: these all lead to important savings. And this saves R from being full on memory, which would often happen if you fed it way too much data.

Within a system of 1400 analysts, it makes a big difference if everyone's taking 2X or 4X the time pulling that they could be. Then, even the efficiently written pulls get slowed down, so you have to run things overnight...and what if you had an error on that overnight pull? Suffice to say, it'd have been much simpler if people wrote solid SQL from the start.


I don't disagree with writing solid SQL. I would go so far as to say some things (most) need to be in stored procedures that are reviewed by competent people. But, some folks don't think about usage sometimes.

This is one of those things I just don't get about folks setting up their databases. If you have a rather large dataset that keeps building via daily transactions, then its time to recognize you really have some basic distinct scenarios and to plan for them.

The most common is adding or querying data about a single entity. Most application developers really only deal with this scenario since that is what most applications care about and how you get your transactional data. Basic database knowledge gets most people to do this ok with proper primary and secondary keys.

Next up is a simple rule, "if you put a state on an entity, expect someone to need to know all the entities with this state." This is a killer for application developers for some reason. It actually requires some database knowledge to setup correctly to be performant. If the data analysts have problems with those queries, then its to to get the DBA to fix the damn schema and write some stored procedures for the app team.

At some point, you will need to do actual reporting, excuse me, business intelligence. You really should have some process that takes the transactional data and puts it into a form where the queries of data analysts can take place. In the old days that would be something to load up Red Brick or some equivalent. Transactional systems make horrid reporting systems. Running those types of queries on the same database as the transactional system is currently trying to work is just a bad idea.

Of course, if you are buying something like IBM DB2 EEE spreading queries against a room of 40 POWER servers, then ignore the above. IBM will fix it for you.


At its simplest its OLTP vs OLAP. Separate the data entry/transactional side of things from the reporting part. Make it efficient for data analysts do do their jobs.


But sometimes, your line-of-business application does analytics queries. Which means you need your app developers to understand how to do OLAP, and you also need a schema design that can run arbtrary OLAP queries within a few orders of magnitude of OLTP speeds (e.g. <10s.)


I can't help thinking that with better processes and tools those 1400 analysts could instead be 50 analysts? (or even 5?)

And that building an aggregation ETL pipeline, maybe inspired by this post, could be the solution?


VA is a massive org, this is not a particularly large number of analysts for a healthcare system of this size (or honestly any complex org).


Yes, but the answer that it is a massive organization does not satisfy my curiosity regarding what they are actually doing.

I come from a small country, in total comparable to the 6 million veterans mentioned here.

1400 analysts is just a lot and I wouldn't imagine for example our national healthcare service could ever employ that amount of analysts?

Hm, or would they? It would make a pretty big dent in the total amount of persons in that field in our country.


6 million Veterans are just the subgroup from one of the studies I did. In reality, the VA system serves 15-20 million patients, given there are 17.4 million Vets, some who use private health care, and some whose families also use VA.

The reason there are 1400 analysts: research studies each require one or two analysts. At this very moment, there are thousands of research studies taking place in the US medical system. Without these number of analysts, you'd have to completely revamp the system, killing all current projects, all current code, and creating a HUGE HUGE headache for everyone, not to mention laying off 1000+ through a system which it is NOT easy to layoff individuals through.

As a matter of fact, they want to transition to a new data infrastructure at the VA, but it's been delayed many times and the logistics have been very vague.


Is it even feasible to revamp the system efficiently?

We really should commission 2 more analysts to figure this out...


VA -> government -> bloat


I understand your point generally but I don't understand this example. If you need data from 15 tables, you need to do those joins, regardless of prefiltering or subgrouping, right?


Disclaimer: It's been a while, but my very first job out of uni was dealing with large queries like that for reporting and import/export out of our OLTP database directly.

15 tables would've been not very common for us, but something between 5 and 10 was normal. Many of those tables would've had millions upon millions of rows (while some were simple 'key tables' with only hundreds or a few thousand records)

The one thing that I learned really fast is that yes, "prefiltering/subgrouping" as the OP calls it, is very important. If you cut down on the number of rows that your query "starts out with" was very very important, as it cuts down on the amount of data that needs to be dealt with in the rest of the query. This was for an old Sybase ASE based system. IIRC, ASE would only be able to automatically optimize this across 4 clauses (might misremember the number and I left before they upgraded to the new newer version with a better optimizer), so ordering of your join clauses was important. If the filtering that cut down on the amount of data needed from other tables came first, your query would run way faster, than if the filters were way down with the rest of the joins.

Just think about it, if you start out with getting 6 million patient records and then start collecting 6 million records from the next table for a join and so on and so forth, that's way more data that needs to be read and churned through than if you can 'start on the other end' so to speak, whittle it down to say 50000 records that now need to be looked up in the patient table.


Well, yes but. Why do you need 15 tables for analysis queries and why isn't someone rolling those tables into something a bit easier with some backend process.


> why isn't someone rolling those tables into something a bit easier with some backend process.

You'd identified why we're all going to have jobs in 100 years.

Automation sounds great. It's exponentially augmenting to some users in a defined user space.

Until you get to someone like me, who looks at the production structure and goes: "this is wholly insufficient for what I need to build, but it has good bones, so I'm going to strip it out and rebuild it for my use case, and only my use case, because to wait for a team to prioritize it according to an arcane schedule will push my team deadlines exceedingly far."

This is why you don't roll everything into backend processes. Companies set up for production (high automation value ROI) and analytics (high labor value ROI) and has a hard time serving the mid tail. EVERYTHING on either direction works against the mid-tail -- security policies, data access policies, software approvals, you name it.

People, policy, and technology. These are the three pillars. If your org isn't performing the way it should, then by golly work at one of these and remember that technology is only one of three.


Tree pillars where you only can choose two to be perfect. Just like databases and CAP theorem


Technology is the easiest to adjust, ironically.


This is where I think the original data admins were deluding themselves. Expecting 1,400 analysts to write better code is a really non-trivial problem, but easy to proclaim.

An actual solution is creating pre-joined tables and having processes ("Hey your query took forever, have you considered using X?") or connectors (".getPrejoinedPatientTable()") that make sure those tables are being used in practice.


> why isn't someone rolling those tables into something a bit easier with some backend process.

To put it in more concrete terms(plain SQL): the tables could be aggregated on a set returning function or a view.


Yes, in these situations materialized views with indexes are generally the correct answer.


May I introduce: Protomyth, meet DW and ETL.


How many TBs of data are we really talking here, if it's just 6M rows? Surely this processing could easily be done on a single machine.


It doesn't say it is just 6m rows. It is 6m patient, which only hints that one of the dimension tables is 6m. Facts gathered in patients might be significantly larger. Also, my experience is saying, if you have hundreds of queries running simultaneously, it is not the volume of data that can be a bottleneck. Depending on the system it can be anything, starting from acquiring lock on a record or initiating a transaction.


True, the number of events / records is probably significantly larger than 6M. I still have a hard time believing any reasonably modern datawarehouse would struggle with queries on this dataset.

If you've got 1400 analysts, I hope you've exported your data from an OLTP database to an OLAP database. Those generally are much easier to scale to many users.


I work for a large healthcare org and from personal experience these things can get large. 6m patients is prob 20m encounters, each encounter may have 10 different kinds of meta data (each one it’s own table) and each kind has 10-100 rows. So really you are doing joins between 10 tables each with ~1-10 billion rows. It actually does slow down even running on expensive hardware.


Ah yes that does sound painful. I'd generally recommend that data be denormalized in the warehouse so the joins don't need to be done at query time. That can make the queries more complex, however.


Hi, I saw your comment on the wechat page, I was wondering if you have the account and can verify me?


> it's that people way overuse it for SQL tasks as this article points out

I'm very confused by this. I've used pandas for ever a decade, and in most cases it's a massive time saver. I can do a single query to bring data into local memory in a Jupyter notebook, and from there re-use that memory across hundreds or more executions of pandas functions to further refine an analysis or whatever task I'm up to.

Your "copy-object-copy" is not relevant in data analysis use cases, and exists in pretty much any system that pulls data from an external service be it SQL or not.


I know a team that would read in multi-GB CSVs from an FTP site into an orchestrator server using Pandas, write to locsl CSV in the orchestator, validate data after write, reload into pandas, painfully and manually check every data type, then use a pandas connector to a database.

Every step along the way here is wrong. The database came with SFTP connectors. The orchestrator should have simply told the database server to pull from the SFTP site into a landing table. Even if not, Pandas has datatypes you can specify on ingestion, which is heaven if your CSV or other files don't record filetypes yet are consistent in structure. Further, if you have custom validation you're applying (not a bad thing!) you likely rarely even need pandas; the native CSV library or (XLRD/openpyxl) for Excel are fine.

Ultimately, it is a training issue. The toolspace is too complex, with too many buzzwords to describe simple things, that people get lost in the whirlwind.


Uses pandas to infer datatypes leads to nasty coercions. Also, pandas --> objects --> chunk based reading yields the same performance as any other library you mentioned.


Absolutely. On top of that, the Pandas object type (most people's default for strings) can house any scalar type, including ints and floats. This has hit a LOT of immature adoption of pyspark, for example, where the datatypes get screwy.


Pandas is great when you're working on the data manually because it lets you interleave python code and "queries", but it's strictly worse than a plain SQL statement if you're writing, say, a REST service that needs to output some data.

SQL executed by the database is orders of magnitude more efficient, way more expressive, and doesn't require you to memorize pandas' absurd API.

And I say this as someone who is by no means a SQL wizard, and fully acknowledging all of SQL's blemishes.


> a REST service

This is a post about data analysis, and everyone wants to point out that pandas isn't good at real-time service requests.

> SQL executed by the database is orders of magnitude more efficient

Compared to pandas? No, it's not. Once I have all the data I need locally, it's MUCH faster to use pandas locally then to re-issue queries to a remote database.


> Once I have all the data I need locally, it's MUCH faster to use pandas locally then to re-issue queries to a remote database.

Both of you are right.

Sure, if you need to grab a huge chunk of the entire database and then do tons of processing on every row that SQL simply cannot do, then you're right.

But when most people think SQL and database, they're thinking of grabbing a tiny fraction of rows, sped up by many orders of magnitude because it utilizes indexes, and doing all calculations/aggregations/joins server-side. Where it absolutely is going to be orders of magnitude more efficient.

Traditional database client API's often aren't going to be particularly performant in your case, because they're usually designed to read and store the entire result of a query in-memory before you can access it. If you're lucky you can enable streaming of results that bypasses this. Other times you'll be far better off accessing the data via some kind of command-line table export tool and streaming its output directly into your program.


I agree, they're both right.

I've been doing data science since around 2008 and it's a balance between local needs and repeated analysis and an often more efficient one off query. Sure the SQL optimizer is going to read off the index for a count(*), but it doesn't really help if I need all the rows locally for data mining anyway. The counts need to line up! So I'll take the snapshot of the data locally for the one off analysis and call it a day. If I need this type of report to be run nightly, it will be off of the data warehouse infrastructure not the production DB server.

Shrug. These things take type and experience to fully internalize and appreciate.


> But when most people think SQL and database

We're not talking about most people, but data analysts. If we're just doing simple sum/count/average aggregated by a column or two with some basic predicates, SQL and an RDBMS are your best friend. Always better to keep compute + storage as close together as possible.

> Traditional database client API's

I'm not sure what point you're trying to make with this. Most data analysts are not working against streaming data, but performing one-off analyses for business counterparts that can be as simple as "X,Y,Z by A,B,C" reports to "which of three marketing treatments for our snail mail catalogue was most effective at converting customers".


I'm not talking about streaming live data, I'm talking about streaming query results if you re-read my comment.

If you're reading many MB's or GB's of data from a database, it's a lot more performant to stream it from the database directly into your local data structure, rather than rely on default processing of query results as a single chunk which will be a lot worse for memory and speed.


Streaming aggregation relies on knowing the problem beforehand, so it's pointless in the context of the day-to-day work of analysts where most of the project is spent figuring out the right questions to ask.

If I need to stream data because the size of data is a constraint, I'm either working on the 0.1% of projects that require it or on an analytical product rather than an analytical project.


> Compared to pandas? No, it's not.

I'm not saying you shouldn't use pandas, it depends on the size of the data. I'm working right now on a project where a SELECT * of the entire fact table would be a couple hundred gigabytes.

The flow is SQL -> pandas -> manipulation, and as always in pipelines like those, the most work you can do at the earliest stage, the better.


Yeah, speaking as a data person, the SQL argument is correct. Python/R are much, much, much slower for this kind of work.

OTOH, SQL is super limiting for a lot of data analysis tasks and you'll inevitably need the data in weird forms that require lots of munging.

Personally, I'm a big fan of using SQL/Airflow/whatever to generate whatever data I'll need all the time at a high level of granularity (user/action etc), and then just run a (very quick) SQL query to get whatever you need into your analytics environment.

Gives you the best of both worlds, IME.


> OTOH, SQL is super limiting for a lot of data analysis tasks and you'll inevitably need the data in weird forms that require lots of munging.

OTGH, to some (I suspect often rather large) extent, that's because most people (I suspect including many "data scientists") are pretty bad at doing their data munging in SQL.


Perhaps (although this tends to be a core skill for any experienced DS - I 100% would not hire anyone without basic SQL, as it's just setting them up for failure). SQL is the only tool I've used everywhere I worked.

But, SQL is bad at lots of stuff. For example, if you need to compare id1 and id2 (with some kind of locality sensitive hash or something). This requires a full join, which is prohibitive in any large data environment.

And honestly, writing 50 case when statements when I could write a function in R or Python is not my idea of a good time.

I love SQL, and do a lot with it, but there are definitely cases where it's the wrong tool. As an example, retention analyses are really annoying to do in SQL, but pretty easy in R/Python (to be fair, the article actually provides a solution here, but it's not standard).


This is a great point and way of looking at it: pandas and SQL live at opposite ends of the maturation level of data wrangling pipelines.


The point is that if you only need to join and aggregate data it's easier and more efficient to do it directly in SQL. Also in production or when your database doesn't fit in memory the idea is to first use SQL to generate an optimized view of your data from the database before further analysis and transformation.


> The point is that if you only need to join and aggregate data it's easier and more efficient to do it directly in SQL

citation needed? I've seen plenty of cases where it would have taken the db ages to do something that pandas does fast, and I don't consider pandas to be particularly fast.


TFA


parent post says "devs do SELECT *" and ...

relational databases can have a lot more data in one table, or related tables, than one query needs. It is often very wasteful of RAM to get ALL and filter again


Not for data analysis, it's a pointless constraint unless you're having issues.

Most data analysis isn't against datasets larger than memory, and I'd rather have more data than I need to spend time waiting to bring it all local again because I forgot a few columns that turn out to be useful later on.


I totally get what you're saying because most of my datasets also used to be smaller than my laptop's memory.

Used to.

I'm in the process of moving more and more processing "upstream" from local pandas to the DBMS and it's already proving to be a bit of a superpower. I regret not learning it earlier, but the second-best time is now.


True, but in that kind of exploratory environment, you'd normally reduce data load and speed up iteration by using sampling, which is super easy in SQL.

> Not for data analysis, it's a pointless constraint unless you're having issues.

This will always happen, as time spent on the project scales up (unless you use some kind of autoscaling magic I suppose).


If your query does " * " it gets all the columns in all the tables. Often, the optimizer when all the columns you need are on the index, never visits the actual table (I remember the term covered index). " * " basically screws this up. "Select *" should never be used in anything in an actual production environment.


Covering index is right.

Select * has a few valid use cases.

If you're selecting from a cte or subquery then writing the same column list twice is redundant and increases complexity/risk of mistakes.

* is also harmless for count and exists.


Data analysis workloads more often run into problems solved by partitioning rather than indexes.


yes and no - I trained on "larger than RAM datasets" intentionally, and subsequently took on projects that require "larger than RAM datasets". Two things happened on my way, companies led by Google invented and deployed datasets previously impossible e.g. BigTable, and secondly the equipment I worked on went from modest to mid-sized RAM (hard to define that). Granted that lots of very useful (and common?) tasks are not "larger than RAM datasets", and then have very different characteristics.. which is starting to sound like "Excel problems look like this, they always fit in RAM, they need partitions not indexes" and the like..

There is a blind-man-and-the-Elephant drift here, which is not terrible, but might need calling out to improve..


> companies led by Google invented and deployed datasets previously impossible e.g. BigTable

BigTable, Redshift, Snowflake and every other "big data" storage system rely heavily on partitioning to achieve the scales they're able to achieve. Some of these systems don't even support indexes.


I got the impression the problem is not so much the "SELECT *" as the missing "WHERE" clause.


Interesting. I'm guessing the extra layer may be needed for some manipulations of data at the application layer, but perhaps that is almost always avoidable?

Have you ever tried Hasura? I'm thinking of giving it a go in a future project.


It's only avoidable to the degree that your software architecture and company architecture allow you to make the more efficient decision to move the logic upstream.

The cases where this pattern emerges are invariably because someone without insufficient access to the DB and its configuration bites the bullet and instead builds whatever they need to do it in the environment they have complete access to.

I've similar problems lead to dead cycles in processor design where the impact is critical. These problems aren't software technology problems. They're people coordination problems.


As a Hasura user, I highly recommend it for OLTP workloads. And for the specific subject at hand (memory usage) it is fantastic. Its (in-database) serialization speed just completely runs circles around anything we could do before in Ruby.


Sounds correct to me. Although I've heard Ruby is slow, I suspect json serialization is done in C? At least it is in Python which I'm more familiar with. Simple cutting out the extra layers is where the big savings likely are.


If/as you give it a go in the future...do let us know what is working and what could be better. We do, in fact, care.

<disclaimer: I work at Hasura>


Well it's true that bringing row(s) of data from database to python and then serializing it for any reason as has extra overhead.

But in Django's case, the same arguments can be made as for Pandas. Django is a big complex framework and an application using it might consume more memory due to countless number of other reasons. There are also best practices to use Django ORM.

But to say if a given Django instance consumes more memory it is only because of "Copy-Object-Copy effect"... I don't think so.


Is it possible to do any kind of version control with PostgREST? I have no doubt that raw SQL offers huge performance advantages over my ORM, but I'm pretty fond of the ability to roll back to a previous state when I push a bad change to production. Performance is just one of a number of tradeoffs that I consider when I'm choosing tools.


It is customary to check in all DB assets, table creation scripts, queries, stored procedures, etc. No different than the rest of the development process. If something isn't being versioned, there is a huge problem.


I would appreciate pointers to any other low/no-copy db -> browser technologies.

I was thinking it would be cool to use parts of JavaScript db libraries to parse query results passed straight through in (compressed?) db wire protocol format via WebRTC.


> If psql can't do what you want, resist the temptation to 'SELECT *' into a data frame and break the problem up into stages where you get the database to do the maximum work before it gets to the data frame.

Why are we introducing an additional tool (psql) here unnecessarily? Sure, if it can be a simpler postgres query, that can be useful, but introducing psql and psql scripting into a workflow that is still going to use python is...utterly wasteful. And even simplifying by optimizing the use of SQL, while a good tool to have in the toolbox, is probably pretty low value for the effort for lots of data analysis tasks.


I'm pretty sure the premise here is that the many gigabytes/terabytes of data reside in an RDBMS to begin with, so we aren't introducing another tool -- we're saying it makes sense to leverage multiple decades' worth of database optimizations by doing a bunch of data filtering, processing, and analysis in the database software and then exporting intermediate results to a pandas environment for final tweaks, visualization, etc.


> I'm pretty sure the premise here is that the many gigabytes/terabytes of data reside in an RDBMS to begin with, so we aren't introducing another tool

psql is a separate scriptable client tool from the postgres database server.


The point was about doing more work using the SQL DB engine, the client library you do that with seems irrelevant to making that point.


> The point was about doing more work using the SQL DB engine

The specific reference was to “psql script”, which runs in the psql client program, not the DB engine. Since I suspected that might be an error intending to reference SQL running on the server, I separately addressed, in my initial response, both what was literally said (psql script) and sql running in the DB.


Any psql script would be building and running SQL queries so I fail to see how that has any impact of the point being made.

Edit: The argument is that there are operations that should be done in the SQL layer and it is worth time learning enough about that layer to understand when and how to use it for computation. Once you learn that, it isn't really relevant if you are using psql or some other client library to build those queries.


>Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.

SQL is very useful, but there are some data manipulations which are much easier to perform in pandas/dplyr/data.table than in SQL. For example, the article discusses how to perform a pivot table, which takes data in a "long" format, and makes it "wider".

In the article, the pandas version is:

>pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')

Compared to the SQL version:

>SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales" FROM emp GROUP BY role;

Not only does the SQL code require you to know up front how many distinct columns you are creating, it requires you to write a line out for each new column. This is okay in simple cases, but is untenable when you are pivoting on a column with hundreds or more distinct values, such as dates or zip codes.

There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.

There are other examples in the article where the SQL code is much longer and less flexible, such as binning, where the bins are hardcoded into the query.


I've been doing a lot of data analysis in Pandas recently. I started off thinking that for efficiency's sake, I should do as much initial processing in the DB as possible, and use Pandas just for the higher level functions that were difficult to do in SQL.

But after some trial and error, I find it much faster to pull relatively large, unprocessed datasets and do everything in Pandas on the local client. Faster both in total analysis time, and faster in DB cycles.

It seems like a couple of simple "select * from cars" and "select * from drivers where age < 30", and doing all the joining, filtering, and summarizing on my machine, is often less burdensome on the db than doing it up-front in SQL.

Of course, this can change depending on the specific dataset, how big it is, how you're indexed, and all that jazz. Just wanted to mention how my initial intuition was misguided.


I've always been disappointed by the SQL pivot. It's hardly useful for me if I have to know up-front all of the columns it's going to pivot out into. The solution would be to use another SQL query to generate a dynamic SQL query, but at that point I would rather just use Pandas


> There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.

And you need to define the columns in advance anyway, because query planners can’t handle columns that change at runtime.


Agreed. https://ibis-project.org/ and https://dbplyr.tidyverse.org/ can compile dataframe-like input to SQL, which might bridge the gap in tooling (although there still are small differences to the pure dataframe syntax)


Does `tablefunc.crosstab()` do what you want?

https://www.postgresql.org/docs/13/tablefunc.html


It is not much better than the canonical example given in the article. It still has the following usability issues:

-You still need to enumerate and label each new column and their types. This particular problem is fixed by crosstabN().

-You need to know upfront how many columns are created before performing the pivot. In the context of data analysis, this is often dynamic or unknown.

-The input to the function is not a dataframe, but a text string that generates the pre-pivot results. This means your analysis up to this point needs to be converted into a string. Not only does this disrupt the flow of an analysis, you also have to worry about escape characters in your string.

-It is not standard across SQL dialects. This function is specific to Postgres, and other dialects have their own version of this function with their own limitations.

The article contains several examples like this where SQL is much more verbose and brittle than the equivalent pandas code.


That's one of the non-standard ways to do it. MSSQL and Oracle also have a pivot function to do this. Unfortunately there is no standard way to do this.


agreed that pivoting can be a pain.

a pattern that i converged on --- at least in postgres --- is to aggregate your data into json objects and then go from there. you don't need to know how many attributes (columns) should be in the result of your pivot. you can also do this in reverse (pivot from wide to long) with the same technique.

so for example if you have the schema `(obj_id, key, value)` in a long-formatted table, where an `obj_id` will have data spanning multiple rows, then you can issue a query like

``` SELECT obj_id, jsonb_object_agg(key, value) FROM table GROUP BY obj_id; ```

up to actual syntax...it's been awhile since i've had to do a task requiring this, so details are fuzzy but pattern's there.

so each row in your query result would look like a json document: `(obj_id, `{"key1": "value", "key2": "value", ...})`

see https://www.postgresql.org/docs/current/functions-json.html for more goodies.


The more I learn SQL, the less I write Python.

Although the SQL syntax is weird and so dated, its portability across tools trumps everything else. You finished the EDA and decided to port the insights to a dashboard? With SQL it's trivial. With Python... well, probably you'll have to port it to SQL unless you have Netflix-like, Jupyter-backed dashboard infrastructure in place. For many of us who only have much-more-prevalent SQL-based dashboard platform, why not starting from SQL? Copy-n-paste is your friend!

I still hate the SQL as a programmer, but as a non-expert data analyst I now have accepted it.


After 15 years programming I have come to love SQL - 10 yrs ago I used to do everything in the book to avoid it out of hate, even though I knew how to use it, but over time I have moved more and more into the database / SQL and my code is just better and better as a result.


I'm neutral on SQL but very big on the right tool for the job. My rule of thumb is, if i'm querying data just to massage it and then stuff it back into the database then I first try to do that as a stored procedure. I remember learning about windowing in SQL late into a project and after redoing some python functions as stored procedures really improving performance.


Nice article. Pandas gets the job done but it's such a step backwards in terms of useability, API consistency and code feel. You can do anything that you can possibly need with it but you regularly have to look up things that you've looked up before because the different parts of the library are patched up together and don't work consistenly in an intuitive way. And then you end up with long lines of().chained().['expressions'].like_this(0).


I much prefer writing actual functions in a real programming language to the verbose non-reusable non-composable relic that is SQL syntax.

Give me a better query language and I will gladly drop Pandas and Data.Table.


Yes, SQL does have the problem of bad composeability and some things being less explicit than they are when working with dataframes. Dplyr is probably the most sane API out of all of them.


+1 for dplyr. I've used both pandas and dplyr daily for a couple years at different times in my career, and there is no comparison in mind when it comes to usability/verbosity/number of times I need to look at the documentation.


I agree, the degree of expressiveness that dplyr achieves is impressive. The library has had a great impact on my work. The syntax is easy to remember so you don't have to constantly look up minor variations in usage. Also, the pipelining aspect makes it dead-simple to debug. It's especially powerful when combined with the purrr library: https://purrr.tidyverse.org/


SQL is a real programming language.


It's not turing complete which is I think why it doesn't feel like a "real" programming language.


SQL is turing complete. A demonstration with recursive CTEs is done here[1].

[1]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System


Oh wow. I stand corrected. Thanks!


Recursive CTEs exist; while I haven't investigated thoroughly, I would be surprised if they didn't make SQL turing complete.

More usefully, most databases allow you to write user-defined functions in other languages, including imperative SQL dialects.


This really grinds my gears too. There's something about the pandas API that makes it impossible for me to do basic ops without tedious manual browsing to get inplace or index arguments right... assignments and conditionals are needlessly verbose too.

Pyspark on the other hand just sticks in my brain, somehow. Chained pyspark method calls looks much neater.


Pandas is just a bad API, to be honest. I know base-R very, very well (which was one of the inspirations, I believe) and I still spend most of my time looking stuff up.

It's such a shame that python doesn't have a better DF library.


I also switched from R to Python/pandas. I remember always being frustrated with pandas since it tries to emulate data.frame, but then just does its own thing without being consistent.


I've used pandas regularly for the past ~5 years and find its API intuitive enough not to complain. I can write and read decently long pandas chained expressions fluently, but I barely know any R. Am I unwittingly a hostage of an inferior API and don't know what I'm missing?


Yes.

Base R is OK, but dplyr is magical.

For instance, integer indexing in base R is df[row,col] rather than the iloc pandas stuff.

plot, print and summary (and generic function OOP more generally is really underappreciated).

Python is a better programming language, but R is a better data analysis environment.

And dplyr is an incredibly fluent DSL for doing data analysis (not quite as good for modelling though).

Seriously, I read the original vignette for dplyr in late 2013/early 2014 and within two weeks I'd switched most of my new analytical code over to it. So very, very good. Less idea-impedance match than any other environment, in my experience.


I was actually using data.table The syntax can be bit cryptic, but you get used to it.


This might not seem like a big issue but in the beginning, these were my issues

1. Not so easy way to rename columns during aggregation

2. The group by generates its own grouped by data and hence you almost always need `reset_index`

3. Sometimes group by can convert a dataframe to series

4. Now `.loc` has provided bit consistent indexing/slicing, but earlier you had `.ix` `.iloc` and what not

These are something I can remember from top of my head. Of course all of these have solutions, but it makes pandas much more verbose. In R, these are just much more succinct.


In case you're interested in what's missing--I maintain a port of dplyr from R to python called siuba, and gave a talk recently on why pandas might be hard to use:

https://www.rstudio.com/resources/rstudioglobal-2021/bringin...


This is super cool, thank you!

Wait... I had no idea dplyr's database support was this good. https://db.rstudio.com/dplyr/


Setting inplace=True isn't too bad, but I definitely have had many issues with working with indexes in Pandas. I don't understand why they didn't design it so that the index can be referenced like any other columns. It overcomplicates things like having to know the subtle difference between join() and merge().


Setting Inplace=True is not recommended and should be used with caution

https://github.com/pandas-dev/pandas/issues/16529


I've never seen anything about inplace being a poor idea to use. Is that documented anywhere or is that ticket the only info about it?


It's not particularly front-and-centre, but it's all over various discussion boards if you go looking for it directly. I'd like the debate to be more visible, personally, particularly whenever the argument gets deprecated for a particular method.

Essentially, `inplace=True` rarely actually saves memory, and causes problems if you like chaining things together. The people who maintain the library/populate the discussion boards are generally pro-chaining, so `inplace` is slowly and quietly on its way out.


I wish it wasn't something you had to go looking for. I never would have thought that it would be an issue. If you do Google searches for inplace, you get no results on the first page discouraging its use.

The documentation should clearly say if the inplace argument causes an internal copy, because the availability of it implies that it doesn't. I've used inplace many times with Pandas because I've had code that I know is working with large amounts of data and I've thought "well I probably shouldn't chain these and cause tons of unnecessary allocations just to have prettier code".


Can you honestly say you'd prefer to be debug thousands of lines of SQL versus the usual <100 lines of Python/pandas that does the same thing? It's no contest. A histogram in pandas: df.col.hist(). Unique values: df.col.value_counts(). Off the top of your head, what is the cleanest way of doing this in SQL and how does it compare? How anyone can say that SQL is objectively better (in readability, density, any metric) other than the fact that they learned it first and now are frustrated that they have to learn another new tool baffles me.

I learned Pandas first. I have no issue with indexing, different ways of referencing cells, modifying individual rows and columns, numerous ways of slicing and dicing. It gets a little sprawling but there's a method to the madness. I can come back to it months later and easily debug. With SQL, it's just madness and 10x more verbose.


> Can you honestly say you'd prefer to be debug thousands of lines of SQL versus the usual <100 lines of Python/pandas that does the same thing?

That's fair, I was using the opportunity to complain about pandas and didn't point out all the problems SQL has for some tasks. What I really want is a dataframe library for Python that's designed in a more sensible way than pandas.


Distinct Count and Group By in SQL will provide distinct Count and Histogram (numerical value) output. Like No SQL vs Relationship, seems there are use cases that lemme them selves to new vs old technologies. It's hard for me to put much stock in opinions from those that have vastly more experience in one tool and not the other being compared.


This is exactly my experience. I've found becoming proficient with Pandas to take much more time than other data manipulation libraries like dplyr or Pyspark dataframes. The Pandas way of doing things is just not memorable or intuitive.


You might be thinking of specific functionality that you find is being implemented in an overly long/verbose fashion.. But generally speaking, how are

> long lines of().chained().['expressions'].like_this(0)

a _bad thing_?

IMHO these pandas chains are easy to read and communicate quite clearly what's being done. If anything, I've found that in my day-to-day while reading pandas I parse the meaning of those chains at least as efficiently as from comments of any level of specificity, or from what other languages (that I have had experience with) would've looked like.


People don't like them because the information density of pandas chains is soooo much higher than the rest of the surrounding code. So, they're reading along at a happy place, consuming a few concepts per statement

...and then BOOM, pandas chain! One statement containing 29+ concepts and their implications.

Followed by more low density code. The rollercoaster leads to complaints because it feels harder. Not because of any actual change in difficulty.

/that's my current working theory, anyway


Hmmm, interesting. I don't mind the information density, coming from R which is even more terse, but the API itself is just not that well thought out (which is fair enough, he was learning as he went).


SQL syntax sucks for doing non-trivial data analysis. I've tried it. Verbose, no composability or code reuse, not really portable across different databases, no easy interoperability with other tools, limited editor/IDE support, etc.

I guess if you have huge amounts of data (10m+ rows) already loaded into a database then sure, do your basic summary stats in SQL.

For everything else, I'll continue using SQL to get the data from the database and use Pandas or Data.Table to actually analyze it.

That said, this is a very comprehensive review of SQL techniques which I think could be very useful for when you do have bigger datasets and/or just need to get data out of a database efficiently. Great writeup and IMO required reading for anyone looking to be a serious "independent" data scientist (i.e. not relying on data engineers to do basic ETL for you).

I'd be a huge fan of something like the PySpark DataFrame API that "compiles" to SQL* (but that doesn't require you to actually be using PySpark which is its own can of worms). I think this would be a lot nicer for data analysis than any traditional ORM style of API, at least for data analysis, while providing better composability and IDE support than writing raw SQL.

*I also want this for Numexpr: https://numexpr.readthedocs.io/en/latest/user_guide.html, but also want a lot of other things, like a Arrow-backed data frames and a Numexpr-like C library to interact with them.


Completely disagree. I have a choice of using snowflake and spark/pandas to do my EDA and I’ll choose sql every time. The code is significantly more readable once you get used to it and you can most definitely do things one step at a time using udfs and temp tables / cte s. I’ve come back to EDA I did a year back and it’s always easier to read a long sql script than a notebook with pandas code.


There's pluses and minuses to both. That being said, how do I write a function in SQL which can abstract over something I do a lot (like the pivoting example earlier), or even some date function like iff(date>'important_date', 'before', 'after') as grouper.

Honestly, that's what ends up making me move away from doing analytics in SQL.


Edit: oops, I think I replied a level deeper than intended. I was responding to the composition/abstraction topic. I think I should just leave this here now though?

I assume you are talking about composition of generic set-processing routines, but I wonder if others realize that? It is easy enough to write a set-returning function and wrap it in arbitrary SQL queries to consume its output. But, it is not easy to write a set-consuming function that can be invoked on an arbitrary SQL query to define its input. Thus, you cannot easily build a library of set-manipulation functions and then compose them into different pipelines.

I think different RDBMS dialects have different approaches here, but none feel like natural use of SQL. You might do something terrible with cursors. Or you might start passing around SQL string arguments to EXECUTE within the generic function, much like an eval() step in other interpreted languages. Other workarounds are to do everything as macro-processing (write your compositions in a different programming language and "compile" to SQL you pass to the query engine) or to abuse arrays or other variable-sized types (abuse some bloated "scalar" value in SQL as a quasi-set).

What's missing is some nice, first-class query (closure) and type system. It would be nice to be able to write a CTE and use it as a named input to a function and to have a sub-query syntax to pass an anonymous input to a function. Instead, all we can do is expand the library of scalar and aggregate functions but constantly repeat ourselves with the boilerplate SQL query structures that orchestrate these row-level operations.


Yeah, that's exactly the issue. One can do this in Python, but not in SQL, and this leads to boilerplate.

I actually think that SparkSQL is a good solution here, as you can create easily reusable functions.


> SQL syntax sucks for doing non-trivial data analysis. I've tried it. Verbose, no composability or code reuse, not really portable across different databases, no easy interoperability with other tools, limited editor/IDE support, etc.

You're entitled to your opinion and tooling choices of course, but the problem is you don't know SQL.


"No composability or code reuse" is definitely a valid criticism of SQL. Check out the very first example of my personal project https://docs.racket-lang.org/plisqin/Read_Me_First.html and let me know how you would implement something akin to `(CategoryName p)` and `(TotalSales p)` in SQL. A view does not count, because then you either have one view per derived field and that is very cumbersome to consume, or you have one view with all the derived fields and you will end up paying performance costs for fields that you are not using.


Don't views and materialized views give you reuse?

Can't you do composibility with foreign references?


Sure, while views may be fine for small projects, using them for larger projects like data warehousing is usually a mistake you'll come to regret.

SQL views are rarely unit tested so you always end up in a regression nightmare when you need to make updates.

If you're going to go pure SQL, you should use something like dbt.


A more useful comment would be to illustrate how. I'd like to know as well as I need to reuse queries in SQL occasionally but am not an expert. Currently I believe learning pl/pgsql is the answer, but even it reeks of punch-cards and other sixtiesisms :-). Tough sell when you're used to Python etc.


That could be said about any tool, framework, language, library, etc. Invest enough time and you can do everything with malbolge but that doesn't mean it doesn't suck.


I am a pandas user considering refactoring part of my ETL pipeline to SQL. I see the trade off as memory efficiency vs expressiveness, and for simple queries on big data, SQL wins. Would you disagree that Pandas/Python is more expressive than SQL? I’m less experienced in SQL but based on my limited experience there, it seems Pandas is clearly more expressive. What is the SQL equivalent of Pandas .apply(lambda x) ?


ClickHouse has lambdas for arrays. They are very useful. Here's an example.

  WITH ['a', 'bc', 'def', 'g'] AS array
  SELECT arrayFilter(v -> (length(v) > 1), array) AS filtered
  
  ┌─filtered─────┐
  │ ['bc','def'] │
  └──────────────┘
The lambda in this case is a selector for strings with more than one character. I would not argue that they are as general as Pandas, but they are might useful. More examples from the following article.

https://altinity.com/blog/harnessing-the-power-of-clickhouse...


Please define an example lambda function, in order to see whether there is an sql equivalent. Imo >90% of the data issues i have seen, can be solved with sql queries. I have seen some in the finance sector which would require super complex udfs, but other than these, sql is the first choise to solve a data issue.


I do work in finance sector and need complex functions :)

And it’s interesting that pandas was invented at a hedge fund, AQR.

I agree that SQL may be better for vanilla BI.


> I guess if you have huge amounts of data (10m+ rows) already loaded into a database then sure, do your basic summary stats in SQL.

This is not huge. This is quite small. Pandas can't handle data that runs into billions of rows or sub-second response on arbitrary queries. Both are common requirements in many analytic applications.

I like Pandas. It's flexible and powerful if you have experience with it. But there's no question that SQL databases (especially data warehouses) handle large datasets and low latency response far better than anything in the Python ecosystem.


This is not huge. This is quite small. Pandas can't handle data that runs into billions of rows or sub-second response on arbitrary queries. Both are common requirements in many analytic applications.

You're right. It's "huge" with respect to what you can expect to load into Pandas and get instant results from. But it's not even "medium" data on the small-medium-big spectrum.

I like Pandas. It's flexible and powerful if you have experience with it. But there's no question that SQL databases (especially data warehouses) handle large datasets and low latency response far better than anything in the Python ecosystem.

I agree with this 100%, but I think a lot of people missed it in my post.


OTOH Pandas/Numpy are great for results. One pattern that I'm hoping to try is running heavy lifting in SQL, transfer results via Apache Arrow, and manipulate the result set in Python with aforesaid libraries.

I don't know enough about how Arrow handles streaming to understand how this would really work but it would break away from the single-threaded connectivity with expensive ser/deser databases have used since the days of Sybase Db-Library, the predecessor to ODBC. 36 years is probably long enough for that model.


Yeah but that's not even close to being Pandas' value proposition. It's for the data scientist and analyst, not the db admin, data engineer or production applications.


For apache arrow backed dataframes check out Polars(https://github.com/ritchie46/polars)


> (i.e. not relying on data engineers to do basic ETL for you).

Is this actually a thing? Surely it can't be a thing.


of course, that's my job... we do basic ETL on hundreds of data sources to provide data to the analysts and quants


In a sufficiently large org, why not?


My concern is not that DE's do ETL, just that a data scientist could actually avoid learning how to do this.

Certainly in all the successful orgs I've worked in, this would rarely be the case. Maybe it works if you have really well-standardized and static data, but I rarely work in this kind of environment.


I think for a lot of people, SQL is a skill that doesn't stick. You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore

The days of dedicated SQL programers are mostly gone.


I started learning SQL twenty years ago and it's one of the only skills from the start of my career that has been consistently useful ever since.


I held this view before graduating and moving back to my 3rd world home country. For some reason, people here (the established ones with 5+ years of experience) see SQL as a general purpose programming language with which they do as much as possible.


> The days of dedicated SQL programers are mostly gone.

I've never met a "dedicated SQL programmer" - all the C++ programmers I've worked with in the investment banking world were also expected to know, and did know, SQL pretty well - asking questions about it were normally part of the interview process.


Exactly. I own a software consultancy, and I tried explaining this to an engineering manager and he just didn't get it.

For some reason, he couldn't understand that when you write SQL queries for a project, you typically do it once, and basically never again, with the exception of maybe adding or removing columns from the query.

The "hard work," if you can call it that, is all in the joins. Then, I completely forget about it.

You spend so much more time in development on everything else.


The engineering manager was right. SQL is the most pervasive programming language in existence. It doesn't get that way because nobody uses it. Maybe you don't use it much, but unless your consultancy specializes in some tiny niche where SQL isn't actually needed, you might want to rethink that. I've seen far too many projects where developers looked down on SQL as some sort of not-so-necessary evil, and they've all ended up reinventing it. Extremely poorly.


We have a lot more clueless developers on average, which makes it seem like no one uses SQL anymore beyond basics. But we're still here, believe me. No one today is a dedicated X programmer, but doesn't mean we're all clueless about X.


I have been working with data for more than 5 years now and my 2 cents is that Pandas shines with exploratory analysis. Because you have the data in memory it's easy to empirically arrive at the exact data transformation you need and figure out whether your analysis is worth pursuing further. At this stage of the development your ability to iterate and experiment quickly is most valuable. I see no value in worrying about the deployment of the work you are doing if you haven't got any results just yet.

Now when the analysis is done and there's a conclusion about putting some of it in production, I am yet to see a single company that relies on pandas to do data transforms "online", i.e. on the live production data stream. Typically in my experience I would end up rewriting my data transforms in either a compiled language or SQL - i.e. something much more performant and maintainable.

As a result as you progress in your career you become very proficient in both pandas and SQL - and you also start glancing at the source code of some of the transforms performed by pandas when you need to port them to the language of choice of the company you're working for.


This article is so useful. It starts out with SQL basics but then quickly leaps into all kinds of PostgreSQL tricks that I didn't know about - binning, efficient sampling, calculating, even linear regression.


I'm so glad to have been around long enough that SQL is now being seen as exotic again


Not long after I got into software engineering is when the first NoSQL craze began. I imagine that those who started not long after me and spent a few years wrangling with MongoDB queries (shudder) and trying to do joins, transactions and consistency guarantees in application code must see RDBMS as some sort of magical new technology.


I'm not a Pandas dev or a Pythonista for that matter but I'm pretty sure Pandas Dataframe should be able to handle SQL cursors or materializing data JIT for the aggregation. Getting all rows in with a fetchall is an antipattern and just completely impractical for a lot of use cases! Didn't Pandas have a SQL Alchemy integration already? Although I'm not sure it would run the aggregation progressively or again with fetchall internally...


Following similar observations I was wondering if one can actually execute SQL queries inside of a Python process with the access to native Python functions and Numpy as UDFs. Thanks to Apache Arrow one can essentially combine DataFrame API with SQL within data analysis workflows, without the need to copy the data and write operators in a mix of C++ and Python, all within the confines of the same Python process.

So I implemented Vinum, which allows to execute queries which may invoke Numpy or Python functions as UDFs available to the interpreter. For example: "SELECT value, np.log(value) FROM t WHERE ..".

https://github.com/dmitrykoval/vinum

Finally, DuckDB makes a great progress integrating pandas dataframes into the API, with UDFs support coming soon. I would certainly recommend giving it a shot for OLAP workflows.


Also I think SQLite lets you call Python functions from the SQL program.


That's correct, but SQLite would require to serialize/deserialize the data sent to Python func (from C to Python and back), while Arrow allows to get a "view" of the same data without making a copy. Which is probably not an issue in OLTP workloads, but may become more visible in OLAP.


> This benchmark does not mention the memory consumed by the database itself - this is intentional. [...] Whether you decide to use the database or not, the memory is already paid for, so you might as well use it!

This sentence is a big red flag for me. An analysis of a stategy that pushes work towards a subsystem, and then purposedly ignores the perforance implications on that subsystem is methodologically unsound.

Personally, I am all for using the DB and writing good SQL. But if I weren't, this argument would not convince me.


oddly, I was having this conversation with a consulting client very recently. The client asked for an "in-memory datastore" .. personally, I am fluent in PostgreSQL and had already done some early versions of a solution using Postgres. Speaking, I was trying to get across the idea that Postgres can be configured specifically to a RAM balance such that, it starts to act like an "in-memory datastore". You can think of the server instance as a "wrapper" around the action of tables and access. The client was/is skeptical and also knows no Postgres personally. Perhaps there are some architectural parts missing in his mind that blur multi-node, cloud'y datastores, versus single node.

Postgres certainly comes from the days of single-box, many workloads, many users.. Currently, there are more varied scenarios, and that common case is a lot less common. As this article shows very well, Postgres is actually a very capable tool.


Hmm one thing that people forget about when it comes to Pandas vs SQL is that for the kind of data you would use in pandas, 1 day and sometimes even 1 week or longer doesn't make a big difference. So say you are building some complicated data processing pipeline and you base it on a bunch of SQL queries. You are hitting the Database every single time.

While if you just get the tables and to most (or at least some) of the merging and processing in pandas you can do the loading of the data in the middle of the night and that is the only time you are hitting the db.

I have personally experienced big analytical SQL queries hitting the db and busy times...


I think one of the less discussed advantages of SQL and reasons why it might be so prevalent in the analytics world is that users don't really need to spend much time to set up their environment. I can send a query to a less technical person with my analysis for them to run. They can put the processed data into a spreadsheet and do whatever they want.


I now do most of my data analysis in Julia instead of pandas, but used pandas for a long time. SQL is a valuable skill and useful when you need to optimize a query for performance, which can sometimes happen with pandas and is much less likely with Julia.

However, even if a pandas query takes 5X longer to run than a SQL query you must consider the efficiency to a developer. You can chain pandas commands together that will accomplish something that takes 10x the lines of SQL. With SQL you’re more likely to end up with many intermediate CTEs along the way. So while you can definitely save processor cycles by using SQL, I don’t think you’ll save clock-face time by using it in most one off tasks. Datascience is usually column oriented and Julia and pandas allow you to stay in that world.


For me, the main use case of a database server is so I can do operations that require more memory than I have on my local computer. Reading through this comment section makes me think my use case is rare or something. Maybe everyone is ok with chunked operations?


Question, if I have a CSV file that I'd like to do some quick SQL queries on before moving the results into Pandas. What would be good resource to do this? Preferably compatible with the rest of the Python-dataframe ecosystem and as simple as pd.read_csv()


SQLite is what you're looking for. If you want to use the CLI,

  .format csv
  .import <path to csv file> <table name>
Alternatively, read your data into pandas and there's extremely easy interop between a DBAPI connection from the python standard lib Sqlite3 module and Pandas (to_sql, read_sql_query, etc.).


I would second this suggestion. I was querying CSV data imported into SQLite this weekend, and it was extremely easy to get started. SQLite is pretty snappy even for tables with millions of rows.

SQLite supports defining columns without a type and will use TEXT by default, so you can take the first line of your CSV that lists the document's dimensions, put those in the brackets of a CREATE TABLE statement, and then run the .import described above (so just CREATE TABLE foo(x,y,z); if x,y,z are your column names).

After importing the data don't forget to create indexes for the queries you'll be using most often, and you're good to go.

Another suggestion for once your data is imported, have SQLite report it in table format:

    .mode column
    .headers on


If you want to try it in SQLite (pros: no need to run a server or install anything since it's in the Python standard library, cons: not nearly as many advanced statistical analysis features as PostgreSQL) my sqlite-utils CLI tool may help here: it can import from CSV/TSV/JSON into a SQLite database: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...


The sqlite3 connection object in Python allows you to register callables which you can use as scalar or aggregate functions in your SQL queries. With this, you can fill some of the gaps compared to PostgreSQL by essentially importing Python libraries. I just found this nice tutorial while looking for relevant docs:

https://wellsr.com/python/create-scalar-and-aggregate-functi...

However, I think the limited type system in SQLite means you would still want to extract more data to process in Python, whether via pandas, numpy, or scipy stats functions. Rather introducing new composite types, I think you might be stuck with just JSON strings and frequent deserialization/reserialization if you wanted to build up structured results and process them via layers of user-defined functions.


Try duckdb to query and even transform the data and then export to a pandas df.


http://harelba.github.io/q/

q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"

It uses sqlite under the hood.


In addition to the recommendation for sqlite, I've found `csvkit` to be an extremely useful set of CLI tools for CSV munging. The `csvsql` [1] entrypoint is especially handy because it allows you to issue SQL queries against your CSV file directly vs. loading then querying.

1: https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html


I keep a non-commercial installation of SQLServer on my machine for this reason, but this is likely overkill for most purposes and requires a Windows machine.

It does have some nice import features for CSV data though.


Try https://bit.io/. You can drop a CSV file and query the data with SQL.


You can easily import the csv file into sqlite and you don't even have to create the table fields beforehand


Windows provides an ODBC driver for CSV files - I don't know how this would play with Pandas.


Try SQL Alchemy?


The code:

  WITH dt AS (
      SELECT unnest(array[1, 2]) AS n
  )
  SELECT * FROM dt;
Is more complex than necessary. This produces the same result:

  SELECT n FROM unnest(array[1, 2]) n;
  ┌───┐
  │ n │
  ├───┤
  │ 1 │
  │ 2 │
  └───┘
  (2 rows)
I think I see some other opportunities as well.

I know the code is from a section dealing with CTEs, but CTEs aren't needed for every situation including things like using VALUES lists. Most people in the target audience can probably ignore this next point (probably on a newer version of PostgreSQL), but older versions of PostgreSQL would materialize the CTE results prior to processing the main query, which is not immediately obvious.


I think that was a deliberate choice by the author to consistently demonstrate CTEs.


Sorry was editing to address this point probably while you were typing your response. It's well taken, but we should be clear that it's not required.


For what it's worth, I maintain a library called siuba that lets you generate SQL code from pandas methods.

It's crazy to me how people use SELECT * -> pandas, but also how people in SQL type a ton of code over and over.

https://github.com/machow/siuba


it seems the only reason people on hn hate sql is that they don't understand sql and/or already buy in another toolset,if you only have a hammer at hand,everything is a nail


Pure SQL porn. Very exciting!

Lately I am seeing rather really creative blog posts on HN. Keep it up guys.


OT but what is the best resource for learning SQL?


I think I do a healthy mixture of both.


tldr ; if you hear ''but we can do this in SQL'', RUN!!!

My eyes hurt as I read this article. There are reasons why analysts dont use SQL to do their job, and it has nothing to do with saving RAM and memory.

1) Data analysis is not a linear process, it involve playing and manipulating the data in different way and letting your mind drift a bit. You want your project in an IDE made for that purpose, the ability to create charts, source control, export and share the information, ect. Pandas is just a piece of that puzzle which is not possible to replicate in pure SQL.

2) In 2020, there are numerical methods you want to try beyond a traditional regression. Most real world data problems are not made for stats101 tools included in sql. Kurtosis? Autocorrelation?

3) Politics. Most database administrator are control freaks who hate the idea of somebody else doing stuff in their DB. Right now were I work we still have to use SSIS-2013 instead of stored procedures in order to avoid the DBA refusal bureaucratic process.

4) Eventual professional development. If your analysis is good and creates value, chances are it will become a 'real' program and you will have to explain what you are doing to turn in into an OOP tool. If you have CS101, good coding in python will make this process much easier than a 3000 lines spagetti-SQL SQL script.

5) Data cleaning. Dealing with outliers, NAN and all that jazz really depends on the problem you try to solve. The absence of a one size fits all solutions is a good case for R/pandas/etc. These issue will break an SQL script in no time.

6) Debugging in SQL. Hahahahahahahaha

If you are still preocupied with the ram usage of your PC to do your project, here are two solutions which infuriate a lot of DBAs I've worked with.

A) https://www.amazon.ca/s?k=ram&__mk_fr_CA=%C3%85M%C3%85%C5%BD...

B) https://aws.amazon.com/


What is your problem with this example, doesn't it get the job done?

  WITH temperatures AS ( /* ... */ )
  SELECT
      *,
      MAX(c) OVER (
          ORDER BY t
          ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
      ) AS hottest_temperature_last_three_days
  FROM
      temperatures;
  
       t      │ c  │ hottest_temperature_last_three_days
  ────────────┼────┼─────────────────────────────────────
   2021-01-01 │ 10 │                                  10
   2021-01-02 │ 12 │                                  12
   2021-01-03 │ 13 │                                  13
   2021-01-04 │ 14 │                                  14
   2021-01-05 │ 18 │                                  18
   2021-01-06 │ 15 │                                  18
   2021-01-07 │ 16 │                                  18
   2021-01-08 │ 17 │                                  17
Why should I fetch all the data and then form the result with another tool?

What a great article.


I re-read my comment, and I think that I expressed myself too harshly. If you like SQL and it get the job done for you, no problem. I would see myself using that SQL query.

However in order to get there, you need to know why you need the ''hottest_temparature_last_three_days''. Why not 2 or 4 days? Why not using heating degree day? What about serial correlation with other regions, or metering disfunction? What if you are working directly with raw data from instruments, and still need to choose wich cleaning method you will use? What if you want to check the correlation with another dataset which is not yet in your database (ex: private dataset in excel from a potential vendor)?

If you know exactly what you want, sure SQL is the way to go. However the first step of a data project is to admit that you dont know what you want. You will perform a litterature review and might have a general idea, but starting with a precise solution in mind is a receipe for failure. This is why you could need to fetch all the data and then form results with another tool. How do you even know which factors and features must be extracted before doing some exploration?

If you know the process you need and only care about RAM-CPU optimization, sure SQL is the way to go. Your project is an ETL and your have the job of a programmer who is coding a report. There is no ''data analysis'' there...


> If you know exactly what you want, sure SQL is the way to go.

I've been doing similar aggregations with MongoDB, simply because when I start a project where I quickly want to store data and then a week later check what I can do with it, it's a tool which is trivial to use. I don't need to think about so many factors like where and how to store the data. I use MongoDB for its flexibility (schemaless), compared to SQL.

But I also use SQL mostly because of the power relations have, yet I use it only for storing stuff where it's clear how the data will look for forever, and what (mostly simple) queries I need to perform.

I've read your comment as if it was suggesting that these kind of articles are not good, yet for me it was a nice overview of some interesting stuff that can be done with SQL. I don't chase articles on SQL, so I don't get many to read, but this one is among the best I've read.

To get back to the quote: How will I know if SQL can do what I want, if I don't read these kind of articles?


If you know SQL well, then doing exploratory analysis in SQL is perfectly reasonable. Especially if you are using a high-performance analytic database.

Disclosure: I develop high-performance analytic database systems with SQL frontends.


What do you mean by exploratory analysis? Lets assume that my dataset contains features which are not normally distributed, and I want to check for coskewness and cokurtosis matrices before and after I remove outliers with a method specific to my field of research, found in a academic paper. Am I supposed to code all of this in SQL? What is your definition of exporatory analysis, and which metrics do you have in mind?

And what if I want to make graphs to present stuff to a colleague? Am I supposed to use... excel to do that?


> Am I supposed to use... excel to do that?

This is gratuitous. You have a clear bias, granted, because it seems your domain is so specific, only a procedural language will do. But it seems you are unfamiliar with modern SQL tools. Some of the obvious ones that come to mind: Metabase[0] for visualisation or Apache MADlib[1] for in-database statistics and machine learning.

[0] https://github.com/metabase/metabase [1] http://madlib.apache.org/


I humbly disagree. Its not that ''my domain is so specific'', but that data analysis in itself is a discovery process which is not straightforward. If you want to explore the jungle, you need a machete.

As for Metabase of MADlib, you are right ; I was not aware of these new tools. They look great, and I'm certain they can help a lot of people. However you assume that they are available! Not all IT departments are open to the idea of buying new software, and if the suggestion comes from an outsider it will be perceived as an insult (been there, done that many many time). And when they refuse, now what? You go back to the usual procedural languages (R, Python, Julia, etc) which are free and don't require the perpetual oversight of some DBA who thinks that all you need is an AVG(X) and GROUP BY since kurtosis is domain specific anyway.

I've meet some Excel-VBA users who couldn't care less about pro devs or decorators since ''they can already do everything by themselves''. Same thing with the SQL only, Python only, Tableau only or wathever-only crowd.


I usually use Python and R for analysis. However, when dealing with larger datasets, e.g., 0.5 - 2 PB, I have to rely on SQL/BigQuery because I can't get Python and R to deal such workloads in reasonable time. I tried Dask, but I couldn't resolve a few bugs it had at the time.

If you were to find outliers in a 1 PB table, what tools would you use?


Familiar. I think the biggest issue is understanding workflow. My hunch is that some developers view a data analysis project as a program... it needs to have these data processing steps performed, summarized in this certain way, and saved off in that certain format. Once those queries are defined, the program is ran and the analysis done.

The actual work the analyst is doing is far more ad-hoc than that. It's iterative and frequently will get torn down, rebuilt, and joined to other sources, as they discover new aspects to the data. They need their hands on the raw data, not on a specifically summarized version of it.


Yup. That is exactly it. You cannot believe the number of programmer I've met who thinks that ''data science is easy'' because ''I can code a neural net too''.

Data analysis is a soft skill that you usually learn by getting your hands dirty in a graduate or professional environment. Its not something you can put on a CV like like ''X years of experience with Y programming language''.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: