
What SQL Analysts Need to Know About Python (2016) - geordilaforge
https://segment.com/blog/sql-vs-python/
======
drej
I'm worried about the inverse - what Python data analysts should know about
SQL. Because I've met tons of analysts who wouldn't be able to even run a
basic select.

I've seen tons of (often non-reproducible) code written in place of a simple
SQL query. I really wish bootcamps and other learning platforms focused on SQL
a bit more.

(I am a Python data analyst who properly learned SQL only after several years
in the industry.)

~~~
oarabbus_
Or worse, they use cursors. Cursors should never be used in SQL. Ever. That's
my philosophy.

~~~
phumbe
What is a better approach? I have Python that directly connects to an Oracle
database, and the Oracle blog tutorial[0] for using their Python package
always uses a cursor.

[0] [https://blogs.oracle.com/oraclemagazine/perform-basic-
crud-o...](https://blogs.oracle.com/oraclemagazine/perform-basic-crud-
operations-with-cx-oracle-part-2)

~~~
ansgri
They talk about client-side cursors, these are part of the Python DB API. What
you shouldn't use is server-side cursor, which you create with DECLARE in
PosgreSQL. These keep their state on the server and are intended for
optimizations for special cases like streaming data processing or realtime
updates. Basically, for deep internals of realtime systems, and not regular
queries.

~~~
goatinaboat
_What you shouldn 't use is server-side cursor, which you create with DECLARE
in PosgreSQL._

That entirely depends on the DB. Oracle for example always executes with a
server-side cursor; all declaring it does is give you a handle to what it was
doing anyway.

~~~
chrishynes
The problem isn't the cursor itself, but that it is usually a symptom of
procedural thinking vs set based thinking. In an RDBMS it's typically far
faster to puzzle out the joins, CTE's, and set based expressions and functions
to use to winnow down a dataset vs a cursor based procedural logic on a row by
row basis.

~~~
bshipp
Thank you for clarifying that; I was wondering where the concern was coming
from.

Without disagreeing with any of the above, one important consideration is what
you're going to do with the query. If all you want to know is a column's mean
or some other simplified statistical value, there's really no sense in pulling
all the data into Python just to calculate it. Do it inside the DB itself with
SQL.

On the other hand, if you need that data to do other work (i.e. populate the
table in a webpage, or generate a new descriptive data set or whatever), then
the trade-off for pulling it into Python/pandas and running a mean in addition
to the other work becomes much smaller.

My approach is usually to do as much data filtering and parsing as possible
inside SQL, but things like complex parsing and string manipulation
(especially!) I'll do with Python. I can do some simple string work in SQL,
but I can almost always do it faster and cleaner in Python.

------
sheepstrat
A few years ago I joined a Rails shop, and one thing that always struck me was
how many of the engineers didn't know SQL. Most of them had learned to code on
Rails, and had always had SQL abstracted away via ActiveRecord.

I know this is not the point of this article, but as data analyst/scientist
roles continue to climb in popularity, I'm curious if there won't be a similar
trend with Python.

~~~
tstrimple
Isn't that one of the selling points of Rails? Time to market is king.
Optimize your SQL query performance after you've released and proven that it's
an actual bottleneck. Why spend more time and money on an optimized product
that might never see the light of day?

~~~
gwbas1c
There's a difference between basic optimization, and needing to refactor most
of your business logic because your assumptions about databases are
boneheaded.

In general, it doesn't matter what you're doing, your basic design patterns
need to fit around how a database works. If you don't know this, you'll hit
scalability issues far too soon, and it'll take too long to fix them.

> Optimize your SQL query performance after you've released and proven that
> it's an actual bottleneck.

I've seen one project fail because the design patterns around using the ORM
were incorrect. Then I joined another project where the bottleneck (from
incorrect use of the ORM) was so bad the product couldn't scale beyond being a
demo. It took 2 months to refactor, all because one of the programmers used an
ORM incorrectly to save a few hours at the beginning.

------
geebee
I've found that SQL and data frames are pretty exceptional together in
combination. I'm a big fan of a Python module called PandaSQL
[https://pypi.org/project/pandasql](https://pypi.org/project/pandasql). What
makes this work so well (for me, at least) is that I can combine pandas
operations that transform data frames with SQL operations that transform data
frames in the same pipeline, and if I _really_ need to, I can just break it
all apart with python and reassemble it back into a dataframe later. I don't
need to recreate Boolean algebra with complex and potentially buggy dataframe
operations[1], and I don't need to recreate loops, conditionals, stats
operations and so forth with complex and potentially buggy SQL.

For clustered work, I've found that Spark sql data frames give a lot of the
same functionality (not quite all, though I _think_ that's because there are
some pandas operations that require a full in-memory dataframe and don't lend
themselves to distributed solutions).

[1] there have been so many attempts to replace SQL with a different
relational-like language. the end result is a new syntax that doesn't work if
you try to pull your queries out and run them against a database
independently. I'm going on a tangent in a footnote here, but I remember
reading "your data will outlast your application." I personally strive for a
usable database outside the context of the application it was originally
created to support. Trust me, eventually someone will want a set of reports
that would (in many cases) be fair easier to write as queries if you'd made
sure your back end database was a properly designed relational database.

~~~
kumarvvr
PandaSQL was last updated 3 years ago. Do you know any other alternatives?

~~~
geebee
No, though it has been a while, and your question may prompt me to take a look
around the landscape. Not that I have had a problem with pandasql, it does
work nicely.

This is one of those technologies that I was hoping would make its way into
the framework, kind of how connection pooling for databases was once an
external module but is now often built into the various web frameworks.
DataFrames with common columns are such a natural match to relational tables
that it seems that a way to call SQL would be (ok, in my opinion, should be)
part of pandas (I think that this is the case for R data frames).

I also just really enjoy writing SQL. To some extent, this is a personal
preference - some people just mentally line up with certain ways of thinking.
There are things that are clearly better to do in python, things that are
clearly better to do in SQL, and then a grey area. I personally lean much
farther to the SQL side of that grey area, but there's certainly nothing
incorrect about going the other direction.

------
mongol
Python + SQL seems like a good match for many analysis problems, but Excel +
SQL is not bad either. I like the ability to combine complex SQL views or SQL
functions with Excel pivot functionality, querying the database directly from
Excel

~~~
merlincorey
Maybe I've not fully understood pivot tables (quite possible), but this
example[0] seems to support my understanding... aren't pivot tables basically
just aggregate functions such as `sum` and `avg` applied to a window of the
table data (pivoted by rows -> columns)?

Some of the executives I work with like to relate all their work in Excel and
they just love pivot tables. I showed one of them the output of a table of
data on web page backed by a database and they asked if I could export the
data to Excel to make a pivot table that would then be displayed on the web
page. Of course, I implemented their pivot table as queries against the
database and created a new view to display it.

I've never understood why I would need Excel to make Pivot tables when I
already have SQL.

[0] [https://blogs.msdn.microsoft.com/spike/2009/03/03/pivot-
tabl...](https://blogs.msdn.microsoft.com/spike/2009/03/03/pivot-tables-in-
sql-server-a-simple-sample/)

~~~
bshipp
You're absolutely right. As someone coming into Python and SQL from a purely
Excel background, the first realization I had was that I had been, in essence,
programming Excel. Not in a rational VBA kind of way, but in how I was linking
all of my cells (and sheets and books) and then using pivot tables, etc., to
extend Excel beyond a single column/row perspective. Microsoft PowerPivot (or
whatever it used to be called) was my gateway drug into the world of SQL.

I can generate similar utilities to Excel/PivotTables using Plotly Dash or
Pandas or even DataTables, but even after a few years of learning and
practising it would take me less than 1/10th the time to generate an
equivalent excel pivot and chart and have it displayed in front of a group.

I'm not saying it's better, just that those execs are so comfortable with that
visual-feedback excel approach that it'll be an uphill battle to convert them
to a programmatic one. Pivot tables provide very rapid means of filtering,
modifying output, and aggregating information than is otherwise possible
inside Excel. To those coming from a two dimensional excel spreadsheet world,
discovering pivot tables is like viewing the world in 3D.

I don't see why you would go to Excel pivot tables from SQL; you already have
a more powerful tool at your disposal, if you're comfortable with it. Going
from Excel to SQL? That hurdle is a bit higher.

~~~
iagovar
Well, because Excel is quick. I know R and SQL, and I still use Excel a lot.

If I want a quick overview of a problem, query db from Excel -> PowerQuery ->
PivotTables. It's easy to use, it's very easy to find solutions in google for
PowerQuery, and it just feels slick.

If I need in depth analysis then is when I switch to programming.

------
danso
Tangential question: I'm curious how many people (here on HN, or in general)
learned SQL before they learned more traditional programming (e.g. Python,
Java, C)? I learned traditional programming (through college) and only
stumbled upon SQL years later (someone left a "How to use Microsft Access"
book around at work).

I absolutely love SQL, and am in the middle of writing a book on how to use
just SQLite for large, complicated data work. My target audience is
programmers, but also non-programmers, because I think SQL is vastly easier to
learn than something like Python. The tradeoff of course is that SQL is much
more limited a framework, but it's more than enough to do data work (all the
stuff that isn't visualization or complex stats).

When I lived around Palo Alto, I met a few recent grads who didn't study STEM
in college, but ended up working for tech firms in content and analytics
roles. They'd know Excel and Tableau, but through co-workers, would pick up
SQL to make their data work easier. But they hadn't yet had learned general
programming, or tried Python/Javascript.

But I can't imagine many other professional or academic career paths in which
someone who ends up learning and using both Python and SQL started out with
SQL, then learned Python (i.e. general programming). Would would they have
done with SQL in school, or as a hobby?

~~~
Whut
I'm a data engineer and I learned SQL before I learned python. I started out
as a marketing specialist (mostly working in Google AdWords), then I graduated
up to BI Analyst, and eventually to data engineer. I've picked up 100% of my
python in my current job.

~~~
danso
What did you think of Python, and was it difficult to grasp the differences
between the two languages/paradigms? I ask because, for myself, any general
language I pick up (Ruby,Python,JS,R), I have a good idea of underlying
concepts like memory pointers and garbage collection, even if it's all
abstracted by the language. With SQL, I have an incredible ignorance of the
most basic programmatic concepts, like how to define variables or custom
functions. The deepest underlying concept I have of SQL is that it has a query
planner that does all the thinking for me.

~~~
Whut
Yeah, the learning curve was pretty steep for me. I've been learning some C#
for my job, and it's been a lot easier after learning basic Python. SQL is
kind of weird in the sense that starting out is very very very easy, but
mastering it is incredibly hard. I've been writing SQL everyday for about 3
years and I'm still blown away with what the SQL experts at my company can do
with the language. I see presentations at conferences that make me feel like
an absolute beginner.

------
oarabbus_
SQL+Python is extremely powerful, and the author makes some good points
(multivariate regression should be done in python and not SQL for example),
but the query example in the blog is not a good one. Every modern DB has
aggregate and statistical functions like ntile, percent_Rank, median, min,
max, etc. I’d honestly rather run these functions against the database than do
it in python. Especially if you're working with billion+ row datasets. In fact
if all I'm doing is these kinds of statisical/aggregate functions then I
definitely wouldn't use python at all.

~~~
bshipp
I wonder if people do timings for the variances between a function in the DB
vs Python? Unless your queries are so small (like individual records) that a
round-trip to the DB is impacted by network lag, I find it's almost always
faster to ask the DB to do as much of the lifting as possible.

The downside, of course, is that your code potentially becomes unreadable,
harder to migrate between database backends, and more difficult to debug.

~~~
teej
Pushing computation back into the database has been a major trend in the last
few years. Tools like DBT have made that transition easy in terms of clean
code and testable SQL.

------
bshipp
One thing that hasn't been mentioned yet, that I found when I started using
Python and SQL together, is the importance of well-considered indices. ORM's
aren't always too clever when they generate an index, and often (for repeated
queries) you can dramatically cut down the processing time by thoughtfully
generating an index.

If I know I'm going to be asking the database a certain question a bunch of
times I'll even generate a temporary index to speed up my analysis, and then
delete it when I'm done. No sense running a SQL query for an hour if I can cut
it down to 5 minutes with an index on the target column.

EDIT: I know this only applies to PostgreSQL, and there are numerous
alternatives, but a big shout-out to pghero which helped me identify a bunch
of duplicate and missing indexes in one of my databases, saving both time and
hard drive space. Incidentally, it also sent me down the road to learn (but
definitely not master) index optimization.
[https://github.com/ankane/pghero](https://github.com/ankane/pghero)

~~~
alexhutcheson
I recommend "Use the Index, Luke"[1] for people who are relatively new to SQL
performance issues and want to learn how to make their queries fast.

[1] [https://use-the-index-luke.com/](https://use-the-index-luke.com/)

------
DannyB2
uMatrix prevents it from loading.

Against my better judgement, I turned off uMatrix and loaded the page. Then
had a look at uMatrix.

Wow, I think we have a winner for a site with the most scripts, most 3rd party
domains, and the sheer number of XHRs.

Sorry for this being off topic.

~~~
eximius
Do you have a sane setup you can share with uMatrix? Last time I looked at it,
I was a bit overwhelmed and shelved it for later.

~~~
arminiusreturns
So, I don't know about setting rules beforehand that work very well
universally, but what I personally think is the better method is to just
slowly build up your ruleset as you use it every day. I don't think lots of
people know how to do this so here it is:

1) Upon using a website you frequent, enable one root at a time until it
functions properly. (or you can do the more granular per grid block enable) 2)
Click umatrix, then click at the top where it says "uMatrix $ver" which will
take you to the dashboard. 3) In the dashboard click on "my rules" 4) on the
right click the "commit" button.

What this does is commit your changes to the permanent ruleset and websites
you frequent will start to "just work". A word of warning, if you blindly
commit after browsing for a while when you may have done some temporary allows
on random websites, those will be commited too. So I suggest either reviewing
the commits first (always a good idea) and deselecting the ones you don't
want, or having a new session for each of your frequented websites that you
then commit from.

I have been planning to write up a tutorial on things like this for family and
friends, maybe I'll post it to show hn.

~~~
DannyB2
When I visit a site that doesn't work. Usually I carefully whitelist things
that I am willing to whitelist until it does work. If I can't get it working,
then in most cases, I don't consider the content of that site to be valuable
enough to whitelist things that seem sketchy.

In the case of this site, uMatrix just flat out blocked the entire site. it's
not that the side page didn't load, or that nothing happened because of, say,
missing javascript. I got a huge warning page in FireFox that uMatrix was flat
out blocking it. Probably because the primary site is already blacklisted,
maybe?

~~~
arminiusreturns
Yes some sites will be completely blacklisted, and you have to manually
whitelist them if you trust them enough to try them. I usually just skip them
if they do that though.

------
csours
I had a problem where there was some fixed width data in a sql database -
basically someone put mainframe data in a database.

There were 3 fields that had multiple entries in them, fixed width delimited.
I had to split the fields by width and re-combine them, then also recombine
them with another set of data with weird delimiters and rules.

It took a day and half (not full time) to figure it out in python. I can't
even imagine tackling the problem in a non-repl language.

~~~
Doxin
Turns out a good database is _really_ good at data munging. A solution for
postgresql might look something like this:

    
    
        CREATE VIEW my_table_improved AS
        SELECT SUBSTRING(the_column, 0, 8) as col1,
               SUBSTRING(the_column, 8, 8) as col2,
               SUBSTRING(the_column, 16, 8) as col3
        FROM my_table
    

After which you can query my_table_improved as a normal table. col1, col2, and
col3 contain the data split out from the_column. In practice you'd probably
also want to do some type conversion, e.g. if col1 is supposed to be an
integer you can simply update the view to select `CAST(SUBSTRING(the_column,
0, 8) as INTEGER) as col1` instead. In production use you might find this to
be slow at which point you will want to create indexes for your new columns.
Something like this (adjusted to which queries you're running of course)
should work:

    
    
        CREATE INDEX ON my_table ((SUBSTRING(the_column, 0, 8)));
    

Of course this is a lot of work if you're unfamiliar with SQL, and above
examples aren't quite complete yet for your use case, but it should get you an
idea of how SQL is the exact right tool for the job here. which is somewhat
the point of many commenters here: get yourself familiar with SQL and save
yourself a metric tonne of work in the future.

An alternate approach might be to write a query that migrates the fixed-width
format to a format where each entry is in their own column. The ease of this
mostly depends on if applications depend on that column being in that format.

side note: above sql code is untested but should be roughly correct.

~~~
csours
If it was just one column, that would be easy enough. Or if one object was on
one row, it would be easy enough. In my case, up to 6 rows could be required
to represent one object, and I had to slice 3 columns with an arbitrary number
of slices.

Oh, and there were two types of sub-record per object, and they had to be
processed in database order.

~~~
Doxin
Well, as long as you can build a query to get the data in the right format
(which you almost inevitably can) you can make a view out of it. But honestly
the true solution here would be to migrate away from such a brain damaged
format.

------
gamesbrainiac
If you are working a lot with databases, and would like to automate some of
the workflow in python, I'd highly recommend Pony ORM, which really feels like
LINQ for python, and feels very close to the original SQL.

~~~
whalesalad
Pony is the LAST product I’d ever recommend to anyone, ever. It’s probably the
worst open source software I’ve ever used.

That being said, thanks to Pony I was called in to fix a shitshow and it
likely earned me an additional 25% in consulting fees... so maybe I do like it
in that regard.

But in all seriousness, stay far away.

~~~
gamesbrainiac
I suggest pony because it would be quite good for people who have a deep
understanding of SQL, because the translation is quite nice.

The alternative - SA is really overkill for most things.

~~~
whalesalad
SQLAlchemy is a buzzkill but it’s not as heinous as Pony which is a straight
up ticking time bomb in your project. Read the source code for 2 minutes and
it should be enough to shy you away.

As far as lightweight Python ORMs, I really enjoy Peewee and it’s usually my
first choice.

~~~
gamesbrainiac
I tried PeeWee, but it does not have support for SQLServer, which I need to
work with on a daily basis.

------
fsloth
Is 'analyst' an actual job title or just one skill required in a job? What
does an 'analyst' do?

~~~
teej
This article is talking about a role typically called “data analyst”. They
help product managers, marketing, operations, etc by running reports, building
dashboards, (business intelligence), building data models, running an
exploratory analysis, analyzing A/B tests (product analytics) or things like
building marketing attribution models or timeseries forecasting.

It’s the job that’s 80% of what a data scientist does but without Python or R.
But mostly it depends because no one can decide what these roles should be
called or what their responsibilities should be.

~~~
noobiemcfoob
A Data Analyst is a Data Scientist who can't prove Bayes Theorem.

------
RocketSyntax
What, that psycopg2 is busted?

