I've seen tons of (often non-reproducible) code written in place of a simple SQL query. I really wish bootcamps and other learning platforms focused on SQL a bit more.
(I am a Python data analyst who properly learned SQL only after several years in the industry.)
Asking for what you want is considerably faster than saying how you want it done.
I understand this use case, but this is in actual application code!
Forget Da Vinci, the first man to be cloned should be Hadley Wickham
For a server reading and writing from/on a database, you paginate your queries. On Postgres that would be using "limit" and doing small bulk inserts, but under Oracle your options for inserting are limited, so there is still value on cursors.
There is the odd occasion where you'll will iterate through the entire results set, do something fast for each and every row, and only get something useful on the end. Those are classic problems where cursors are more efficient than the alternatives. But even on those, once you add error handling and recovering may yield better results with pagination.
That entirely depends on the DB. Oracle for example always executes with a server-side cursor; all declaring it does is give you a handle to what it was doing anyway.
Without disagreeing with any of the above, one important consideration is what you're going to do with the query. If all you want to know is a column's mean or some other simplified statistical value, there's really no sense in pulling all the data into Python just to calculate it. Do it inside the DB itself with SQL.
On the other hand, if you need that data to do other work (i.e. populate the table in a webpage, or generate a new descriptive data set or whatever), then the trade-off for pulling it into Python/pandas and running a mean in addition to the other work becomes much smaller.
My approach is usually to do as much data filtering and parsing as possible inside SQL, but things like complex parsing and string manipulation (especially!) I'll do with Python. I can do some simple string work in SQL, but I can almost always do it faster and cleaner in Python.
But I do know that if you didn't want a 20,000 row response to your query to be HTML-ized to be transmitted and then displayed (sometimes over a 56kbps modem), you used cursors in the NSAPI.
I have no idea about whether that was a good design decision at the time, and even less idea now, but I kind of incorporated the practice, and didn't know it was "bad".
It is relevant for large web-API end-point responses (pagination) even now, no?
So! Why is that bad? :-)
EDIT: Please don't be snarky - I spend more time figuring out clock-skew on high-frequency mixed-signal boards than I do talking to a database, and I'd like to learn to be better at the latter.
With an in-memory database or running it on an NVMe drive you can get some ridiculous performance out of Sqlite using APSW cursors.
I know this is not the point of this article, but as data analyst/scientist roles continue to climb in popularity, I'm curious if there won't be a similar trend with Python.
In general, it doesn't matter what you're doing, your basic design patterns need to fit around how a database works. If you don't know this, you'll hit scalability issues far too soon, and it'll take too long to fix them.
> Optimize your SQL query performance after you've released and proven that it's an actual bottleneck.
I've seen one project fail because the design patterns around using the ORM were incorrect. Then I joined another project where the bottleneck (from incorrect use of the ORM) was so bad the product couldn't scale beyond being a demo. It took 2 months to refactor, all because one of the programmers used an ORM incorrectly to save a few hours at the beginning.
No need to check the SQL logs directly.
First thing I learned about optimising Django was to check django-debug-toolbar to see how many queries were being generated per page. This is fairly common knowledge.
Look at Django's select_related. It's one of those if you don't understand what's happening under the hood, then you're probably querying way more than you should be.
It's even more important if your production application has millions of rows. Too easy to create a system that runs perfectly on 10 rows but will crash your production server as soon as you deploy it. Forgetting to create an index on the FKs is a classic one.
Eventually I was forced to pop open the hood and horrified to find this spaghetti bowl of nested, duplicate queries that took a fair bit of work to simplify and optimize. I was not so lucky as to have a DBA I could dump my problems on and was forced to learn that lesson the hard way.
I'm just building out a Django app now and using select_related, or rather prefetch_related, for retrieving tags (m2m relationship). Seems to work well so far, but I'm I'm sure I'll run into a similar thing of having to optimize all these queries soon.
Once my database hit 100GB and a few hundred million rows I had no choice but to sit down and actually learn what each of my ORM commands was asking my database to do. Sometimes I removed a select_related. Sometimes I replaced it with prefetch_related. Sometimes I eliminated an entire filter operation or moved it elsewhere. A few times I injected a greatly simplified raw SQL query instead of relying on complex ORM generated SQL. In four instances I replaced expensive join operations with periodically rebuilt "materialized views" to reduce CPU usage and DB I/O. All was timed with django-debug-toolbar and/or pghero to minimize database impacts and network congestion.
So select_related was sneaky in the sense that I thought I had solved the problem very early on, when I had merely delayed it until much later. If your database always remains small you'll likely never encounter this issue.
The solution is to not fire and forget the application, but to install something like django-debug-toolbar and monitor what your program is doing as the database grows in size. But for heaven's sake, don't worry about that problem today. Get your app working so you can make money. Once it's done, however, remember that your ORM has put a thick collar on your new puppy, and as it grows you'll need to expand that collar or you'll slowly strangle your pet.
Not necessarily. It's the specifics of the queries not the number of rows.
Which is why you measure before you optimize.
In addition, the documentation does a pretty good job of highlighting some of the common gotchas. Unlike your environment, my page/js weight is very low but I'm querying against a few hundred million records joined across many tables. Even using materialized views to eliminate the impact of joins in postgres, it's required a fairly delicate touch to make the delay for page loads tolerable.
In that respect I would likely redo the project in flask and sqlalchemy, if only because then I wouldn't have to remember the syntax nuances of two separate ORMs. They're similar, but not identical, and it's infuriating at times. Plus I'm very comfortable dipping down into raw SQL in sqlalchemy, and it hasn't been as intuitive for me with Django.
For clustered work, I've found that Spark sql data frames give a lot of the same functionality (not quite all, though I think that's because there are some pandas operations that require a full in-memory dataframe and don't lend themselves to distributed solutions).
 there have been so many attempts to replace SQL with a different relational-like language. the end result is a new syntax that doesn't work if you try to pull your queries out and run them against a database independently. I'm going on a tangent in a footnote here, but I remember reading "your data will outlast your application." I personally strive for a usable database outside the context of the application it was originally created to support. Trust me, eventually someone will want a set of reports that would (in many cases) be fair easier to write as queries if you'd made sure your back end database was a properly designed relational database.
This is one of those technologies that I was hoping would make its way into the framework, kind of how connection pooling for databases was once an external module but is now often built into the various web frameworks. DataFrames with common columns are such a natural match to relational tables that it seems that a way to call SQL would be (ok, in my opinion, should be) part of pandas (I think that this is the case for R data frames).
I also just really enjoy writing SQL. To some extent, this is a personal preference - some people just mentally line up with certain ways of thinking. There are things that are clearly better to do in python, things that are clearly better to do in SQL, and then a grey area. I personally lean much farther to the SQL side of that grey area, but there's certainly nothing incorrect about going the other direction.
But as a sketchpad for handling a problem? Excel is pretty hard to beat.
I'd recommend trying J. Download here - https://code.jsoftware.com/wiki/System/Installation#Installa... . There is, of course, the famous barrier to entry, but the result, unsurprisingly, is also respectable.
Some of the executives I work with like to relate all their work in Excel and they just love pivot tables.
I showed one of them the output of a table of data on web page backed by a database and they asked if I could export the data to Excel to make a pivot table that would then be displayed on the web page. Of course, I implemented their pivot table as queries against the database and created a new view to display it.
I've never understood why I would need Excel to make Pivot tables when I already have SQL.
I can generate similar utilities to Excel/PivotTables using Plotly Dash or Pandas or even DataTables, but even after a few years of learning and practising it would take me less than 1/10th the time to generate an equivalent excel pivot and chart and have it displayed in front of a group.
I'm not saying it's better, just that those execs are so comfortable with that visual-feedback excel approach that it'll be an uphill battle to convert them to a programmatic one. Pivot tables provide very rapid means of filtering, modifying output, and aggregating information than is otherwise possible inside Excel. To those coming from a two dimensional excel spreadsheet world, discovering pivot tables is like viewing the world in 3D.
I don't see why you would go to Excel pivot tables from SQL; you already have a more powerful tool at your disposal, if you're comfortable with it. Going from Excel to SQL? That hurdle is a bit higher.
If I want a quick overview of a problem, query db from Excel -> PowerQuery -> PivotTables. It's easy to use, it's very easy to find solutions in google for PowerQuery, and it just feels slick.
If I need in depth analysis then is when I switch to programming.
If the data allows that, then a data export (usually requiring a bunch of aggregation and preprocessing) to a format suitable for excel pivot tables or some of the many business analytics tools often is quite useful to users who want to analyze the data.
However, it should be expected that a side-effect of such ad-hoc data analysis often is specific reports that become understood to be useful, and which can then be re-implemented "properly" i.e. as a sql query/report that gets run in an automated, tested, reliable way and delivered where needed without that manual analysis step.
I absolutely love SQL, and am in the middle of writing a book on how to use just SQLite for large, complicated data work. My target audience is programmers, but also non-programmers, because I think SQL is vastly easier to learn than something like Python. The tradeoff of course is that SQL is much more limited a framework, but it's more than enough to do data work (all the stuff that isn't visualization or complex stats).
But I can't imagine many other professional or academic career paths in which someone who ends up learning and using both Python and SQL started out with SQL, then learned Python (i.e. general programming). Would would they have done with SQL in school, or as a hobby?
I've forgotten some of the more advanced SQL that I've written over the years. I look back at SQL I've written as if I'm reading someone else's work.
I enjoy working with postgres but miss the temporary table workflow in sybase.
The downside, of course, is that your code potentially becomes unreadable, harder to migrate between database backends, and more difficult to debug.
If I know I'm going to be asking the database a certain question a bunch of times I'll even generate a temporary index to speed up my analysis, and then delete it when I'm done. No sense running a SQL query for an hour if I can cut it down to 5 minutes with an index on the target column.
EDIT: I know this only applies to PostgreSQL, and there are numerous alternatives, but a big shout-out to pghero which helped me identify a bunch of duplicate and missing indexes in one of my databases, saving both time and hard drive space. Incidentally, it also sent me down the road to learn (but definitely not master) index optimization.
Against my better judgement, I turned off uMatrix and loaded the page. Then had a look at uMatrix.
Wow, I think we have a winner for a site with the most scripts, most 3rd party domains, and the sheer number of XHRs.
Sorry for this being off topic.
1) Upon using a website you frequent, enable one root at a time until it functions properly. (or you can do the more granular per grid block enable) 2) Click umatrix, then click at the top where it says "uMatrix $ver" which will take you to the dashboard. 3) In the dashboard click on "my rules" 4) on the right click the "commit" button.
What this does is commit your changes to the permanent ruleset and websites you frequent will start to "just work". A word of warning, if you blindly commit after browsing for a while when you may have done some temporary allows on random websites, those will be commited too. So I suggest either reviewing the commits first (always a good idea) and deselecting the ones you don't want, or having a new session for each of your frequented websites that you then commit from.
I have been planning to write up a tutorial on things like this for family and friends, maybe I'll post it to show hn.
Most sites will still mostly work, and the few that don't usually just need a CDN to be enabled (you can do this per-site).
There were 3 fields that had multiple entries in them, fixed width delimited. I had to split the fields by width and re-combine them, then also recombine them with another set of data with weird delimiters and rules.
It took a day and half (not full time) to figure it out in python. I can't even imagine tackling the problem in a non-repl language.
CREATE VIEW my_table_improved AS
SELECT SUBSTRING(the_column, 0, 8) as col1,
SUBSTRING(the_column, 8, 8) as col2,
SUBSTRING(the_column, 16, 8) as col3
CREATE INDEX ON my_table ((SUBSTRING(the_column, 0, 8)));
An alternate approach might be to write a query that migrates the fixed-width format to a format where each entry is in their own column. The ease of this mostly depends on if applications depend on that column being in that format.
side note: above sql code is untested but should be roughly correct.
Oh, and there were two types of sub-record per object, and they had to be processed in database order.
That being said, thanks to Pony I was called in to fix a shitshow and it likely earned me an additional 25% in consulting fees... so maybe I do like it in that regard.
But in all seriousness, stay far away.
The alternative - SA is really overkill for most things.
As far as lightweight Python ORMs, I really enjoy Peewee and it’s usually my first choice.
It’s the job that’s 80% of what a data scientist does but without Python or R. But mostly it depends because no one can decide what these roles should be called or what their responsibilities should be.
One your last point, I don't think that will ever happen. There is a lot to benefit from your role being ambiguous.