Hacker News new | past | comments | ask | show | jobs | submit login
Advanced SQL and database books and resources (neilwithdata.com)
440 points by eatonphil on May 11, 2020 | hide | past | favorite | 47 comments

If you are an application programmer needing to query the database, I wholeheartedly recommend "SQL Performance Explained" by Markus Winand, or his website: https://use-the-index-luke.com/

that was eye-opening indeed. Before reading his blog, I had no clue how indexes worked and how to write SQL that match the index structure (for instance the order of conditions in the where clause matter if you want to leverage a multi-column index).

> for instance the order of conditions in the where clause matter if you want to leverage a multi-column index

Ups, if this is you take away, then I've done something wrong.

Let me correct that: The order of columns in an index matters, not the order of conditions in the where clause.

This is the best part of HN.

Not only you did a comment on previous post, but it's a comment from the author of an awesome guide [https://use-the-index-luke.com].

Markus, I can assure you, that your guide has saved a lot of machine-computed-hours over the world with less energy consumption/wasted. Great work. :)

Ah thanks for the clarification !

This book has easily been one of the most influential on my career. Having excellent SQL skills has been my secret weapon for a while now.

And IIRC this holds MongoDB and I'd assume other non-SQL DBs.

If you have an index on <companyId, userId>, and you query with a userID only, that index wont be used. But if the index was <userId, companyId> then that index would be used. Or if you supplied both userId and companyId in your query, then either index would work.

And yes, this occasionally means that adding a

    where id in (select id from company)
sometimes will switch your query from doing a full table scan to using an index, fixing your problem for long enough to prepare a fix to add the appropriate index. Not that I've ever had to do something like that or anything.

> If you have an index on <companyId, userId>, and you query with a userID only, that index wont be used.

Surprisingly, it could be used in some circumstances, just not for the regular seek.

If the index is small (compared to the base table), the DBMS may decide to perform a full index scan (instead of the full table scan), especially if your SELECT list doesn't contain columns which are not in the index.

And Oracle can employ so called "skip scan" if it realizes that the number of distinct companies is small. This is essentially a separate seek under each distinct company.

I think you just need to realize that an index on <companyId, userId> is a single index

Ha-ha, man, that is a gem comment I am getting your book

I have also enjoyed his blog. Here's another way to understand indexes by implementing them in a simple SQL database. This includes both storing/retrieving indexed values and also deciding whether or not a query has an applicable index.


The PostgreSQL exercises web [1] is an awesome intro to basic and medium level topics.

It has been discussed previously in HN [2].

[1] https://pgexercises.com/

[2] https://news.ycombinator.com/item?id=12022953

I don't see Celko books


they're a must have resource for advanced SQL.

Do you suggest any one in particular? I've been daunted by their size.

SQL for Smarties tends to have a nice coverage of common problems encountered and how to resolve them.

That book was great and I learned a huge amount by reading it. But, unless it's been updated (I read it years ago but it's also an older book), there were numerous typos and even some code errors.


I couldn't dislike the marketing of "The Art of PostgreSQL" more ...

A website that looks like a SaaS.

Hugely overpriced paperback version.

No book summary available.

A free chapter banner that doesn't link anywhere.

No possibility to buy on different stores.

I found the book apprisal somewhere on Reddit in "advanced books on postgres". I expected to find postgres optimization and at least a few pages on how to scale it (or where to look for solutions), instead I was dissapointed.

While it's still a very decent book on SQL syntaxes and advanced queries (I bet even 10y+ PostgreSQL users don't know half of what PostgreSQL can do) here is why I don't recommend it:

- poorly written instructions on how to set up Postgres 11 with all those crazy extensions

- No epub version (it's simply broken and nobody responds on my emails, instead I get a reminder to continue my learning course which has 2 links, a short intro and a link to the book. I've stopped at 50% just because the book was downloaded from another browser and now I get these irritating emails)

- There is a lot of Python in the book, I want to learn SQL and PostgreSQL, not Python or any other sort of languages. I use Py professionally but it does not mean I want it to be tied to SQL.

- There are unrelated interviews, for instance, a guy who worked on Django ORMs (how is this related to my SQL knowledge)

Don't get me wrong, don't treat this comment as a rant about how bad the book is, the SQL tricks are really awesome there, however if you think about buying this book, go for 49$ version and expect to have some pain to setup everything.

I am like 50/50 on asking for refund

Fully agree with you. I bought it, just to discover, that it is somehow just a simple introduction to SQL. Also what I found irritating: Nowhere on the marketing page of the book I found a reference to the PostgreSQL-version for which it was written or applicable. Most probably because it is only very loosely coupled with PostgreSQL.

I bought it through work and it is very missable to be honest. Very shallow.

If you want something more advanced and fundamental, I'd recommend classic Database System Implementation[0] by Hector Garcia-Molina; Jennifer D. Widom; Jeffrey D. Ullman. It's still relevant. Technologies change, but fundamental principles and algorithms... change order of magnitude slower.

And yes, it's that Ullman.

0. https://www.amazon.com/dp/0130402648

Rob Conery's A Curious Moon (https://bigmachine.io/products/a-curious-moon/) is an interesting read for a PostgreSQL book. I wouldn't say it tries to be an advanced book by design, but I learned a lot of modern SQL from it.

I think the number of people needing truly advanced books is small whereas a lot of people need to jump from beginners to intermediate, or intermediate to advanced.

Advanced PostgreSQL and NASA data ... sold. Shut up and take my $30. Instabuy :-)

I have long thought that the best training for database is a course on set theory and group theory, coupled with a close reading of William Kent's "Data and Reality".

The book is a mess, but its structure illustrates its several points.

Cmu course is a must watch really

OMG yes! Andy Pavlo is such a good teacher. I watched his entire lecture series, and it really helped me prepare for an interview. I couldn't recommend his MOOC enough if you really want to know how a database works. I just wish that the homework assignments were available to the public.

Yeah, it's good but somewhat diverged from the practical part unless you are an onsite student.

I wonder what are alternatives (video lectures or textual resources) with more hands-on approach implementing database systems.

Not sure if this is what you are looking for but I found this paper useful on architecting databases. I read it as a complement to the CMU videos. https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf

Otherwise there is surprisingly a lack of information about architecting databases online and you have to go to the source code of SQLite and PostgreSQL. I suspect this is because databases are some of the most complex systems in computer science and involve many trade secrets.

It's awesome, but not for learning how to solve problems with SQL.

Particularly given that the professor is doing his lecture from his bathtub!

It is, but not if you're trying to get better at SQL.

I am a bit biased here in terms of resources, I worked with a few people to create a free resource that covers intro to sql optimization: https://dataschool.com/sql-optimization

I also created a bunch of animations to help people build mental models for joins, aggregations, window functions, and subqueries which I find is the main blocker for people using SQL optimally: https://dataschool.com/how-to-teach-people-sql/

A billion "Up" for The Art of SQL by Faroult

There is something to be said for having an advanced knowledge of SQL and then carefully staying a very long distance from complicated queries. I can't think of a language where the syntax buckles worse than SQL when things get complicated.

Sometimes it makes sense to roll up the sleeves and do unspeakable things with SQL, but that isn't some sort of win.

Having occasionally wrangled with relativley complicated queries and jungles of views ans procedures, I have to ask that what would be a better way to get complicated stuff out from databases?

I can think of two other options, which I personally both dislike:

1. Some fancy graphical ETL monster which takes ages to learn and where the learned skills are more or less untransferable to anywhere else. And which, at the end, makes same things as raw SQL but just in a more opaque way.

2. Build the complexity outside database. Yes, you avoid complex SQL, but you also lose quite a lot. SQL prohibits you from doing quite a many different stupid mistakes with your data. At least 99 times out of 100 databases have better performance making the complicated calculations than your home brewed solution outside the database (Yes, I agree. The exceptions can be notable...) Finally, reusability of the results is way easier if you keep as much calculations as possible in the database.

But I can't say that I love hairy SQL, so if there are better ways, I would be keen to have a look.

Re: point 2. I'm not necessarily advocating it, but one advantage of doing work outside of SQL is testability. Esp. unit tests. It's still relatively difficult to test SQL, for a variety of reason. I'm working with dbt now, which offers some simple validation out of the box (verify such and such is non-null, for example). More complex tests, such as "verify that some rows/columns have some expected values," require writing a query (returning 0 rows on success, 1 or more on failure.) It works, but still rather clumsy because a) requires an actual database (ie, not really unit testing) and b) writing tests this way is pretty tedious and error prone compared to using a typical assertion framework a la python's unittest.

I totally agree the options are all somewhat unpleasant. I've found dbt to be a nice step in the right direction.

Yep, and another is that version control support of databases is still a bit limited. (There are some solutions, at least redgate is doing something for SQL server on this.) I have a workflow where I recreate all views and procedures in the beginning of the query batch from text files and that gives some ways to version control my views, but as you say, it is clumsy.

I've been having a bit of joy using Dataform [] recently to "modularise" SQL code into more manageable views. It brings some of the best practices of software development to the SQL hellscape, making management of complicated SQL a pleasure.

[] https://dataform.co/product

(this is a plug from a happy customer!)

Completely agree with this.

It's a pity that Ingres's QUEL[1] has disappeared; the competition would be good for SQL.

[1] https://en.wikipedia.org/wiki/QUEL_query_languages

I recall seeing some gnarly SQL at British Telecom over 2k lines.

Id agree with you about keeping it as simple as possible

People do dumb things with SQL, but depending on the query/set of columns, 2k is a pretty quick line count to reach. Just dont use select *, have a wide table, do more than a few trivial operations moving data around, and you'll get to 2k pretty quick.

That being said, people do dumb stuff with SQL all the time; The worst query I have ever seen was 11MB, megabytes. The worst query plan I have ever seen generated a plan that was ~130MB. I regularly have people ask questions in #sqlhelp (https://sqlps.io/slack/ SQL Server focused) like "Why can I only have 2100 parameters to a stored procedure?" (dude was super mad that I asked if his design might use some rework.)

I've told SQL learners before

There is no Advanced SQL; there's only "a very large amount of SQL", "exotic SQL", and "SQL no one should have written."

Are you sure it wasn't either:

+ a stored procedure


+ the end-product of an ORM tool?

Really enjoy SQL Antipatterns by Bill Karwin.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact