Hacker News new | past | comments | ask | show | jobs | submit login
Advanced SQL window function quiz (windowfunctions.com)
163 points by andy_boot on Feb 8, 2018 | hide | past | favorite | 79 comments

Why are they recalculating the sum of the weights in the Q0 query?

Their answer:

   select age, sum(weight) as total_weight
   from cats
   group by age
   having sum(weight) > 12;
Which is equivalent to:

   select age, sum(weight) as total_weight
   from cats
   group by age
   having total_weight > 12;

The HAVING clause is evaluated before the SELECT so you have to specify the full clause `having sum(weight) > 12` and not just the alias. Most SQL engines are probably smart enough to not actually recalculate everything.

Yes, SQL engines tend to be much smarter than the SQL languages as such.

Most RDBMSes won't let you use a column alias in the same scope as it is defined (pretty sure the SQL standard itself doesn't specify that it should be allowed). You can get around this by using a CTE.

Also, optimisers aren't stupid, the actual sum() function will not be recalculated.

Author here. - As others have said this SQL Engine makes me write the SQL out the long way. For some reason the shorter version (which I wrote first) is rejected by my postgres client.


Answer from core PostgreSQL dev, Tom Lane

Good find. Thank you.

SQL is a declarative language not a procedural language.

Q6: Ranking has an error, or I misunderstood. It says "Order by: ranking, name" but in truth one should not order by name!

Anyway, overall very nice!

you are correct. I'll fix that.

We've got course that cover window function, not sure if I can post the url here. grrr.

Lovely website, I'm finally learning about window functions as I do this. Thanks a bunch.

One question / bug report? Q7 rejects this:

select name, weight, weight - (min(weight) over (order by weight asc rows between 1 preceding and current row) ) as weight_to_lose from cats

is that much worse than lag() or is it just non-idiomatic?

(edit: and Q10 asks about "window", but it was quite hard to google. if you could provide another one of those links to the relevant docs, that could help a bunch :D)

That query looks fine to me, I'll tweak the code.

Yep I need to fix up Q10 a bit.

Meanwhile I'm stuck with mysql 5.4 without foreign keys, cuz, you know, performance. (The main db for a billion dollar security company)

MySQL foreign keys add locks to referenced rows, and can have surprising effects on concurrent queries, particularly if you aren't running at read committed or lower isolation level.

There's also this: https://bugs.mysql.com/bug.php?id=48652 - still open after all these years (and yes I know there are much longer open bugs).

InnoDB storage engine is available in 5.4, you don't have to use only MyISAM[1].

However, depending on your application, you may be enforcing these constraints elsewhere and not need them at the engine level, even if it's a nice-to-have.

[1] http://download.nust.na/pub6/mysql/doc/refman/5.4/en/storage...

I know a database that's using InnoDB tables but also has no foreign keys and, indeed, JOINs are forbidden.

(And yes, the workflow does include actions that involve multiple tables in one go.)

Check out the new MySQL 8.0 release candidate. Is has windowing functions now.


I really really like this creation of yours, and I've already learned a couple things from it.

However, you really really need to be careful about SQL injection. I can see that you tried to lock it down as much as possible, as far as I can tell the account the queries are running under only has SELECT permission (no update/delete/etc). However I was still able to get some data you probably don't want me to have...

This query returns a list of all tables in your database:

SELECT c.relname||'' FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace

And this query returns a list of users and their privileges:

SELECT usename||'', usecreatedb||'', usesuper||'', usecatupd||'' FROM pg_user

I guess your name is Andy based on your HN username, but who is Rusty? :) And that's just what I've been able to get in 5 minutes of trying, let alone a determined attacker.

Noted, I am currently running on the assumption that there is nothing interesting in that PG (There isn't). But yes I would like to lock that user down further it really shouldn't be able to access those things.

It is now more locked down (at least "pg_catalog, public, information_schema" are no longer available). Thanks.

||'' is another way to do ::text ? :-)

I've used a few of these and it's great to practice using more!

Q7 includes output:

  Misty   5.7  0.2
  Oscar   6.1  0.4
  Smokey  6.1  0.0
How would one show 0.4 for both Oscar and Smokey?


I'm not sure how you'd do this with windowing functions, but at least you can use common table expressions (WITH) to achieve what you're asking for.

The site is down right now, so I had to test using my own db instance. Column and table names might not match.

      distinct_weights AS (
        SELECT DISTINCT weight
        FROM cats),
      weights AS (
          OVER (
            ORDER BY weight ) AS lag
        FROM distinct_weights)
    cats.weight - weights.lag
  FROM cats, weights
  WHERE cats.weight = weights.weight
  ORDER BY weight;

  Any     5.5   null
  Misty   5.7   0.2
  Oscar   6.1   0.4
  Smokey  6.1   0.4

I see, it's joining with the lagged weight. Same without CTE:

  SELECT c.name, c.weight, c.weight - lw.lower_weight AS weight_to_lose
  FROM cats c
  JOIN (SELECT weight, lag(weight, 1, weight) over (order by weight) 
        lower_weight FROM cats 
        GROUP BY weight
        ORDER BY weight) lw ON c.weight = lw.weight
  ORDER BY weight

Author here - I'm scrolling thru the logs and it is interesting to see that on the first question lots of people try to write "select name, total_weight" despite there only being a weight column which you have to use an aggregate on to get the total weight.

I guess I need to rethink my design a little.

I did that and to be honest I am not quite sure why.

I think it's due to lack of reading the question properly and impatience.

But it also might be due to how programmers sometimes think/work. Often, when I am trying to solve a problem I like writing something (anything) down in broad strokes first so that I can then go back in and 'fix' each smaller part of the larger problem.

The vast majority of developers don't know SQL beyond simple CRUD queries. The surge in popularity of NoSQL over the past decade has made SQL become a lost art. Even when using relational databases, ORMs often alleviate the need for understanding the principles.

Thank you for making this.

Q6 has a flaw. You request "Order by: ranking, name" but the accepted solution sorts by rank only (Alfie should come before Ashes)

I suspect there's a need to keep the design simple, but the source table schema should probably be more prominent. I suspect most people try to guess the names of the columns before they ever click "Show table schema".

Nice quiz, by the way, and I like the topic. In my experience, too few devs / dbas make good use of window functions.

I thought the underlying schema was pretty obvious. However I made lots of mistakes due to trying it out from my phone, and it didn't cooperate as well as one would've liked.

But I think the design was good. I also noticed the option to look at the schema, if needed.

It took me a while to see the 'show schema' link. Maybe this should always be shown.

Quick bit of feedback: question 10 refers to the "WINDOW command" which I think should be called the "WINDOW clause" to help clueless folks (ie me) Google for it.

good idea. I'll fix that.

You had me on the plural for your table names. Have not seen it in a while. Personally believe it is a preference thing.

Despite being the author I think you are correct. It should be table: 'cat' not 'cats'

My feedback after answering the first question:

1. Q0's success messages says "Group by and Having are commonly used aggregate functions". This terminology is incorrect. SUM() is an aggregate function. GROUP BY and HAVING are clauses. 2. Without an ORDER BY clause, there's no guarantee of the order of the result, but the quiz appears to require the output rows to be in the same (in the case of Q0, undefined) order as the expected rows. While this is unlikely to spontaneously change, it's like depending on the order of an unordered hash map. As long as you're teaching people things, it's good to instill the habit of always ordering their results.

Q1: It's not apparently from the way the schema is shown whether "name" is a unique key. The suggested SQL will produce incorrect results in the case of a duplicate name as the window framing defaults to RANGE UNBOUNDED PRECEDING, so following peer rows are included in the aggregation.

True. Well observed. How would you resolve this?

I'm not an expert in window functions and I don't know what's idiomatic. I think explicitly specifying ROWS UNBOUNDED PRECEDING makes it do the right thing, but as you noted regarding ORDER BY, this is making people type a lot of SQL and maybe detracting from the concept. But then, this is a problem they could easily run into in the real world.

I could see perhaps having a question that comes right after this one in which the OVER ORDER BY contains duplicates, that can be used as an opportunity to teach the "gotcha".

Alternately, if there's a guarantee of uniqueness, this can't happen.. but again it might be setting people up to try to apply what they think they learned to the real world and then finding out that their sums are wrong...

1) Yes. I'll re-word that part. 2) I debated with myself on that. On one hand it is clearly better to ORDER BY. On the other hand I'm trying to teach people the window functions and I don't want them typing in more SQL than they have to.

Great job!

And here is a small suggestion, you might want to change the name of output column "unique" in Q3 because.

  *expression* as unique --does work and is recommended answer
  *expression* unique --fail because of use of a restricted word
  *expression* "unique" --work because of double quote
Using a restricted word that need quotation in some context is globally discouraged and in this context can keep someone stuck independently of the problem at stack.

PS: also apply to "position" in Q4 maybe thats intentional?

Good points. I'll update Q3 and look at Q4 it was not intentional to use keywords.

I guess window functions are advanced features, but I found it a little bit basic. Some things you could do to make it more advanced:

    * add questions about range mode
    * have questions which discuss default behavior regarding ordering and partitioning
    * Have questions which rely on things like RANGE 3 PRECEDING or other constructs

Last time I checked, PG didn't support RANGE <value> PRECEDING, only RANGE UNBOUNDED PRECEDING. The new MySQL 8.0windowing functions do, though.

I'm stuck on the first question. I'm not sure why this is wrong. Can anyone help me?

> select name, sum(weight) OVER (PARTITION BY name) as running_total_weight from cats order by name


It's asking for a running total of all weights, so you shouldn't partition by name. Use an ORDER BY instead of PARTITION BY.

Ah, I get it. Thanks!

Thank you for doing this.

Sidenote: either HN killed your backend or I did with a recursive CTE. If so: sorry :)

No problem, just restarted it.

Its a brand new project so I may need to do some hardening.

This is cool! How many questions long is it?

(it's 10 questions long)

Nice. I rarely use window functions and find them a bit arcane. This was really accessible.

select version();

PostgreSQL 9.3.20 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bi

I wonder why so old version is being used. I will be not supported by the community pretty soon.

I mean, in more recent Postgres versions, more things related to window functions are available – like CUBE, GROUPING SETS, etc. 9.3 was released back in 2013.

There are some big flaws in window functions I'd love to see fixed. The main one being the very limited ways in which you can refer to the 'current' row in window and filter expressions. The second being... you're almost always better off just doing all this work in code, in a programming language with dependable performance characteristics.

I don't know, I think most performance gained from pulling data out and processing it is generally an illusion. Windowing functions can have terrible performance, but that's pretty easy to determine and mitigate (indexing, multiple queries in a procedure). Pulling data down and 'doing something' with it starts getting into the world of round trips, serialization, concurrency models...

Plus there is the whole, "My report wants to use the same logic as this web form", oh it's in Java? Guess we'll just pull the data out, process it, populate a temp table, report it, clear those values....

If you have business data in SQL, just write the logic in SQL.

I generally haven't had a problem with window function performance - I was mostly using MSSQL.

That said, the single biggest performance drain I've seen is people not using the database, and instead pulling data out to process in code. I don't think I've ever seen a case where it was faster to process data outside of the database - though there has also been plenty of times when the pertinent SQL to had to be written "smarter." Think in sets, not iteratively! There's a terrifying amount of loops in a lot of legacy SQL statements.

If you ever want to see some interesting thoughts on this, lookup "Toon Koppelaars thickdb". Granted he works for Oracle, but I think his point is quite well articulated.

Even if you write optimized C code, you can't tend to beat well written DB code because the IO completely overwhelms the processing time. And somewhat counterintuitively, moving processing outside the database tends to increase overall database CPU usage, because you tend to force more IO through the system.

That makes sense, since IO costs more than most cpu processing, and if you calculate in the DB Engine, you don't tend to need as much raw data out, you only need the results.

Basically I want a modern, cost effective mainframe, that uses a reasonable language for the data layer. The earlier suggestions of Python + Postgres or C# + MSSql are nearer the mark, but never quite feel first class.

Another Toon fan here. HE also co-wrote a book called Applied Mathematics for Database Professionals that is superb. It's dry, like a textbook. But the clarity and depth that they have on the underpinnings of how all this works and why it does is excellent.

I think we could be friends. As an appdev turned database/BI dev, there are a lot of intricacies of databases that are just not intuitive and take a while to get down right.

We're getting away from it by either moving to new tech or the tech is improving (adaptive query processing, auto tuning indexes, writable columnstore).

Even when I was an app dev I said you live and die by your database, either it's data model or by not putting processing in the right place. It's an architecture thing and something that (generally) has to be addressed early on.

I was given a report written in T-SQL that had multiple nested loops and a select statement inside the inner loop.

Rewriting it as one SQL query made it 4 orders of magnitude faster.

The point is you dont have business data yet, you need to create it ... using business logic. All the data is in sql but for some reason you dont store value x (probably an aggregate of sorts) and sql environment doesnt have a good generalized syntax for logic.

I think the problem is maintenance instead of performance. When you really want performance (in data reads) youll be precomputing and storing the values. Then your report can filter and be indexed properly.

I dont think that any tech solution will give you fast reads and fast writes you need to pick one.

Your server probably has alot of down time ... use it , make a cron job and make a seperate reporting table ... reporting on raw data does not scale , window functions do not scale

Using window functions is equivalent to using a better cpu. Maybe the environment you work in is faster but you have not factored the logic to make the best use of the resources.

You can take a crap approach in java and port it to sql. But the approach is crap and you should create a seperate reporting process or at least use hadoop.

Rather than taking the data out of the database, how about plugging your programming language of choice into the database?

For Postgres, for example, you can write stored procedures using any of these: https://www.postgresql.org/docs/9.1/static/external-pl.html

Not positive in PostGres, in MSSQL you still wind up copying the data and using additional resources, it's just handled for you. I'm a big fan of scripting in DBs becoming a thing (and a little scared) but it's not all roses (yet). You would also loose optimizer "fixes" right?

The PHP version looks like it hasn't been touched since 2011 and the Python link is dead. Doesn't give much confidence that this is a viable and supported pathway to Db <-> Application communication.

AWS RDS doesn't support any of the PL extensions for Postgres. And unfortunately, that's a real killer for a lot of people.

But when I've been able to use Python functions directly inside my SQL code, it is a thing of absolute beauty.

I agree, too little is done in SQL.

One good reason to avoid it and stored procedures is for parallelism. But you need a lot of data to make that worthwhile. Especially in these days of 3+Tb RAM machines.

Access to the current row was improved in the SQL:2011 standard. See section "8.2 Navigation within a window" in this paper: https://sigmodrecord.org/publications/sigmodRecord/1203/pdfs...

This is great news, although I'm not sure it gives me the thing I _most_ want, which is just to be able to refer to the 'current' row within a filter's where clause.

Not always the case, especially with columnstore indexes and batch mode, even traditionally, rownum/lead/lag/ordering functions are extremely handy along with aggregates. I can't actually think of many, result set/batch, scenarios that would be better replicated in code, not counting the speed on extra delivery (every row for a sum) and possible miscalculations. Add in the ability to index views and some other things for specific scenarios, then again, I'm mostly MSSQL and lack in other offerings, or knowing where they might lack.

Also, the site is blocked for me currently...

I think often conversations about this are really people talking past each other. If you work in the enterprise, and spend tons of time dealing with 'business logic' and bespoke systems, using SQL seems preferable because many of your headaches come from systems being out of sync, or reports showing different values. Databases represent a way to guard against a lot of problems that tend to occur in that environment. It's only when you branch into other realms of development where it starts to feel really backwards and archaic.

But is getting correct results reliably really "backwards and archaic", or does it only seem so because it is not the latest shiny toy with a trendiest name? (edited: spelling)

SQL has dependable performance characteristics... What is not dependable about how window functions are computed?

Where clause pushdown through views, or worse CTEs. Views with window functions querying views with window functions. Queries with multiple windows. I didn't just randomly make the statement, modern query planners are an absolute arse to predict in the way that you can 100% of the time in external code. The moment you're doing anything really 'sequency' instead of just calculating some dumb running totals, window functions can become a nightmare, and I just made the statement so people don't automatically think there's a friction-free story for doing general sequence processing in SQL. Not sure why this has offended people so much.

If you have a small amount of data, yes; but if you have any significant amount of data, you are losing multiple orders of magnitude of performance when you pull all that data out of the database.

I lose orders of magnitude of dev time every time Postgres can't work out how to push a where clause down into a view, tbh. In our case, doing processing on sequences at insert time is much more pleasant, performant and featureful than trying to do it after the fact with window functions.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact