
Advanced SQL window function quiz - andy_boot
http://www.windowfunctions.com/
======
gravypod
Why are they recalculating the sum of the weights in the Q0 query?

Their answer:

    
    
       select age, sum(weight) as total_weight
       from cats
       group by age
       having sum(weight) > 12;
    

Which is equivalent to:

    
    
       select age, sum(weight) as total_weight
       from cats
       group by age
       having total_weight > 12;

~~~
atarashi
The HAVING clause is evaluated before the SELECT so you have to specify the
full clause `having sum(weight) > 12` and not just the alias. Most SQL engines
are probably smart enough to not actually recalculate everything.

~~~
cema
Yes, SQL engines tend to be much smarter than the SQL languages as such.

------
zer
Q6: Ranking has an error, or I misunderstood. It says "Order by: ranking,
name" but in truth one should not order by name!

Anyway, overall very nice!

~~~
andy_boot
you are correct. I'll fix that.

------
emmelaich
See also Markus Winand's blog and [http://use-the-index-luke.com/sql/partial-
results/window-fun...](http://use-the-index-luke.com/sql/partial-
results/window-functions)

and Joe Celko [https://www.red-gate.com/simple-talk/sql/t-sql-
programming/w...](https://www.red-gate.com/simple-talk/sql/t-sql-
programming/window-functions-in-sql/)

~~~
nablaone
We've got course that cover window function, not sure if I can post the url
here. grrr.

------
nothrabannosir
Lovely website, I'm finally learning about window functions as I do this.
Thanks a bunch.

One question / bug report? Q7 rejects this:

select name, weight, weight - (min(weight) over (order by weight asc rows
between 1 preceding and current row) ) as weight_to_lose from cats

is that much worse than lag() or is it just non-idiomatic?

(edit: and Q10 asks about "window", but it was quite hard to google. if you
could provide another one of those links to the relevant docs, that could help
a bunch :D)

~~~
andy_boot
That query looks fine to me, I'll tweak the code.

Yep I need to fix up Q10 a bit.

------
drraid0
Meanwhile I'm stuck with mysql 5.4 without foreign keys, cuz, you know,
performance. (The main db for a billion dollar security company)

~~~
Alupis
InnoDB storage engine is available in 5.4, you don't have to use only
MyISAM[1].

However, depending on your application, you may be enforcing these constraints
elsewhere and not _need_ them at the engine level, even if it's a nice-to-
have.

[1]
[http://download.nust.na/pub6/mysql/doc/refman/5.4/en/storage...](http://download.nust.na/pub6/mysql/doc/refman/5.4/en/storage-
engines.html)

~~~
zimpenfish
I know a database that's using InnoDB tables but also has no foreign keys and,
indeed, JOINs are forbidden.

(And yes, the workflow does include actions that involve multiple tables in
one go.)

------
pzxc
I really really like this creation of yours, and I've already learned a couple
things from it.

However, you really really need to be careful about SQL injection. I can see
that you tried to lock it down as much as possible, as far as I can tell the
account the queries are running under only has SELECT permission (no
update/delete/etc). However I was still able to get some data you probably
don't want me to have...

This query returns a list of all tables in your database:

SELECT c.relname||'' FROM pg_catalog.pg_class c LEFT JOIN
pg_catalog.pg_namespace n ON n.oid = c.relnamespace

And this query returns a list of users and their privileges:

SELECT usename||'', usecreatedb||'', usesuper||'', usecatupd||'' FROM pg_user

I guess your name is Andy based on your HN username, but who is Rusty? :) And
that's just what I've been able to get in 5 minutes of trying, let alone a
determined attacker.

~~~
andy_boot
Noted, I am currently running on the assumption that there is nothing
interesting in that PG (There isn't). But yes I would like to lock that user
down further it really shouldn't be able to access those things.

------
karmakaze
I've used a few of these and it's great to practice using more!

Q7 includes output:

    
    
      Misty   5.7  0.2
      Oscar   6.1  0.4
      Smokey  6.1  0.0
    

How would one show 0.4 for both Oscar and Smokey?

[http://www.windowfunctions.com/questions/7](http://www.windowfunctions.com/questions/7)

~~~
dancek
I'm not sure how you'd do this with windowing functions, but at least you can
use common table expressions (WITH) to achieve what you're asking for.

The site is down right now, so I had to test using my own db instance. Column
and table names might not match.

    
    
      WITH
          distinct_weights AS (
            SELECT DISTINCT weight
            FROM cats),
          weights AS (
            SELECT
              weight,
              lag(weight)
              OVER (
                ORDER BY weight ) AS lag
            FROM distinct_weights)
      SELECT
        cats.name,
        cats.weight,
        cats.weight - weights.lag
      FROM cats, weights
      WHERE cats.weight = weights.weight
      ORDER BY weight;
    

yields

    
    
      Any     5.5   null
      Misty   5.7   0.2
      Oscar   6.1   0.4
      Smokey  6.1   0.4

~~~
karmakaze
I see, it's joining with the lagged weight. Same without CTE:

    
    
      SELECT c.name, c.weight, c.weight - lw.lower_weight AS weight_to_lose
      FROM cats c
      JOIN (SELECT weight, lag(weight, 1, weight) over (order by weight) 
            lower_weight FROM cats 
            GROUP BY weight
            ORDER BY weight) lw ON c.weight = lw.weight
      ORDER BY weight

------
andy_boot
Author here - I'm scrolling thru the logs and it is interesting to see that on
the first question lots of people try to write "select name, total_weight"
despite there only being a weight column which you have to use an aggregate on
to get the total weight.

I guess I need to rethink my design a little.

~~~
jabagawee
Quick bit of feedback: question 10 refers to the "WINDOW command" which I
think should be called the "WINDOW clause" to help clueless folks (ie me)
Google for it.

~~~
andy_boot
good idea. I'll fix that.

------
masto
My feedback after answering the first question:

1\. Q0's success messages says "Group by and Having are commonly used
aggregate functions". This terminology is incorrect. SUM() is an aggregate
function. GROUP BY and HAVING are clauses. 2\. Without an ORDER BY clause,
there's no guarantee of the order of the result, but the quiz appears to
require the output rows to be in the same (in the case of Q0, undefined) order
as the expected rows. While this is unlikely to spontaneously change, it's
like depending on the order of an unordered hash map. As long as you're
teaching people things, it's good to instill the habit of always ordering
their results.

~~~
masto
Q1: It's not apparently from the way the schema is shown whether "name" is a
unique key. The suggested SQL will produce incorrect results in the case of a
duplicate name as the window framing defaults to RANGE UNBOUNDED PRECEDING, so
following peer rows are included in the aggregation.

~~~
andy_boot
True. Well observed. How would you resolve this?

~~~
masto
I'm not an expert in window functions and I don't know what's idiomatic. I
think explicitly specifying ROWS UNBOUNDED PRECEDING makes it do the right
thing, but as you noted regarding ORDER BY, this is making people type a lot
of SQL and maybe detracting from the concept. But then, this is a problem they
could easily run into in the real world.

I could see perhaps having a question that comes right after this one in which
the OVER ORDER BY contains duplicates, that can be used as an opportunity to
teach the "gotcha".

Alternately, if there's a guarantee of uniqueness, this can't happen.. but
again it might be setting people up to try to apply what they think they
learned to the real world and then finding out that their sums are wrong...

------
Twisell
Great job!

And here is a small suggestion, you might want to change the name of output
column "unique" in Q3 because.

    
    
      *expression* as unique --does work and is recommended answer
      *expression* unique --fail because of use of a restricted word
      *expression* "unique" --work because of double quote
    

Using a restricted word that need quotation in some context is globally
discouraged and in this context can keep someone stuck independently of the
problem at stack.

PS: also apply to "position" in Q4 maybe thats intentional?

~~~
andy_boot
Good points. I'll update Q3 and look at Q4 it was not intentional to use
keywords.

------
einhverfr
I guess window functions are advanced features, but I found it a little bit
basic. Some things you could do to make it more advanced:

    
    
        * add questions about range mode
        * have questions which discuss default behavior regarding ordering and partitioning
        * Have questions which rely on things like RANGE 3 PRECEDING or other constructs

~~~
dwanvik
Last time I checked, PG didn't support RANGE <value> PRECEDING, only RANGE
UNBOUNDED PRECEDING. The new MySQL 8.0windowing functions do, though.

------
picardo
I'm stuck on the first question. I'm not sure why this is wrong. Can anyone
help me?

> select name, sum(weight) OVER (PARTITION BY name) as running_total_weight
> from cats order by name

[http://www.windowfunctions.com/questions/1](http://www.windowfunctions.com/questions/1)

~~~
papercrane
It's asking for a running total of all weights, so you shouldn't partition by
name. Use an ORDER BY instead of PARTITION BY.

~~~
picardo
Ah, I get it. Thanks!

------
Eremotherium
Thank you for doing this.

Sidenote: either HN killed your backend or I did with a recursive CTE. If so:
sorry :)

~~~
andy_boot
No problem, just restarted it.

Its a brand new project so I may need to do some hardening.

------
dbkaplun
Source:

[https://github.com/bootandy/window_funcs](https://github.com/bootandy/window_funcs)

------
rattray
This is cool! How many questions long is it?

~~~
rattray
(it's 10 questions long)

------
tofflos
Nice. I rarely use window functions and find them a bit arcane. This was
really accessible.

------
postila
select version();

PostgreSQL 9.3.20 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu
4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bi

I wonder why so old version is being used. I will be not supported by the
community pretty soon.

~~~
postila
I mean, in more recent Postgres versions, more things related to window
functions are available – like CUBE, GROUPING SETS, etc. 9.3 was released back
in 2013.

------
thom
There are some big flaws in window functions I'd love to see fixed. The main
one being the very limited ways in which you can refer to the 'current' row in
window and filter expressions. The second being... you're almost always better
off just doing all this work in code, in a programming language with
dependable performance characteristics.

~~~
nickpeterson
I don't know, I think most performance gained from pulling data out and
processing it is generally an illusion. Windowing functions can have terrible
performance, but that's pretty easy to determine and mitigate (indexing,
multiple queries in a procedure). Pulling data down and 'doing something' with
it starts getting into the world of round trips, serialization, concurrency
models...

Plus there is the whole, "My report wants to use the same logic as this web
form", oh it's in Java? Guess we'll just pull the data out, process it,
populate a temp table, report it, clear those values....

If you have business data in SQL, just write the logic in SQL.

~~~
Amezarak
I generally haven't had a problem with window function performance - I was
mostly using MSSQL.

That said, the single biggest performance drain I've seen is people _not_
using the database, and instead pulling data out to process in code. I don't
think I've ever seen a case where it was faster to process data outside of the
database - though there has also been plenty of times when the pertinent SQL
to had to be written "smarter." Think in sets, not iteratively! There's a
terrifying amount of loops in a lot of legacy SQL statements.

~~~
nickpeterson
If you ever want to see some interesting thoughts on this, lookup "Toon
Koppelaars thickdb". Granted he works for Oracle, but I think his point is
quite well articulated.

Even if you write optimized C code, you can't tend to beat well written DB
code because the IO completely overwhelms the processing time. And somewhat
counterintuitively, moving processing outside the database tends to increase
overall database CPU usage, because you tend to force more IO through the
system.

That makes sense, since IO costs more than most cpu processing, and if you
calculate in the DB Engine, you don't tend to need as much raw data out, you
only need the results.

Basically I want a modern, cost effective mainframe, that uses a reasonable
language for the data layer. The earlier suggestions of Python + Postgres or
C# + MSSql are nearer the mark, but never quite feel first class.

~~~
ianamartin
Another Toon fan here. HE also co-wrote a book called Applied Mathematics for
Database Professionals that is superb. It's dry, like a textbook. But the
clarity and depth that they have on the underpinnings of how all this works
and why it does is excellent.

