The HAVING clause is evaluated before the SELECT so you have to specify the full clause `having sum(weight) > 12` and not just the alias. Most SQL engines are probably smart enough to not actually recalculate everything.
Most RDBMSes won't let you use a column alias in the same scope as it is defined (pretty sure the SQL standard itself doesn't specify that it should be allowed). You can get around this by using a CTE.
Also, optimisers aren't stupid, the actual sum() function will not be recalculated.
Author here. - As others have said this SQL Engine makes me write the SQL out the long way. For some reason the shorter version (which I wrote first) is rejected by my postgres client.
Lovely website, I'm finally learning about window functions as I do this. Thanks a bunch.
One question / bug report? Q7 rejects this:
select
name, weight, weight - (min(weight) over (order by weight asc rows between 1 preceding and current row) ) as weight_to_lose
from cats
is that much worse than lag() or is it just non-idiomatic?
(edit: and Q10 asks about "window", but it was quite hard to google. if you could provide another one of those links to the relevant docs, that could help a bunch :D)
MySQL foreign keys add locks to referenced rows, and can have surprising effects on concurrent queries, particularly if you aren't running at read committed or lower isolation level.
InnoDB storage engine is available in 5.4, you don't have to use only MyISAM[1].
However, depending on your application, you may be enforcing these constraints elsewhere and not need them at the engine level, even if it's a nice-to-have.
I really really like this creation of yours, and I've already learned a couple things from it.
However, you really really need to be careful about SQL injection. I can see that you tried to lock it down as much as possible, as far as I can tell the account the queries are running under only has SELECT permission (no update/delete/etc). However I was still able to get some data you probably don't want me to have...
This query returns a list of all tables in your database:
SELECT c.relname||'' FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
And this query returns a list of users and their privileges:
SELECT usename||'', usecreatedb||'', usesuper||'', usecatupd||'' FROM pg_user
I guess your name is Andy based on your HN username, but who is Rusty? :) And that's just what I've been able to get in 5 minutes of trying, let alone a determined attacker.
Noted, I am currently running on the assumption that there is nothing interesting in that PG (There isn't). But yes I would like to lock that user down further it really shouldn't be able to access those things.
I'm not sure how you'd do this with windowing functions, but at least you can use common table expressions (WITH) to achieve what you're asking for.
The site is down right now, so I had to test using my own db instance. Column and table names might not match.
WITH
distinct_weights AS (
SELECT DISTINCT weight
FROM cats),
weights AS (
SELECT
weight,
lag(weight)
OVER (
ORDER BY weight ) AS lag
FROM distinct_weights)
SELECT
cats.name,
cats.weight,
cats.weight - weights.lag
FROM cats, weights
WHERE cats.weight = weights.weight
ORDER BY weight;
yields
Any 5.5 null
Misty 5.7 0.2
Oscar 6.1 0.4
Smokey 6.1 0.4
I see, it's joining with the lagged weight. Same without CTE:
SELECT c.name, c.weight, c.weight - lw.lower_weight AS weight_to_lose
FROM cats c
JOIN (SELECT weight, lag(weight, 1, weight) over (order by weight)
lower_weight FROM cats
GROUP BY weight
ORDER BY weight) lw ON c.weight = lw.weight
ORDER BY weight
Author here - I'm scrolling thru the logs and it is interesting to see that on the first question lots of people try to write "select name, total_weight" despite there only being a weight column which you have to use an aggregate on to get the total weight.
I did that and to be honest I am not quite sure why.
I think it's due to lack of reading the question properly and impatience.
But it also might be due to how programmers sometimes think/work. Often, when I am trying to solve a problem I like writing something (anything) down in broad strokes first so that I can then go back in and 'fix' each smaller part of the larger problem.
The vast majority of developers don't know SQL beyond simple CRUD queries. The surge in popularity of NoSQL over the past decade has made SQL become a lost art. Even when using relational databases, ORMs often alleviate the need for understanding the principles.
Thank you for making this.
Q6 has a flaw. You request "Order by: ranking, name" but the accepted solution sorts by rank only (Alfie should come before Ashes)
I suspect there's a need to keep the design simple, but the source table schema should probably be more prominent. I suspect most people try to guess the names of the columns before they ever click "Show table schema".
Nice quiz, by the way, and I like the topic. In my experience, too few devs / dbas make good use of window functions.
I thought the underlying schema was pretty obvious. However I made lots of mistakes due to trying it out from my phone, and it didn't cooperate as well as one would've liked.
But I think the design was good. I also noticed the option to look at the schema, if needed.
Quick bit of feedback: question 10 refers to the "WINDOW command" which I think should be called the "WINDOW clause" to help clueless folks (ie me) Google for it.
1. Q0's success messages says "Group by and Having are commonly used aggregate functions". This terminology is incorrect. SUM() is an aggregate function. GROUP BY and HAVING are clauses.
2. Without an ORDER BY clause, there's no guarantee of the order of the result, but the quiz appears to require the output rows to be in the same (in the case of Q0, undefined) order as the expected rows. While this is unlikely to spontaneously change, it's like depending on the order of an unordered hash map. As long as you're teaching people things, it's good to instill the habit of always ordering their results.
Q1: It's not apparently from the way the schema is shown whether "name" is a unique key. The suggested SQL will produce incorrect results in the case of a duplicate name as the window framing defaults to RANGE UNBOUNDED PRECEDING, so following peer rows are included in the aggregation.
I'm not an expert in window functions and I don't know what's idiomatic. I think explicitly specifying ROWS UNBOUNDED PRECEDING makes it do the right thing, but as you noted regarding ORDER BY, this is making people type a lot of SQL and maybe detracting from the concept. But then, this is a problem they could easily run into in the real world.
I could see perhaps having a question that comes right after this one in which the OVER ORDER BY contains duplicates, that can be used as an opportunity to teach the "gotcha".
Alternately, if there's a guarantee of uniqueness, this can't happen.. but again it might be setting people up to try to apply what they think they learned to the real world and then finding out that their sums are wrong...
1) Yes. I'll re-word that part.
2) I debated with myself on that. On one hand it is clearly better to ORDER BY. On the other hand I'm trying to teach people the window functions and I don't want them typing in more SQL than they have to.
And here is a small suggestion, you might want to change the name of output column "unique" in Q3 because.
*expression* as unique --does work and is recommended answer
*expression* unique --fail because of use of a restricted word
*expression* "unique" --work because of double quote
Using a restricted word that need quotation in some context is globally discouraged and in this context can keep someone stuck independently of the problem at stack.
PS: also apply to "position" in Q4 maybe thats intentional?
I guess window functions are advanced features, but I found it a little bit basic. Some things you could do to make it more advanced:
* add questions about range mode
* have questions which discuss default behavior regarding ordering and partitioning
* Have questions which rely on things like RANGE 3 PRECEDING or other constructs
I mean, in more recent Postgres versions, more things related to window functions are available – like CUBE, GROUPING SETS, etc.
9.3 was released back in 2013.
There are some big flaws in window functions I'd love to see fixed. The main one being the very limited ways in which you can refer to the 'current' row in window and filter expressions. The second being... you're almost always better off just doing all this work in code, in a programming language with dependable performance characteristics.
I don't know, I think most performance gained from pulling data out and processing it is generally an illusion. Windowing functions can have terrible performance, but that's pretty easy to determine and mitigate (indexing, multiple queries in a procedure). Pulling data down and 'doing something' with it starts getting into the world of round trips, serialization, concurrency models...
Plus there is the whole, "My report wants to use the same logic as this web form", oh it's in Java? Guess we'll just pull the data out, process it, populate a temp table, report it, clear those values....
If you have business data in SQL, just write the logic in SQL.
I generally haven't had a problem with window function performance - I was mostly using MSSQL.
That said, the single biggest performance drain I've seen is people not using the database, and instead pulling data out to process in code. I don't think I've ever seen a case where it was faster to process data outside of the database - though there has also been plenty of times when the pertinent SQL to had to be written "smarter." Think in sets, not iteratively! There's a terrifying amount of loops in a lot of legacy SQL statements.
If you ever want to see some interesting thoughts on this, lookup "Toon Koppelaars thickdb". Granted he works for Oracle, but I think his point is quite well articulated.
Even if you write optimized C code, you can't tend to beat well written DB code because the IO completely overwhelms the processing time. And somewhat counterintuitively, moving processing outside the database tends to increase overall database CPU usage, because you tend to force more IO through the system.
That makes sense, since IO costs more than most cpu processing, and if you calculate in the DB Engine, you don't tend to need as much raw data out, you only need the results.
Basically I want a modern, cost effective mainframe, that uses a reasonable language for the data layer. The earlier suggestions of Python + Postgres or C# + MSSql are nearer the mark, but never quite feel first class.
Another Toon fan here. HE also co-wrote a book called Applied Mathematics for Database Professionals that is superb. It's dry, like a textbook. But the clarity and depth that they have on the underpinnings of how all this works and why it does is excellent.
I think we could be friends. As an appdev turned database/BI dev, there are a lot of intricacies of databases that are just not intuitive and take a while to get down right.
We're getting away from it by either moving to new tech or the tech is improving (adaptive query processing, auto tuning indexes, writable columnstore).
Even when I was an app dev I said you live and die by your database, either it's data model or by not putting processing in the right place. It's an architecture thing and something that (generally) has to be addressed early on.
The point is you dont have business data yet, you need to create it ... using business logic. All the data is in sql but for some reason you dont store value x (probably an aggregate of sorts) and sql environment doesnt have a good generalized syntax for logic.
I think the problem is maintenance instead of performance. When you really want performance (in data reads) youll be precomputing and storing the values. Then your report can filter and be indexed properly.
I dont think that any tech solution will give you fast reads and fast writes you need to pick one.
Your server probably has alot of down time ... use it , make a cron job and make a seperate reporting table ... reporting on raw data does not scale , window functions do not scale
Using window functions is equivalent to using a better cpu. Maybe the environment you work in is faster but you have not factored the logic to make the best use of the resources.
You can take a crap approach in java and port it to sql. But the approach is crap and you should create a seperate reporting process or at least use hadoop.
Not positive in PostGres, in MSSQL you still wind up copying the data and using additional resources, it's just handled for you. I'm a big fan of scripting in DBs becoming a thing (and a little scared) but it's not all roses (yet). You would also loose optimizer "fixes" right?
The PHP version looks like it hasn't been touched since 2011 and the Python link is dead. Doesn't give much confidence that this is a viable and supported pathway to Db <-> Application communication.
One good reason to avoid it and stored procedures is for parallelism. But you need a lot of data to make that worthwhile. Especially in these days of 3+Tb RAM machines.
This is great news, although I'm not sure it gives me the thing I _most_ want, which is just to be able to refer to the 'current' row within a filter's where clause.
Not always the case, especially with columnstore indexes and batch mode, even traditionally, rownum/lead/lag/ordering functions are extremely handy along with aggregates. I can't actually think of many, result set/batch, scenarios that would be better replicated in code, not counting the speed on extra delivery (every row for a sum) and possible miscalculations. Add in the ability to index views and some other things for specific scenarios, then again, I'm mostly MSSQL and lack in other offerings, or knowing where they might lack.
I think often conversations about this are really people talking past each other. If you work in the enterprise, and spend tons of time dealing with 'business logic' and bespoke systems, using SQL seems preferable because many of your headaches come from systems being out of sync, or reports showing different values. Databases represent a way to guard against a lot of problems that tend to occur in that environment. It's only when you branch into other realms of development where it starts to feel really backwards and archaic.
But is getting correct results reliably really "backwards and archaic", or does it only seem so because it is not the latest shiny toy with a trendiest name?
(edited: spelling)
Where clause pushdown through views, or worse CTEs. Views with window functions querying views with window functions. Queries with multiple windows. I didn't just randomly make the statement, modern query planners are an absolute arse to predict in the way that you can 100% of the time in external code. The moment you're doing anything really 'sequency' instead of just calculating some dumb running totals, window functions can become a nightmare, and I just made the statement so people don't automatically think there's a friction-free story for doing general sequence processing in SQL. Not sure why this has offended people so much.
If you have a small amount of data, yes; but if you have any significant amount of data, you are losing multiple orders of magnitude of performance when you pull all that data out of the database.
I lose orders of magnitude of dev time every time Postgres can't work out how to push a where clause down into a view, tbh. In our case, doing processing on sequences at insert time is much more pleasant, performant and featureful than trying to do it after the fact with window functions.
Their answer:
Which is equivalent to: