
SQL Window Functions - helenanders26
http://www.helenanderson.co.nz/sql-window-functions-part-1/
======
gerhardi
I interview people quite often for data engineer positions where good
knowledge of SQL is also required and I'll usually have the applicant give me
a presentation of some system or data pipeline that they have worked on
earlier (any possible business or client secrets can of course be obfuscated
or omitted). I'll ask questions about that system or process "like I'm 12
years old" and then I'll make up on the fly some kind of problem statement
that I would like to have this and this kind of results from the data - how
would you do it? Very often I include problems that would be easily solvable
with window functions and more often than not those who know them get hired :)

IMHO it is really helpful to know these.

~~~
julienkervizic
Yeah, knowledge of window functions allows to solve quite complex data
manipulation problems fairly easily.

For some data-engineering type of work such as sessionization, doing it
without window functions would make the task really complicated.

In some platforms such as MySQL there are alternatives such as correlated
subqueries that also allow to do extensive data manipulation easily, but at
quite the cost penalty.

In my experience, people who know window functions, are already quite well
versed in SQL and thus can serve as a good proxy to gauge overall experience
in Analytical SQL.

------
serpix
The existence of LAG and LEAD have solved some seriously iffy requirements
twice in the past year now. One time for measuring / determining from two
consecutive GPS points some details about speed, route and distance (omitting
some details here), LAG allowed to see and compare the previous points
recorded and is IMO absolutely vital for PostGIS.

second time for calculating changes between two rows and combining that with
some business rules, the existence of LAG was absolutely crucial for keeping
it all in the database.

RANK / PARTITION BY is something that when you need it you really really need
it. Like if you join a bunch of rows to a single column but would like to just
have the first or last of those rows for results and disregard all the other
rows? RANK and order by

------
syntheticcdo
I would say you need _to know about the existence of_ window functions.
Knowing they exist and generally what they are used for means some time in the
future I can identify a problem I am trying to solve, and then go read up on
the documentation for window functions and use them in that place, and then
probably forget about the specifics.

~~~
theshrike79
"...there are known knowns; there are things we know we know. We also know
there are known unknowns; that is to say we know there are some things we do
not know. But there are also unknown unknowns—the ones we don't know we don't
know." \- Donald Rumsfeld

Basically if you know that something exists, but don't know about it you can
still go learn it and apply it to solve a problem.

But if you don't know about it, most likely you'll try to reinvent the wheel
and spend way too much time doing it.

~~~
Ma8ee
It always annoys me when that phrase is attributed to Donald Rumsfeld. The
concept is much older than him, and the attribution to him gives him an air of
wisdom that he doesn't deserve.

~~~
pmarreck
Then find someone else to credit it with? Seems futile to combat this
attribution without a replacement

~~~
Ma8ee
Does it need to be attributed to someone? It has long been a well known
concept in the intelligence community.

------
cam8001
The author explicitly states that she is using these in SQL Server, but I did
a bit of digging and they are supported in ANSI SQL2003/2008, which is
implemented by all the major vendors (MySQL, PostegreSQL, Oracle, DB2, SQL
Server).

Wikipedia has some data on support:
[https://en.wikipedia.org/wiki/Select_(SQL)#Window_function_s...](https://en.wikipedia.org/wiki/Select_\(SQL\)#Window_function_support_by_RDBMS_vendors)

~~~
michaelper22
MySQL only as of 8.0, which also added CTEs, more JSON features, and index
features relevant to indexing JSON.

------
jhoechtl
Date is hard. The example says

> BETWEEN '2018-01-01 00:00:00:000' and '2018-12-31 00:00:00:000'

So the last day of december will be missing.

~~~
quietbritishjim
BETWEEN is inclusive, so it will match '2018-12-31 00:00:00:000' but not
'2018-12-31 00:00:00:001'. So if the date column actually only stores dates
(not times) then it will actually work correctly - certainly using "2019-01-01
00:00:00:000" would give wrong results.

It would be safer, and clearer, by doing the half-open comparison explicitly.
But I would argue it's BETWEEN that's hard (because it's closed rather than
half-open as you'd expect) rather than dates, at least in this case.

orderdate >= '2018-01-01 00:00:00:000' AND orderdate < '2019-01-01
00:00:00:000'

~~~
tda
In postgres I would use tsrange to make it very explicit if I want to include
the edges or not: '[2018-01-01 00:00:00, 2019-01-01 00:00:00)'::tsrange @>
'2018-01-01 00:00:00'::timestamp

~~~
quietbritishjim
I didn't realise that Postgres has ranges, thanks for that. But, even though I
have a background as a pure mathematician, I strongly disagree that "[a, b)
includes x" is any more explicit than "x >= a AND x < b". Given that those
ranges aren't standard SQL and are _much_ less commonly used than simple
inequality operators, I'd stick with inequalities for clarity.

------
gigatexal
Not to tack on too much hyperbole but window functions are amazing and can
save you from doing a lot of self joins.

~~~
pixelbath
Same with CTEs. In the last year I've worked with both and they really change
the game for nontrivial queries.

------
zzo38computer
I have used window functions in SQLite. You don't need it a lot, but there are
some uses for it. However, I don't know what the EXCLUDE clause is good for;
it doesn't seems so useful.

~~~
linkerzx
Yeah the addition of window functions onto sqlite last year was quite a good
feature to enable more data use cases with sqlite.

The exclude clause is part of the "window" configuration which defines which
row to process, with it you can for instance define a range between all the
previous and next rows and check if your data-point looks abnormal.

------
just_myles
I remember using window functions to create complex data outputs for data
transformation pipelines. I didn't even know what they were but postgres just
started supporting them and it looked like the right solution at the time.

I don't find a whole of use for them now but, I believe simply being aware
they exist and an idea of what they do is easy enough to research and apply to
any project you are working on. This is from a analyst and dev perspective.

------
AdamM12
Anyone have an example of using window functions in a non time series data
set? Kinda seems like a bit of a flag in my head that if I'm asked to do some
analysis w/ time series data (depending on the question) a window function
might be a good the tool. Reading this you can tell I'm not particularly
fluent in them. We use them at work and I was asked during my interview about
them (did not know but got job anyway).

~~~
ianamartin
Say you have to grab a bunch of files a few times a day and zip them up and
drop them on and endpoint somewhere. But the endpoint has a max file size for
a single upload.

If you have a table of metadata about your files (zipped_size and location,
for example) you can window on the sum of the zipped_size to collect some
number of files you can get that will be under the file size limit.

~~~
AdamM12
Thank you

------
thom
As well as the basics, it really helps to understand all the different options
for window frames, allowing you to only look backwards or forwards a certain
amount, or skip the current row etc with more recent support in Postgres at
least.

Also if you want to get really fancy, it's always worth remembering that you
can write your own aggregates in Postgres, which can of course work as window
functions.

------
alexgmcm
In BigQuery I've found array_agg() to be preferable in most use cases.

Also, in Hive I used to use Custom Reducers for many things where now I have
to make do with Window Functions.

BigQuery is blazingly fast though so it seems a reasonable sacrifice.

~~~
adwww
I use pretty simple window functions with almost every BigQuery query I write,
I can't imagine going back to a database that doesn't support them now.

~~~
alexgmcm
Yeah - I've never had to use a db that didn't support them.

It's almost unimaginable for me!

------
eternalny1
Why, in the second example with RANK(), does the values for 1/2 go from rank
12 to 15 and then 6, 7?

Am I missing something obvious?

~~~
caf
It's because the query has an ORDER BY clause that orders it all by sale_date,
whereas the RANK() is calculated from subtotal order.

------
pmarreck
How is this different from LIMIT and OFFSET options?

~~~
breakingcups
Very.

~~~
pmarreck
...but how?

