
Window Functions in SQL (2013) - okket
https://blog.jooq.org/2013/11/03/probably-the-coolest-sql-feature-window-functions/
======
jnordwick
LAG/LEAD are basically SQL having to give in a little to the timeseries and
column db people. It obliterates the set theoretic nature of SQL, and they are
dog slow.

This is why any decent columnar/timeseries db has its own often arrish query
language to take advantages of ordering information inherent in the data.
There is an order in that index and SQL table, you just don't have good tools
to access and use it.

This touches on a conversation I was having a couple weeks ago where somebody
was saying that column/array storage is purely a backend consideration and
don't need to care about it in the front end of the database. That view gets
you ugly primitives like LAG/LEAD. Being able to exploit your backend decision
is paramount to performance.

Products like Informix or KDB+ just give you full access to the data on and
off disk as an array so you ca build complex windowing primitives (and even
store them and reuse them later).

(PS: I just find this really funny. KDB+ is probably is the only database I
know that literally does streaming analytics on a Raspberry Pi. No joke, the
full DB on Linux ARM/Pi. [https://kx.com/download/](https://kx.com/download/)
)

~~~
lukaseder
I would be very surprised if LEAD and LAG couldn't profit from indexing, if
you design your query / window function in a way that it actually can profit.

Perhaps you're hinting at the idea that it's too easy to screw this up as a
user, for it to be inefficient

~~~
jnordwick
The doesn't help that much since you are probably already using an index for
filter criteria, the table is probably unkeyed (fact table).

The case it does work is that you are ordered on timestamp (for example), the
query planner properly figures to use any filtering indexes and sequential
indexes, ... _magic occurs_

Now the job is making an SQL statement that the query planner can actual
understand and optimize since SQL lacks so much of the machinery for order-
dependent processing (this was a feature when SQL was conceived remember).

~~~
lukaseder
What if I'm designing a covering index to cover for both my filtering and
ordering needs? The SQL engine should pick that up in my opinion and avoid
needing to re-order anything.

Of course, SQL is by far more general purpose than producing ordered stuff. In
fact, ordering is a completely non-relational thing that was even criticised
in SQL in early days (such as offsets, limits, duplication, etc.). But SQL has
learned to shoehorn foreign concepts into the language / platform for a long
time, so I'm just curious about a concrete example where SQL window functions
fail (and couldn't be fixed) in SQL for you.

~~~
jnordwick
Try doing a VWAP in a group by 5000 stock symbol over a LAG of last 30000 -
basically a sample every 10 millis for 5 minutes. This is the simple case
where you can actually use LAG of a fixed amount.

(When the query returns tomorrow morning you realize that LAG was actually
useless because your observation arrivals weren't periodic and you need to
deal with / bucket for that to actually get usable results).

We used to run a lot of LEAD/LAG queries on a very, very large Oracle install
could never get the performance we wanted out of it. Basically turned it into
a very expensive file server that spat out HDF that we then processed by a mix
of shell and python. Should not have been faster than the overpriced heater in
the corner we called a database but was.

~~~
thom
Just to clarify, when you say LEAD/LAG here, do you actually mean ROWS/RANGE?
So, approximately:

    
    
        sum(price * quantity) over w / sum(quantity) over w as VWAP
        ⋮
        window w as (partition by stock order by time range 30000 preceding)
    

Totally willing to believe first-hand experience, but surprised and
disappointed if it's impossible to make that fast. (Apologies if I'm totally
misunderstanding the use case).

~~~
jnordwick
Is there really that much of a difference between LEAD/LAG and OVER+RANGE when
using an aggregation function? I'm sure there is, but it has been a few years
since I had to deal with some of this.

There are probably also difference between OVER when used in TSQL and PLSQL.

Last time I had to do this was actually with real-time advertising bidding,
and it was across keywords, not ticker symbols (many more, much sparser). We
had a 20 millisecond budget, and had to break the queries up to at least get
partial results to push out bid even if we didn't have all the results back in
from the database that we wanted.

We used OVER+RANGE for average price, but needed LAG for first derivative (is
the price trending up or down). We had to start doing roll ups of the data
hourly and be constantly trimming to keep the real-time performance high
enough, but that meant we needed to have have a separate off-line analytical
system. Common to do, but still I think it can be done without it if you use
the correct tech stack.

~~~
lukaseder
What you're doing certainly doesn't sound like an easy job for window
functions, although it might be possible with RANGE + interval - at least in
Oracle 12c.

How about Oracle 12c's MATCH_RECOGNIZE, though?

[http://www.oracle.com/ocom/groups/public/@otn/documents/webc...](http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/1965433.pdf)

~~~
jnordwick
Thanks. That was very interesting. Overall, a total band-aid of course, but
still better than nothing. I can completely see doing ad hoc queries with that
(and I can't wait to try it out) but I fear the performance hit of trying to
build a system around it :)

It didn't give any examples of queries going both directions, e.g. the down-up
V first example. You probably often also want the up-down ^ shapes too. It
would be nice if the PATTERN had alternation too "STRT (DOWN+ UP+)|(UP+
DOWN+)" so you didn't need to double your queries. Or else you're probably
better off creating a temp on CUR-LAG(1) to give you the sign and then pattern
match the sign (if that makes sense).

Oracle really does go out of their way to keep SQL modern though. I remember
the first time I saw a "CONNECT BY". lol. I had an interview a few months
later and was asked to write an SQL query to basically traverse a tree (the
gist of it). I being a young smart ass gave the correct answer (you can't) and
proceeded to write the START / CONNECT just to show off a little.

~~~
lukaseder
Yes, it's a band-aid. But perhaps it will feel less so, if you're looking at
the original invention by Esper Tech (I have no experience with it, just know
that I've seen MATCH_RECOGNIZE there first):
[http://www.espertech.com/esper/release-5.1.0/esper-
reference...](http://www.espertech.com/esper/release-5.1.0/esper-
reference/html/match-recognize.html)

Hah, yeah CONNECT BY is great for simple recursion, but for more complex
(probably less performing) cases, I prefer the standard approach with CTEs...

------
mmsmatt
> Once you know window functions, you risk putting them all over the place

Agreed.

A few years of writing expressive, powerful, performant SQL in big blocks will
really change your mental model of data across _all_ languages.

I highly recommend a deep dive into the relational world, and I feel super
grateful for the SQL-focused job I held previously. Brought those lessons back
with me to both FP and OO.

~~~
nickpeterson
They are extremely useful, although they seem to vary in performance quite a
bit (SQL Server experience here, so YMMV). I'm always a bit shocked how little
SQL most software developers know, especially because it's often the
bottleneck if used poorly.

------
thom
I think my only real complaint about window functions is that you can't access
the 'current' row in aggregates and 'filter' conditions. It means very complex
look aheads or look behinds aren't really possible.

It would also be great if windows could be function parameters in Postgres. I
like to abstract complicated conditionals and calculations into functions, but
it falls down if you're operating over a window.

~~~
lukaseder
Would you mind providing an example of what you'd like to do, which is too
complex for window functions?

~~~
thom
Apologies for contrived examples. First, it'd be nice to be able to use both
rows inside some aggregates, e.g.:

    
    
        min(ST_Distance(this.point, foo.point)) over (blah)
    

That is, for each 'foo', get me the minimum distance to another foo in the
current window.

The second peeve is to do with filters, e.g.:

    
    
        count(*) filter (where this.time - foo.time < 5 and type_id = 42) over (blah)
    

That is, how many events of type 42 are there up to 5 seconds before the
current event.

In both cases, it's sad not being able to address the current row as well as
the row in the window. I'm not saying either form is completely impossible to
get working somehow, I just think it'd be a nice feature.

~~~
lukaseder
Thanks for clarifying. Oh, I see yes indeed. We can only nest aggregate
functions inside of window functions, not the other way round (unless
resorting to derived tables, views, ctes, etc.). Indeed, it's logical to know
why it's not possible (the way the language was designed), but the language
could have been designed differently.

Do note that Oracle has the KEEP clause that might probably solve your issue
with aggregate functions only... But it's a rather esoteric, vendor-specific
extension.

Anyway, I do agree, what you're describing would be a very nice feature.

------
taeric
So, yeah. Window functions are new to me. I was actually asking the team about
a month ago if there was a way to do this in SQL. Consider my mind blown, and
thanks for posting!

------
okket
"Modern SQL" feature overview for MySQL + forks:

[https://dveeden.github.io/modern-sql-in-
mysql/](https://dveeden.github.io/modern-sql-in-mysql/)

(aka "Look, window functions are coming")

~~~
tracker1
Similarly, there's plv8 for PostgreSQL

~~~
koolba
> Similarly, there's plv8 for PostgreSQL

How is that similar? plv8 is a procedural language (javascript) for writing
user defined functions.

------
teilo
Windows functions are a feature I cannot live without. They achieve results
that otherwise could only be achieved with complex and ugly CASE and embedded
SELECTs. Another reason Postgres is superior to MySQL.

------
cygned
The most useful feature imho is temporal data. Too bad Postgres does not
support it.

