
A Little Known SQL Feature: Use Logical Windowing to Aggregate Sliding Ranges - mariuz
https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/
======
datahead
This technique applies to many RDBMS, not just Oracle (as others have noted).
Teradata, PostgreSQL, MS SQL all have 'analytical' functions like this.
Analytical functions (over, partition by, etc) are extremely powerful and can
help simplify architecture/design for the data science/analytical communities.

One of the persistent issues on my team is the reliance upon a dataframe
representation w/ R or python to do this type of aggregation and windowing.
Most people will eschew learning the 'advanced' SQL and instead bring data
locally to do imperative style munging on it.

This creates a few issues, mainly adding complexity to the analytical stack:
\- Instead of querying the data and doing ETL/feature engineering in the db-
you are moving data around (usually to less powerful machines, such as a
laptop) for simple exploration.

\- This wastes time and usually results in more dependencies (dplyr for
example- no hate Hadley), sometimes even limiting you to single threaded
operations. Teradata, for example, is massively parallel and will perform
these operations in short order. I've seen Data Scientists wait 6hr for R to
do the same thing a SQL query against a prod system returns in 3min.

\- Code is not portable. A query can be executed and results retrieved through
ODBC, JDBC or native connections. Without these, data engineers are often
asked to install R (including libs) on some intermediate machine just to do
munging/ETL/feature engineering. If SQL driven, moving from quantitative
exploration to operational is quite easy (maybe just a query tune).

All that to say, I'm glad this post is highlighting some of the advanced SQL
that I hope more people rely upon. All of these ideas are better articulated
in MAD Skills [0]

[0]
[http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf](http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf)

~~~
SOLAR_FIELDS
One minor nit - normally a prod db is subject to availability restrictions
that limit your ability to run 3 minute-long queries on them that suck up a
bunch of resources. Nightly mirror to a staging/analysis db of sorts is a bit
better practice.

Window functions are extraordinarily powerful, as are Common Table Expressions
(CTE's). I encourage anyone who uses SQL with any regularity to learn them
immediately once they are comfortable with the more straightforward queries
and clauses SQL offers. Once you've mastered windowing and CTE's, you'll
wonder how you functioned without them.

~~~
bigger_cheese
The admins at my work have always told us (i.e. people wanting to run data
analysis) to avoid executing complex queries on production servers.

The workflow I use is to run simple select queries on prod databases. To bring
the subset of data I'm interested in down locally Once I have the "extracts"
I'm interested in I'll transform the data locally using SAS.

SAS has pretty powerful time interval routines which is what I'd typically use
to compute something like the example in the article.

My work has just started looking at "big data" we have a new hadoop databases
which is supposed to be used for that so I assume if I ever need to start
looking into data stored there it will make more sense to run the SQL "on the
database". So the article has some useful info there.

~~~
tomlock
One thing to note here might be that if network is the bottleneck you'd likely
be asked the opposite. In my experience if people pay attention to
indexes/partitioning they can win dba's hearts, but dbas often don't explain
these things because most people don't listen. So to them its easier to say
"don't run complex queries".

~~~
SOLAR_FIELDS
Part of the problem with people paying attention to indexes/partitioning is
that there's no very easy way to display that to the end user. A person
knowledgeable in SQL can query the right tables to determine this but at that
point you're pretty much thinking like a DBA in the first place. If major
vendors had a nice visualization of such a thing built into the interaction
tools like SSMS/SQL Developer/PGAdmin3 etc. then it might be easier (these
might exist, I'm just not aware of them).

So yeah, the end result is DBA's give the generic "don't run SELECT * and use
a WHERE clause when you're querying large tables"

------
teilo
I have made a lot of window function converts. It's amazing how many devs
don't know about them. When I see a whole series of CASE statements +
subqueries, it is almost always possible to dramatically simplify the query
with one or more window functions.

------
eranation
This (in perhaps a slightly different syntax flavor) works on
Hive/Presto/SparkSQL as well by the way. Also in Apache drill.

[https://drill.apache.org/docs/sql-window-functions-
introduct...](https://drill.apache.org/docs/sql-window-functions-
introduction/)

~~~
electrum
Presto implements the SQL standard, so the syntax is identical:
[https://prestodb.io/docs/current/functions/window.html](https://prestodb.io/docs/current/functions/window.html)

------
RyanHamilton
Windowing functions in standard SQL databases are like putting lipstick on a
pig. You created a design based on sets then later tried to slap on the idea
of order to sets. kdb is based on the concept of ordered lists, not sets and
makes windowing functions on time and order based queries ridiculously easy.
If this kind of thing interests you I recommend checking it out these
comparison queries between kdb and standard SQL:
[http://www.timestored.com/b/kdb-qsql-query-vs-
sql/](http://www.timestored.com/b/kdb-qsql-query-vs-sql/)

Unfortunately kdb is very expensive.

~~~
gshulegaard
> Windowing functions in standard SQL databases are like putting lipstick on a
> pig.

I have always thought the underlying Relational Algebra of SQL databases to be
quite beautiful
([https://en.wikipedia.org/wiki/Relational_algebra](https://en.wikipedia.org/wiki/Relational_algebra)).
This doesn't make it the right tool for every job, but let's try to shy away
from calling it a, "pig".

> You created a design based on sets then later tried to slap on the idea of
> order to sets.

I have some questions here.

It sounds like you are implying that sets and order are mutually exclusive,
but this is not my understanding of the math. Could you elaborate?

* [https://en.wikipedia.org/wiki/Partially_ordered_set](https://en.wikipedia.org/wiki/Partially_ordered_set)

* [https://en.wikipedia.org/wiki/Total_order](https://en.wikipedia.org/wiki/Total_order)

From the second reference on Total Order:

> "The lexicographical order on the Cartesian product of a family of totally
> ordered sets, indexed by a well ordered set, is itself a total order."

This quote seems to actually contradict this implication...especially in
respect to Relational Algebra which relies heavily on Cartesian Product.

Thanks in advance for your response! I enjoy improving my own understanding of
the underlying Computer Science/Mathematics whenever I can.

~~~
RyanHamilton
>>It sounds like you are implying that sets and order are mutually exclusive,
but this is not my understanding of the math. Could you elaborate?

The concept of sets and order may not be mutually exclusive in the
mathematical theory but in terms of how actual databases were conceived and
implemented they were very far apart. Early on many databases actually took
advantage of the fact the operations were set based and that order was not
guaranteed to achieve a number of speed optimizations. How long did many
databases take to get row number support? "partition by" support?

Take a look at this SO post for doing a running sums calculation:
[http://stackoverflow.com/questions/14953294/how-to-get-
runni...](http://stackoverflow.com/questions/14953294/how-to-get-running-sum-
of-a-column-in-sql-server)

This is the same kdb code: update runningSumB:sums b by a from t

Instead many people ended up turning to cursors to allow ordered calculations.

~~~
gshulegaard
> The concept of sets and order may not be mutually exclusive in the
> mathematical theory but in terms of how actual databases were conceived and
> implemented they were very far apart.

I may be wrong, but it was my understanding the original Relational Databases
were very much based on the Relational Algebra originally pioneered by E.F.
Codd[1]. So I think it is a stretch to say that they were "conceived" separate
from the mathematics.

Further, it was my understanding that E.F. Codd took his relational algebra
which he developed while working as a researcher at IBM and applied it to an
implementation of an RDBMS also at IBM (now known as IBM DB2)[2]. So
similarly, I think it is a stretch to claim that databases were originally
implemented "far apart" from the mathematics.

I think there is a valid point that actual DB implementations may be lacking,
but isn't that a reason not to use that particular implementation rather than
a knock against Relational models?

I may be wrong, but it seems like your problems aren't with the Relational
Model but rather with potentially lacking/incomplete implementations of it.

[1]
[https://en.wikipedia.org/wiki/Relational_database](https://en.wikipedia.org/wiki/Relational_database)

[2]
[https://en.wikipedia.org/wiki/IBM_DB2](https://en.wikipedia.org/wiki/IBM_DB2)

------
epalmer
I don't use SQL that much in my day to day work but I have a need that
Windowing will satisfy. Thanks to @mariuz for submitting this.

I use Oracle mostly so I don't even have to worry about syntax differences.

------
garyclarke27
Itzik Ben-Gan wrote a whole book just on SQL windows functions and it's
brilliant. He writes for SQL server usually, but he made this one more
generic, he had to because at the time SQL server 2012 was missing much of the
functionality he described, unlike Postgres or Oracle. Postgres has incredible
capability here - Window AND Frame partitions/ordering, also row based ranges
or value based ranges. And as a huge bonus unique to Postgres, the wondeful -
Filter by clause - so much more elegant than sub queries.

~~~
j_s
This one?

 _Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions
(Developer Reference)_

[https://www.amazon.com/dp/B00JDMPHKY](https://www.amazon.com/dp/B00JDMPHKY)

------
contingencies
_How many payments were there in the same hour as any given payment?

How many payments were there in the same hour before any given payment?

How many payments were there within one hour before any given payment?_

I have two serious questions.

(1) In which realistic scenario would you ever need to ask such questions in
the first place?

(2) Even if you did need to do so, for example for risk management, occasional
reporting or some kind of adaptive scaling supposition, then surely you would
be calculating once per time unit (eg. hour/day) or recomputing a sliding-
total?

As I have actually personally written a number of relatively high profile,
large scale, publicly deployed, major brand digital entertainment content
rental and purchase systems (including complexities resulting from things like
multi-device, multi-DRM cryptography) for brands like LG, Nokia, Samsung, etc.
- these are not flippant questions.

Looking further... uh-oh... author uses Java.

 _Sakila example database was originally developed by Mike Hillyer of the
MySQL AB documentation team. This project is designed to help database
administrators to decide which database to use for development of new
products. The user can run the same SQL against different kind of databases
and compare the performance._

So the Sakila project referenced is not supposed to be a real project, it's a
personal re-interpretation of a test project designed specifically by a
database nerd with zero domain knowledge to show off database features...

 _jOOQ is an innovative solution for a better integration of Java applications
with popular databases_

... and this is what the author is working on, the world's billionth RDBMS ORM
layer.

IMHO readability is far more important than efficiency. Programmers should
generally avoid using these types of obscure, frequently partly platform-
linked techniques and consider their drawbacks against other potential
solutions.

Five minutes of life wasted.

~~~
infinite8s
Huh? Windowing functions are part of the SQL standard (as of SQL2003).

Also, these types of queries are the bread and butter of analytical queries
(not really as useful for transactional purposes).

~~~
contingencies
There is 'standard' and there is 'reality'.

If it doesn't work everywhere (eg. on sqlite) then it's not 'standard' in that
it's not familiar to all developers (minimum subset).

The further you go from that subset, the higher price your developers pay to
grok your code. Yes, that includes stuff as simple as foreign keys, indices,
etc.

Contrary to your bread and butter assertions, in my experience this stuff is
end-of-the-branch, end-of-the-twig, end-of-the-leaf, only-present-at-winter-
solstice level common in web and CRUD applications.

~~~
smallnamespace
I've worked on multiple true relational SQL and 'analytics' backends and they
have all supported analytic functions of varying complexity. Everyone around
me was using them as well because the queries were orders of magnitude faster,
and smaller and _more readable_ than using nested SELECT ... GROUP BY
statements to achieve the same result.

Note that for reporting and analytics workloads, 1) even batch queries can be
much faster using analytics functions because the optimizer can do many of
them for essentially free as part of the scan rather than requiring a subquery
and 2) interactive workflows become much, much quicker in my experience. Data
exploration is much less useful if running a query takes 5 hours rather than
30 seconds.

Defining sqlite as the 'standard' SQL and refusing to use any other features
available on your platform out of fear of confusing new developers is
misguided. Sometimes analytics functions are the best tool for the task, and
good engineers should judiciously weigh their utility against the risk of
losing access to that tool.

Keep in mind though that the cost of migrating your db backend is generally so
high that if you _are_ forced to do so, rewriting queries that depended on
analytics functions will likely be a small rounding error within the total
migration cost.

------
est
tl;dr Oracle.

~~~
samch
The article highlights the Oracle flavor of this functionality, but similar
windowing operations are available in PostgreSQL[1] and SQL Server[2]. Edit to
add that I have no idea if JOOQ supports windowing functions in non-Oracle
systems.

1: [https://www.postgresql.org/docs/current/static/functions-
win...](https://www.postgresql.org/docs/current/static/functions-window.html)

2: [https://msdn.microsoft.com/en-
us/library/ms189461.aspx](https://msdn.microsoft.com/en-
us/library/ms189461.aspx)

~~~
lukaseder
_Logical_ windowing is not available in those other databases, i.e. only
Oracle can specify a window whose size depends on the content of the current
row, rather than a fixed number of rows.

~~~
samch
I'm not sure that's entirely correct. This is from the SQL Server MSDN docs,
"Alternatively, the RANGE clause logically limits the rows within a partition
by specifying a range of values with respect to the value in the current row."

Source: [https://msdn.microsoft.com/en-
us/library/ms189461.aspx](https://msdn.microsoft.com/en-
us/library/ms189461.aspx)

