
Distinct on: a confusing, unique and useful feature in Postgres - yogiUSA
https://www.yogeshchauhan.com/167/postgres/the-confusing-unique-and-useful-feature-in-postgres-distinct-on
======
Groxx
Seems like a saner option (in that behavior is defined) than MySQL's ability
to select columns in a group-by that are not aggregates / grouped-on.

e.g. this works in MySQL, and has for many years:

    
    
        select a, b, c from table group by a
    

and you'll get undefined (but valid) b and c items from the grouped results
per `a`. It's absurdly useful at times - if the non-grouped columns are just
denormalized data, it doesn't matter which row they came from, and there's no
need to spend time sorting/maxing/etc. And sometimes you truly don't care
which entry you get, you just want _an_ entry, e.g. to show a sample of what's
in the group.

~~~
munk-a
Thankfully, though, Postgres' DISTINCT ON does play nicely with ordering and
that's where I think it shines. If you have revisioned historical data this is
one of the best ways to clearly and expressively pull off most recent records
- it also works well for MIN/MAX based selections or more complicated ordering
schemas - in my workday setting we have a common ordered data set that we
order along a three component index and have ended up wrapping one of our most
common DISTINCT ON expressions in a view.

~~~
AmericanChopper
I had no idea this existed, and that was the first use case that came to mind.
I can see myself using this a lot now. I have one especially terrible dataset
I’m working with at the moment where table A has a primary key, and table B is
a set of denormalized transaction logs. If I want to know the status of a
record in table A, I have to JOIN on table B (SELECT ... MAX), to get the most
recent transaction for that PK. It’s ugly, confusing to anybody else who reads
it, and inefficient. DISTINCT ON seems like a a significant improvement for
that.

Note: it would be theoretically trivial to actually normalize this schema, but
sadly that’s beyond my remit. At first glance I can’t think of a scenario
where I would choose this design pattern unless I was forced to work within
similar constraints (which does happen all the time in the real world).

~~~
munk-a
If performance is a concern a MATERIALIZED VIEW might be a good tool to use -
I've worked in similar sounding legacy environments and usually, while the
underlying structures are immutable due to layers of code, adding a mat view
or two on top is do-able.

Maybe that'd be some help?

~~~
AmericanChopper
This particular use-case wouldn’t be suited to MVs, because it’s infrequently
accessed, and cannot tolerate stale data. So refreshing the view just kinda
kicks the can down the road a little bit. Other DB engines support a FAST or
INCREMENTAL refresh, which can even support queries with aggregates (as long
as certain conditions are met). For instance Oracle achieves this with
MATERIALIZED VIEW LOGs, which are essentially just transaction logs for the
MV.

It’s a bit of a moot point for me in this case though, as I’m not allowed to
make any DDL changes to this DB, I just have to produce periodic reports.
Which is perhaps a bit of a contrived situation, but it’s absolutely not my
first time dealing with it. In reality there’s other even more trivial
solutions to this problem. It could be solved by adding a row to table A (or
creating a new related table), and then adding an update to the code that
inserts the transaction logs (or even just with a table trigger - though
that’s a bit more spaghetti). That would be all that’s required to bring this
schema (or at least this particular portion of it) in line with a normal form.
This particular DB is a large component in a generally spaghttified
architecture, that lots of stakeholders write to and consume data from, so
over the years it’s grown too complex to reliably tolerate simple changes. I
imagine constraints similar to these is what drives a lot of use for features
like DISTINCT ON, where the more “correct” approach might just be to fix your
schema.

------
cosmie
One really useful call out is that it supports expressions, not just columns.
You can create derived entities via case statements (think binning, entity
normalization, etc) and leverage the super handy functionality against those
derived entities.

It also allows you to use multiple entities/expressions in the distinct on, so
you can pivot the logic between granularity levels. Which comes in handy when
needing to pull rollups latest/first/biggest/smallest style rollups from
varying perspectives.

~~~
rattray
This sounds pretty awesome. Anyone have any sample queries demonstrating this
lying around?

~~~
cosmie
I threw together a Fiddle that demonstrates it here[1]. It shows the basic
single column version, a multi-column distinct on, and an example that uses a
case statement for a derived column. The only thing with the third example is
that I don't believe you can alias the derived column within the distinct on
field itself, so have to repeat the expression - once within the distinct on
and again in the column list of the select statement.

[1] [https://www.db-fiddle.com/f/7CCYFf14UtkuGYfGZAt5PY/3](https://www.db-
fiddle.com/f/7CCYFf14UtkuGYfGZAt5PY/3)

------
Tostino
Distinct on is for sure one of the most useful Postgres extensions to SQL, as
far as how concise it makes queries which otherwise would suck to write.

I would have liked to see equivalent queries written with GROUP BY so people
could understand and translate that knowledge even if they don't have direct
Postgres experience.

Thanks for the article!

~~~
claytonjy
I believe the demo query

    
    
        SELECT DISTINCT ON (location) location, time, report
        FROM weather_reports
        ORDER BY location, time DESC;
    

could also be written with a self-join like so

    
    
        SELECT location, time, report
        FROM weather_reports
        JOIN (
            SELECT location, max(time) AS time
            FROM weather_reports
            GROUP BY location
        ) as _ USING (location, time)
        ORDER BY location;
    

though I'd be curious what the query-plan differences are. I suspect the
DISTINCT ON might translate more closely to a lateral join

    
    
        SELECT
        FROM (
            SELECT DISTINCT(location)
            FROM weather_reports
        ) AS locations
        JOIN LATERAL (
            SELECT location, time, report
            FROM weather_reports
            WHERE location = locations.location
            ORDER BY ts DESC
            LIMIT 1
        ) AS reports ON TRUE
        ORDER BY location;
    

But I really can't say.

~~~
paulddraper
The first only works if location,time is unique. In the example it may be, not
necessarily in general.

And yes, the JOIN LATERAL would be a good substitute.

P.S. More concise is CROSS JOIN ... rather than JOIN ... ON TRUE

------
beckingz
This removes the self join needed to ask the query
"(latest/first/biggest/smallest) record per (customer/user/entity)".

Nice!

~~~
caseyohara
I've been struggling with a similar query recently perhaps better suited for a
time series DB. Can DISTINCT ON be used to ask the query "latest record per
customer _per day_ ". For example, let's say we have a bunch of devices that
send sensor readings arbitrarily throughout each day. Could DISTINCT ON be
used to query the last reading of each day per device?

~~~
dagss
Also consider expressin this with a lateral join.

Scroll a bit down here for a similar example.

[https://stackoverflow.com/questions/11472790/postgres-
analog...](https://stackoverflow.com/questions/11472790/postgres-analogue-to-
cross-apply-in-sql-server)

------
abalashov
DISTINCT ON, when I stumbled upon it rather accidentally a few years ago, was
a life saver in some complicated queries in some even more complicated stored
functions in our telephony infrastructure product, where only the first route
offered by any vendor that otherwise matches a set of criteria is desired.

The alternative would have been a lot less concise and would have required a
much longer stored procedure.

------
Twisell
What's nice about PostgreSQL is that you quite often discover little useful
trick you have missed so far.

Guess that's however true for any well written language when you master it
enough to enjoy theses gems hidden in plain sight.

------
thibautg
DISTINCT ON is also very useful for nearest neighbor queries with PostGIS.

------
cosmotic
If you want your application to be portable, don't use this feature.

~~~
irrational
Do you mean portable as in able to run on different databases? Your
application would have to be fairly simple to achieve that goal. We moved an
application from Oracle to Postgres. It took about 2 years to complete.

~~~
cosmotic
Many applications are built on ORMs which are pretty portable. About the only
thing I regularly depend on that is proprietary per RDBMS is "last insert
autoincrement id" since it's not part of the SQL standard. Binary blobs is
another big one.

