

Using PostgreSQL Arrays the Right Way - afshin
http://blog.heapanalytics.com/dont-iterate-over-a-postgres-array-with-a-loop/

======
bonesmoses
Introducing the caveats of PostgreSQL arrays is informative. However, I can't
help but note that this app probably would have been better served by a
classic event log table. Why have each user forever carrying tens, or hundreds
of thousands of events directly in their user record?

If data mining is necessary, it's _right there_ in the event log and can be
manipulated by much simpler (and faster, possibly by orders of magnitude) SQL.
Additionally, it's not bloating regular access of the user records.

I've been seeing these designs more and more, recently. Is it a natural
extension of non-DBAs taking DDL design roles? Because this data manipulation
style doesn't really mix well with any RDBMS I'm familiar with.

~~~
mrits
A lot of times its because you want want a rollup table but also need to
answer cardinality questions. So instead of scanning millions of rows you can
have a single rolled up array that you can perform set operations on other
rolls (instead of a numeric aggregate that does not have the appropriate
mathematical properties to do much with).

~~~
bonesmoses
Well... generally this is solved via indices. An event log indexed by user ID
would have the same effect. You get all the user info, and if you really need
to grab historical or event data, it's there when necessary for parts of the
app _that actually need it_.

I own a lot of stuff, but I don't drag my house everywhere in case I might
need something. :p

------
greggyb
I find it very interesting to see how seemingly minor rewrites can have huge
performance impacts.

I do have a question about the SQL style. I work in business intelligence, so
I write a pretty good amount of SQL, and get to see a lot of other people's
efforts, ranging from newbie analysts' first fumblings to experienced DBAs'
queries and procedures. It seems that nested subqueries in the style of this
article are used much more often than CTEs, and I'm curious why this is.

* I find CTEs much more readable than subqueries * For naive uses, they are optimized the same * For complex uses (e.g. recursion to traverse a parent-child hierarchy), CTEs are the only option

The only good reason I see for subqueries is if you are using an old version
of your RDBMS that does not support CTEs.

I'm just wondering if anyone else can weigh in on the topic.

~~~
pgaddict
The thing is, CTEs are not named subqueries. They are supposed to execute just
once, even if used on multiple places in the query.

This causes a number of challenges for the planner, because:

* some of the places may benefit from quickly producing the first row (e.g. in EXISTS) while other places may need all the rows (e.g. when you pass the rows to GROUP BY or something)

* the places may use additional clauses, but these can't be pushed down to the CTEs because that'd influence the other place

There may be databases with planner that handle this in a different way, but
in PostgreSQL the CTEs act as optimization fences - each CTE is optimized on
it's own, without considering the rest of the query (additional clauses etc.)

People who use CTEs as named subqueries are often surprised by the impact this
may have on the plans performance-wise.

Consider for example this:

    
    
      CREATE TABLE foo (a INT UNIQUE, b FLOAT);
      INSERT INTO foo SELECT i, random() FROM generate_series(1,1000000) s(i);
    
      CREATE TABLE bar (a INT);
      INSERT INTO bar SELECT i FROM generate_series(1,1000000) s(i);
    
      WITH foo_filtered AS (SELECT * FROM foo WHERE b < 0.5)
      SELECT * FROM bar JOIN foo_filtered USING (a) LIMIT 10;
    
                                             QUERY PLAN                                       
      ----------------------------------------------------------------------------------------
       Limit  (cost=36986.19..38236.83 rows=10 width=12)
         CTE foo_filtered
           ->  Seq Scan on foo  (cost=0.00..17906.00 rows=510375 width=12)
                 Filter: (b < '0.5'::double precision)
         ->  Hash Join  (cost=19080.19..63848915.94 rows=510375 width=12)
               Hash Cond: (bar.a = foo_filtered.a)
               ->  Seq Scan on bar  (cost=0.00..14425.00 rows=1000000 width=4)
               ->  Hash  (cost=10207.50..10207.50 rows=510375 width=12)
                     ->  CTE Scan on foo_filtered  (cost=0.00..10207.50 rows=510375 width=12)
      (9 rows)
    

and without the CTE:

    
    
      SELECT * FROM bar JOIN foo USING (a) WHERE b < 0.5 LIMIT 10;
    
                                          QUERY PLAN                                    
      ----------------------------------------------------------------------------------
       Limit  (cost=0.42..10.26 rows=10 width=12)
         ->  Nested Loop  (cost=0.42..502029.00 rows=510375 width=12)
               ->  Seq Scan on bar  (cost=0.00..14425.00 rows=1000000 width=4)
               ->  Index Scan using foo_a_key on foo  (cost=0.42..0.48 rows=1 width=12)
                     Index Cond: (a = bar.a)
                     Filter: (b < '0.5'::double precision)
      (6 rows)
    

All because the planner was unable to consider the join condition and the
LIMIT 10, when planning the CTE.

~~~
greggyb
Thanks very much for your feedback.

This is good to know. Most of my work is in Microsoft SQL Server, and its
query planner does allow other clauses to be pushed into the evaluation of the
CTE.

In MSSQLServer, there is no expectation that a CTE is evaluated only once,
just that is defined once - it is just an in-line view.

------
dbenhur
Author could lose a layer of subselect with the filter on rank by using
first_value over the window frame instead.
[https://gist.github.com/dbenhur/951b7bae4e07d7998268](https://gist.github.com/dbenhur/951b7bae4e07d7998268)

~~~
drob
Author here. I didn't know about first_value. Thanks for the tip!

I'll play with this later today and see if it performs any better. In any
case, it makes the query easier to read, so we'll probably use this unless it
somehow makes the function slower.

Thanks again!

~~~
drob
Profiled this. Using first_value instead of the additional sub-select made
this perform slower at a factor of about 2x for arrays of 1M events.

Interesting, and very surprising! I'm going to dig around and see if I can't
figure out why this is the case.

I would have guessed there would be a minor performance improvement, because
the first_value approach effectively allows PostgreSQL to discard the data
sooner, instead of materializing it in an additional subquery.

~~~
dbenhur
Wow... I am surprised. If you gather some useful data please send it on to
postgresql-dev and.or postgresql-performance so some core devs can fix it in
subsequent releases.

------
mrits
I had some pretty great success rewriting some of our longer running queries
to use arrays. A couple things:

I implemented several helper functions around this. Basically I have a row
based function that makes sets unique then I have aggregation functions that
give me the cardinality of these sets. I use intset when I can, otherwise I
use the "key" part of hstore in substitute of the missing set structure

I've been pushing hard to use:
[https://github.com/aggregateknowledge/postgresql-
hll](https://github.com/aggregateknowledge/postgresql-hll)

Being able to switch algorithms based on size IMO makes that one of the most
badass thing about postgres right now.

~~~
bonesmoses
That's some pretty good stuff, right there. Out of curiosity, any reason
you're using HSTORE instead of JSON? I've been watching what contexts JSON
gets used in PostgreSQL systems, and having another data point never hurts. :)

~~~
pgaddict
JSON is just a TEXT data type with validation, while HSTORE stores the data in
a structured way and allows a more flexible indexing and such stuff. OTOH it's
just text=>text while JSON format allows additional data types.

That changed with JSONB, which got introduced in 9.4 - that's essentially
HSTORE2, with additional improvements.

