
Let Postgres Do the Work (2013) - Jarred
http://sorentwo.com/2013/12/30/let-postgres-do-the-work.html
======
phereford
At work, we have an internally generated report that was done purely in ruby.
It worked, barely, munging numbers for about 1000 records. When we hit 4000
records, the report took about 6 hours to run in ruby.

We looked for some obvious optimizations like eager loading and some map
reduce magic and implemented them. That brought us to a range of 45 minutes to
2 hours.

We decided that it was still not good enough. We moved ALL of the report into
SQL functions (Postgres) and we have the whole function running in 27 seconds.

I agree with some of the top comments that maintaining this is a little
trickier than ruby code, but some ideas we came up with in maintenance: 1)
Write the SQL in a ruby file that has a class method of sorts that returns the
generated function. This allows for version control to track the changes of
the sql function. 2) Focus on tests at the interface level. We had great test
coverage on the end result prior to moving to SQL. We were able to re utilize
those tests to verify accuracy with calculations that are now in SQL.

This really isn't for everyone, but for those that want to try something new,
it is worth it.

~~~
wsmith
I once reduced the running time of a report from 45 minutes to 3 seconds (900x
improvement) by moving the code inside the database.

If a programming language wants to stay fast it must eventually become a
database. I realize this is an unpopular opinion but popularity is the wrong
metric to judge by.

~~~
dom0
Hence Kx, ZODB.

~~~
takeda
I'm not too familiar with ZODB, but it looks like it tries to impose OO on a
database, while in reality relational model works best with data, so making
programing language able to interact with data that way would be better. I
think something like JOOQ[1] is close to that.

In order to get good performance you want to minimize number of back and forth
requests over network. So instead of making a request to obtain list of items,
and then fetching each of the item one by one (so called N+1 select issue) you
will get a better performance if you make the database send only the data you
want, nothing more, nothing less.

[1] [http://www.jooq.org/](http://www.jooq.org/)

~~~
dom0
> I'm not too familiar with ZODB, but it looks like it tries to impose OO on a
> database, while in reality relational model works best with data, so making
> programing language able to interact with data that way would be better.

ZODB itself is essentially a transactional single-object store, where that
"single object" is usually the root node of an arbitrary (possibly cyclic)
object graph. Storage depends on the backend, nowadays you'd usually use file
storage for development and relstorage for any kind of deployment.

It doesn't have any query language, Python is. There are no indices beyond
those that you explicitly create (eg. a dictionary would be a simple index,
but other packages provide more complex indices, like B-Trees - the nice thing
here is that anything that goes into the DB is fully transacted, which removes
a lot of headache that you'd have with other solutions (eg. flat XML files)).

ZODB thrives when your data model is too complex to easily or efficiently map
into SQL, when you don't do or need efficient complex ad-hoc queries and when
180 % performance and concurrency (although this was recently improved through
the MVCC reimplementation) isn't the highest priority. Since it's pretty good
at caching (much better than most ORMs) performance usually doesn't suck.

------
memracom
This is an example of what I call "embracing PostgreSQL". Roughly speaking it
means ditching the ORM layers and using a simple API to run any SQL query that
you can think up. Then your application can leverage the many features that
exist in PostgreSQL and not just treat it as a dumb SQL server.

Using functions as an intermediary to raw queries, also decouples the SQL
details from the application. What I mean is that if the application calls an
SQL function to read, update, insert data then you can revise the SQL
functions independently of application code. In other words, if you refactor
an SQL function, you can deploy the change to the database without rebuilding
and deploying the application.

This is especially helpful in companies that have an onerous verification and
validation process for application changes, because often "performance
improvements" are classified as administrative changes which can be done with
less bureaucracy.

So, embrace PostgreSQL as a permanent part of your application and reap the
benefits of its many powerful features.

~~~
rtpg
As someone who's thought about embracing over using the ORM, here's a couple
issue that come up:

\- Version control. The SQL functions end up just being data in your DB. This
feels like an anti-feature in the world of continuous integration and code
review.

\- Lack of composability. An ORM gives you the ability to easily compose
filters. This lets me write functions like "take this query, and make sure
it's properly filtered down to one user's data". Because of how SQL works,
each query needs to be hand-crafted, and it's hard to make parts of it
generic.

\- Rewriting business logic. I have business logic in Python already, but if I
want to query off of that in the DB in SQL, now I need two implementations.
You can sort of get around this in an ORM by annotations.

\- I'm not sure what the developer tooling environment is like. PyCharm is
pretttty nice.

To be honest, you citing that it's easy to get around the validation process
when using Postgres is a major red flag. Why would I want to circumvent code
review for things that are most undoubtably code?

~~~
gfody
The right way to have composability in a RDBMS is with things like
parameterized views or table-valued functions. This would fall under embracing
the RDBMS, leveraging its modern capabilities.

------
kofejnik
I had to work on/with a rather large backend which used this approach. Most of
the logic was in stored procedures, with python/sqlalchemy used as a glue and
to process requests.

It was absolutely horrible to develop and a nightmare to test.

For one, there's nothing approaching ActiveRecord migrations in stability and
ease of use, so schema changes on dev/test/deploy were a huge pain. Then, you
can't really work in normal "auto-run tests on save" mode as you would with
saner frameworks, syncing source from git to postgres is needed (the principal
dev altogether preferred to edit directly in pgadmin and then sync). Then, SQL
tends to be a lot harder to read than python/ruby, especially so if not well
documented and with many join tables. And finally, you can't scale out the
heavy stuff, you are limited by your master instance basically. 2/5, would not
recommend.

~~~
phereford
With the correct abstractions, you can definitely run tests on save.

SQL is rather readable these days. It's obviously not as easy to read as ruby
or python and is slightly more verbose, but it is certainly readable.

Mileage will vary of course. There are cost/benefit trade offs with every
architectural decision made.

~~~
josteink
> SQL is rather readable these days. It's obviously not as easy to read as
> ruby

As someone who can't read Ruby but can handle "any level" of SQL fluently,
that's obviously not a very objective statement.

~~~
phereford
I made a very broad assumption based on my past experiences with engineers and
I apologize if I offended you.

Most engineers I have met in my career have really depended on an ORM to do
all of their heavy lifting. They can write some of the most eloquent
ActiveRecord queries but would be unable to create the same functionality
using raw SQL.

Again, deeply sorry if I offended. It was not my intention.

~~~
kofejnik
Wow, is this really the norm? I mean, not knowing SQL and working purely with
ORM. This seems weird to me, as I learned SQL first.

~~~
majewsky
Yes, that's definitely the norm. Having some experience with Rails, I'm pretty
sure that 90% of Rails developers have never written any SQL. If they need to
access the production database, most of them would use the rails console on
the production app server instead of psql etc.

And I can sympathize. For example, I wouldn't even know how Rails maps a many-
to-many relation to SQL tables, so I would have to use the rails console
myself to traverse such a relation.

~~~
kofejnik
Although Rails is very reasonable with its mappings and they are easy to use
directly, I prefer Rails for db manipulaton, it is usually much easier and
faster to type 'User.find(1).favorites << Product.find(10)' than an equivalent
SQL.

~~~
bdcravens
Very true, but I've found that it's easy for those ActiveRecord abstractions
to kill performance (I did this on a dashboard that was performing some
serious calculations across hundreds of millions of records, often joining
other similarly sized tables)

------
gfodor
Cool demonstration of IMMUTABLE functions (TIL), but there's a reason people
have moved away from stored procedures/functions in RDBMSs in favor of moving
this stuff into much slower application code -- you now have business logic
living inside of the database. From there, you have to now think about how
this affects:

\- Testing

\- Deployment/Migrations

\- Day-to-day tooling to work on it

\- Switching costs (mental overhead, using multiple languages)

\- Documentation

\- Probably more stuff

The ruby code is just like all of your other ruby code and can be grokked and
worked on by anyone who is up to speed on your ruby stack. The DB function
meanwhile requires jumping through lots of hoops and will always feel like a
second class citizen relative to the vast majority of your application code,
unless you do a lot of work and maintenance (unlikely to happen without
feeling distracting if this code is a relatively small slice of your code
base.)

In some cases, this can be worth it for massive optimization wins. But the
article doesn't really touch on why this might be a bad idea.

I do think it highlights the potential of systems like MADLIB [1] where you
can push the data crunching into the database, while having that be abstracted
away in your primary programming language (in this case, R.) One could imagine
a world where you wrote the ruby code and it was somehow mapped in to be run
on the database. But that's not the world we live in today unfortunately. It's
really a shame, because people using a database as capable of PostgreSQL but
with a fairly dumb ORM on top are under-utilizing its potential, for sure, but
there's not really a great path afaik to leverage it. (I remember years ago MS
introduced running C# inside of SQL Server, I am not sure if that ever ended
up panning out into being something smarter than the dumb "upload code to
database manually" model we are used it.)

[1] [http://madlib.incubator.apache.org/](http://madlib.incubator.apache.org/)

~~~
ianamartin
I may be in the minority here, and I'll happily accept that if I am. But my
rule of thumb is that anything that can be done inside the database should be
done inside the database.

But I'm a python/C#/.Net person.

The stored procs/functions go into a repo just like everything else does. It's
not that big a deal. Alembic for migrations is pretty solid. And if you're
dealing with a large code base, you log changes inside the DB itself. Use the
database for version control.

It's really manageable. It requires that your dev team really knows SQL. But I
don't think that's a bad thing.

It's fast, gets the job done, and if you're running a language that doesn't
make it easy to deal with concurrency, it's nice to just let the database
handle transactions.

For everyone who isn't facebook or google, this is the way to go, in my
experience.

Let the db do the hard work, let your preferred language be the middleman, and
have a reasonable front end framework do the presentation.

I can scale a web app to millions of users on a couple hundred bucks a month
of hardware with this approach. You can too.

~~~
kodablah
> my rule of thumb is that anything that can be done inside the database
> should be done inside the database

For non-distributable (or at least non-multi-master) DBs, you are often using
your most expensive and least distributable resource to do your cheapest and
most distributable work. My rule of thumb is that unless you need atomicity or
the performance (as profiled) makes a real difference, do it outside the DB.
Also has the side effect of avoiding system lock-in (at the expense of
language lock-in which I think is ok).

Granted, it's subjective, but I've had the most trouble scaling DBs than other
systems, and it's usually CPU/mem of business logic pushing the boundaries.

~~~
kofejnik
Exactly, you really need ACID only for important stuff, other data you can
cache on workers where you run most of the logic

~~~
anewhnaccount
If you're doing reports you could run them against a read-only slave.

------
smac8
At my last job we finally got our front-end devs to move all client side db
work to calls to stored procedures or views. After doing so, nearly all of
them fell completely in favor of this method. This is escpecially true with
large, complex DBs or data warehouses, where the model is subject to change
and reliance on back-end devs better knowledge of the database is more
important.

That said, despite some of the tradeoffs - for example with testing, or
migrating to a new db (which is extremely rare anyway) - it is ultimately far
better for client code not to have to know back-end implementation. I think
things like graphql are finally making this abundantly clear.

What I have been dying for is a real time DB BAAS that is ACID compliant,
preferably relational, has a simple rest api, and allows me to write stored
procedures that I can call from my client code. Horizon is probably the
closest thing out there. Right now using firebase, and sick of the absurd
amount of client side code I have to commmit to to pull data from various
parts of the database. Requires huge amounts of overfetching, unecessary
coupling of client-server code, horrible consistency support, overly loose
type structuring, etc.

If somebody writes a postgres version of horizon I will pay big money to use
it :)

~~~
oomkiller
Doesn't include everything that Horizon does, but it provides a good
foundation for building it [http://postgrest.com/](http://postgrest.com/)

~~~
smac8
Cool, hadn't seen that. There is also postgraphql, which I have become a big
fan of and is a great foundation:

[https://github.com/calebmer/postgraphql](https://github.com/calebmer/postgraphql)

That said, the real-time aspect is what I'm really dying for, plus a company
supporting a BAAS. The ease of firebase is so nice, and you can make really
cool real time apps with it very fast. I'm not sure if switching to a
relational model would make impose some technical limitation that json stores
don't (besides obvious complexity). Postgraphql + socket.io is kind of what
i'm thinking is the start

~~~
ianbicking
Some clever use of NOTIFY might help with real-time? It's basically
publish/subscribe directly in Postgres itself, but I've never actually used
it.

~~~
smac8
I had forgotten about this actually. Thanks for the reminder :) Curious if
anyone has tried using this for real time apps and what their experience was

~~~
oomkiller
Unfortunately NOTIFY doesn't work across nodes.

------
goflyapig
The article makes a good point that if you want ordering by a computed
'ranking' to be performant, your database will need to store and index that
value.

But then it takes another leap and says that this means your database should
actually _compute_ that value. I don't see why that's necessary, and it would
bring with it all sorts of issues that other people have mentioned
(maintainability, scaling, testing, coupling).

It seems you could equally solve the problem by adding an additional indexed
'ranking' column, and computing and storing the value at the same time you
insert the row [1]. It seems that's essentially what Postgres would do anyway.

Also, I'd note that the algorithm here is very simplistic, and even so, the
author had to make a _functional_ change in order to get it to be performant
in Postgres. You can't just substitute an ID for a timestamp, just because
both are monotonically increasing. The initial Ruby version of this algorithm
treats a given 'popularity' as a fixed jump in some amount of time (e.g. 3
hours). The Postgres version treats it as a fixed jump in _ranking_ (e.g. 3
ranks). Those are not equivalent.

[1] This does assume that you can control all the database manipulation from a
centralized place.

~~~
majewsky
> It seems that's essentially what Postgres would do anyway.

Yes, but if Postgres does it, it's literally impossible for the developer to
screw up, whereas manually putting the ranking in the table can easily cause
inconsistent records if you don't know what you're doing.

And even if you know what you're doing, next month's code change might not
remember that it's important to update the ranking with the rest of the record
(unless you're calculating it in a before_save hook or something).

------
blowski
I tried doing this, but I found it hard to maintain as it needed more slaves
to cope with the load of the complex functions. Also, it was harder to tweak
the algorithm for things like "increment the score by x in case y, otherwise
increment by z". I guess it's possible, but I just found it too hard to work
with.

For me, it was easier to use Redis to keep a sorted set of article IDs. I
updated the scores in my application code using Redis native functions, and
then got a slice of IDs for a `SELECT ... WHERE IN (ids, from, redis)`.

It's probably because I'm a rubbish database engineer, so stick with my
strengths.

~~~
toast0
> It's probably because I'm a rubbish database engineer, so stick with my
> strengths.

This doesn't make you a rubbish database engineer. Scaling databases is
complex, so it makes sense to have your source of truth database do only the
things it must do (consistent updates, indexing the ranking value if needed
for queries) and things it's good at, and move everything else (calculating
the ranking value) to another part of your system. If you can keep database
load low enough to avoid sharding, that's a big win; you don't have to not
grow your DB, just grow slower than Intel grows CPU performance / accessible
RAM.

------
sbuttgereit
PostgreSQL Stored Procedures can absolutely be a pain to manage and change
management can be tricky. Having said that these are not problems without
solutions and there are some good wins if you can be organized enough to pull
it off.

First off, plan and organize things. Yes, it would be nice if there were
another level of abstraction, like Oracle's packages, but you can use schemas
a bit more freely for namespacing. But a little time thinking instead of just
coding can avoid much of the spaghetti. Many good devs will think through the
rationality of their APIs, this is no different.

Second. Use a tool that lets you manage the code-bits like code. Migration
tools like FlywayDB, etc. are not good for this because they sacrifice too
much to the stateful nature of the database: you end up organizing things on a
timeline rather than based on your structure. A better approach is like the
one in Sqitch ([http://sqitch.org](http://sqitch.org)) where you organize
files according the organization of the database rather than the history of
the database... very powerful approach. There are also verify steps which can
incorporate testing: and testing the store procedures is easy... it's maging
the data in tables across tests that can be hard and that is true regardless
where your business logic lives (insofar as those tests can change data).

Third there are good testing tools available: The guy that made Sqitch also
makes pgTAP ([http://pgtap.org/](http://pgtap.org/)). Well worth a look. I
haven't done much myself with this one, but will likely do so soon.

------
liquidise
I've written about this before [1] but i think the higher level message here
is about Ruby on Rails developers expanding their database understanding
and/or reliance. Given how cleanly ActiveRecord abstracts database porcelain,
many devs who started with Rails don't lean on the database enough.

This, along with validators, table indexing and foreign keys are some of the
more common shortcomings i see in devs who learned on Rails.

1: [https://blog.benroux.me/be-the-better-rails-
developer/](https://blog.benroux.me/be-the-better-rails-developer/)

~~~
vinceguidry
When you're starting out, you don't need these optimizations. Depending on
what you're trying to do, you may never need them, and instead find that a
caching solution is best.

~~~
hderms
Data integrity isn't an optimization.

------
justusw
What is the performance impact on INSERT when using tables with complex
indices like the one mentioned in the article?

~~~
coltonv
I'm not the author but I'd guess almost nothing. Basically you just need to
calculate the result of the function and index it by the result, should be
almost completely unnoticeable.

~~~
zerd
Unless it's a table with high write load. You're basically shifting the
expensive part from read time to write time.

------
combatentropy
For version control, I put .sql files right alongside my other source code,
and check them in. I might try snippets out interactively, but after a
statement grows so long that it wraps a couple of times, I usually move it to
a text file and edit from there. I run it in psql with the \i command:

    
    
       psql> \i test.sql
    

Usually I have one terminal window for the text file opened in vim, and
another window for psql where all I do is keep running "\i test.sql".

When the code is polished and ready to go live, I turn it into a view or
function and move it into a permanent file (views.sql, functions.sql, ddl.sql,
make.sql, or whatever you want to call it). This file can be run as many times
as I want, idempotently. This is because it starts off dropping all the views
or functions that it later defines. You can try instead saying "CREATE OR
REPLACE . . ." but PostgreSQL only lets you replace a view or function if its
outer appearance stays the same (for views, the output columns can't change;
for functions, the input and output can't change). So I find it simpler to
just drop all of them ("DROP TABLE foo IF EXISTS...") and then create them
(You usually have to drop them in the opposite order you create them, if any
depend on one another). I wrap the whole file with BEGIN and COMMIT, so if
there is any error the whole thing aborts. Actually I wrap it in BEGIN and
ROLLBACK, and run it to see if there are any errors. Only when it's solid,
polished, and error-free do I temporarily change ROLLBACK to COMMIT, run it,
and immediately change it back to ROLLBACK. The file that gets checked into
version control ends in ROLLBACK, for safety.

Later if I want to change something, I always do it first by editing the .sql
file and then running it from psql. Then I check the changed file into version
control.

\---

For views and functions this is easy, because they don't keep their own data.
You can drop and recreate them at any time. For tables, it's a little more
complicated. You can't just drop and recreate them, or you'll lose all your
data. So for tables I keep their definitions in a separate file. I run this
file at an app's inception but seldom afterward. If I need to make a change to
a table, I do it interactively, in the command-line tool ("ALTER TABLE foo ADD
COLUMN blah ..."). Then I will probably edit the tables.sql file to reflect
the change, just so it's in version control. But it's up to you. You can
always pull the table definition from Postgres with the pg_dump command.

------
z3t4
its often much easier to scale via app "farms" then to scale the db backend.
Sometimes db procedures can lower the total db load, but be careful letting
the db do too much work.

------
jklein11
I thought the point of using an ORM like Active Record was to decouple the
business logic and the database?

~~~
oftenwrong
It can decouple you from the specific database implementation. The ORM can
generate SQL that targets the database you are using by altering the generated
queries to account for dialect and feature differences. For example, it can,
potentially, make it possible to switch your application from using MySQL to
PostgreSQL rather seamlessly. However, in my experience, this is rarely
straightforward, and rarely done. It's not unusual for the application layer
to become dependent on some quirk or unique feature of the underlying
database, and sometimes that dependency is not obvious. The sibling reply is
not far off the mark when it comes to using ORMs "in practice".

------
eddieroger
In one of my gigs, it was normal operating procedure to put logic in sprocs
and call out to that from elsewhere. It makes so much sense - RDBMS engines
are good at doing math, so we should let them do as much math as possible. ORM
layers hide that away, and thinking about these problems in Ruby or Python
work, but those languages just aren't as good at plain math as a database can
be. I've done this type of thing in Rails and Express apps, and it makes
complete sense there. You can even keep the sproc in source control by
creating and changing them as migrations. It's tops.

------
mooneater
"We can replace the timestamp with the primary key for the “recentness”
calculation". Very different properties, not a great replacement.

~~~
czep
I definitely agree that there may be some unexpected consequences of using the
primary key in this way. It certainly won't behave like a timestamp and could
distort the function outputs.

Would it be possible to substitute the integer count of seconds since epoch
(or some other arbitrary base timestamp) and use that instead? Then you'd
still have an immutable function to satisfy the index, but also you'd preserve
valuable information about the relative time intervals between posts.

------
ankurdhama
The idea is: Compute where your data is and yes data do have a location when
computing has to be done with physics(computer).

------
JBiserkov
>The integer value of a timestamp is just an increasing counter, ticking off
each microsecond. The database itself has another form of increasing counter,
the primary key.

This observation is either profound or trivial, depending on the mindset. The
hundred-fold improvements it brings (along with the immutability) are not
trivial, by any measure.

------
tsxxst
Is it really needed to declare separate functions with all that boilerplate
just to multiply and add a few numbers in an SQL statement?

~~~
Sean1708
No.

------
alexnewman
The future is this but backed with super cool FDW for s3 and bigtable

------
mrwnmonm
what happens if i got 10 posts then before getting the second 10 one of the
posts that i should get in the second query got 50 likes, it will got higher
ranking and i will miss it, right?

------
godmodus
he mentions 'logarithmic clamping' towards the end, can someone provide a link
to what that is? google-fu failing.

------
yandrypozo
I think ~0.442ms is very slow for a social media web or app, considering that
it's only 1 query, maybe that example isn't the best for that kind of queries

------
Walkman
It's unbelievable that an article can get to HN front page in which the author
doesn't know the difference between Rails and Ruby. So annoying!

This is NOT Ruby for God's sake:

    
    
        SYSTEM_EPOCH   = 1.day.ago.to_i

~~~
randall
People are downvoting I think because this is a rather pedantic point to be
annoyed by. The tone is one of superiority, rather than pointing out the
simple mistake.

Calling a rails function a ruby function is unlikely to invalidate the
entirety of the article.

~~~
Walkman
I know the point is somewhat pedantic, but I just don't really like to learn
from people who doesn't even know what is what. Somewhat invalidates the
article IMO, but I read it anyway, just with very carefully, trying not to
learn anything just understanding the basic idea he/she want to tell.

I think they downvote because they are so ignorant I can't even believe it.

~~~
randomdata
Perhaps they are downvoting because of your own ignorance? The aforementioned
code does depend on ActiveSupport (or similar library), but has absolutely
nothing to do with Rails.

~~~
Walkman
Thanks for explaining!

------
alexnewman
Speaking of this i'm looking at hiring someone to do some OSS rust/postgresql
work. Namely writing FDW for s3 and bigtable.

