
Which is better? Performing calculations in sql or in your application? - hartleybrody
http://stackoverflow.com/q/7510092/625840
======
sitharus
This is something I've been through before.

The first version of an app I worked from was very SQL heavy. Almost every
calculation was done in a stored proc and the app servers just formatted that.

As the product got popular this became the bottleneck. It's far easier to get
more app servers than DB servers.

So we restructured it to do straight index reads and aggregations in the DB,
but more complex calculations in the app itself.

It all depends on the circumstances, but I'd still advocate pushing as much in
to the DB as you can without making convoluted SQL - your average RDBMS has
amazing optimisations for aggregation, sorting and filtering.

~~~
codexon
But the problem with doing it on the app is that when you are joining huge
rows, all of this unfiltered data gets sent over a relatively slow 100-1000
mbit port.

~~~
awda
We push data over 10Gbit - 40Gbit ports, but the point still stands.

------
fiatmoney
From a straight performance perspective:

The rule of thumb is that DBs are bound on IO, so any calculation that you can
get with the same amount of IO as returning the records is likely to be
"free", and any calculation that requires more IO is likely to be "expensive"
(unless that allows you to avoid IO by restricting returned records, avoiding
future queries, etc. - it gets complicated fast).

So things like column-column calculations, simple aggregates, etc. are likely
to be good ideas on the DB; for anything else It Depends.

~~~
sliverstorm
_The rule of thumb is that DBs are bound on IO_

Isn't that just because of the way systems that make use of DBs are
architected? I would think you could trivially make them CPU bound if you
start tacking a bushel of FLOPs on to every request.

~~~
fiatmoney
It's a rule of thumb, not a rule of fist.

But actually, it's not uncommon that you might have enough caching (or an in-
memory DB) that you appear to be maxed out on CPU - but because of memory
latency, overhead between the DB and the network, etc. you can still get a
fair number of operations for "free".

Measure everything, but know where your starting points are and what to try
first.

------
sehrope
The big advantages for doing things "on the DB" are centralizing business
logic and eliminating serialization/transport/deserialization. Having a view
or stored proc with business logic lets it be shared across different
components of the same app or even across separate apps that share a database.
An extreme (yet common) example of calls best done on the DB itself is a SUM
of price X quantity or any other grouping operation. Rather than moving N rows
to the client, the server can aggregate it all and just send back the result.

If CPU usage on your DB really is your bottleneck ( _and seriously it 's
probably not_) then you should look into federating logic out to your app.
Otherwise the centralization of app logic alone is worth it.

~~~
altcognito
You should write isolated libraries of business logic, not bake it into your
data layer.

~~~
sehrope
Sometimes it makes sense to make a common library shared between apps that
interacts with your database and sometimes it makes sense to put that code
itself in the database (as a view or stored proc). The latter has the
advantage of working in multiple languages, direct DB access (eg. your fav DB
client), and external systems that also read from your database (eg. a
reporting service that connects to your DB). A shared library would be useless
in the second and third situation and you'd end up duplicating logic.

~~~
mnutt
So when the time comes to change the business logic you have stored in the
database, do you have to ensure that every possible client that may use that
business logic is updated at the same time? You could version your business
logic in the database, but then it seems like you've implemented a version
controlled business logic library inside the database.

Incidentally, every time I see people attempting to put business logic in the
database it's usually manually updated views or stored procedures. It reminds
me of the days of yore of people shelling into production to edit php files.
Other than Rails migrations, South, etc, are there usable tools out there for
sanely writing software that runs inside the database?

~~~
dragonwriter
> So when the time comes to change the business logic you have stored in the
> database, do you have to ensure that every possible client that may use that
> business logic is updated at the same time?

No. If you change any of the interfaces (e.g., the structure or semantics of a
view) in a non-backward-compatible manner, rather than merely changing the
implementation, you have to assure that the _consumers_ of the specific
affected interfaces are updated. But if you've built a DB structure that
isolates applications well (each application uses its own set of views, which
reference the views implementing shared logic, which reference the base
tables) most changes to shared business logic should be completely transparent
to most applications, in the normal case not impacting even the application-
specific views, but even when they do only requiring changes to the app-
specific view definitions that don't impact the actual application.

------
danielweber
I've written some pretty awesome SQL queries that give back exactly what the
application wants with the application sometimes not even needing to do
anything else afterwards.

I've usually always regretted those "awesome" queries.

I've known some very smart developers who've gotten lost in SQL queries, while
show them the equivalent Ruby/Python/Javascript/C code that parses through the
results and they can understand it less than a minute.

~~~
bartonfink
One approach I've seen that works well for untangling gnarly SQL is to
refactor those mega-queries into smaller, component queries that are pieced
together by application code. This keeps the processing on the DB server but
avoids the pain that comes with programming in SQL. I rather like SQL, but it
doesn't handle complexity well at all (e.g. maintaining a "variable" to be
used in different parts of the query).

For example, I once worked on a Foursquare clone originally written by a
hardcore PostgreSQL nut. This system had a query that, if memory serves,
returned a list of places of a certain type within a geographic area along
with user activity on those places (votes, comments, etc). This was around a
75 line SQL query that actually wasn't that fast (response times from the DB
were roughly 1 second even with every join indexed). We rewrote that query
into 4 smaller queries (place ID's within that area, place ID's within that
category, hydrating those places from the filtered ID's and then getting the
user info), and that cut our DB response by about 70% in addition to making
the system easier to work with. This required roughly 10 lines of Java code
and a variable - a list of ID's that we got first and passed into each other
query. It also freed us up to do other things - for instance, if performance
were still a problem, queries 2-4 could have been done asynchronously behind a
latch. By lifting the "glue" out of SQL and into a better language, it freed
us to do new things, and it freed the database from having to juggle
unnecessary complexity while planning and executing its queries.

~~~
crazygringo
I've written gigantic, 500+ line queries for MySQL, with probably 40
subqueries within. Obviously, what MySQL receives is a monstrosity that nobody
in their right mind would try to understand.

But the query is assembled piece by piece, in separate functions, each
subquery responsible for its own contribution to the final query string, with
well-defined inputs and outputs. The entire file that generates the query
reads quite logically.

And there's simply no alternative -- many pieces of processing involves
100,000+ rows, so round-trips between db and app would be prohibitively slow.
The whole thing uses data from around 10 different tables, it's extremely
relational.

But because it's structured well and written correctly, the whole thing
executes in a small fraction of a second. (Trying to do it in a "NoSQL" style
would probably take ten minutes of back-and-forth network communications.)

I've known a lot of programmers who would shy away from such a thing -- but
that's because a lot of programmers don't bother to actually understand SQL
the way they understand Ruby or JavaScript or PHP. It can do amazing feats of
data processing, which is the whole point of a relational database. My advice
is, dig deep into SQL. It can work wonders, but it's true that its "best
practices" can be difficult to learn, and there's a lot of bad advice out
there.

~~~
boomzilla
I've never seen that kind of huge queries for OLTP (online transaction
processing) apps, but I've written ~1000 line SQL queries for off line batch
jobs. Especially with the recent popularity of Hive that allows UDFs written
in Java, one can do wonders with SQL.

One thing SQL really helps is it force you to think "data first". Instead of
thinking algorithms, step by step what you want to do, it makes you think
along the line: what data I got and what output I want to get out of it, not
unlike functional programming, but with more focus on data sets.

------
j45
Premature optimization is a bigger enemy than performance issues.

That being said, the DB server is what's optimized to do calculations.

The tradeoff is you have a second stack to maintain and performance tune now
beyond being a datastore. A positive is you can independently write and run
tests.

The question is, can you resist building the perfect empire on day 1? Move
stored procs and functions into the DB as they are needed. Whatever you're
working on (including who is working on it) isn't that important.

~~~
gaius
And to cache the results of those calculations for the next query. How do a
farm of app servers do that?

~~~
j45
Depending on the app server you are using, it should be able to cache results
of a query no problem. I've never considered this to be an issue :)

------
JulianMorrison
Move things to the DB if the calculation aggregates data (the input is much
larger than the output), to avoid materializing data and shipping it around.
(Map-reduce is a subtype of this.) Otherwise don't.

------
protomyth
If you are using numeric data (not floats) in the database, I would say do
your calculations in the database. Different languages and tools have
different arithmetic handling which can cause some very long and psychotic
debugging sessions. A single point of calculation is a good thing.

[edit] I should explain a bit. If you use are in a multi-language
environment[1] and are doing financial or weight / volume calculations, be
extremely careful if you decide to not do all the calculation on the database.
Having results calculate differently in two different places will drive you
mad. I have noticed some serious problems with number handling in different
languages and some mistakes in calculation will get you sued.

1) SQL counts as one of the languages

~~~
chris_wot
Except if a database upgrade fixes a floating point issue. Or the engine may
not be as capable of doing calculations on floating point logic!

~~~
protomyth
I specifically said non-floating point. I have not used any database that
would screw up a decimal type. It would lead to mass migration away in certain
very well paying industries.

~~~
chris_wot
Sorry, missed that.

------
jka
If you find yourself returning _very large numbers_ of rows to the client, you
might be doing something wrong - SQL databases are at their most efficient
when you select only the columns and rows you need.

If you perform aggregation/calculations in the DB, you can potentially save
on-the-wire data transfer time (and potentially CPU time on your clients..
though obviously that is shifting the CPU work to the database).

Similarly if you find yourself making multiple trips to the database, and then
using loops to combine different data sets, you're probably too far on the
'client-side' and should look at using some joins and combination logic on the
DB side to get what you need in a single (and likely more efficient) round-
trip.

------
georgemcbay
The top answer on the page says as much but the answer to pretty much any
"Which is a faster way to do this, ABC or XYZ?" is "try it both ways, measure
the results (adjust for load if necessary), and see". Doesn't matter if you're
talking different pure code algorithms to achieve the same result, stored
procs vs app code, network caching or not, etc. If doing it the absolute
fastest way matters to you, try multiple approaches and measure the results.
Even in situations where there are accepted "best practices" there may be
variables specific to your own project/tech stack that cause _your_ results to
be surprising!

------
soora
In practice, the answer is not always obvious.

Benchmarking multiple queries / approaches is generally worthwhile if
performance is important.

~~~
kabdib
... annnnd then the database technology changes, and your benchmark results of
three years ago are now actively hosing you. Or you have to change databases
but you can't because you embedded a lot of logic in your database layer that
can't be easily ported (somewhere, a salesman for your current database
product is crying tears of joy and shouting his cash-register shout, "Ka-
ching! Customer lock-in! Ka-CHING baby!". Avoid these people).

Unless you have really compelling reasons to get snuggly with a particular
vendor's technology, be conservative.

[This still applies if you're using a "free" database engine; you're just not
paying MS or Oracle or whomever, and it's "just" your own time]

~~~
ibejoeb
>Unless you have really compelling reasons to get snuggly with a particular
vendor's technology

I'll never understand this mentality. You bought it, so why not use it? Even
OSS databases have some _very_ compelling features. Take advantage. Use the
hell out of your tools.

This is tantamount to saying "I'm using Go, but I can't use goroutines,
because some day I might want to use Python instead." Don't want to get too
attached to the technology, right?

The only time I ever tried to go abstract is when I sold on-premises software
that had to support multiple enterprise-size clients' vendor choices. You know
what happened? It took forever, everyone was unhappy, and ultimately it turned
into 2 vendor-specific versions and dropping the least popular 3rd. Life was
better after that.

------
borlak
The top answer in that thread is excellent.

Basically, "it depends". having dealt with extremely DB-intensive
applications, I have developed a personal motto of "be nice to the DB".

Let the DB be a secure storage of your data, not a calculating part of your
application. But like the top answer says, sometimes it is not practical to do
a calculation within the application. In my case, we had a few database
servers set aside just for reporting, so we could slam them with difficult
queries and not worry about affecting data.

------
johnwatson11218
I have seen sql queries that were 30 pages long if printed out. The developer
showing me this was looking for some other examples that he said were up to
100 pages long. These queries had multiple levels of correlated subqueries and
tons of decode statements ( oracle syntax). There looked to be tons of
duplication just in what I saw.

How are you going to write unit tests for that stuff? Refactor? etc. etc.

The examples always start out simple like summing a bunch of rows that match a
predicate but once you start doing that it is hard to rewrite that to use
application code once it becomes too complex.

Also most databases are 20+ year old technologies and often have weird systems
in place for storing the code in the db or something else just as odd. No more
grep, no more static code analysis.

As far as I am concerned the db is a pile of facts or observations. I tell the
db something and later it tells me what I told it. When I am thinking about
what goes in the db I think about using the past perfect verb tense. On this
day such and such happened. Thats it. Preferably that never changes, you might
get new info in the future so just record that new info along with everything
else.

Ideally we should be getting to a point to where resources are so cheap that
CRUD can become CR - no more updates or delete just new facts.

~~~
ef4
The existence of terribly written queries proves nothing. Because there's also
plenty of terribly written code that takes 100 pages to do something simple.

In both cases it comes down to people blindly grasping when they don't
understand the fundamentals.

Relational logic can be extremely elegant, composable, and testable.
Unfortunately SQL is a pretty awful interface to expose those ideas, and most
attempts to wrap SQL in a better interface make the mistake of trying to
pretend to be object-oriented, when they should really let their true
relational nature shine through.

------
danielharan
Crucially, "performance" isn't defined.

The one that matters to me the most is developer time.

Write a damn SQL query. If it's too slow or the DB becomes a bottleneck, then
reconsider.

~~~
mistermann
As usual "it depends" but yes, this is imho the _the most important_ aspect
(provided _reasonable_ performance and sustainability is carefully taken into
consideration).

The older I get, the more tired I get of developers writing in-house apps that
are _never_ going to have more than 10 concurrent users, but they architect as
if they are going to have 1000+ concurrent users, regardless of the additional
cost or complexity....which is how relatively simple projects end up cost
$100k+++ and become maintenance nightmares, and why simple change requests are
often rejected because they would be "too complex".

------
mzarate06
I default to relying on SQL. I've found that helps ensure calculations have a
single source of truth, which helps maintenance and reliable sharing
throughout your app, or across multiple apps.

SQL also offers powerful aggregate functions to assist. Much simpler to use
something like AVG() or SUM() in a SQL query than having to worry about
deriving the same calcs in application code.

------
mmaunder
It comes down to performance and sometimes what your hardware constraints are
e.g. your DB is a faster machine than your web server and you have no say in
the matter.

You may also benefit from precalculating stuff in the DB and storing it. I
wrote this 6 years ago which illustrates the point:

[http://markmaunder.com/2007/07/20/how-to-create-a-zip-
code-d...](http://markmaunder.com/2007/07/20/how-to-create-a-zip-code-
distance-lookup-table/)

------
InclinedPlane
Performance wise? Maybe writing hand-tuned assembly code will be faster. Or
using CUDA. Don't optimize performance where you don't have to. Are small
calculations the long pole in your app that talks to a db? Almost certainly
not. Pick a method that maximizes robustness, ease of understanding,
servicing, openness to feature changes, and accountability. In my experience
that typically means client side code, not queries or sprocs.

------
benjaminwootton
I would always lean on the database to perform simple sums and aggregations
unless I was really concerned about turning the database into a performance
bottleneck.

If you need to add simple logic above and beyond this, stored procedures
aren't sexy but they can be a good compromise that avoids shipping data around
and re-implementing SQL in your application server.

------
adamconroy
It is interesting that nobody has mentioned testability, as in unit tests.
Some people (such as Thoughtworks) advocate removing all business logic from
the db server into app code so that it can be unit tested.

I buy that argument to some degree but in practice I am SQL junkie and always
implement calculations in SQL.

~~~
chris_wot
There are tools that allow for unit testing for SQL. One that pops to mind is
RedGate's tools. They wrote a white paper on it here:

[http://download.red-
gate.com/HelpPDF/DatabaseUnitTestingWith...](http://download.red-
gate.com/HelpPDF/DatabaseUnitTestingWithRedGateSQLTools.pdf)

P.S. I'm not affiliated with them in any way, I just love their products.

------
mathattack
Isn't there a size past which it's always good to do it on the DB? Or to use
buzzwords, "At some points, shouldn't most Big Data calculations be done on
the database?"

------
_random_
The most important factor is whether the DBAs in your company are cooperative
and how easy it is to do a SQL release to production.

