
Application vs Database Programming - einhverfr
http://ledgersmbdev.blogspot.com/2012/02/application-vs-database-programming.html
======
protomyth
SQL Databases are quite a lot like front-wheel drive cars on icy roads, you
don't get a lot of warning before everything goes bad. The biggest problem is
a query that finally exceeds what can be cached. You can go from less than a
second to multiple minutes per query. It is much like the problem of doing
high performance calculations where you want to use the CPU cache very
efficiently and not have a cache miss.

Worse is when some optimizers suddenly switch plans to something that is
inefficient. Sybase had a really bad habit of this with big datasets.

Sadly, a lot of the problem comes down to schema design. A lot of designs are
just fine retrieving one item and its attributes from the database, but get
into trouble when you try to retrieve a group of items with the same attribute
from a database. Status reports (e.g. how many unpaid invoices do we have?)
tend to sniff these problems out. I generally follow the rule "if a single
attribute on an item is important to know, then knowing all of the items with
that attribute is going to come up and should be designed for".

------
zzzeek
Asking as someone who works mostly with relational DBs, where do nonrelational
databases fit along this spectrum ? If you're dealing with a large dataset in
Cassandra or MongoDB, do you think in terms of instructions or sets, or
somewhere in between (or something else entirely) ? My understanding is that
if you need to extract a dataset according to complex criteria you'd be using
something like map/reduce...or perhaps you'd be organizing the data on the
persistence side to produce that query naturally.

~~~
bkirwi
That's exactly correct.

If you're working with a pure key-value store, the only query you have is
get(), plus maybe a scan over all keys. Stores like Cassandra give a little
more structure to the data (column families, supercolumns, etc.) and you can
use that to make certain types of queries simpler, but there's still nothing
like the algebra that comes with relational DBs. Anything other than the most
primitive logic is going to have to be written in client-side code, which is
why you see things like MapReduce used to filter down large data sets or other
complex client-side transformations.

The advantage of writing on the client is that you get to use whatever the
hell programming paradigm you like: pretty much any language that can deal
with a hashmap can deal with NoSQL. (That's true of relational DBs, of course,
since you don't _have_ to write complex queries; but if you're not taking
advantage of the relational structure, NoSQL tends to be easier / faster.)

~~~
zzzeek
Right, this is actually what I was getting at, without even being conscious of
it when I asked that. Part of the appeal of NoSQL is that programmers _don't_
need to learn a different paradigm of thinking other than the usual imperative
model.

I'm not sure if I like that rationale for using NoSQL. Strikes me as being
lazy up front. Or maybe you don't really need ad-hoc queries.

~~~
SkyMarshal
As a fan of Codd/Date/Pascal, relational algebra, and ACID, I used to feel the
same about NoSQL, but have since come to appreciate it as an option.
Schemaless is nicer for exploratory programming with frequent data model
changes, for one example.

Here's a good article on NoSQL data modeling that covers it better, along with
some good discussion in the comments:

[http://highlyscalable.wordpress.com/2012/03/01/nosql-data-
mo...](http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-
techniques/)

------
dhconnelly
Funny, I had this conversation with one of my professors the other day. I've
never written straight SQL--only used ORM. He showed us a giant SQL query and
my first question was "do people really write it this way, instead of building
up from intermediate results?" It doesn't look maintainable. He said yes, and
that the SQL programmers are proud of writing huge queries. The query engine
will optimize the hell out of it, so it's the right thing to do.

~~~
dmethvin
SQL is just like any other language. If you're new to it then sure it's going
to look strange. But a large well-designed query can be _many_ times faster
than "building up from intermediate results". Your procedural code is off in
another process from the SQL server, maybe even in another server across the
country. Pulling the intermediate results back and forth between them is
expensive.

ORM is great when a database is being used as a simple program data
persistence layer, but if you limit yourself to that then you might as well
use a NoSQL solution.

~~~
mattmanser
The more I've worked with SQL, the more I'm starting to question this idea
that huge elegant queries are any good. I used to almost be fanatical about
only returning exactly the information I needed from the DB.

They are such a nightmare to read, there's no exposure to them in the code
that your IDE can make sense of, if there's a performance problem it's almost
always down to one of those huge queries and in the end most software I work
with the DB engine is either on the same box or on the same network so pushing
across more rows to process in the app in a far more generic way doesn't cause
the sort of latency problems you describe (seems a pretty rare config imo).

So I disagree.

Also I disagree with your admittedly fairly throw away comment about ORM &
NoSQL. Even if it is just a data persistence layer bunging everything in a
NoSQL solution is usually going to be the wrong call in the long run. If
you're dealing with data it almost always pays off to keep it in a relational
structure because if the app/business is a success someone will invariably
start wanting to use the data in other ways. I think NoSQL is good for a very
small subset of data storage problems, I want to use it, but have yet to have
had a reason to.

~~~
ldh
You make good points, but I'm not entirely sure that the possibility of
someday needing to do analytics on the data precludes one from using NoSQL
now. I've heard/read about a good deal of large scale sites running on NoSQL
alongside a relational database setup for reporting. Seems like it's not that
big a deal to move some of that data into a SQL store if/when it becomes
necessary? If NoSQL gets you the simplicity and performance characteristics
you want, maybe it's a premature optimization to limit your options based on
hypothetical future scenarios. I could see it from either perspective.

~~~
mattmanser
I'm challenging the idea that if you're just using an ORM for persisting
objects you're probably doing it wrong.

For the vast set of programming problems having that user table that's
relationally linked to a person table that's relationally linked to a company
table that's relationally linked to, etc. gives you far more flexibility than
having to start dealing with getting objects out of a key-value pair store and
manually piecing together complex relationships just to add a simple page to
the client's admin page to say how many people have setup their account
properly (as a random real-world example where the admin emails their staff
and they setup their own user details).

Once you get past the basic CRUD of a problem and start having to use the data
for functionality, basically past anything more than a toy program, the
relational model through an ORM is going to be much simpler than a NoSQL
solution.

I just don't get how this statement makes sense:

 _ORM is great when a database is being used as a simple program data
persistence layer, but if you limit yourself to that then you might as well
use a NoSQL solution_

I wouldn't because as soon as you start adding fairly simple functionality the
NoSQL solution would rapidly become a nightmare. To me NoSQL would be the go
to solution if I had a few very basic data structures that I had to do a lot
of work on, not the opposite as dmethvin suggests, lots of different objects
with simple operations.

In concrete examples, Google, NoSQL, Instagram, probably NoSQL, something like
Freckle (Time Tracking), SQL, Salesforce, SQL all the way.

But as I said I've not had a project or personal project yet where I've gone
'Aha, now is the time! NoSQL here I come!' so push back if I'm doing it wrong!

~~~
dmethvin
It sounds like we are agreeing then. If you are using the data for ad-hoc
reporting via a SQL query tool for example, you're not just using your
database as a simple program data persistence layer.

If you're not careful though, your program-oriented database won't be very
useful outside the realm of the program. As an example, I worked on a database
where they persisted a good chunk of data in each record as a JSON string.
They did it because they didn't want the hassle of updating the database
schema. It worked fine for their needs inside the program, but it was kind of
hard to do an ad-hoc query on that.

------
dmethvin
It's really hard to say what the performance issue was here, but my experience
is that SQL servers do a really good job of handling data sets if you use them
correctly. If the problem was a "cache miss" it might mean that one or more
tables needs proper indexes for example. Or, if the code involves a cursor and
a code loop, that is a very bad sign.

~~~
einhverfr
The problem was that I was doing it as follows:

insert/insert/update. Wash rinse repeat.

The better way to do it (which solved the problem was:

Insert a bunch of rows, insert a bunch of rows, update a bunch of rows, no
repeating.

------
thurn
I don't agree at all that this phenomenon is restricted to database
programming. Thinking in terms of higher-level operations is always more
readable and often more efficient (the logical conclusion being things like
Haksell and Prolog where there aren't really operations at all).

For example, to find the common elements in two lists, you could use a for
loop, or you could follow the OP's advice and use set operations to do
something like:

    
    
        set(list1).intersection(set(list2))

~~~
calibraxis
Yes, I often use clojure.set's relational operators. (select, project, join,
etc.) Because it's common to deal with sequences of tuple-ish data.

It's natural, once you lose this limiting notion of "objects" which aren't
represented in a generic way.

------
tom_b
Always test your data operations on data volumes (and user load) that mirror
what you expect to see in production.

This is important across the entirety of your dev stack.

~~~
InclinedPlane
One of the biggest problems in database operations is that there are many
potential query plans for a given query, but finding the optimal one can be
difficult. More so, which query plan is optimal varies depending on table
size.

If you let the DB pick its own query plan all the time you can end up with
inconsistent performance, often by an order of magnitude or more. However, if
you turn on query plan stability (every query after the first one will have
the same plan) then you can end up with a state where the query plan doesn't
change as the data change, which can lead to serious performance issues as a
table's size grows (if your table starts out small, the DB may choose a full
table scan, which is very fast, but will bog down extremely when the table
grows to thousands or millions of row).

In short, performance is a hard problem, and if anything the tools for
analyzing and optimizing performance for DB operations are well behind the
state of the art for compiled code, for example.

~~~
gaius
Oracle is pretty smart about this, if a new plan is slower than the previous
one, it seamlessly flips back.

~~~
politician
JIT compilers have been doing hotspot recompilation for a while now. Is there
a reason why databases haven't adopted the same techniques? Is there some
barrier to caching multiple query plans optimized for different load
scenarios?

~~~
gaius
What do you mean by "load scenario"? The only thing that should affect the
plan is the characteristics of the underlying data (e.g. do we expect to
return so many rows that we should full-scan anyway, or can we use an index).

------
henrikschroder
Developer discovers there's a difference between the declarative paradigm and
the imperative paradigm, writes blog post, news at 11.

Snark aside, you can take the author's advice too far in the other direction
as well. Relational algebra is all fine on paper, but relational integrity,
foreign keys, strict normalization and n-way joins are horrible for
performance. At the end of the day, the declarative SQL you write has to
translate to a bunch of bulk reads from a hard drive somewhere, and you really
need to understand execution plans as well if you want any sort of
performance.

~~~
cbsmith
THIS. I was going to write the same thing. It's really about
declarative/functional programming vs. imperative/procedural programming.

He's right though that as you work with more and more data, the impact of
making the traditional imperative/procedural mistakes because a real problem.

