
Relational databases performed a task that didn't need doing (1991) - blasdel
http://home.pipeline.com/~hbaker1/letters/CACM-RelationalDatabases.html
======
davidw
> Biological systems follow the rule "ontogeny recapitulates phylogeny", which
> states that every higher-level organism goes through a developmental history
> which mirrors the evolutionary development of the species itself.

I learned that that was incorrect in high school.

<http://en.wikipedia.org/wiki/Recapitulation_theory>

~~~
euccastro
True, but the phrase you quote has a merely rethoric use anyway. The OP wasn't
basing any of his argumentation on that broken thesis.

~~~
davidw
No, but it still takes something away from the whole when you know it's flat
out wrong. It's like starting out "Since the earth is flat, we believe files
should be too".

~~~
euccastro
That's a bad illustration, since the false fact is used in the argumentation.

I find bad spelling a better analogy. If the OP's spelling and grammar were
insufferable, it would have "taken something away from the whole" and still
not be worth commenting on.

That said, no big deal. I only felt compelled to reply to your comment because
I found it strange that it was the highest ranked.

------
jbarciauskas
Every organization I've ever worked at has huge data quality issues. Also, a
company's data is one of its most valuable assets, providing financial
reporting, forecasting, market analysis, and transactional functions.

The question I pose, then, is the data available within corporations today of
a higher consistency and quality than it would be if all organizations used
only key-value stores or a document stores or CSV files?

I'm going to venture to say despite the generally abysmal quality of data in
large organizations today, it'd be much worse without the tools made available
through RDBMS's.

~~~
protomyth
I'm not so sure. It seems that the shoehorning that is done to get data into
the tables cause some problems. I am not sure if it is inherent in the model
or just a function of the standard two team arrangement (dev & DBA). The rigid
implementation requirements and mismatch with the languages developer use
seems to be a problem.

~~~
mattmcknight
The two team arrangement has caused me no end of problems over the past 10
years of my life, beyond shoehorning. The separation of the database team from
the development team introduces communication problems, priority mismatches,
and fixes done on one side or the other that should have been made on the
other side. If you are reading this and you have separate database and
development teams, merge them now, or at the least, put control of the schema
in the hands of the development team.

------
cpr
People probably don't realize who Henry Baker is.

[http://en.wikipedia.org/wiki/Henry_Baker_(computer_scientist...](http://en.wikipedia.org/wiki/Henry_Baker_\(computer_scientist\))

Yes, he's trolling here a bit, but he's brilliant (if quirky).

(Disclaimer: I knew him when I worked at MIT lo these many decades ago.)

------
bradgessler
I think this would be more appropriately titled, "I think RDBMS has set the
industry back by 10 years".

Some of the more interesting tidbits:

    
    
       I can categorically state that relational databases 
       set the commercial data processing industry back at 
       least ten years and wasted many of the billions of 
       dollars
    
       --
    
       Virtually all commercial applications in the 1960's
       were based on files of fixed-length records of 
       multiple fields, which were selected and merged. 
       Codd's relational theory dressed up these concepts 
       with the trappings of mathematics (wow, we lowly 
       Cobol programmers are now mathematicians!) by 
       calling files relations, records rows, fields 
       domains, and merges joins. 
    

I would love to hear HN's thoughts on this.

~~~
gchpaco
SQL being notoriously poor at hierarchal data has been known for quite a
while; he's obviously very correct there. Oracle goes so far as to offer a
non-standard CONNECT statement for tree traversal. And there's not a whole lot
of sophistication to relational theory on the ground level. Optimizing it is
hard, but that's always been the case.

To a large degree we have designed things so that the data we try to store is
as relational as possible, never mind the domain implications--then, the
mismatch is covered up. A shopping cart is a perfect stupid problem easily
solved by relational databases, until you start introducing the real world;
special discounts, group packages, "buy these items and pay less $$ than you
would getting them individually" is usually done through gross hackery, etc.

~~~
Retric
A databases job is to handle _Data_ and a relational database can do that just
fine. When you want to handle really complex relationships you need to write
code because the rules are not abstract. You can create a horribly complex
view / stored procedure and pretend it's the databases job, or you accept that
some rules are stored on the database but not implemented using the database.
Edit: None of the "database alternatives" help you solve the checkout problem.

PS: It's like complaining that HS level Calculus does not let you solve a
complex differential equation. Useful abstractions always have their limits
because they are abstractions.

~~~
gchpaco
Data is not simple rows. Data is complex, human generated stuff that has both
implicit and explicit structure and it does not collapse down well to any
single abstraction, not even object graphs. Fine, so we have to pick some
abstraction. That doesn't mean the abstraction we pick is necessarily
appropriate. Memory is an abstraction; we could just dump all of memory out to
a file and call that a database (and folks used to, and still do sometimes).
It has issues for persistent databases, like brittleness and pointer
swizzling, so we adopt a more high level one.

The relational view of data is a higher level abstraction _but that doesn't
mean it's an especially useful one_. Hierarchy is something that comes up
constantly in the real world and that is something very awkward to represent
in a relational database. You can try to use Codd's adjacency list, but
reassembling the links is costly and drilling down a hierarchy requires a
database query at every level. If you don't believe me, here's a use case: I
want everything there is about a sub-part of a Bill of Materials with one
query and without having to grope through the entire index. There's a reason
why Oracle implemented CONNECT BY, which is abjectly non-relational.

Also the sheer amount of research and work that goes into making even a
simplistic relational database responsive is enormous; when Baker wrote this
it was not uncommon to lose two orders of magnitude throughput by going to a
relational database from a custom written one, and if you were very lucky be
able to claw back another order of magnitude by indexes. People today are
complaining greatly about the inefficiencies of using row-structured data when
you're usually interested in the columns. Even today the way to get maximum
throughput from a database is to denormalize and in doing so virtually set the
schema in stone.

A 1960s era heirarchial database could traverse a heirarchy as fast as the
computer could request data from memory. The models were brittle and didn't
always represent the real world well, but you could run an entire country's
airlines in the amount of computing power your watch has now. The banks today
rely on IBM's IMS, which is hierarchial.

~~~
Retric
As I said "Useful abstractions always have their limits" you can represent any
tree using the relational model. The basic problem is recursive data
structures need a higher level of abstraction. Consider the following
situation:

    
    
      Bob is managed by Ted
      Ted is managed by Bob
    

If that situation is allowed then all recursive queries need to deal with it.
On the other hand if it is NOT allowed then every commit needs to deal with
it. However, while the relational model does not understand those issues just
like it does not understand HTML it still stores the data just fine. Generally
it's a fairly moot point because it's a program talking to the database and
not a person, so that program can provide that level of abstraction just fine,
granted it's less efficient than a model that understood the abstraction.

As to speed, Relational Databases are the x86 chips of the Database world.
There are plenty of custom system that are faster doing specific things, but
they sacrifice speed for adaptability and backwards comparability. For
example: A core 2 chip's instruction decoder and an SQL databases query
optimizer do basically the same thing. With a custom solution you can just
skip that step and directly say what to do, but the abstraction is normally
worth it so you can upgrade the hardware for years without changing the
program.

So the relational model makes the basic assumption that speed is not the
primary concern or you would be writing at a more basic level, yet people want
more speed so there are plenty of real world hacks to help things along like
CONNECT BY and _stored procedures_ , and all those new x86 instructions. So,
99.9% of the time in the real world an of the shelf database running on an x86
chip is plenty fast if used correctly and it's also fairly cheap compared to
the total costs of the other solutions.

------
sfwc
Seemingly the same Henry Baker who wrote all those papers on garbage
collection.

<http://home.pipeline.com/~hbaker1/>

<http://www.bakercapital.com/team_baker_g.html>

Also, <http://c2.com/cgi/wiki?ResponseToBakersAntiRelationalPaper>

(all via <http://c2.com/cgi/wiki?HenryBaker>)

~~~
mahmud
Henry Baker = Hans Boehm + Zed Shaw.

~~~
ThinkWriteMute
Oh SNAP

------
amalcon
People act like non-relational data storage is some kind of new idea, but it's
not. Even if you discount flat files, there's still BerkelyDB (early 90's) and
the things it was meant to improve on (earlier). Why, then, are relational
databases popular?

I find it hard to believe the author's claim that it's because people wanted
to claim they were doing fancy math with their data storage, over all other
considerations. If this were the case, Lisp would have been the programming
_lingua franca_ through the 70s and 80s, not C. Lisp is obviously closer to
its mathematical roots.

Maybe it's because relational databases got the nice consistency features
first; Interbase certainly predates BerkeleyDB. I'm more inclined to think
it's a combination of two factors: First, the database can do much of the
heavy lifting for you. This is nice to application programmers who are stuck
with the likes of C, FORTRAN, COBOL, and Pascal; it's less interesting with
functional abstractions available.

Second, you get the SQL prompt. It's a dangerous tool, and should almost never
be used for updates, but it's _really nice_ to be able to answer unanticipated
questions by simply writing up a query. Think of what the Django admin buys
you; this is like a miniature version of that.

~~~
blasdel
His point is not that "people wanted to claim they were doing fancy math with
their data storage, over all other considerations" -- but that they wanted to
reify their extant degenerate patterns as _theory_.

~~~
amalcon
You're right, that does make more sense. It would still seem that Lisp would
have been more prominent if the ability to claim good _theory_ (regardless of
what you're actually doing) was a primary consideration. The average rank-and-
file programmer (or manager) today just doesn't care about theory.

~~~
rbranson
Agreed. Most development organizations, especially those in large enterprises,
the primary users of RDBMS technology when this article was written, are more
concerned with coming up with the correct solution at the fastest clip. As far
as they're concerned, theory be damned. Relations provide a simple, easy-to-
understand data modeling concept and the query language is relatively quick to
pick up for most developers. RDBMS are like the Java of data containers.
You'll never get fired for going with an RDBMS.

------
chubbard
I think this is a very poignant essay given the place in history we are at. In
1991 this essay looked like sour grapes over forgotten history, but I'm more
interested in what lessons we can learn from previous system. It's interesting
to hear about a time before SQL and relational systems, and how early
developers dealt with persistence problems. The historical account on the
controversy of relational technology rings true today given we are struggling
with it as well.

I heard someone say that when the industry was moving away from big iron to
mini-computers and PCs the programs they wrote were considered worthless
because they couldn't reuse them on the next platform. It was the data that
was the jewel. Relational technology embraced separating data from the program
to enable reusing data within another program or different hardware. Prior to
that data and program were tied together. They were fixed and non-portable.
Relational technology came to occupy separation of data and program to
facilitate data portability.

What's changed is the way in which we write programs. We aren't tied to
machine architecture like we once were. We use python, ruby or Java which is
portable among many platforms. What was once trash is now treasure because we
aren't constrained to one architecture.

Does that mean let's go back to flat files? No. I think it's combination of
the two. The features like data separation is important. Network access to
data is important. Query languages are important. Hierarchical has always been
important, but we've probably down played how important. Scalability and speed
are ever more important.

What's not as important anymore is relational. Those other features can be
separated.

------
xenoterracide
"With the recent arrival of object-oriented databases, the industry may
finally achieve some of the promises which were made 20 years ago about the
capabilities of computers to automate and improve organizations."

um... wow... this 1 sentence discredits the entire email. Object Oriented
Databases were a HUGE failure. I can't be sure but PostgreSQL might be the
only surviving one, and it survived because it went relational. just my 2
cents.

~~~
jbooth
No it doesn't.. ok, early 90s, people thought object oriented everything was
the way forward.

But his main point, that blasting through a flat file without a ton of row-
locking and transactional overhead could satisfy a number of needs better than
the actual DBMS, is still true.

If you need to be transactional, on the other hand, that's a different story.

------
ars
This:

> Biological systems follow the rule "ontogeny recapitulates phylogeny", which
> states that every higher-level organism goes through a developmental history
> which mirrors the evolutionary development of the species itself.

Is not true. And it had been known to not be true since 1922, and possibly
even 1890.

~~~
randallsquared
It was brought up as possibly-true in my 1980s high school, so I guess it was
probably something he learned in high school or college in the 1950s or 1960s
and never unlearned. People forget that so much more of what we thought we
knew before google, wikipedia, and snopes was wrong. You have to work harder
to be wrong on stuff like that, nowadays, since it's so easy to look up. In
1991, you had to go to or have a decent library to find out that that was
well-known to be wrong, so a non-specialist might never question what they
learned in long-ago school. Now, it's so easy to look up minor details as you
write an essay that it seems willful if someone doesn't, but if you apply that
standard to twentieth century essays, you'll come away with a view of the
writers that's often unwarranted...

Next up: spellcheck. ;)

------
lsb
The task they did do, that needed doing, was causing a shift in thinking that
was tantamount to the shift you undergo when you learn Prolog, having only
been exposed to Assembler.

By abstracting away the storage model, you can reuse the same database calls
in the browser (SQLite), on one machine (MySQL/Postgres), or on a cluster, and
you're free to think about more business-specific logic.

------
edw519
_I had great difficulty in controlling my mirth while I read the self-
congratulatory article "Database Systems: Achievements and Opportunities" in
the October, 1991, issue of the Communications,..._

I had little difficult in controlling my mirth when I realized that in 18
years, some things haven't changed. Linkbait, drama, and trolling all still
look the same.

 _As a designer of commercial manufacturing applications on IBM mainframes in
the late 1960's and early 1970's..._

If you're going to wave your resume, make sure it's "wavable". If I had had a
hand as a designer of commercial manufacturing applications on IBM mainframes
in the late 1960's and early 1970's, I sure wouldn't brag about it. They were
an excellent example of what _not_ to do: so expensive, so difficult to deploy
and use, and so ineffective, that the whole world rushed out to write better
apps on mini-computers, and eventually, PCs. Ironically, the one thing they
_did_ do well was their relational database storage systems. If you owned a
multi-million dollar AMAPS installation in 1978, the COBOL apps were soon
worthless. The only thing salvageable was the TOTAL DBMS.

 _I can categorically state that relational databases set the commercial data
processing industry back at least ten years and wasted many of the billions of
dollars that were spent on data processing._

This conclusion is based upon what data? Maybe in 1991, you could bullshit the
ACM without supporting data, but 2009 readers demand citings. Wikis & google
have exposed the posers.

 _Unfortunately, relational databases performed a task that didn't need doing;
e.g., these databases were orders of magnitude slower than the "flat files"
they replaced,_

Again, based upon what data? From what planet? Just because someone
overnormalized a commercial database doesn't make it the fault of the
underlying technology. That would be like saying, "That webpage sucks,
therefore HTML sucks."

 _Why were relational databases such a Procrustean bed? Because organizations,
budgets, products, etc., are hierarchical..._

Organizations, budgets, products, etc. are data sources and sinks and can be
structured any number of ways, including hierarchical. But the lifeblood of
any business is its order flow and business processes which are almost always
ideally suited to be relational; they're "linked" to almost everything else.
Not everything has to be in 4th normal form, but flat files and hirerarchical
data bases are almost always a poor stepchild to RDBMS for business flow.

 _These databases could also respond quickly to "real-time" requests for
information, because the data was readily accessible through pointers and hash
tables--without performing "joins"._

I guess it's not really fair to "debate" with an OP from a generation ago.
Even with Moore's Law, he would have had a hard time wrapping his head around
where the real bottlenecks would be today. But one thing really hasn't changed
that much: throughput has rarely been on the critical path. Why sacrifice data
integrity, adherance to business rules, and effective delivery of user needs
for a few microseconds? I remember routinely witnessing subsecond
intercontinental response time on massive relational database installations as
early as 1981. Why didn't OP?

Oh, and RDBMS with pointers and hash tables have been around since 1965:

<http://en.wikipedia.org/wiki/Pick_operating_system>

and now also support object oriented technology:

<http://en.wikipedia.org/wiki/InterSystems_Cach%C3%A9>

 _I shudder to think about the large number of man-years that were devoted
during the 1970's and 1980's to "optimizing" relational databases to the point
where they could remotely compete in the marketplace._

I shudder to think about the large number of man-years lost by PHBs who read
drivel like this and waste the time of people who do real work with
initiatives derived from these drama-based conclusions.

 _Database research has produced a number of good results, but the relational
database is not one of them._

From someone who has built a career rescuing so many manufacturers and
distributors from flat file systems with relational database technology to
whoever posted this: thanks for the laugh. I really needed it on a tough
Monday.

~~~
giardini
_Arguably not a single statement_ that you have made in your overly hasty post
is true. To address a few:

Henry Baker is no troll. His letter was written to the ACM in 1991. Baker was
an established and respected computer scientist and iconoclast. To treat the
letter as if it were written yesterday is foolish and disrespectful, In any
case, had Baker's remarks been misguided or irrelevant then the ACM would not
have published them.

Baker was always in demand. No need to compare resumes: you likely would not
rank well against him,

"Flat files" were used often for batch processing. For sequential processing
and merging/splitting/sorting of data they are usually faster than RDBMS.
During that processing, temporary files might be created that are indexed,
hashed, or sequential.

Today network-style databases remain faster by at least an order of magnitude
(and sometimes two) than relational databases. Ask any knowledgeable
mainframer about the relative speeds of their network-style vs relational
databases. The same holds for smaller computers too.

Baker would have no problem "wrapping his head around" almost any problem. An
archive of some of his research papers:

<http://home.pipeline.com/~hbaker1/>

You saw "subsecond intercontinental response time on massive relational
database installations as early as 1981", when IIRC the only relational
databases available was Oracle and possibly Ingres. My memories are that
Oracle's early versions were performance dogs by any standard. But perhaps you
can refresh our memories by citing performance statistics and showing us
benchmarks.

PICK was not a relational database.

TOTAL was not a relational database.

Your remarks remind me of how so often when an excellent programmer leaves a
shop, the remaining developers diss his work and argue for rewriting all of
his (excellent working) code.

But to miss on almost every statement? Oh, my!

~~~
chipsy
What bugs me is that even though Baker states in the comment that he proved a
particular deficiency in the relational algebra(the expression of transitive
closures), the assertions he uses to lead into this seem extremely vague:

 _they(relational databases) made trivial problems obviously trivial, but did
nothing to solve the really hard data processing problems..._

 _organizations, budgets, products, etc., are hierarchical; hierarchies
require transitive closures for their "explosions"..._

He never describes concretely what an "explosion" is in a real situation - not
in the comment, at least. I'll give the benefit of the doubt that perhaps he
brings it up in some paper.

But my understanding is that the relational algebra is useful primarily
because of its linguistic properties - instead of every problem needing a
customized data model and customized graph-traversal solution, it describes
how to use a generic query syntax and joins everywhere.

While this limits the theoretical power of the system and introduces order-of-
magnitude slowdowns, it still solves a major subset of business problems, and
it does it with a more compact description than the equivalent graph solution.
And that, in turn, means that the theoretical limitations aren't necessarily
going to matter for shipping and maintaining a solution, especially when
working with a short-term view and small scale.

Essentially, he banks on his credentials to turn an "academic" oriented
statement into a "industry" one. But from the 70s onward he was writing papers
on garbage collection, concurrency and PL theory, not going out and building
systems in the industry.

So my view is that he's being overly biased towards theoretical purity above
practicality, a tendency that has led many an academic astray.

(This is not a slam to academics and their pursuit of purity, though. It's a
great way to advance the computing arts, and complements well the messy
practicality of hackers.)

~~~
chubbard
> So my view is that he's being overly biased towards theoretical purity above
> practicality, a tendency that has led many an academic astray.

Not so fast. He contrasts working systems in practice compared to the pure
relational technology. He is directly saying relational technology was one of
purity backed up by mathematics to make it seem more rigorous and battle
tested. Where we have a perfectly practical solution that works well and
performs better. I don't see his academic purity leading him away from
practicality at all.

I think his academic resume is making you think he must be all about purity,
but in fact he's being very practical.

------
giardini
I wonder if relational databases would not be as popular had easy-to-use
transactional filesystems been commonly available at an earlier time.

To gain transactional integrity (ACID properties) in that day, one had to
either purchase a database management system (usually hierarchical or network)
and work within that system or be a heckuva programmer and write it oneself.

Had transactional filesystems been commonly available, then file processing
with those filesystems would have been akin to working with a relational
database management system (RDBMS).

But that leaves referential integrity (RI) as a programmer task. Unfortunately
most sites I've worked at do not seem to believe that RI should be strictly
enforced. So maybe it wouldn't make much difference.

------
JulianMorrison
Seems to me, messing with immense flat files is going to be faster iff you
have a Computer Scientist of an equal caliber to the ones who wrote <insert
professional SQL DB here> to optimize your indexing and storage formats and
disk access patterns and whatnot. That is, at the very least, going to be
expensive.

It may just be simpler to just throw Oracle/Hadoop and hardware at the
problem.

------
TheSOB88
What?

------
47
> With the recent arrival of object-oriented databases, the industry may
> finally achieve some of the promises which were made 20 years ago about the
> capabilities of computers to automate and improve organizations.

Oh No! i thought NoSQL/Key-value databases are the future

