
Startups should use a relational database - raycmorgan
http://raycmorgan.com/e/startups-should-use-relational-database.html
======
robconery
One thing that could likely get you fired rather quickly is running analytics
on your live transactional system. Yes, your business needs to make decisions
based on data, this is not terribly new. To think that you only have one data
store is a bit short-sighted.

Many businesses (including startups) have moved to using document stores for
high read environments and scraping nightly drops to their backend analytics
systems. This is smart - you don't want to run summing/aggregation on a live
transactional system for (hopefully) obvious reasons.

EDIT: it's also worth noting that map/reduce is typically much more powerful
when aggregating large datasets. When trying to run analytics on top of a
transactional system, developers like Ray here would end up with multiple
joins and groupings - all of which slow _everything_ down. Map/reduce
certainly isn't perfect, but the author dismisses it as difficult witchcraft
when, in practice, parallel execution of MR queries can greatly decrease
resources and time to information.

I sort of think we've moved beyond this discussion.

~~~
nostromo
The disconnect between your comment and the article is the term "startup" now
means giant companies like Airbnb and tiny two person companies that haven't
yet created an MVP.

I think this article is targeted at the latter: pre-MVP and just post-MVP. For
those startups, having two databases with one dedicated to a backend analytics
system reeks of premature optimization.

~~~
VLM
"having two databases with one dedicated to a backend analytics system reeks
of premature optimization."

Its free software, you don't have to pay for two instances of Oracle.

One thing that will quickly kill a biz is combining the functions of PROD and
DEV/TEST. Making the DEV/TEST box the DEV/TEST/REPORTS box is not a big deal,
and you can't run a (real) biz without a DEV/TEST box.

~~~
waps
Having run the technical side of ~5 "small" businesses now (no more than $1.5M
revenue), I disagree.

Eventually, nothing can match the performance of storing binary blobs on a
cluster. But that only becomes worthwhile if you database is significantly
larger than a terabyte. And I'm only talking about the operational "core"
database, not your "data warehouse" (the log dumping ground, which should be
split off when your database gets to be a few dozen gigs).

Meanwhile, mysql has big advantages :

1) can do basic optimization with "ALTER TABLE", even (mostly) live.

2) you can mix PROD and DEV/TEST (though obviously you need to use good
judgement). Obviously you should also have a DEV/TEST instance for actual
testing. Sometimes you want to run a test quickly against PROD though. Adding
a slave, having it sync and then running against the slave is a joy.

3) creating reports is quick, customizable and everything you want.

4) It's "idiot-friendly". Employees can ramp up to the structure in a mysql db
in 2 weeks flat. Try that with custom document stores.

5) It's typesafe and relational safe (if correctly designed), with the
advantages that brings : significantly less weirdness in the database.

6) Phpmyadmin. Mysql workbench. Django. Php ...

I'm even going to argue that the GP's argument, that running analytics on PROD
can get you fired, is not just wrong, it's actually an advantage of using
mysql. (And the open source SAP database can run "live" analytics. You just
can't believe how great that is for dashboards)

------
lmm
The unhappy truth is that for many startups, relational integrity and
transaction safety are simply not very valuable. Customers of an early-stage
startup are by definition willing to take a risk on whatever they're getting
from that startup. So simply _not thinking_ about these problems - accepting
that occasionally a partial write will happen, or two writes will collide, or
a migration will not quite work correctly and your pages will crash until it's
fixed - is a worthwhile sacrifice to increase development speed.

~~~
stiff
This is just a very bad excuse for doing ridiculously shitty systems that stay
shitty way after the startup phase. If you at all bother to understand your
problem domain, and are not just a monkey at the typewriter, writing down a
good domain model and adding the constraints is not going to decrease
development speed, quite to the contrary, it's going to increase it, web
developers often spend whole workdays just tracking down in the logs the
"story" of some now angry customer that happened to violate some unformalized
assumption of the system and got mishandled later in the process, especially
if this customer happened to pay already. Not to mention that with a good
domain model, the code for the individual functionality flows out naturally,
while with a shitty one, you might end up with three times as much code for
the same thing.

Also, those mistakes in modelling the domain and in enforcing the constraints
are often there to stay and slowly become impossible to fix, once you have
10000 records that do not fall into a few well specified states, it's hard to
go through all of them, find some common denominators, and migrate the
database. Not to mention that with the mess people can do in the code, and
with the messy stack in use today, it's easy to introduce bugs that might be
hard for anyone to notice but seriously harm your business.

The amount of fashionable nonsense in software engineering seems to be higher
than ever, unfortunately.

~~~
raverbashing
> If you at all bother to understand your problem domain

This is a startup. There is no problem domain. There is no spec. There's none
for a startup. You may not even have code.

If you're doing a "startup" but have this ironed out, great, everything you
said then applies, but it's not a startup, since you found your business
model, it's a small company.

And I would love to enforce constraints on the DB but unfortunately, I already
had "primary keys" that repeat, unbeknownst to the project customers.

~~~
marcosdumay
> This is a startup. There is no problem domain.

Is the startup just a game developers do on their free time? Because if it is
trying to solve a problem, there is a problem domain, even if it's the wrong
problem.

------
dclara
I'm totally with you. We've experienced to have Object database or XML
database and NoSQL database. Now we understand that relational database is
just the right way to go for web applications, because it deals with
structured data so well, keeps querying and sorting, filtering seamlessly and
effortlessly. It's a must.

It is the same thing for choosing Linux distribution and JDK mode. See the
references here:

[http://bingobo.info/blog/table-of-
contents.jsp](http://bingobo.info/blog/table-of-contents.jsp)

BTW, your title should have "relational" instead of "relation".

------
HorizonXP
Serious question: what are NoSQL databases really good for? I'm only really
used to relational DBs, and I'm unclear about which problems a NoSQL database
is useful for.

~~~
nl
_which problems a NoSQL database is useful for_

It depends in which NoSQL database. Depending on your problem, you can
probably find a NoSQL DB optimised for it. It will often be unclear if that
NoSQL DB is actually better than a relational database until you try it.

Examples:

High write throughput: Cassandra

Simple key-value: Redis

Text search: Solr/Elastic Search

etc..

~~~
twic
Redis does more than simple key-value - rather than just reading and writing
values, since values can be complex types like lists and dictionaries, it can
insert into them, append to them, etc. It's still key-value, it's just not
_simple_ key-value!

That said, i would hesitate to describe Redis as a database at all. A key
characteristic of databases is that they store every write in a durable way.
Redis can checkpoint its state periodically, but as i understand it, it either
can't or typically isn't used to safely keep every write. Redis is something
in between a database and memcached. I doubt there's ever a situation where
you have to choose between PostgreSQL/Cassandra/CouchDB and Redis; Redis is
something you would use in addition to a database.

As for text search - RDBMSs have full text search, and at least in the case of
PostregSQL, it seems pretty good - see slide 49 in
[http://es.slideshare.net/billkarwin/full-text-search-in-
post...](http://es.slideshare.net/billkarwin/full-text-search-in-postgresql)
from 2009. You might not want to be leaning on your database for text search
when you're at scale (for operational reasons more than performance ones), but
it's a plausible way to start.

------
Gulthor
People too often forget about graph databases when talking about NoSQL
solutions. Graph databases offer an interesting and elegant alternative to
relational databases and I could definitely see a startup decide to use this
kind of technology.

As far as I know, most graph databases support transactions and offer great
scalability. Such databases are also schema-less and can be queried with
Gremlin, a powerful graph traversal language (see www.tinkerpop.com).

With respect to scalability and transactions, Titan
([http://thinkaurelius.com/](http://thinkaurelius.com/)) looks very promising:
it supports various backends for storage (Cassandra, HBase, etc.) and indexing
(currently Elastic Search and Lucene). Graph analytics can be done via Faunus
([http://thinkaurelius.github.io/faunus/](http://thinkaurelius.github.io/faunus/)),
backed by Hadoop.

There are other vendors out there (Neo4J, OrientDB, etc.) which offer
interesting solutions worth looking at - I'm just a bit less familiar with
them.

The major downside I see with graph databases is that most of them are fairly
recent and their ecosystem is tiny (though growing). Should a startup venture
on such young technologies, or stick to mature and battle-tested solutions
(ie. relational databases)?

Could startups use this kind of graph "NoSQL" databases? I don't see why not.
If your startup is some kind of social network, graph databases are certainly
an option worth considering. If I were to create a startup, I'd hardly use a
document database like MongoDB but I will really consider using a graph
database. In the end, it's all about having the right tool in hand, and
knowing how to assert what is "right" for you.

------
adamnemecek
A blog post titled "Startups should use NoSQL databases" in 3, 2, 1, ...

------
VLM
There can only be one DB much like the LOTR can only have one ring. Why? Thats
the only area the linked article falls down on. Its a pretty good article
other than that.

So you properly normalized your entire system, customer billing transaction
records all the way up to article tags. Then article tags gets too huge. So
next version looks at RDBMS and Redis, and the next version after that only
looks at Redis. Customer billing transactions remains on a "real" DB and the
tag cloud lives on redis. And the problem with that is... what exactly?

Its obsolete thinking. I can't have two databases because we're a poor startup
and the only databases that exist are DB2 and Oracle and everyone knows
they're super expensive so super expensive times two is unaffordable. Dude,
its almost 2014 not 1980, Postgres/mysql/redis its all free.

~~~
raycmorgan
Thank you for your comment. I agree that there is nothing wrong with using a
plurality of systems when needed. Your example of moving from one to another
is great! Start with a simple system, and once you find bottlenecks, optimize
with specialized stores.

------
jwilliams
This comes up every now and then on HN. There are plenty of NoSQL horror
stories.

Thing is. Most SQL database at scale is a bit of a horror too. Have you seen
real-life production relational databases? Gawd. Hacks on hacks. Then you add
another database. And another analytics database. And a bunch of point to
point data feeds. Argh.

But hey. That's data.

If you think choosing SQL will solve your analytics woes down the line -- it's
just not true. You're in for some pain no matter what you do.

... That's unless you get a porcelain schema first time. Which, if you're in a
startup, probably means you're working on the wrong problem.

That's not an argument _for_ using NoSQL (I used MongoDB daily, but I've got
plenty of love for PostgreSQL). It's a rebuttal that SQL magically solves a
different problem.

------
luzero
Startups should use the right tool.

A relational database might be just as wrong as a nosql one if all you need is
redis.

------
sedlich
Strongly disagree with the article as simplification always looks shiny.
Start-Ups should sit back for a few hour and days and invest the work to
answer some serious questions as these [http://nosql-database.org/select-the-
right-database.html](http://nosql-database.org/select-the-right-database.html)
(there are other cataloges like this one).

Then you get a little closer to the truth.

------
lampe3
I often read that argument that NoSQL Databases are Schemaless and yes the
Database is but your Data is or it isn't. YOU must know your Data.

"All the while moving work onto the developers to standardize how they handle
different migration cases."

I know a startup is fast and bla bla... BUT your team should know the tools
that you are using... For me SQL DB's force me to add a new field and some
kind of value and i don't like to be forced to a solution.

"In document stores, you have two choices: store related data as sub-
documents, or store related data as separate documents with references. It is
up to the developers to understand the trade-offs of both approaches.
Selecting one over the other can lead to performance gains or issues,
scalability issues and above all, make asking certain questions of the data a
lot harder."

Again know the tools you are using. And for example MongoDB has good ORM's
too.

"But that takes much more forethought and is dependent on a particular
problem."

If your startup is doing something new and shiny you don't have the knowledge
and forethought and you often dont know what particular problem will come at
you.

Most of the point's look like: You learned at your University SQL now you know
it(but in really life you don't) and now use it because you know how to
normalize a Database. This argumentation is often used to say why java is so
great or why javascript is bad.

I personally started with php then moved to rails and now to meteor(uses
MongoDB) and we never before meteor could make so fast a good prototype which
for a startup is very important.

So yeah if you are comfy with SQL use it if your comfy with NoSQL use it.

------
guard-of-terra
Don't they? For some tasks relational databases are good, for some they are
worse. Call me captain.

However relational databases will have hard time with big data because your
dataset is bigger than your database and you have no relational integrity.

------
RyanZAG
Depends what your startup is doing. If you are only using your database to
store some basic transactions, then a relational database is a very good fit.
This is really the case for most startups tackling common problems. However,
if your startup is tackling a problem with unique technical challenges, then
you can't just ignore the issue. For example, a geo-location startup tracking
the location in real time of users with a free app is simply not going to be
able to use a relational database.

~~~
charliesome
> _For example, a geo-location startup tracking the location in real time of
> users with a free app is simply not going to be able to use a relational
> database._

Why not?

~~~
nl
Rapidly growing, infrequently queried data is not the ideal scenario for most
relational databases.

1) Relational databases typically aren't optimised for write-throughput. It's
quite possible to do it, but you'll need fast and large disks (eg, FusionIO in
a SAN or something).

2) Location-tracking applications typically don't require interactive queries
- generally it is more a batch-based system that can be run offline.

Saying you are _not going to be able to use a relational database_ is
overstating it a bit in my view.

Clearly you can make it work, but something like Cassandra will give you
better write thoughput, won't force you to rely on a SAN/NAS for data storage
and will let you use Map/Reduce to batch process the data.

------
yeukhon
I don't know. It's a hard question. MongoDB is pretty much used in any
hackathons simply because it's easy to setup, driver support is good, and
schemaless. The last one is really why people use MongoDB over SQL DBMS. For
startup, there might be a concern that schema migration is tough.

But one can argue that not careful with schema design can break api and make
codebase messy.

I guess I will stick with the hard work now... I guess not careful with schema
will definitely bite me.

~~~
est
Another reason to choose MongoDB is built-in array and nested dict support,
with good enough indexing.

So you don't have to create bullshit m2m tables with tedious joins for a
fucking tagging system

~~~
mwhite
Obligatory JSON and hstore in Postgres comment.

~~~
est
stable version, every value is string. No atomic incremental operations, no
nesting, shitty index.

~~~
mwhite
JSON has non-string values and nesting. [1]

I read somewhere that nesting in hstore is coming in the next version (Q3
2014?) and non-string types are on deck.

Compared to the nightmarish development workflows and processes I've had to
deal with resulting from using CouchDB as a main datastore, having to get the
entire JSON value in order to update one key seems like not that big of a
deal. What NoSQL databases even let you do incremental operations in that
sense?

Shitty index? It seems like you should be able to make an index on a value
inside the JSON just as easily as any other index.

Then maybe some advanced features of Postgres can really shine:
[http://www.postgresql.org/docs/8.3/static/indexes-bitmap-
sca...](http://www.postgresql.org/docs/8.3/static/indexes-bitmap-scans.html)
[https://wiki.postgresql.org/wiki/Index-
only_scans](https://wiki.postgresql.org/wiki/Index-only_scans)

I'm also exploring a solution for abstracting that as a normal, non-JSON table
for semi-structured data using views.

Basically, it seems like for _semi-structured data_ where you know what the
schema is, but maybe it just changes over time or isn't 100% certain, so it's
not possible to store it using a typical schema, JSON + indexes + views offers
the best of both worlds.

[1] [http://clarkdave.net/2013/06/what-can-you-do-with-
postgresql...](http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-
and-json/)

------
joshguthrie
"Blog titles should stop using should."

------
pkolaczk
He forgot one of the very important reasons to use (some) NoSQL databases:
high availability. Relational database systems are very poor at providing
that. Most often the availability options are limited to resistance to node
failures. RDBMSes have several SPOFs and must use failover which is not
dependable, hard to test, and in many times needs manual intervention. Forget
resistance to network partitions.

~~~
jahewson
CAP theorem tells us that you can't have availability without sacrificing
consistency or partition tolerance, which means that there isn't a NoSQL
database which can do that either.

It is not true that relational databases must have a single point of failure
(SPoF) or must use failover: MySQL Cluster is a sharded multi-master
distributed database without a SPoF.

On the other hand Redis, for example, is a master-slave failover NoSQL
datastore.

~~~
pkolaczk
CAP theorem says it cannot be done _at the same time_. But it is perfectly
fine to sacrifice consistency for availability at the time partition happens
and restore consistency once the partition is fixed. Still better than nothing
if revenue counts. Financial institutions do like that all the time.

