
Startup Mistakes: Choice of Datastore - StavrosK
https://www.stavros.io/posts/startup-mistakes-datastore/
======
onion2k
Just pick _something_ and build an MVP with it. Then get on with the hard part
of finding paying customers. You can fix the bad tech decisions later. Without
customers it won't matter which database you used before your startup failed.

~~~
mamcx
BIG ERROR.

Like, MONUMENTAL.

That _something_ (if chose very wrong) will totally derail you progress and
will cost a lot of fix it later.

Is incredible. Nobody remember that the most cost effective way to fix a
problem is in the early stages?

And, yes, the best overall primary datastore, like 90% of the cases, is a
RDBMS. Very few actually need to use something else.

~~~
geophile
Yes and no.

Yes, choose carefully at the beginning, and use relational. More specifically,
use Postgres.

But choose the right things to worry about at the beginning. So worry about
something that can accommodate changing requirements, i.e., a relational
database. Don't worry about scalability. You will be very, very lucky to ever
have that problem. Worry about it then.

~~~
edneypitta
>something that can accommodate changing requirements, i.e., a relational
database

Wouldn't NoSQL be better suited for this scenario? Genuinely curious.

~~~
mamcx
Exist the myth RDBMS are not flexible (because it have schemas). To the
contrary, the relational model is very flexible, and allow to model everything
you could want. Even model a NoSql :)

And the Relational model is fairly simple. You could learn anything you need
in minutes and just remember in add a few index here and there. For the small
amount of things you need to do, is incredible how many features a RDBMS give
you for free (like transactions, some basic axioms, and a query engine).

\----

I don't remember the original quote, but is alike:

"Novices worry for code, Experts focus on data structures".

Properly chose your data-structures, schemas and data-layout will have a huge
net impact in your code. That is why a well modeled database will perform well
and allow to easily code against it.

This is the hard part, and most novices like to "defer" it.

In my times, we start designing the app first _on the DB layer_ , including
the queries, reports, etc. Now most start with the front-end or back-end...
Imaging the are focusing in the abstract logic (because some can't imagine the
datastore is part of it!) and ignore or reject the concept of learn what are
the datastore capabilities.

That is like ignoring the documentation about arrays, building a layer on top
of it, rejecting the idea of use arrays as full, and then wondering why his
code performs bad and re-creating, badly, what it already have.

\---

For fast prototypes, sqlite (stored in RAM) could be good. In the early stages
I erase the DB in each run of the code (after the initial DB design, tweaking
it). I continue to do that as far as possible, and only worry about migrations
and all that when start shipping to customers.

Also, not be afraid of build several copies of the tables - like experiments-
(customerA, customerB, etc), use views and peek on your DB documentation.

~~~
SQLite
> I don't remember the original quote, but is alike: "Novices worry for code,
> Experts focus on data structures".

The origin is Linus Torvalds on the Git mailing list. Here is a copy from LWN:
[https://lwn.net/Articles/193245/](https://lwn.net/Articles/193245/)

The quote is ironic considering that Git uses a bespoke and not particularly
well-designed key/value database, which has resulted in notorious usability
problems in Git.

------
jihadjihad
The reason MongoDB is simultaneously lauded and derided is because there is a
bimodal distribution of people using it: those who build databases for a
living, and those who query them.

There is something to be said for being able to rapidly prototype ideas,
especially if your primary skillset is jockeying JSON in the context of
web/app development. However, deciding whether or not NoSQL is the right fit
for you or your project/business depends on how much of your time you will be
spending getting down and dirty with the database.

~~~
maephisto
yes, yes, yes.

------
redwood
So he used a database 7 years ago that was about one year old at the time and
had a bad experience with it and now continues to judge it to this day without
tracking the evolution since, including schemas in the current release
candidate.

~~~
StavrosK
No, the article isn't about Mongo. Quite the opposite.

~~~
maephisto
Well, this discussion here is now about Mongo

~~~
StavrosK
Regrettably :(

~~~
maephisto
Uhmm, the article actually contained the phrases "Don't use Mongo" and "pick
PostgreSQL" so it's kinda asking for a duel between the two in the comments.

------
elvinyung
I feel like MongoDB/NoSQL is a horse that's been beaten so much in the last
few years that no one is actually making that choice nowadays. Hasn't everyone
already learned to stick with Postgres?

~~~
maephisto
MongoDB/NoSQL is deployed and used at large scale by companies like Facebook,
Ebay and many others. That's quite far from "no one is actually making that
choice nowadays".

~~~
alkonaut
> companies like Facebook, Ebay

They didn't chose NoSQL, they were forced to. I'm fairly convinced they
started with relational stores. If a company or product grows to a point where
relational data doesn't work, that's a problem you _want_ to have.

The mistake is either thinking you need to design for facebook scale from the
beginning OR thinknig that you can cut time in a startup by not having to
bother with those pesky schemas that just slow you down.

~~~
erik_seaberg
Facebook started with normalized tables, though they altered the schema really
frequently. They added more MySQL servers as they grew to more schools. Then
their users started graduating and moving around and things got complicated.

------
bmh_ca
FWIW, my personal anecdote: I chose Google App Engine for a large enterprise
application almost a decade ago and it's been absolutely, undeniably one of
the best choices we made.

The application is locked into Google, but that hasn't proven a problem yet
and can be designed around if need be.

~~~
StavrosK
My personal anecdote is the exact opposite of yours. I picked GAE for one of
my personal projects years ago and it has been terrible. You can't do certain
kinds of lookups unless you build the indexes first, there are no delete
cascades, no intra-table constraints, etc.

I suspect the difference between us is that you spent the time working around
these problems, whereas I expected it to just work.

------
codazoda
This article is a bit combative but I generally agree with the idea.

I use both MySQL and MongoDB in my daily work on a classifieds site that does
a few hundred million pageviews a month. Both are pretty solid performers. The
article is correct that with Mongo you just move the schema into the code (new
versions not withstanding). I think it's nicer to have the schema on the
database side but it's really just user preference. We typically end up
creating a schema class and defining it up-front anyway. There is also a small
subset of cases where not having a schema at all is actually a benefit.

Starting with a popular SQL engine is a really good tried and tested method
though.

~~~
beagle3
It's more than user preference in the long run; When your schema is encoded in
the database (with triggers, constraints, foreign keys, the whole shebang),
you can be sure that whichever way you access the database, it is still
consistent.

When your schema is "in the code", even if you completely abstract it into a
library, it means that the quick-one-off Ruby/Perl/C#/Python script will not
have the integrity checks, and may corrupt your DB.

------
tabeth
To the people saying "just do Postgres", what would you say to a startup that
wants to create offline first apps? CouchDB, for example is way better at that
than Postgres.

1) Use CouchDB and Postgres?

2) Somehow implement revisions and Postgres?

3) Use Postgres anyway and scrap the offline first?

~~~
StavrosK
If you want to create offline-first apps, I would use Postgres as the main
datastore and use Couch to sync data between client and server. You'd have to
decide which of your data would live where (or if you wanted to use Couch as a
way to transfer data from the server to the client).

Couch is a very good datastore for that use case, though, so I would
definitely use it in some capacity for your purpose.

~~~
tabeth
Can you expand on this? Do you mean you'd use Postgres as your main store and
periodically "flush" data from CouchDB to Postgres (probably as JSON)?

I don't see how that would work reliably. On the client would you source
CouchDB or Postgres? Presumably you'd access CouchDB directly, but then why
even use Postgres (for that subset of data, anyway).

~~~
StavrosK
It really depends on what your data looks like. If it's just game settings and
state, put them in Couch and that's it, and keep user data like payments and
activity there.

If it's data you're going to want to run analytics on and sync to the client,
you're probably going to have to store it in both places, I think.

------
jarym
I think the real mistake is one of: 1\. I'm gonna use NoSQL because I've heard
its really cool. 2\. I'm not going to use SQL because its hard for me to write
queries and change stuff on the fly.

If either of those are true (gotta be honest with yourself when deciding) then
STOP. You have to think about the best tool for the job overall - sometimes
that will be NoSQL, sometimes it will be an RDBMS. If you can't decide which
of the two, then Postgres with its JSON support is IMO the best starting
point.

------
cocktailpeanuts
OP forgot the biggest mistake for choice of datastore: Blockchain.

~~~
relyio
It isn't an awfully common anti-pattern for startups though. Or at least, not
as common as picking Mongo<3.* instead of pq because things needs to be
"webscale".

------
wonderous
@StavrosK: (aka the author) - what is the TLDR of the article? Ask since there
appear to be a number of users including myself that appear to not get the
intent of your article.

—

Meta-comment: Feel like if the poster is self-identifying as the author when
posting the link, it’s verified via say email/domain, an HN username has been
ID’d in the past as the author, etc. — it should be automatically obvious in
post, comments, etc.

~~~
StavrosK
It's what I explicitly call out as a TLDR in the article itself: "Think before
you pick a database. If you insist on not thinking, pick PostgreSQL. Trust
me."

Basically, "Postgres is a better default".

~~~
wonderous
There is no TLDR, summary, etc. clearly marked as such; by one estimate it
takes 140 seconds or 468 words to get to the start of the sentence you quoted
above; which is strange given your point is to stick to best practices unless
there’s a strong reason not to do so. Strongly recommend moving that info to
the top of your post.

------
mooreed
1) Read the image captions - they made my day. 2) Pick Postgres unless you
have a compelling reason not to. 2b) Use the time gained by not over-
engineering upfront, to focus on users and business logic. 3) Optimize/Evolve
away from Postgres later as needed. 4) Profit.

------
jlebrech
If your datastore has no schema your application must be the schema, you have
to version your models and have those migrate when old data is accessed.

------
lowbloodsugar
Wow. This fell off the front page in the time it took for me to drive to work.
Guess the NoSQL crowd ain't got time for that.

While I'm generally sympathetic to your post, there are some things that are
red flags.

>If you have ten services accessing the same database and sharing data between
themselves

Whoever access the database schema owns it. If you have ten systems accessing
your database then ten teams own it. And if ten teams own it then nobody owns
it. Nobody can change it. Seen this at a successful start-up that got big, but
then couldn't rev order management, because every team had their finger in the
pie, we couldn't do a schema update without breaking everyone, and of course
one team was "under tremendous pressure to hit a major milestone and we just
can't do that now" for over a year.

Now I would turn this around into a win for RDBMS by suggesting the use of
functions or stored procedures: with an RDBMS we can construct an API, and
then we can version those APIs. And then the team that owns the database can
do what they like. That said, we can do the same with NoSQL databases by not
allowing other teams to access them. The team that owns the NoSQL database is
required to maintain an API for it.

I've only ever had nightmares with other teams coding against my schema.
GraphQL worries me in that respect and I'd love to hear how people here have
fared with long lasting GraphQL, in the real world.

>Django, for example, makes migrations trivial, as you just change your
application-level classes and the database gets migrated automatically.

I've had automated migration systems grind to a halt and leave the DB fucked
too.

>Priscilla used a graph database when her data was relational. Her husband,
furious, filed for divorce.

There are no schema that are "relational" but not "a graph". However there are
plenty of schema where a graph database is a natural fit but that require
either one-table-per-node-type or building a graph model on top of your RDBMS
(e.g. an Entity-Attribute schema). Oy.

There are also many schema where there is only one entity type, but every join
is against itself, and we're looking to join all the way out to the clique. In
this case would you suggest an RDBMS, and then put the iteration in the
application? You suggested earlier that making up for the inadequacies of the
datastore in the application is a bad idea.

I've got a graph application that uses NoSQL and it was the right call. An
RDBMS would have allowed us to write something that worked for simple cases,
but that would have brought the system to its knees based on some customer
usage. The solution for the RDBMS would be the same as what we had to do for
NoSQL. But up to that moment, the NoSQL allowed us to iterate far faster than
an RDBMS.

>For example, if you later need to compile a list of all the brands of all the
products on your store, an RDBMS can easily do that by reading the “brands”
table

Only if you built a brands table. You can't argue that we can't predict the
future, so use an RDBS because its easy to change, but then make arguments
that require that the builder accurately predicted the future and built the
schema with that foresight. Sure, we could go and pull a brands table out of
the existing tables but thats work, and it might be work on a live database
that brings it down.

A graph database would be just as likely to have 10 brand nodes since the
overhead of creating the first such node is far lower than creating an entire
table and updating the schema.

>Relational databases excel at easily providing answers to questions that
weren’t predicted at the time when the data model was designed.

Or a NoSQL database with spark. "But spark is something new to learn"

And this is the biggest flaw in your argument. You're pro RDBMS because you
know SQL and how to run RDBMS, create the schema, and write the queries. It is
incredibly easy to get started with MongoDB. That right there is why it is
popular. Not because its good. But when you say "Just get something started
and use an RDBMS", you're actually saying "I know you know javascript, but I
need you to learn Modula-2 for this part of the system". Fundamentally
different syntax and strict types (or "schema").

>For all its unparallelizability, ACID is pretty damn nice.

Until someone holds transactions open across network calls and kills
throughput. I've seen that issue lose a company a multi-million dollar
contract because of contention on a single row. Or until someone chooses the
wrong isolation level ("But I used a transaction!") and two transactions
happily decrement non-atomically. This shit be hard, and part of the "hard" is
not knowing you're doing it wrong.

>You’ll have plenty of time figuring out what to use when you know your exact
usage patterns, if the business manages to not die until then.

So start with something quick and easy then. You've already managed to
describe two scenarios where an RDBMS blew up in practice (multiple teams
hitting a DB causing schema lock; its a network, not a flat hierarchy).

If you have an experienced SQL team, by all means go with RDBMS. But lets not
pretend they are a panacea. Honestly, I'd use a graph database pretty much all
the time - _if I could only trust them_. But its the quality, reliability and
longevity that I have a problem with, not with the nature of how the database
organizes data.

------
williamstein
TL;DR: "just use Postgres"

~~~
frik
TL;DR: "just use an MySQL or Postgres or SQLite"

~~~
StavrosK
Don't use SQLite for production, for all the love I have for it, the client
libraries are usually locking accesses and don't work properly with concurrent
reads/writes.

~~~
alkonaut
It's designed for single client, and works very well in production for single
client. Mobile apps, desktop apps, and anywhere you can serialize access it
works great.

Let's say "don't use it for production for a multiuser server app" then I
agree. But that isn't really a supported scenario at all.

~~~
emodendroket
But in all those scenarios the downsides of just writing JSON to disk or
something are smaller too.

------
gcb0
> startup mistakes

> .io domain

oh my.

~~~
dordoka
What's the problem with a .io domain? Honest question.

~~~
codazoda
There was a recent failure that affected everyone using the TLD and a popular
blog post explaining it and suggesting that you use something more reliable.
Some of the comments on the article pointed out that the same company runs
.org and a handful of other reliable TLD's.

~~~
dordoka
Thanks very much

