
DynamoDB to Postgres: Why and How - orf
https://containership.engineering/dynamodb-to-postgres-why-and-how-aa891681af4d
======
hesdeadjim
DynamoDB was a constant source of anxiety for me and my team. We fit the usage
pattern perfectly, so the author's points didn't apply to us.

Our primary problem was having to constantly manage and change our provisioned
capacity so that we could respond to usage spikes and keep our bill from
becoming stratospheric. Worse, anytime we'd add new functionality -- such as
nighttime analytics jobs that scanned the table or daily backups, it meant
revisiting the parameters for our auto-scaling or risk getting throttled and
having our application randomly start spewing errors.

DynamoDB's billing and provisioning model is awful to deal with. When your
team is having serious conversations about the repercussions of your data
values going from 4kb to 8kb, you very quickly begin questioning your decision
to use DynamoDB in the first place.

~~~
Zaheer
FYI DynamoDB in last few weeks added auto-scaling:
[https://aws.amazon.com/blogs/aws/new-auto-scaling-for-
amazon...](https://aws.amazon.com/blogs/aws/new-auto-scaling-for-amazon-
dynamodb/)

Additionally they somewhat recently added an optional caching layer:
[https://aws.amazon.com/dynamodb/dax/](https://aws.amazon.com/dynamodb/dax/)

~~~
hesdeadjim
Nice they finally added auto-scaling. It seems to function pretty similarly to
the tool we built, with the same (documented) limitations.

DAX would have definitely addressed many of our issues, had it been available
at the time. Glad they added that.

~~~
azinman2
So does that address all your issues?

~~~
hesdeadjim
No, for our usage needing to worry about provisioned capacity at all was
becoming a deal-breaker. That especially became tedious to estimate and plan
for as we had more variability in the size of our values we needed to store.

------
linkmotif
> I’m not a database guy, I’m a node guy.

Is this really a distinction your "scrappy startup type" can afford to make?

> Although since version 3.4.x Mongo passes all Jepsen tests, when our
> original database was chosen the current version of Mongo at the time did
> not and was deemed unreliable by the team

Does not pass all Jepsen tests => deeemed unreliable :( Wow. Of all possible
reasons, THAT'S the reason MongoDB wasn't chosen? Jepsen, man. Pretty sad so
many people just look at Jepsen results like this. Jepsen, really, is about
the tests more than the results. It highlights various failure modes that may
be difficult or uncommon to replicate in production. These failure modes are
important to know about to understand the product you are using, but are not
really the end-all when it comes to deciding whether that product is worth
using!

Anyway, show me a database that passes Jepsen and I'll show you a database
that only accepts writes to a single node.

~~~
andreimatei1
> Anyway, show me a database that passes Jepsen and I'll show you a database
> that only accepts writes to a single node.

I know one - CockroachDB :). Relational, strongly-consistent, SQL, elastic,
multi-active replication - the dream! (I work on it)

[https://www.cockroachlabs.com/blog/cockroachdb-beta-
passes-j...](https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-
jepsen-testing/) [http://jepsen.io/analyses/cockroachdb-
beta-20160829](http://jepsen.io/analyses/cockroachdb-beta-20160829)

~~~
Thaxll
“For instance, on a cluster of five m3.large nodes, an even mixture of
processes performing single-row inserts and selects over a few hundred rows
pushed ~40 inserts and ~20 reads per second (steady-state).”

If someone can explains those abysmal low numbers ...

~~~
andreimatei1
Well, those numbers, if I remember correctly, refer to insert rate in the face
of some extreme amounts of contention - tons of overlapping reads and writes
(that's how the test was trying to trigger consistency violations). And
moreover, they were measured while the Jepsen test framework was messing with
the cluster. Contention is a problem for every database, and particularly so
for CRDB. In the absence of contention, we routinely see thousands of queries
per second per node.

Since the time of that analyses, we have done a significant amount of work for
speeding up these high-contention scenarios, with quite dramatic differences
in some cases we looked at. We pretty much changed our transaction execution
model from a more "optimistic" one where transactions can abort each other and
induce thrashing, to something resembling more the traditional row locks. So,
hopefully, even for these atypical uses cases we should generally perform much
better.

------
neovintage
The really important part of this post is the section articulating why not
Dynamo. It uncovers a big question I ask of all of my customers when they're
trying to choose data stores at any point during the life of their
application.

Do you know your query patterns for your application?

If you know how your application / product need to pull data from the data
store, you way more information that will allow you to pick the right data
store to fit the problem at hand.

When you don't have _any_ prior query patterns to go off of, picking a
relational database to start your application is a good choice. You need the
flexibility to allow your schema to change over time. The schema and query
patterns will eventually reach a point where changes don't happen all that
often and then you can look to other datastores if you need to scale. Plus,
are you really going to have THAT much data when you start?

Disclosure: I work on Heroku Postgres. Could be a little biased....

~~~
lilbobbytables
Similar to the article author, many just don't have experience with relational
db's and have only heard that they aren't flexible and don't scale.

They don't realize just how easily performant Postgres can be - and often so
easily by sticking basic indexes where they are needed. Similarly they may not
realize the amount of data and load PG or MySQL can handle just fine.

------
paladin314159
> Assuming you know how your data needs to be queried when you create your
> table, then the hash/range key combination could be the answer to your query
> problems. However we did not.

DynamoDB's purpose is to allow you to easily scale a known query pattern. I
definitely don't recommend it as a first solution when you don't know how the
data is going to be accessed. Or when everything fits on one PostgreSQL box.

We've had a lot of success starting out with a big PostgreSQL box and
migrating use cases to DynamoDB when a table gets too big and has a query
pattern that fits (essentially K/V lookup with some simple indexes).

~~~
gamesbrainiac
Good news is that postgres is going to get multi-master replication soon in
postgres 10, so a lot of the painpoints when it comes to scaling our writes,
are hopefully going to be solved.

~~~
bradhe
> are hopefully going to be solved.

Right, because multi-master replication schemes have such a long lineage of
being operationally simple. /s

~~~
paladin314159
Yeah, that's the real power of DynamoDB, despite all of its shortcomings. It's
operationally simple (given you have the money).

------
mark242
Dismissing CouchDB because it only supports pre-sharding is _extremely_
shortsighted. It takes very little time to replicate from one db with 2 shards
to another db with 3 shards, even on N nodes. Copying the *.couch files is the
absolute wrong way to manage this especially in a production environment when
those files will be constantly changing.

It is far, far easier to add nodes to a Cloudant/Couch2 cluster, set up
realtime replication from db1 -> db2, then once the replication has caught up,
at your leisure you switch your database connection in your app.

------
idbehold
I'd also like to remind anyone using DynamoDB: it cannot store empty strings.
Let that sink in for a second. If you think you're going to be able to store
JSON documents you'll be in for a nasty surprise when you try to insert a
valid JSON document and either the request fails or it converts any empty
strings in your JSON document into nulls.

~~~
koolba
An empty string (zero bytes) isn't valid JSON.

'""' vs '' (single quotes to illustrate the wrapper for the total content).

~~~
idbehold

       {"foo":""}
    

That's valid JSON that cannot be stored in DynamoDB because the value of "foo"
is an empty string. See previous discussion:
[https://news.ycombinator.com/item?id=13170746](https://news.ycombinator.com/item?id=13170746)

~~~
koolba
Ah I didn't notice the distinction between storing the serialzied form (which
would be non empty) and storing the record itself.

Yes that does suck quite a bit and seems like a pretty bad decision.

------
luord
My own experience working with Postgres, discovering nifty features or trying
out new ones, and articles like this make it clear for me that I'm unlikely to
ever need anything else. By the time any of the applications I work on could
grow to the size in which Postgres wouldn't be a good option, I'd hopefully be
in the process of selling it anyway.

I'm quite ok with that. It's such an awesome piece of software.

------
aluskuiuc
I'm surprised the author didn't consider using DynamoDB Streams to replicate
to a PGSQL database for the relational queries while keeping the primary key-
value lookups on DDB.

~~~
manigandham
What does that gain other than more expense and overhead? SQL can do key/value
just fine.

------
dchuk
It doesn't really seem like this guy was qualified to make such a pivotal
decision like choosing the database your entire system will run on given his
self-submitted breadth of experience in the opening paragraphs...

~~~
takeda
You could say I'm a Postgres fanboy, but even to me he did not sound
convincing.

The guy did not seem to know what he was talking about. Almost felt like he
picked Postgres by throwing darts at possible options. Yes, he picked the
right choice :), but I feel like he will misuse it[1] and few months later
write article how bad it was.

[1] Heh, I've seen Postgres tables that had 4 columns:

\- id

\- date added

\- date updated

\- thrift encoded blob containing the data (that was before JSONB, but even
then the data was highly relational and that was silly)

------
pedalpete
I've just started using DynamoDB on a project. Before I made the choice, I
discussed some architecture issues with a friend who works at a big fast
growing unicorn start-up. They use MongoDB for the majority of the app the
user sees, they use ElasticSearch for search, postgres for reporting. Data
gets exported or updated in the data-sources that best suit the need of that
data, rather than trying to find a single data solution that fits every use of
the data.

~~~
jungturk
This (polyglot persistence in Fowler-speak) is typical in situations when you
have access patterns that aren't well-solved by a single data store.

It will likely introduce some serious consistency issues that need to be
attended to, but that pain might be better than the alternatives.

------
orf
I submitted this article because of a single paragraph:

> When it came to the different types of NoSQL databases, it was pretty clear
> that a document store best suited our needs. We needed to be able to query
> based on the data so key/value was out, there was no need to have the
> ability to make all of the complex connections between data of a graph db,
> and nobody actually knows what a column db is (jk but our data is not the
> right use case for it). So document it was. We would have to change our data
> layout a bit to really work well with most document databases, our current
> data structure was similar to what you would expect for a relational db with
> tables etc since dynamo also uses a table structure. I wanted something that
> was ACID, none of that eventual consistency nonsense. That took a lot of
> them out of the race right there.

I don't get it. He needed arbitrary querying, relations, ACID, tables. And
yet... his first thought was to go to a document-based no-sql database and is
only later hit with the sudden realization that his relational data is bet
suited in a relational database.

What has gone so wrong here that an otherwise smart and educated engineer
makes these kinds of fundamental mistakes? Is it the success of no-sql
marketing departments, and the apparent un-sexiness of traditional relational
databases? Is it education? The job market? A fear of SQL (that you never
really have to write anyway...)? Is it the quick-fix non-relational databases
give you?

As far as I can see there are very very very few nails that a PostgreSQL
hammer cannot hit. I guess that's part of the point of the article, but why
does it need an article to explain this?

How can we stop these kinds of articles from _needing_ to be written?

> The change drastically increased the speed of our API, some requests are up
> to 6x faster than they were with Dynamo. The account details request, our
> previously slowest call, took around 12 seconds with Dynamo and was reduced
> to 2 seconds with Postgres!

 _sigh_

Edit: Not sure why this has been moved to the bottom of the thread. I wasn't
trying to single this particular developer out, I was merely commenting on
this troubling trend. Apologies if this came across in the wrong way.

~~~
virmundi
Keep in mind that the Postgres hammer is a non-standard hammer. Sure it is
pretty SQL compliant (except there are no dirty reads thanks to MVCC), but if
you use JSON or the arrays or the many other wonderful features of Postgres,
you are now tied to Postgres. You are also slowly inching your way into
Document Stores.

~~~
orf
The same can be said about any database though. No one is like another.

SQL as a standard is roughly followed, and PostgreSQL is very compliant.
That's more than can be said about any non-relational database - each has it's
own query language.

I use JSON fields and arrays quite a bit, but that doesn't mean I need a
document store. It means I have small portions of data that is non-relational,
or is better suited to use arrays/json than strict tables, but I still want
everything else consistent, enforced, and also have all the other bells and
whistles Postgres gives you (FTS, partial indexes, rich types everywhere,
ACID, the list goes on and on).

Being tied to Postgres is often better than being tied to a document store. At
least you can dip your toes into non-rel as little or much as you like (where
it makes sense), rather than diving in at the deep end and realizing you can't
swim.

~~~
virmundi
What about ArangoDB? You can perform joins, even across the cluster. It's a
document store. You don't have transactions across the cluster (yet, I'm
pretty sure I'm right). You get sharding and distribution automatically.
Again, at this point you're out of the SQL world, why not go fully out?

------
tdurden
The author will hopefully look back on this post and cringe a little. All the
requirements point to a relational database, but that is not realized quickly
because "if I need to use a database, I go straight to NoSQL"

------
williamstein
This closely mirrors my experience a few months ago:
[https://news.ycombinator.com/item?id=12649712](https://news.ycombinator.com/item?id=12649712)

