
From relational DB to a single DynamoDB table - forrestbrazeal
https://www.trek10.com/blog/dynamodb-single-table-relational-modeling/
======
gregwebs
This is a nice talk to explain what a document database is and how it is
supposed to be used. Many think it is about not having a schema, but you could
have the database enforce a schema in a document Database.

Document storage freezes access patterns in a way that creates data locality
for preferred data access patterns. The major downside is that you might not
know all the access patterns and even if you do new access patterns can arise.
Either way, it can be extremely difficult to accommodate unexpected access
patterns.

I have seen document storage work well for aggregating data from external
APIs. The whole point is that you are making preferred access patterns for
your data rather than use an external API. And you are probably ready to
rebuild your database if your access pattern changes.

I would also expect document storage to work well in an architecture of
microservices under the conditions that:

* The schema is small so there aren't many possible access patterns

* Data is shipped to a SQL (warehouse) database for more complex processing anyways.

~~~
013a
I tend to believe that one of the most powerful things about Dynamo is the
streaming capabilities. It means that you can cleanly use Dynamo as the source
of truth for your data, reap all of the performance benefits it has, and
automatically ship that data to another database for more advanced BI
querying. Moreover, you can do this offline; queue up every mutation that hits
Dynamo, then process them at a rate that your data warehouse can ingest
without worrying about overprovisioning for spikes.

I've worked on a team of 8 engineers that spent 6 months building a streaming
replication system from MySQL to Redshift. Its crazy how simple this would
have been had the originating database been Dynamo; its very possible one
engineer could have done it in a fraction of the time. It also seems likely to
me that an open source solution might already exist.

------
shortj
Especially with the new(ish) DynamoDB features mentioned in this article (ex:
adaptive scaling), the backup and restore capabilities including Point in Time
Recovery, and on-demand pricing model, makes it a really effective database
when going with the single table design.

Wrapping your head around this takes some pretty serious time along with trial
and error. Work and rework your data model multiple times. Turning to
spreadsheets actually tremendously helps with being able to define and move
data around to play that game of Tetris mentioned in the article.

In addition, you'll probably want to avoid any of the common ORM type clients
that exist for DynamoDB, they tend to have made decisions that make using a
single table design an uphill battle. Write your own little wrapper for the
native AWS DynamoDB client in your language of choice and you'll be much
happier long term.

~~~
maxxxxx
" Write your own little wrapper for the native AWS DynamoDB client in your
language of choice and you'll be much happier long term."

That seems to be the secret with a lot 3rd party components. Write your own
wrapper that encapsulates the features you use but don't let that component
dominate your codebase.

------
rodeoclown
When breaking up your records into multiple rows by attribute - you then
require multiple read units due to how DynamoDB charges you for each row read
which costs a minimum of a read unit. So each complete record would have a
cost for each attribute as if they were each sized at 4kb (minimum size of a
read unit) on read and write.

You also very likely need to use DynamoDB transactions (not really mentioned
in the article) or else you're taking big integrity risks with partially
incoherent records and so you're paying more for their transaction bookkeeping
as well.

Doesn't this seem to map badly to DynamoDB's cost model?

~~~
mabbo
Not exactly. If you find yourself needing to do multiple reads for the same
record then whichever row you loaded first, put the other data into that same
row. It's all about planning your access patterns ahead of time.

------
forrestbrazeal
In case you missed it, the code needed to load the Northwind database into DDB
with appropriate indexes is here [0].

One fun thing: this was my first time playing with the new DynamoDB on-demand
capacity features. If you do run the code, check out the CloudWatch metrics on
the table. It's pretty cool to see the WCUs shoot up to 150 and back down to 0
with no throttling whatsoever.

[0] [https://github.com/trek10inc/ddb-single-table-
example](https://github.com/trek10inc/ddb-single-table-example)

~~~
013a
My biggest fear with On-Demand is that there's no cap to the cost, and an
unruly service or access pattern could blast up the RCUs/WCUs instead of
getting a simple throttle and timeout. Is this a valid fear, and is there any
way around it if so?

~~~
nzoschke
It’s a valid fear. Dynamo can get really expensive really fast with on demand
or auto scaling.

I have a dynamo addon for Heroku apps and need to keep cost under control. I
use AWS Budgets and put a fixed price monthly budget on every table. If users
go over the budgets access keys get disabled.

[https://elements.heroku.com/addons/autodyne](https://elements.heroku.com/addons/autodyne)

[https://aws.amazon.com/aws-cost-management/aws-
budgets/](https://aws.amazon.com/aws-cost-management/aws-budgets/)

------
tracker1
I would offer, that depending on your querying needs with a little added
complexity, you could replicate the data to an elasticsearch cluster for more
complicated queries and use dynamo as the source of record/control.

IIRC, you can setup events based on dynamo records that can trigger a lambda
function to update the search db.

~~~
ilkhan4
This what I do: all writes go to Dynamo and it replicates to ES for
querying/filtering. It's the best of both worlds for operational stuff, IMHO,
although still not perfect for ad-hoc querying like SQL.

~~~
ChicagoDave
No matter what, all data has to land in a data lake for BI and general
reporting.

And always define reporting requirements up front. The last thing you want to
find out is that you need a real-time dashboard and you don’t have the
services or storage to support it.

------
eptcyka
This kind of a design seems to make certain rows be accessed very often, no
matter what data ends up actually being read. Doesn't this imply a hard
bottleneck as, whilst DynamoDB IO operations can scale __infinitely__, they
can only do so if the operations are spread over a random distribution of keys
over which the backend can be partitioned, thus scaled horizontally, right?

~~~
shortj
That limit is going to be pretty high. Adaptive capacity (or on-demand) gives
you up to 3000 read capacity units / second and 1000 write capacity units /
second per partition.

The general solution if you think you are going to get to volume where that is
not sufficient is to do something like add an additional integer (or other
known symbol) to the end of the key. Ex: `key.1`, `key.2`, `key.3`. When doing
operations you would have to run N operations across the key + symbol space.
Not as elegant and you'll pay the additional computation for those queries,
but since they can all be run in parallel it's not too bad in terms of
latency.

Now, if all you are doing is hammering away at a few rows, you are stuck for
sure. DAX [1] or some other caching layer may be your only way out of that
problem.

[1]
[https://aws.amazon.com/dynamodb/dax/](https://aws.amazon.com/dynamodb/dax/)

------
dstaley
I can't wait for the day when we figure out how to "scale to zero" with
database engines and also still provide quick starts. The serverless model is
fantastic for smaller projects that can't justify the monthly cost of RDS. I
was super excited about Aurora Serverless until I found out cold starts take
about 30 seconds, and even longer if you use a VPC to connect to it from a
Lambda function. The single table approach with DynamoDB would work, but the
amount of cognitive effort necessary compared to just using a relational
database isn't worth it for small toy projects.

~~~
jchanimal
FaunaDB (my employer) scales from zero to worldwide, with no pre-provisioning
or capacity planning required. In addition to serverless database pricing and
no cold start, you get full ACID transactions as well as relational features.
None of the overloading of shard keys etc as featured in the DynamoDB video.

~~~
Drdrdrq
[https://fauna.com/](https://fauna.com/)

------
etaioinshrdlu
I smell a tool in the making! One where you enter your access patterns as SQL-
ish format up-front, and you get your document DB design as output.

Then it should also wrap your new table up in an ORM to protect you from
yourself.

Could also support generating migrations to support adding new access
patterns, at the cost of modifying the entire database.

~~~
adamfeldman
[https://www.prisma.io/](https://www.prisma.io/)

------
scarface74
My only NoSQL experience was with document databases like Mongo and I’ve
always thought that DynamoDB was too limiting. This talk single handedly
changed my opinion of the usefulness of DynamoDB and how to architect a system
around it properly when I first watched it.

------
Arzh
This seems like they just want to remove joins with string concats (composite
keys) and that doesn't sound like a good idea to me. Though all of this looks
pretty slick it seems like it's all based on a shaky foundation.

~~~
mattlong
I was also very surprised by this. I'm having a hard time accepting that I'd
have to manually concat my columns to create sort keys and keep track of the
corresponding queries for them, e.g. "BEGINS WITH..."? Granted, I only watched
the video so maybe I'm missing something and that was just to simplify the
presentation.

~~~
Arzh
Yeah I'm with you, I've been reading up today about the GSIs in Dynamo but
they really seem like a hack to get you a usable database, I need to come up
with a good reason to use it at work, probably migrate some older databases so
that we can multi-DC them.

------
jacob019
Anyone else have trouble scrolling that page? It's scrolling very slowly, and
page-up/down don't work. Firefox 64/Linux

~~~
Drdrdrq
Yes, FF/Linux. They must be doing some magic with scrolling.

------
jugg1es
I just finished a big, critical project at work that uses this exact approach
with an adjacency list in DynamoDB with 5 GSI's for different access patterns.
Not only is it super fast, it's also very extensible. As this article states,
however, the most important and hardest part is ensuring your design is
correct because you can't really change it once you implement it. You can add
features but it's basically impossible to modify an existing implementation.
As long as you do the right planning, DynamoDB works great!

------
chisleu
I love dynamodb in theory. It's a great C* replacement for simple users who
don't need some of C*'s features.

What I don't like about the auto scaling is it seems to be limited to either
scaling reads or writes automatically. That seems short sighted. I don't
understand the technical reasoning.

~~~
bboreham
There are two sets of settings for auto-scaling reads and writes, but you can
turn them both on.

Or you can turn on the very recent “on-demand pricing” and not worry about
capacity settings.

------
eknkc
Isn’t there a partitioning issue on the GSI here?

I mean, all customers have something like pk=customer1 sk=customer so all
customers are in the same partition of the GSI. I know it’s a big deal To
distribute the table partition key but doesn’t that also apply to the GSI?

~~~
forrestbrazeal
It does apply to the GSI. This point is briefly addressed in the sidebar about
sharding father down in the post. Not sure the "customer" records would have a
problem, but the static GSI partition keys for orders, products, and suppliers
could potentially run into issues.

------
bazza451
Dynamo as a replacement for RDS is great until you come across these simple
things that are easy to implement in a RDS (like a paging model for row
results) and you need to jump to a page in the middle. it’s the biggest PITA.

------
gambler
_> Step 1: Define the access patterns you think you’ll need_

What happens when they change?

~~~
keketi
It depends. You might get away with just adding a Global Secondary Index or
you might be totally screwed. DynamoDB sacrifices flexibility in favor of
scalability.

~~~
danenania
You won't be _totally_ screwed since you can always export/import your data
into a new table with new indexes, but it could certainly take some work. That
said, re-indexing large amounts of production data will take some planning
with any database.

~~~
zwily
You can add GSI’s to live tables, but not LSI’s.

------
jjuel
I love the idea of this, but it is so hard for me to wrap my head around. Can
never get my mind out of a relational model. Of course for the personal
project I am working on I am not sure NoSQL makes sense, but that could just
be me not knowing anything other than a relational model. It seems a lot like
some functional languages. Just so foreign compared to what I have always
known.

~~~
grogenaut
The way I usually think about it is I start with designing the queries. SQL
has you go backwards, you design the data per how it lays out denormalized,
then write complexity to answer the questions you want. NOSQL you think about
answering your queries and model to that. Think about basically most apis just
hitting one table and a back index max. IF you need to do ad hoc querying do
what others have said and stream to a different data store. Eg design fore
scale first based on the most common user actions.

~~~
jjuel
That does make sense. I am just trying to think of it in terms of my current
personal project. How I could lay it out, but the problem is it is mostly ad
hoc queries since it is historical data. Like storing game results and
statistics. I think relational for my current situation works best. However,
this is pretty cool in the future for projects I would have just used
relational before. Watching the videos about it have been a great help. Helps
me figure out the use cases.

~~~
grogenaut
What's the TPS on any of your ad hoc queries? If it's low and very specialized
than relational or a lucene or redshift type system likely is what you want,
eg more like ERP.

If say you're doing dotabuff type stuff then just chunking the records you
want out by player or even just an S3 Blob that is updated might be better.
Then the rest of your tables are literally just indexes to the large amounts
of data.

It totes depends on what you're doing.

The big issue with relational really is that you hit a point where
transactions just can't keep up with locking across the cluster. If your
system doesn't do this, then your next problem is just sharding on tons of
data. If you don't have enough data then relational is fine. If you do have
that much data then you're going to have to figure out a sane sharding and
joining strategy across a cluster, which is more work. At some point in size
of data you either have each user on their own shards of boxes, or you are
keeping only indexes in the databases to retrieve other data. Both take a
bunch of work.

Much of the motivation behind that talk and amazon's drive away from SQL is
really just the business risk of hitting that scaling limit while you have a
business that is doubling in revenue. Your options when you hit a limit on sql
(esp with transactions) is to heavily re-architect it. And that takes a while,
so you have to pause growth during that time. At which point a second mover
can come roaring past you.

If I was building somthing like dotabuff I'd go with relational database
sharded by users or some type of columnar database for the ad-hoc querying.
Tho really for dotabuff you could build almost the entire site to be
statically served off of cloudfront and just re-crunch data when a new match
comes in.

However I'd have dynamo as the front end data ingest record keeper. Things
like "did I see this game data already" to avoid using things like FIFO
queues.

Happy to talk about it more if you provide more examples of what you're trying
to map.

And again, if you don't have these problems, relational is just totally fine.
You can scale quite high on relational, it's just that you run the risk of
hitting that asymptotal tipping point where the system just goes down and you
can't get bigger hardware. If you're not going to hit it, the tool you know
will serve you well.

Another random thought, I find that people who thing in SQL forget how damn
awkward it is... For instance, ask someone who uses Relational regularly how
to create tables for users's shipping address. They'll spout a denormalized
form that works in many many places. Ask a college hire the same question, you
might get more than one table. The thing that is obvious to the regular user
will be non-obvious and confusing to the college hire. And I would argue that
the table layout is somthing that evolved over 30+ years of the industry doing
it over and over.

------
dstroot
I just got enlightened. I’ve used both SQL and NoSQL for years now. In my
simplistic view of the world I was convinced relational data goes in SQL and
denormalized KV single access pattern go to NoSQL. Tools I was using don’t
offer the secondary index option. This blew my mind that I could suppport a
full relational model in DynamoDB. Thanks for sharing this!

------
lliamander
This "index overloading" is an interesting term that I haven't heard before
(though it seems very similar to an entity-attribute-value model).

The only document store I've used "in anger" has been DynamoDB, does this
technique apply well to other document stores?

------
Drdrdrq
Curious: when would you choose DynamoDB over RDS / Aurora?

~~~
forrestbrazeal
Rick Houlihan has another talk from re:Invent 2018 where he goes into great
detail about this -- see the following Twitter thread for a jumping off point.

[https://twitter.com/alexbdebrie/status/1081237474930769920](https://twitter.com/alexbdebrie/status/1081237474930769920)

~~~
mabbo
I just finished watching that video 10 minutes ago and my mind is blown. I've
been using DynamoDB wrong for most of 7 years.

------
Thaxll
The video is very good, everyone should watch it.

------
rawoke083600
omg ! why put yourself through this nosql silver bullet ! NoSql has a very
nich application (Exteme high load and low dimensional data with little
relationships) This automatically exclude.most business cases ! RelationDB
(plain old sql) is what you looking for 9/10 !!! The amount of stupid
workarounds ive seen just to say 'we use dynamoDb' is epic...

