
Rules for Data Modeling with DynamoDB - shortj
https://www.trek10.com/blog/the-ten-rules-for-data-modeling-with-dynamodb
======
kpmah
> Normalization was built for a world with very different assumptions. In the
> data centers of the 1980s, storage was at a premium and compute was
> relatively cheap. But the times have changed. Storage is cheap as can be,
> while compute is at a premium.

Normalisation isn't primarily about about saving storage, it's about avoiding
update anomalies i.e. correctness.

~~~
abd12
Great point! It was originally about both of these things, but the storage
aspect isn't discussed much anymore because it's really not a concern.

The data integrity issue is still a concern, and I talk about that in the
book. You need to manage data integrity in your application and think about
how to handle updates properly. But it's completely doable and many people
have.

~~~
AmericanChopper
> You need to manage data integrity in your application

This is just another way of saying you need to implement your own system for
managing consistency. Dynamo offers transactions now, but they don’t offer
actual serialization. Your transaction will simply fail if it runs into any
contention. You might think that’s ok, just retry. But because you’ve chosen
to sacrifice modelling your data, this will happen a lot. If you want to use
aggregates in your Dynamo, you have to update an aggregate field every time
you update your data, which means you create single points of contention for
your failure prone transactions to run into all over your app. There are ways
around this, but they all come at the cost of further sacrificing consistency
guarantees. The main issue with that being that once you’ve achieved an
inconsistent state, it’s incredibly difficult to even know it’s happened, let
alone to fix it.

Then you run into the issue of actually implementing your data access. With a
denormalized data set, implementing all your data access comes at the expense
of increasing interface complexity. Your schema ends up having objects, which
contain arrays of objects, which contain arrays... and you have to design all
of your interfaces around keys that belong to the top level parent object.

The relational model wasn’t designed to optimize one type of performance over
another. It was designed to optimize operating on a relational dataset,
regardless of the implementation details of the underlying management system.
Trying to cram a relational set into a NoSQL DB unavoidably comes at the
expense of some very serious compromises, with the primary benefit being that
the DB itself is easier to administer. It’s not as simple as cost of storage
vs cost of compute. NoSQL DBs like dynamo (actually especially dynamo) are
great technology, with many perfectly valid use cases. But RDBMS is not one of
them, and everybody I’ve seen attempt to use it as an RDBMS eventually regrets
it.

------
abd12
Author here! If you want more, I just released a book on DynamoDB yesterday
--> [https://www.dynamodbbook.com/](https://www.dynamodbbook.com/) . There's a
launch discount for the next few days.

The book is highly recommended by folks at AWS, including Rick Houlihan, the
leader of the NoSQL Blackbelt Team at AWS[0].

Happy to answer any questions you have! Also available on Twitter and via
email (I'm easily findable).

[0] -
[https://twitter.com/houlihan_rick/status/1247522640278859777](https://twitter.com/houlihan_rick/status/1247522640278859777)

~~~
mabbo
After seeing Rick's Re:invent talk, the one where at about minute 40
everyones' heads exploded, I emailed him (I'm in a very far away other
department of Amazon) to ask him for more, because everything he was saying
was absolutely not the way my group was using DynamoDB (ie: we were doing it
wrong).

He could have ignored my email entirely. He's a busy guy, right? I wouldn't
have held it against him at all. Instead, he was super nice, provided me with
more documentation, and honestly was just really helpful.

If he recommends your book, I'm buying it.

~~~
HatchedLake721
Link please?

~~~
abd12
Rick has had the most-watched re:Invent talk the last three years.

Here are the links for each:

\- 2019:
[https://www.youtube.com/watch?v=6yqfmXiZTlM](https://www.youtube.com/watch?v=6yqfmXiZTlM)

\- 2018:
[https://www.youtube.com/watch?v=HaEPXoXVf2k](https://www.youtube.com/watch?v=HaEPXoXVf2k)

\- 2017:
[https://www.youtube.com/watch?v=jzeKPKpucS0](https://www.youtube.com/watch?v=jzeKPKpucS0)

Additional DynamoDB links here: [https://github.com/alexdebrie/awesome-
dynamodb](https://github.com/alexdebrie/awesome-dynamodb)

------
ajxs
> "With denormalizing, data integrity is more of an application concern.
> You'll need to consider when this duplicated data can change and how to
> update it if needed. But this denormalization will give you a greater scale
> than is possible with other databases."

There's the big catch. As another poster pointed out, normalisation is not
about efficiency. It's about correctness. People have been quick to make the
comparison between storage and compute cost. The high cost of development and
bug-fixing time trumps both of them by an order of magnitude. The guarantee of
referential-integrity alone that SQL offers helps eradicate an entire class of
bugs for your application with no added effort. This article glosses so
blithely over this critical caveat. Whenever this discussion comes up I'm
quick to refer back to the yardstick of "Does your application have users? If
so, then its data is relational". I can't wait for the day when we look back
at NoSQL as the 'dancing sickness' of the IT world.

It's also worth questioning: 'At what scale does this tradeoff become
worthwhile?' Another poster here correctly pointed out that modern versions of
Postgres scale remarkably well. The tipping point where this kind of NoSQL
implementation becomes the most efficient option is likely to be far beyond
the scale of most products. It's true that completely denormalising your data
will make reads much faster, this is undeniable. This does not mean you need
to throw the baby out with the bathwater and store your master data in NoSQL.

~~~
AmericanChopper
> I can't wait for the day when we look back at NoSQL as the 'dancing
> sickness' of the IT world.

It does have some compelling use cases. It’s just relational data isn’t one of
them. If you have a use case with a low potential for write contention, a
tolerance for eventual consistency, a very simple data structure, and a high
demand for read throughput, then it’s great. One area that I’ve seen it used
with great success is content publishing. You have one author, perhaps an
additional editor/proofreader, the content is one document (with perhaps one
other related document, like an author bio), and hopefully you want thousands
or perhaps millions of people to be able to get decent read performance.
Another example could be pretty much anything you’d typically use a
Materialized View for in a DB. You can compute the view in your RDBMS,
periodically publish it to a document database, and then offload all read
throughput to a better suited system.

NoSQL is usually used wrong imo, but that doesn’t mean there aren’t ways to
use it right. There’s valid use cases for graph databases and stream
processing systems too. But they’re not hip enough to produce the same volume
of highly questionable web apps.

~~~
ajxs
You're absolutely correct with the point about the materialised view use case.
I wasn't going to labor the point going into extra detail in my post. The most
successful use I've seen for NoSQL databases is aggregating complex relational
data structures into a single document record periodically. You're entirely
correct. It's not that the technology is inherently wrong ( in most cases,
MongoDB is another story ) it's just the widespread misuse giving this
technology a bad name.

------
jayd16
At what point do these auto-sharding databases like DynamoDB become worth the
effort these days? You can squeeze a lot out of a single Postgres instance and
much more if you go with read replicas or Redis caches.

When you start with a relational model you don't need a priori knowledge of
your data access and you get solid performance and guarantees. If you need
this access knowledge beforehand, is DynamoDB best for scaling mature
products?

~~~
abd12
I just answered this on Twitter, but I think there are two instances where
it's a no-brainer to use DynamoDB:

\- High-scale situations where you're worried about performance of a
relational database, particularly joins, as it scales.

\- If you're using serverless compute (e.g. AWS Lambda or AppSync) where
traditional databases don't fit well with the connection model.

That said, you can use DynamoDB for almost every OLTP application. It's just
more a matter of personal preference as to whether you want to use a
relational database or something like DynamoDB. I pick DynamoDB every time b/c
I understand how to use it and like the other benefits (billing model,
permissions model, performance characteristics), but I won't say you're
_wrong_ if you don't choose it in these other situations.

~~~
dfee
Does the connection model problem go away when using serverless rds?
[https://aws.amazon.com/rds/aurora/serverless/](https://aws.amazon.com/rds/aurora/serverless/)

~~~
agacera
I don't know about aurora serverless.

But aws offers a proxy exactly for this purpose.

[https://aws.amazon.com/rds/proxy/](https://aws.amazon.com/rds/proxy/)

~~~
dfee
Very interesting. Thank you for sharing.

------
debrice
There is a lot more you should learn about DynamoDB but I appreciate the
effort of the author. Please read the AWS documentation, it's not that big and
explains vital things that just aren't in this article. Very important things
like:

\- LSI can't be created after table is created

\- GSI late creation generate back pressure on main table

\- If you have an LSI, your table will not scale beyond 10GB

\- How often a table will scale up and down per day/hour?

\- Cost of auto-scaling in cloudwatch (alarms aren't free)

...and so much more. I've been working with Dynamodb for over 2 years now and
I love it.

~~~
abd12
One note on this -- if you have an LSI, you can't have an _item collection_
larger than 10GB, where an item collection refers to all the items with the
same partition key in your main table and your LSI.

A DynamoDB table with an LSI can scale far beyond 10GB. That said, I would
avoid LSIs in almost all circumstances. Just go with a GSI.

~~~
debrice
Thank you for the clarification abd12, you are right. I only create LSI if I
know for sure the data within a given shard will not go beyond 10GB... since
we never really do we always go with GSI.

------
thoraway1010
I love postgresql.

Did business with a startup, signed up, started getting service, they play an
intermediary biller / payor role.

Because of an issue in company name used in signup their billing system fell
over and didn't setup billing.

But what was crazy is a quickly realized this shop was a noSQL shop. NOTHING
connected to anything - so since they hadn't built any reports to cross check
any of this they literally did not notice (I noticed other consistency issues
elsewhere).

In a SQL database this stuff especially around accounting / money is 101
stuff, but noSQL seemed to really struggle here based on how they'd set it up.

I finally bugged them to charge us, but even that was a bit goofy (basically
it looked they exported some transaction to a credit card system - but I doubt
had any logic to handle failed payment issues etc).

We have one other vendor where the amount actually charged is a few pennies
off the receipt totals - issues with doubles, rounding and application logic
or something which doesn't simply use the same number (from database) for
order / item detail and total and billing.

So at least in finance / accounting, a ledger which is basically a source of
truth, and is linked / summarized / etc in various ways to other parts of
systems (sales tax, receipts, credit card charges by order etc) really results
in some free consistency wins that don't seem to free in nosql land.

~~~
ChicagoDave
Operational data stores are not reporting data stores. A healthy system would
fire change events to which a reporting system subscribes, placing the data
where it belongs in the reporting data model.

~~~
thoraway1010
Except what happens when change events are dropped.

in SQL, normalized, ref integrity on - you simply never get out of sync.

------
reilly3000
This is a great post, and DDB is a great database for the right use cases. I
want to give a shout out to FaunaDb to anybody looking for alternatives - its
also serverless and crazy scalable, and usage-based pricing. Its downside is
its proprietary FQL query language, not because it sucks (it doesn't!) but
there is a learning curve. They provide a rich GraphQL interface as an
alternative to FQL. Its upside vs DDB is a much richer set of functions
including aggregations, and first-class support for user defined functions.
Their attribute-based permissions system is phenomenal. Its definitely worth a
look if you're considering DynamoDb but want something that takes less upfront
planning about access patterns.

------
abhishektwr
Having spent a few years working with DynamoDB to build multi-region, multi-
tenancy platforms, I must say that DynamoDB is a good fit as a supplement
datastore i.e. you should only store a sub-set of information managed by your
serverless microservice. DynamoDB multi-region replication is just amazing.
Unfortunately, we had a few massive billing spikes with DynamoDB, and we end-
up adding pricing measurement and tests to track read/write units in all our
functions.

I generally don't recommend DynamoDB as primary data store irrespective of
your use case. It takes too much time to model the data. With every new
requirement, you have to redo a lot of modelling exercise. Choices you made in
beginning start looking bad and you will not remember why you created that
particular combination of the composite key or local secondary index which
offers no benefit due to incremental changes. Transaction support is painful,
existing SDKs just don't cut.

I often wish some of the GCP Firebase features are available in DynamoDB like
namespace, control on daily throughput to avoid billing spikes and transaction
support.

------
ithrow
Curious, besides being a truly serverless and scalable database why else would
one choose to model relational data in DynamoDB? For the 'single table design'
scheme the author talks about you are in a world of hurt if you need new
access patterns? which is highly probable for most systems.

~~~
Thaxll
Because it's managed and works "forever".

~~~
ithrow
Already implied that in "truly serverless and scalable database". My point is
that it would be very dumb to not use something like postgres for relational
data unless you really need those dynamodb features.

------
fulafel
DynamoDB seems to be so low level that it takes a lot of design and
programming effort to get right. Are there any higher level solutions that
build on DynamoDB that take care of these things automatically? For example
denormalization sounds pretty error prone if you implement it by hand.

~~~
abd12
It's different than a relational database in that you need to model your data
to your patterns, rather than model your data and then handle your patterns.

Once you learn the principles, it really is like clockwork. It changes your
process, but you implement the same process every time.

Honestly, I think part of the problem is that there's a lot of bad NoSQL
content out there. A little standardization of process in this space will go a
long way, IMO :)

------
femto113
> To handle [compound primary keys] in DynamoDB, you would need to create two
> items in a transaction where each operation asserts that there is not an
> existing item with the same primary key.

There are other approaches--in cases where I've needed compound keys I've had
success using version-5 UUIDs as the primary key constructed from a
concatenation of the compound key fields. The advantage is that Dynamo's
default optimistic locking works as expected with no transaction needed. A
potential disadvantage is if you frequently need to look records up by just
one component you'd need a secondary index instead of the primary doing double
duty.

------
polymath21
Thanks for this Alex! Will definitely check this out. DynamoDBGuide.com was a
huge help to me when I was learning serverless last year to build
wanderium.com. There's definitely a learning curve for DynamoDB but the
performance is really good.

Do you talk about the best way to do aggregations in your book? That's one of
the more annoying downsides of DynamoDB that I've kind of had to hack my way
around. (I combine DynamoDB streams with a Lambda function to
increment/decrement a count)

~~~
abd12
Thank you! Glad you found DynamoDBGuide.com helpful :)

Yep, I do talk about aggregations in the book. One strategy that I've
discussed is available in a blog post here[0] and involves using DynamoDB
Transactions to handle aggregates.

If you're looking for large-scale aggregates for analytics (e.g. "What are my
top-selling items last month?"), I have an Analytics supplement in the Plus
package that includes notes on different patterns for analytics w/ DynamoDB.

Let me know if that helps!

[0] - [https://www.alexdebrie.com/posts/dynamodb-
transactions/#hand...](https://www.alexdebrie.com/posts/dynamodb-
transactions/#handling-counts-and-preventing-duplicates)

------
jugg1es
Once you get comfortable modeling your data in Dynamo, it becomes hard to
justify using RDBMS with all of the overhead that goes along with it. Dynamo
is not for every use case, of course, but as long as you design the table
generically enough to handle adjacency lists and you don't mind making more
than 1 query sometimes, it works really well.

~~~
morcutt
I disagree. I spent a few months with it and it was a bit of a nightmare. It
fits certain use cases and AWS and others push it as a good use case for more
than it should be.

Have geo data? Need a count? Need to look something up for your team member
and you don't have time to setup BI tools? Want to query something differently
because of design changes? The answer is pretty much no or you can but it'll
take some extra thing for all of those.

We ended up picking up an Aurora Serverless and it is working just fine. We
can quickly make changes as business cases change, our entire team can query
it for now (until we have fancier tools on the biz end), and we are able to
move much more quickly.

~~~
jugg1es
Yep, if you need aggregated data at all for operational purposes, Dynamo isn't
a good choice. A lot of the hardship people have with Dynamo is not really
understanding how to model their data correctly. If you don't design it
correctly from the start, you are f*cked, and I sympathize with that.

------
js8
IMS
([https://en.wikipedia.org/wiki/IBM_Information_Management_Sys...](https://en.wikipedia.org/wiki/IBM_Information_Management_System))
called and wants its segmented hierarchies back. :-)

------
tkyjonathan
"Normalization was built for a world with very different assumptions. In the
data centers of the 1980s, storage was at a premium and compute was relatively
cheap."

But forget to do normalisation and you will be paying 5 figures a month on
your AWS RDS server.

"Storage is cheap as can be, while compute is at a premium."

This person fundamentally does not understand databases. Compute has almost
nothing to do with the data layer - or at least, if your DB is maxing on CPU,
then something is wrong like a missing index. And for storage, its not like
you are just keeping old movies on your old hard disk - you are actively
accessing that data.

It would be more correct to say: Disk storage is cheap, but SDRAM cache is
x1000 more expensive.

The main issue with databases is IO and the more data you have to read,
process and keep in cache, the slower your database becomes. Relational or
non-relation still follows these rules of physics.

~~~
ranman
> This person fundamentally does not understand databases.

Oh boy I do love hackernews :).

It sounds like you’ve spent a lot of your career in a SQL world. Have you
worked a lot with DDB/MongoDB/Cassandra? If not then give it a whirl with more
than a toy application and share your thoughts. Already done that? Try the
brand new “constructive criticism” framework.

Instead of “this person fundamentally does not understand databases” based on
13 words in a 1200+ word article, consider: “I disagree with this statement
and here’s why”.

You get all of the karma with none of the ad hominem! Win win!

------
ChicagoDave
There is an unhealthy attachment to relational data stores today. It’s a tool,
not an architecture or solution. We shouldn’t start with them and often should
exclude them from our operational implementations. Reporting and analysis
systems benefit tremendously from relational data stores. But we learned years
ago that separate operational and reporting systems provided optimal
performance.

I suggest those of you still unfamiliar with nosql operational data storage
patterns trust companies like Trek10 and Accenture (where i saw great
success).

------
dyeje
This is great advice, wish I had it before I started working with DynamoDB.
Hot keys can be a particularly nasty landmind.

------
ChicagoDave
There’s an aspect to software development relating to speed/agility. NoSQL
data stores offer a schemaless approach That reduces a great deal of
unnecessary friction. The amount of time we’ve spent managing relational
schemas is ludicrously expensive. There are still great usage patterns for
relational, but operationally is not one of them. I’d argue it’s an anti-
pattern.

~~~
aeyes
I would not call NoSQL data stores more agile. While it is true that you can
add new columns at any point in time the same cannot be said about supporting
new access patterns you didn't design for.

~~~
arborix
People often confuse startup speed with agility. The talk "Agility ≠ Speed"
does a good job of critiquing that idea. Agility is about the ability to
change direction gracefully. NoSQL definitely gets you started quickly—as OP
noted, you avoid much of the overhead of data modeling and making changes to
schemas—but at a certain point you can get into real trouble, and that will
slam the brakes on your velocity. I've experienced the same thing with weakly
typed languages. These sorts of tools can be very useful in the right use
cases, but trading away some of the guardrails can extract a large cost if you
aren't thoughtful and careful.

~~~
ChicagoDave
If I hadn’t been on large corporate projects that were wildly successful in
extreme shortened periods of design and development, I wouldn’t make my
arguments.

Adapting to the needs of a given domain may require a relational data store,
but it’s my experience those are rare occasions.

