An example: if you provision X capacity units, you're actually provisioning X/N capacity units per partition. AWS is mostly transparent (via documentation) about when a table splits into a new partition (I say "mostly" because I was told by AWS that the numbers in the docs aren't quite right), but you'll have no idea how the keys are hashed between partitions, and you won't know if you have a hot partition that's getting slammed because your keys aren't well-distributed. Well, no, I take that back -- you will know, because you'll get throttled even though you're not consuming anywhere near the capacity you've provisioned. You just won't know how to fix it without a lot of trial and error (which isn't great in a production system). If your r/w usage isn't consistent, you'll either have periods of throttling, or you'll have to over-provision and waste money. There's no auto-scaling like you can do with EC2.
Not trying to knock the product: still using DDB... but getting it to a point where I felt reasonably confident about it took way longer than managing my own data store... and then they had a 6-hour outage one early, early Sunday morning a couple months ago. Possibly solution: dual-writes to two AWS regions at once and the ability to auto-pivot reads to the backup region. Which of course doubles provisioning costs.
Ok, maybe I'm knocking it a little. It's a good product, but there are definitely tradeoffs.
Agreed that DDB documentation could be better on the best practices. I found this deep dive youtube video very helpful: https://www.youtube.com/watch?v=VuKu23oZp9Q
That's actually another instance of the lack of transparency and trial-and-error: "oh hey, writes are getting throttled... no idea why... let's drop this index and see if it helps".
And in the case of both the redis and memcached backends, if the maintenance requires restarting redis/memcached or rebooting the instance, you lose all data in the cache (at least up until your last backup). For this particular project, that amount of downtime would easily cause a real outage for customers, and was unacceptable.