
DynamoDB cannot store empty strings - idbehold
https://forums.aws.amazon.com/thread.jspa?threadID=90137
======
KirinDave
I'm surprised this is news in 2016. Anyone who uses DynamoDB hits this
immediately and suffers a bit from it.

Would you like more insider advice on using DynamoDB? My startup Level Money
used it as a primary data store and has kept it throughout our lifespan. We're
migrating away from it now, but I wouldn't necessarily discourage people from
using it.

 _NEVER SCAN ONE TABLE AND UPDATE ROWS AS YOU ENCOUNTER THEM, OR USE NATIVE
KEY ORDERING TO TRIGGER UPDATES ANYWHERE ELSE!_

Secretly DynamoDB is just a bunch of SQL databases with floating masters, or
so we surmise. If you iterate across things in native order without
randomization and at a very high speed then you will overload individual
shards. You can end up at 10X write provisioning and still get rate limit
responses. Randomizing the traversal of the keyspace fixes this.

Which is yeah, really really bad for sufficiently large datasets. You have to
get creative to randomize it sufficiently in some cases.

Still, it's quite nice to have something like DynamoDB handling your scaling
early on. It's a surprisingly useful design for a database and keeps you rom
over-relying on relational properties which eventually don't scale. It also
forces you to develop a story for cross-table transactional queries and their
failures quite early in your platform's life cycle. Forcing that discipline is
almost always healthy.

If you're not careful though, it becomes frightfully expensive. Before we
understood why we were rate limiting we panicked and ended up with >$12k/mo in
DDB costs. Not really a sustainable cost for a very small company.

~~~
throwaway2016a
> Secretly DynamoDB is just a bunch of SQL databases with floating masters, or
> so we surmise.

SQL is a querying mechanism not an underlying storage protocol. Converting
from DynamoDB syntax to SQL would be a bazaar choice for them.

If you look at the limitations of Dynamo it becomes fairly clear what they do.
Nearest I can figure it is close to this:

Each hash key resolves to a number of possible servers the data can be on.
Data is replicated across several of these servers. For redundancy. The hash
key determines which shard to use.

On individual machines, each set of data is stored by a compound key of hash
key and sort key (if there is a sort key). The data is probably stored on disk
sequentially by sort key or close to it. They possibly use something like
LevelDB for this.[1]

If you have Global Secondary indexes it is literally treated as a separate
table that is replicated to automatically. This is why users were doing anyway
so Amazon just made it easy.

For those of you who do not know dynamo well. There are three basic read
operations GetItem, Query, and Scan. Scans CANNOT be sorted, this is an
important implementation detail.

A query can hit all the records for a single hash key very quickly because all
records with the hash key exist on the same hardware. And they can be sorted
because as mentioned earlier they are sorted by key on storage. Which is why
you cannot query on more than one hash key at once. And why you can query for
exact sort keys, greater than, less than, or between but not non-sequential.
Because for performance Dynamo will only return sequential records from a
query.

Scans, nearest I can tell, are map reduce across all shards that your table
might exist on.

In conclusion. DynamoDB is a key value store with compound keys where the
records are stored in order by key.

[1] In fact, I would not be surprised if Dynamo is mapped on top of LevelDB.

~~~
9864457d701a443
I used to work on the DynamoDB team. Throwaway account because my normal
account can be tied back to my real name.

> Each hash key resolves to a number of possible servers the data can be on.
> Data is replicated across several of these servers. For redundancy. The hash
> key determines which shard to use.

> On individual machines, each set of data is stored by a compound key of hash
> key and sort key (if there is a sort key). The data is probably stored on
> disk sequentially by sort key or close to it.

This is pretty much exactly correct. The hash key maps to a quorum group of 3
servers (the key is hashed, with each quorum group owning a range of the hash
space). One of those 3 is the master and coordinates writes as well as
answering strongly consistent queries; eventually consistent queries can be
answered by any of the 3 replicas.

> They possibly use something like LevelDB for this.[1]

Sigh...if only. I don't remember the exact timeline but LevelDB either didn't
exist when we started development or wasn't stable enough to be on our radar.

DynamoDB is this very elegant system of highly-available replication,
millisecond latencies, Paxos distributed state machines, etc. Then at the very
bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++
code allowing the Java parts to come in through a side door of the MySQL
interface, bypassing most of the query optimizer (since none of the queries
are complex) and hitting InnoDB almost directly.

There was a push to implement WiredTiger as an alternative storage engine, and
migrate data transparently over time as it proved to be more performant.
However, 10gen bought WiredTiger and their incentive to work with us vanished,
as MongoDB was and is one of Dynamo's biggest competitors.

~~~
KirinDave
> Then at the very bottom of it all there's a big pile of MySQL instances.
> Plus some ugly JNI/C++ code allowing the Java parts to come in through a
> side door of the MySQL interface, bypassing most of the query optimizer
> (since none of the queries are complex) and hitting InnoDB almost directly.

Ah! I knew it! It's an orchestration layer on top of a mysql layer with
floating masters.

I remember back when DDB was first launched we all sat around at Powerset and
tried to figure out how Amazon did it given the two pieces of information we
were given: "it is not based off the dynamo paper" and "it is based off both
open source and proprietary code". We figured it had to be MySQL or Postgres
that they were referring to.

I didn't know you were completely bypassing the query layer though.

------
jknoepfler
What is exceptionally irritating about this is that there is no valid reason
for not being able to represent an empty string, and the lack of support makes
the service inelegant and harder to use.

1\. Empty strings are straightforward to represent.

2\. Not allowing them violates the principle of least surprise, which makes
the service harder to use.

3\. Not supporting them adds needless complexity to client applications.

4\. An empty string has obvious meaning and a place in everyday applications.

~~~
ScottBurson
It just blows my mind that someone building a _database_ in the 21st century
doesn't know that _zero is not a special case_. In my book, anyone who doesn't
get that is a beginning programmer, not even intermediate.

I know it took people a while to get this originally. In the old days, for
example, Fortran 'do' loops were always executed at least once, regardless of
the values of the bounds. Nowadays we know better: you check the values at the
top of the loop, not (just) the bottom, to handle the zero-iteration case
correctly.

But now it should be burned into every developer's brain: counts can be zero,
strings can be empty, lists can be empty, etc. etc. etc., and handling these
cases correctly is critical. Zero is not a special case!!!

------
brongondwana
Damn, do we have to teach you kids everything? Just prefix your values with an
X.

[http://stackoverflow.com/questions/6852612/bash-test-for-
emp...](http://stackoverflow.com/questions/6852612/bash-test-for-empty-string-
with-x)

------
rjdavis3
What is the point of this trending? To try and get AWS to fix it? I must admit
this was a gotcha for me at first but it is documented right on their standard
Limits page. I'm guessing the fix is non-trivial or they probably would have
just done it.

[http://docs.aws.amazon.com/amazondynamodb/latest/developergu...](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html)

~~~
idbehold
This limitation should be in a giant alert at the top of all DynamoDB
documentation. They are being misleading in their documentation:

> A map is similar to a JSON object. There are no restrictions on the data
> types that can be stored in a map element, and the elements in a map do not
> have to be of the same type.

> Maps are ideal for storing JSON documents in DynamoDB. The following example
> shows a map that contains a string, a number, and a nested list that
> contains another map.

~~~
rjdavis3
I agree they should fix it but I think they have been transparent about the
restriction. It is already listed within most of their documentation.

[http://docs.aws.amazon.com/amazondynamodb/latest/APIReferenc...](http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html)

It is also listed as a Gotcha on the AWS Open Guide which trended on HN a few
weeks ago.

[https://github.com/open-guides/og-aws#dynamodb](https://github.com/open-
guides/og-aws#dynamodb)

~~~
idbehold
Are you agreeing that they should fix the issue or their documentation
suggesting that "Maps are ideal for storing JSON documents in DynamoDB"? Or
both?

~~~
rjdavis3
I would love for them to fix storage of empty string but I don't think that is
going to happen given all the documentation they have around not allowing it.
Must have been easier for them to just document the limitation.

I do agree that they could do a better job in their documentation explaining
the difference between JSON spec and their JSON equivalent that they store. It
is not clear that the JSON you post is not stored in the same exact way. They
could be more clear about the types they append to your data (L, S, SS, N) and
then reference the link I put above about the restrictions for each type.

------
Just1689
I've bumped into this. It's a serious problem.

If you think you're going to migrate an existing system to DynamoDB, then this
is one of the things you need to plan for.

The workarounds for this are rather unappealing, perhaps heretical.

------
spullara
Oracle cannot store empty strings (they are treated as NULL).

~~~
BinaryIdiot
As far as I can tell Oracle _can_ store empty strings they just happen to be
treated exactly like NULL. That's at least marginally better. Still, it's hard
to imagine people being willing to use Oracle nowadays.

~~~
kpil
Why? Because you think sql is not good or because you think there are better
relational databases?

I'd say that Oracle delivers quite much value for your money for many
applications.

I have worked with all other relevant relational databases and also with a few
nosql databases, in distributed systems and monoliths.

It's hard to make a recommendation that can be applied generically. The best
answer is "It depends, and it's complicated"

One thing is sure. Nosql is not the future (quite the opposite) although
sometimes -in rather special circumstances- the right solution. Graph
databases are perhaps more interesting, and perhaps event sourcing, but the
promises are way larger than the deliveries.

Edit. I forgot the obvious reason. You try to stay clear of the Dark Side.

~~~
tokenizerrr
Know any graph databases that you like? I recently tried arangodb and it
consumes a crazy amount of memory for even a relatively small dataset (several
million ~20 character strings)

~~~
vidarh
Try Postgres. I'm serious. For a dataset that should easily fit in memory,
doing graph queries CTE's in Postgres can work very well.

------
larrik
I used DynamoDB for an inherited project around the time I had gotten really
good with MongoDB. I don't have any idea why anyone would ever use DynamoDB.
It's like Mongo, without any of the good parts, that you can't run locally or
any non-Amazon server. I don't see the benefit.

Edit: Specifically, I was using some pretty advanced aggregation stuff. At the
time I used it, DynamoDB had zero aggregation whatsoever.

~~~
jonthepirate
I also have done 3 years of MongoDB before I (unwillingly) wrote an app with
it... worst development experience I've ever had. I wouldn't have used it were
it not for my corporate overlords at the time.

My new rule of thumb: Use plain databases for most things and only use
DynamoDB if there's a precise feature in your system that it is well suited
for.

DynamoDB is a horrible general use database.

~~~
wbkang
You had a horrible experience with MongoDB but you are drawing a conclusion
about DynamoDB? Am I missing something?

~~~
OJFord
I think your parent commenter meant "3y experience using Mongo before having
to use it [- DDB, subject of thread -] ..."

------
jaysoncena
In my previous team, we instead use the string "null" to represent NULL. Its
pretty dumb but we haven't thought of any other way.

What I like about DynamoDB is that you can scale the table easily. You don't
need a dedicated DBA to maintain your database.

For what I don't like

1\. You pay for what IOPS you allocated 2\. You can't share free IOPS to other
tables 3\. Getting a snapshot of the whole DB is impossible, DB backups are
not transactional 4\. Use EMR or DataPipeline for backups 5\. If you reach
your IOPS limit, you need to retry your writes/updates instead of delayed
ack's. Others uses libraries that limits writes but its per server and doesn't
account the free writes on other servers since the accounting is on the client
level

~~~
jaclaz
>In my previous team, we instead use the string "null" to represent NULL. Its
pretty dumb but we haven't thought of any other way.

Yep, it is a classic, BUT (just in case):
[https://news.ycombinator.com/item?id=12426315](https://news.ycombinator.com/item?id=12426315)

------
pinkskip
Before they announce tons of new stuff to please the shareholders, there are
much awaited enhancements on existing products which are much over due. Case
in point this thread. Keep the developers happy. Just saying.

------
coltonv
I dealt with this. It's absurd. So many smart coding conventions are
completely broken by this absurd convention.

Don't use Dynamo unless you for some reason have to. It's half assed and not
getting many updates.

~~~
cookiecaper
_DON 'T_ use something provided by AWS?! Why, that's heresy! ;)

------
ars
This is (was) very common in database systems. Many don't distinguish empty
string from NULL. Example:
[http://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com...](http://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.oracon.usage.doc/topics/suppt_null_empty_whitespace.html)

As a workaround, if the database supports unicode, then store a "zero width
space" for a blank string.

------
IanDrake
"As a broader question, do you have a specific use case for having an empty
string in an AttributeValue?"

Can anyone explain to me why this _isn 't_ a bizarre question? They can't
imagine a use-case for empty strings on their own?

~~~
tyingq
I would quote them back their own marketing: _" Using the AWS SDK you can
write applications that store JSON documents directly into Amazon DynamoDB
tables"_

------
optimuspaul
Why is this news?

It might be annoying but it's super easy to work around. It makes sense to not
store empty strings, they provide no value and it would only take up space.

~~~
mikeash
How do empty strings provide no value? It's a valid value for strings, one
which many programs encounter regularly, and it's reasonable to want to store
them. Refusing to store empty strings makes as much sense as refusing to store
zeroes in a numeric field, because zero is "nothing."

------
robbiet480
Just hit this problem again this morning. Quite annoying. I figured out what
string was empty eventually, but i've solved this on other projects by just
removing all keys with empty values to get around possible errors.

~~~
idbehold
The problem we have is that we're trying to store JSON provided by the user.
Meaning that we need to persist empty strings provided by them. The only way I
can see around this issue is to replace all empty strings with a special
string token on the way in, and replace all those special string tokens on the
way out.

~~~
richardwhiuk
Append all strings with a zero width space, and remove it on the way out.

~~~
networked
What if the user has a string ending with a meaningful zero width space
already stored? For example, the string could be checksummed somewhere. It
would corrupt their data.

If you want a kludge for this, it's better to generate a longish random string
(e.g., a UUID) to indicate an empty value.

~~~
eriknstr
When you get a string from the client you prepend a single zero width space.
When you send a string back you strip the single leading space you added. The
client will always have the exact same data back that they sent originally.

~~~
networked
You're right, of course. Sorry, I wasn't clear. I meant that a user might have
stored a string with a zero width space at the end by the time you introduce
this escaping mechanism. (I've already edited the comment to indicate this.)
The same goes double if you append a common printable character. You'd have to
rely on some additional indicator, such as the date and time the record with
the string was stored, to know whether to unescape a string and also be sure
nothing changed those date and time without escaping the data.

~~~
eriknstr
Oh, yeah, in that case I'd either if-case it by time stamp or I'd prepend a
zero width space to all historical data as well. I would prefer doing the
latter and would only do the former if there was some reason I couldn't do the
latter, for example if I had too much historical data to able to process it
(though I have a hard time imagining that happening for something so trivial
as prepending a zero width space, unlike say converting thousands of hours of
video which might actually be too time-consuming or computationally
expensive).

One issue that might arise with altering historical data that I can imagine
would be if it was ever necessary to restore from backup and your backup was
made before you later added the zero width space, and then you forget to add
the zero width space again when you restore from backup a few months down the
road. But with proper documentation and procedures that shouldn't happen.

------
MBCook
Can someone give me some examples of cases where there is a strong semantic
difference that you'd want to represent with a blank string vs a null?

I'm interested in abstract concepts, not "I'm trying to port from
Oracle/Mongo/whatever and this is how it works". Assume your designing against
some generic record store.

I'm having a ton of trouble thinking of a case where that's a big problem
(again, ignoring trying to save time in porting) and I was wondering if
someone could provide me a few examples.

~~~
techiferous
It's the difference between "this value is unknown" and "this value is known
and empty". It's similar to why JavaScript has both null and undefined.

Here's an example. Let's say a user is taking a survey and the last question
is: "Please share any additional thoughts if you have them." A null could mean
the user did not answer this question, probably because they haven't gotten to
it yet. An empty string could mean the user was asked this question and
actually didn't have anything to say.

~~~
MBCook
That's basically the only example I could think of: tracking if someone filled
out a form field.

But in that case how can you tell the difference? I mean if they never clicked
in the field or tabbed into it and submitted the form is that meaningfully
different from they clicked into it and didn't type anything?

I can think of ways to work around it (keep a boolean for 'was filled out', a
number that means 'they got up to field 17', etc.) but I'm still not sure you
can draw a conclusion of 'they skipped it' vs 'they didn't enter anything'.

~~~
techiferous
You're assuming all the questions are being presented to the user at once.
Imagine a UI that presents questions to a user one-by-one. If the user bails
out of the survey early, the rest of the survey answers can be null.

~~~
MBCook
In that case you can easily keep track of what step they got up to in the
wizard. It's a different way to store the data, but would work just as well.

I think the null vs empty argument works better when there are 40 boxes on
screen at once. But again I'm not sure what semantic difference you'd find
between 'they left it empty' and 'they never clicked into it'.

~~~
ceejayoz
Keeping track of the step works until you move a field onto a different step
in the wizard.

~~~
MBCook
First, it's a document store so you can store the way the form is constructed
with it (i.e. order of elements, etc).

Second, in Dynamo you could chose the absence of a key to represent "never got
here" and a key with a null value to represent "they left this blank". You
don't need an empty string.

~~~
Dylan16807
So using null to mean the opposite of what it normally means? I would
recommend against anyone using that workaround.

~~~
MBCook
That depends on whether you see null as a marker of not having data or a
marker of not knowing if there is data.

~~~
Dylan16807
Both of those are data unknown, and both would normally be null. You don't
distinguish between known unknowns and unknown unknowns at the database level.

When you say "The user submitted an empty form field as their answer", you _do
have_ data. It's weird to use null for that.

Since you can leave a key absent in DynamoDB, it basically has two ways to say
"data unknown" and no way to say "knowingly left blank".

You can repurpose null as you suggested, and it will work fine in isolation,
but it will massively violate the principle of least surprise and could lead
to painful bugs.

------
dynamothrowaway
It also doesn't store empty lists, sets, maps, etc, not just top level, but
nested inside dynamo maps. The particular SDK i was using silently removes
keys with empty (whatever) values from the map. Dynamo maps in general were
full of gotchas and we ended up just serializing as json since the only gotcha
there is integer keys becoming strings.

Also dynamo's capacity provisioning can be scaled up any time (although
scaling up is slow-ish), but only down once per day, which means that unlike
the promise of ec2 you end up having to essentially pay for your peak load at
all times.

If I had it to do over again I'd have stuck with postgres, the only reason we
used dynamo in the first place was to throw our boss a bone to soften our flat
refusal to use Amazon Simple Workflow, which is an even bigger quagmire.

~~~
zwily
You can scale down a table 4 times per day.

Totally agree on SWF. :)

------
taormina
I've known about this "problem" for ages, but has it really been 4.5 years?
Wow.

------
hoodoof
I tried using DynamoDB. It felt really unintuitive and like I had to bend my
thought process to fit it.

I misconfigured some parameters and ended up with a big bill for a database
that was doing pretty much nothing.

I don't see the upside of DynamoDB.

~~~
balls187
> I don't see the upside of DynamoDB.

Wide column, no-sql data store where you pay for provisioned throughput, and
storage over 25GB.

All maintence is taken care by Amazon, no sharding, scale, patching, or tuning
required.

If you need a non relational data store for use inside an AWS deployment, you
would be hardpressed to find a better alternative.

~~~
eropple
_> If you need a non relational data store for use inside an AWS deployment,
you would be hardpressed to find a better alternative._

RDS Postgres supports HStore and JSON types.

~~~
balls187
To utilize it in a production environment, you would need:

1\. have at minimum 2 instances 2\. schedule backups 3\. create and manage
your db schema 4\. manage users/roles

Admittedly RDS makes some of those tasks easy, but Dynamodb makes it even
easier.

1\. Create the table, provision read/write throughput 2\. Create IAM role.

~~~
hoodoof
You don't do backups?

~~~
balls187
Good point.

You need to back up data from Dynamodb, but as a measure to protect against
user caused data loss, not AWS infrastructure failure dataloss.

------
idbehold
Same issue, different thread: [https://github.com/aws/aws-sdk-
js/issues/833](https://github.com/aws/aws-sdk-js/issues/833)

------
user5994461
How about BigTable and Cassandra?

~~~
rrdharan
Google Cloud Bigtable will let you store empty strings as values. I'm not
quite as sure about whether you can successfully set a 0-byte value and
distinguish it from an empty string ['\0'] value; the answer may depend on the
client library/language/API that you use to interact with the service.

[Source/Disclaimer: I work on Google Cloud Bigtable.]

------
PerfectElement
I recommend people look into Azure's DocumentDB. I've used Dynamo before and
even though it was fast and reliable, development experience was a pain.

I'm slowly migrating a self-hosted MongoDB database to DocumentDB and things
are going pretty well. The only thing missing for me is aggregation and
transparent encryption, but they are working on those.

------
jnordwick
Nor can it save empty sets it appears.

------
amelius
Simple workaround: prefix every string with a fixed character. Advantage: it
will leave the lexicographical sort order unchanged.

