Hacker News new | past | comments | ask | show | jobs | submit login
DynamoDB cannot store empty strings (amazon.com)
188 points by idbehold on Dec 13, 2016 | hide | past | favorite | 172 comments

I'm surprised this is news in 2016. Anyone who uses DynamoDB hits this immediately and suffers a bit from it.

Would you like more insider advice on using DynamoDB? My startup Level Money used it as a primary data store and has kept it throughout our lifespan. We're migrating away from it now, but I wouldn't necessarily discourage people from using it.


Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise. If you iterate across things in native order without randomization and at a very high speed then you will overload individual shards. You can end up at 10X write provisioning and still get rate limit responses. Randomizing the traversal of the keyspace fixes this.

Which is yeah, really really bad for sufficiently large datasets. You have to get creative to randomize it sufficiently in some cases.

Still, it's quite nice to have something like DynamoDB handling your scaling early on. It's a surprisingly useful design for a database and keeps you rom over-relying on relational properties which eventually don't scale. It also forces you to develop a story for cross-table transactional queries and their failures quite early in your platform's life cycle. Forcing that discipline is almost always healthy.

If you're not careful though, it becomes frightfully expensive. Before we understood why we were rate limiting we panicked and ended up with >$12k/mo in DDB costs. Not really a sustainable cost for a very small company.

> Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise. If you iterate across things in native order without randomization and at a very high speed then you will overload individual shards. You can end up at 10X write provisioning and still get rate limit responses. Randomizing the traversal of the keyspace fixes this.

It's not based on SQL, but the fact that the table is sharded and has the throughput characteristics you describe is well documented (i.e. not at all secret :p) http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

Yes, it's documented that, to quote the page you linked, "a single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units."

The documentation also explains that scanning a table runs the risk of saturating the capacity of a partition (1):

> As a table or index grows, the Scan operation slows. The Scan operation examines every item for the requested values, and can use up the provisioned throughput for a large table or index in a single operation. (...)

> The larger the table or index being scanned, the more time the Scan will take to complete. In addition, a sequential Scan might not always be able to fully utilize the provisioned read throughput capacity: Even though DynamoDB distributes a large table's data across multiple physical partitions, a Scan operation can only read one partition at a time. For this reason, the throughput of a Scan is constrained by the maximum throughput of a single partition. (emphasis mine)

Whenever I take an important dependency on a product, I make it a habit to read or skim virtually all of the product's documentation from beginning to end. Documentation for complex technologies is something to study. It's served me very well and I'd recommend the practice to others. With this approach you'll find that you just "know" (or can quickly look up) things that tend to surprise other people. Even if not all of the knowledge is in your working memory, you'll have a vague recollection of reading "something about that" and will be able to come back quickly to what you read.

(1) http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

That's fantastic, but let's just take a look at the documentation a few years ago when I discovered this:


Huh, worded slightly differently, isn't it? They DO allude to this perhaps being the case in https://web.archive.org/web/20121221003912/http://docs.aws.a... but it's not made clear at any point that scans will yield up keys neatly bundled by shard.

It is not the case even now that the scan operation is using your capacity to cause this, it's because of the way shards enumerate their keys and that is not done simultaneously and mixed for you before sending it. You can redirect write traffic to another table and still often exhibit a rate limit effect even though the scan isn't consuming writes for that table.

That, I think, is still quite surprising.

You can take for granted how great the docs are now, although I still submit that this aspect of the system is quite poorly documented. AWS in general is fantastic at conveying API endpoints and very poor at offering a new developer a narrative on how to use the product.

The reason that the docs are as good as they are now: people like me have been around yelling at Amazon for years to improve their documentation, and telling tech reps to better document things. I hope you in your capacity do the same. And I will continue to offer insights like this on forums like this precisely because there are lots of relatively new platform engineers here. It's one of the few things I _like_ doing on hacker news.

Pyxl101, I find it surprising that your comment was downvoted. Typically people downvote for one of 2 reasons, either they disagree with your tone or they disagree with your facts, but in your case your tone seems reasonable and your facts seem accurate. It does seem unreasonable that people downvoted you.

Is "rtfm" really such a valuable sentiment in reply to my post?

Personally, I find it to often be a complete non-answer wrapped up with a dismissive and insulting attitude.

I think you're reading dismissive tone into responses where none exists. I was merely pointing out that your inferences around sharding are actually (now) a publicly well documented fact. My intent was to elevate your statement from "educated guess by some random internet person" to "actually, this is definitely true". I would interpret Pyxl as chiming with a more broad "hey everybody, here's a good example of why it's helpful to read the docs, let's all learn from it", rather than "KirinDave, you are a bad person who didn't RTFM, and you should feel bad". It's hard to convey tone on the internet; life is much better when you interpret others' words charitably.

My experience with Amazon is that the documentation is always bad and unhelpful. It's not a bad idea to read them, but it's often misleading to believe they are the whole story.

This is in sharp contrast to the GCE documentation (which is exhaustive although often out of date) or the Azure documentation (imho best overall for technical use but the UI keeps fizzing around making that part of it useless).

> Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise.

SQL is a querying mechanism not an underlying storage protocol. Converting from DynamoDB syntax to SQL would be a bazaar choice for them.

If you look at the limitations of Dynamo it becomes fairly clear what they do. Nearest I can figure it is close to this:

Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it. They possibly use something like LevelDB for this.[1]

If you have Global Secondary indexes it is literally treated as a separate table that is replicated to automatically. This is why users were doing anyway so Amazon just made it easy.

For those of you who do not know dynamo well. There are three basic read operations GetItem, Query, and Scan. Scans CANNOT be sorted, this is an important implementation detail.

A query can hit all the records for a single hash key very quickly because all records with the hash key exist on the same hardware. And they can be sorted because as mentioned earlier they are sorted by key on storage. Which is why you cannot query on more than one hash key at once. And why you can query for exact sort keys, greater than, less than, or between but not non-sequential. Because for performance Dynamo will only return sequential records from a query.

Scans, nearest I can tell, are map reduce across all shards that your table might exist on.

In conclusion. DynamoDB is a key value store with compound keys where the records are stored in order by key.

[1] In fact, I would not be surprised if Dynamo is mapped on top of LevelDB.

I used to work on the DynamoDB team. Throwaway account because my normal account can be tied back to my real name.

> Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

> On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it.

This is pretty much exactly correct. The hash key maps to a quorum group of 3 servers (the key is hashed, with each quorum group owning a range of the hash space). One of those 3 is the master and coordinates writes as well as answering strongly consistent queries; eventually consistent queries can be answered by any of the 3 replicas.

> They possibly use something like LevelDB for this.[1]

Sigh...if only. I don't remember the exact timeline but LevelDB either didn't exist when we started development or wasn't stable enough to be on our radar.

DynamoDB is this very elegant system of highly-available replication, millisecond latencies, Paxos distributed state machines, etc. Then at the very bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++ code allowing the Java parts to come in through a side door of the MySQL interface, bypassing most of the query optimizer (since none of the queries are complex) and hitting InnoDB almost directly.

There was a push to implement WiredTiger as an alternative storage engine, and migrate data transparently over time as it proved to be more performant. However, 10gen bought WiredTiger and their incentive to work with us vanished, as MongoDB was and is one of Dynamo's biggest competitors.

> Then at the very bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++ code allowing the Java parts to come in through a side door of the MySQL interface, bypassing most of the query optimizer (since none of the queries are complex) and hitting InnoDB almost directly.

Ah! I knew it! It's an orchestration layer on top of a mysql layer with floating masters.

I remember back when DDB was first launched we all sat around at Powerset and tried to figure out how Amazon did it given the two pieces of information we were given: "it is not based off the dynamo paper" and "it is based off both open source and proprietary code". We figured it had to be MySQL or Postgres that they were referring to.

I didn't know you were completely bypassing the query layer though.

> SQL is a querying mechanism not an underlying storage protocol

My phone corrected "MySQL" to SQL and before I noticed the edit window expired. My apologies for the error.

Kudos for bring write on that one. I'm wouldnot have guessed it was mySQL.

I'm slightly less surprised because it apparently bypasses the SQL optimization engine and goes straight to InnoDB (per the other post). I have seen people write key-value stores based on MySQL+InnoDB before (sometimes bypassing SQL entirely).

Can I ask what you are migrating away to?

Google Datastore wouldn't be a bad bet.

Nope! We were considering RethinkDB, but then its corporate support fell apart.

I enjoyed RethinkDB a lot. Sad to see the business go belly-up.

Have you evaluated Couchbase? I've been looking for an excuse to try it, but I usually just fall back on Postgres.

I do not know what Couchbase is appropriate for. Certainly not what we're doing.

Will you comment on why? Surely, if the database-as-a-REST-service model of DynamoDB is considered appropriate, Google Cloud Datastore will work too?

Or is it precisely that which you've changed your minds about? The managed-REST-DB approach?

I'm curious because I'm currently writing an app, and I've chosen Google Datastore as the DB backend, primarily because it's managed (no DB VM maintenance), and makes scalability easy.

It's unlikely that someone would seriously entertain using a datastore provider that wasn't running in their cloud provider.

Also I work at a bank now and I can't talk about where we're going.

Oracle doesn't allow empty strings either (well, it does but they are the same as NULL). It's not a big deal if you understand what it's doing.

Strange. MySQL understands both empty strings and NULL, and they are not equivalent. Informix understands empty strings, and IBM mentions that Oracle does not. Microsoft SQL server understands empty strings, but apparently whitespace at the end is truncated in comparisons, so " " is equal to " ".

Most likely DynamoDB is BigTable like database, Amazon may even used forked version of Cassandra or HBase.

All of this databases, got hot partition issue. If you cause to many read/write to one server you run into issues. That's why key schema is very important in non-trivial use-cases.

Classic anti-pattern is to use date or other increasing number as prefix. It is also problem when you use S3.

did you consider using EMR and Hbase? its not very expensive to run a provisioned 3 node managed cluster.

I'm fairly sure that EMR and HBase would give us miserably bad performance characteristics for the kinds of workload we're describing here.

Why would I consider it for sub-millisecond latency tasks? Has something changed there I'm unaware of?

well i was just curious. im kinda wondering if anything among hbase, Cassandra, Mongodb etc are equivalent in performance at your scale. i see that you mentioned rethinkdb below.. but after their collapse, im wondering what are the options you are looking at.

No comment on the corporate aspect. On a personal level, I avoid using MongoDB on new projects.

What is exceptionally irritating about this is that there is no valid reason for not being able to represent an empty string, and the lack of support makes the service inelegant and harder to use.

1. Empty strings are straightforward to represent.

2. Not allowing them violates the principle of least surprise, which makes the service harder to use.

3. Not supporting them adds needless complexity to client applications.

4. An empty string has obvious meaning and a place in everyday applications.

It just blows my mind that someone building a database in the 21st century doesn't know that zero is not a special case. In my book, anyone who doesn't get that is a beginning programmer, not even intermediate.

I know it took people a while to get this originally. In the old days, for example, Fortran 'do' loops were always executed at least once, regardless of the values of the bounds. Nowadays we know better: you check the values at the top of the loop, not (just) the bottom, to handle the zero-iteration case correctly.

But now it should be burned into every developer's brain: counts can be zero, strings can be empty, lists can be empty, etc. etc. etc., and handling these cases correctly is critical. Zero is not a special case!!!

Damn, do we have to teach you kids everything? Just prefix your values with an X.


What is the point of this trending? To try and get AWS to fix it? I must admit this was a gotcha for me at first but it is documented right on their standard Limits page. I'm guessing the fix is non-trivial or they probably would have just done it.


This limitation should be in a giant alert at the top of all DynamoDB documentation. They are being misleading in their documentation:

> A map is similar to a JSON object. There are no restrictions on the data types that can be stored in a map element, and the elements in a map do not have to be of the same type.

> Maps are ideal for storing JSON documents in DynamoDB. The following example shows a map that contains a string, a number, and a nested list that contains another map.

I agree they should fix it but I think they have been transparent about the restriction. It is already listed within most of their documentation.


It is also listed as a Gotcha on the AWS Open Guide which trended on HN a few weeks ago.


Are you agreeing that they should fix the issue or their documentation suggesting that "Maps are ideal for storing JSON documents in DynamoDB"? Or both?

I would love for them to fix storage of empty string but I don't think that is going to happen given all the documentation they have around not allowing it. Must have been easier for them to just document the limitation.

I do agree that they could do a better job in their documentation explaining the difference between JSON spec and their JSON equivalent that they store. It is not clear that the JSON you post is not stored in the same exact way. They could be more clear about the types they append to your data (L, S, SS, N) and then reference the link I put above about the restrictions for each type.

Not being able to serialize an empty string is abysmally poor design.

There's a difference between poor design and performance optimizations. Lacking any information, I'm inclined to commit an appeal to authority and guess that it's probably a performance optimization.

I've bumped into this. It's a serious problem.

If you think you're going to migrate an existing system to DynamoDB, then this is one of the things you need to plan for.

The workarounds for this are rather unappealing, perhaps heretical.

Oracle cannot store empty strings (they are treated as NULL).

Wow, I've been working with Oracle for a million years and I didn't know that.

Fortunately, now that I'm over a million years old I'm probably not going to freak out about it no matter how wrong it seems. I might relax my pro-NULL politics a bit though.

This was one of the first things I recall learning about Oracle when I had my first internship dealing with it. It boggles my mind that you can work with Oracle for more than a week and not run into this.

Enjoy your boggle, because I've been working (on and off) with Oracle since 1994, and despite frequently being the "guy who explains NULL" I never ran across this, nor did anyone ever point it out to me.

I would even say I've worked with some top-notch DBA's, and I've at times been insufferable in my defense of NULL as a concept, and the databases in question have dealt with a whole lot of data and made a whole lot of money -- yet somehow I didn't know this.

Yay HN. :-)

In my experience it most often manifests in the form of columns that are constrained not NOT NULL into which you suddenly can't insert, say, a user who has no last name.

Stupid question: Have you been able to side-step the limitation because an empty string is stored as NULL?

I discovered this porting an application from Postgres to Oracle. Another thing that drives me nuts about Oracle is that there is no native boolean type.

As far as I can tell Oracle can store empty strings they just happen to be treated exactly like NULL. That's at least marginally better. Still, it's hard to imagine people being willing to use Oracle nowadays.

Why? Because you think sql is not good or because you think there are better relational databases?

I'd say that Oracle delivers quite much value for your money for many applications.

I have worked with all other relevant relational databases and also with a few nosql databases, in distributed systems and monoliths.

It's hard to make a recommendation that can be applied generically. The best answer is "It depends, and it's complicated"

One thing is sure. Nosql is not the future (quite the opposite) although sometimes -in rather special circumstances- the right solution. Graph databases are perhaps more interesting, and perhaps event sourcing, but the promises are way larger than the deliveries.

Edit. I forgot the obvious reason. You try to stay clear of the Dark Side.

Know any graph databases that you like? I recently tried arangodb and it consumes a crazy amount of memory for even a relatively small dataset (several million ~20 character strings)

Try Postgres. I'm serious. For a dataset that should easily fit in memory, doing graph queries CTE's in Postgres can work very well.

It's mostly because of the money. Relational is still a great type of database and I'm sure Oracle is great. But I haven't worked with many companies that could afford Oracle very easily if at all.

I also wouldn't make a generic database recommendation :)

Not necessarily a choice, but everywhere I've ever worked used Oracle. Including my last 2 <500-person Valley-based tech companies I've been employed at used it.

For all its drawbacks (and I'm not a DBA or developer) at least I don't have to participate in HN threads where people discuss the 10 buzzwordy-DB-but-not-a-DB trends-of-the-week to migrate your data to.. which will all be forgotten next week :)

> Still, it's hard to imagine people being willing to use Oracle nowadays.

I use Oracle every day, because that is one of the options for our customers rely on to run their critical Fortune 500 infrastructure. The other being SQL Server.

Does that mean that two empty strings won't compare equal to each other?

This rears its head when you want to make a string column nullable. If you do that, you can't store empty strings in it. Really broken IMHO.

Postgresql is pretty good, however Oracle still can do some crazy things, at the cost of a correspondingly insane amount (IMHO) of database admin / sysadmin overhead.

An empty string has a length. That length is zero. NULL does not have a length.

The length of a string shouldn't matter to anybody.

Really? I can think of at least a couple of good reasons it matters (i/o related).

What matters is the amount of space it takes up. What matters is the content of the string in the context of parsing it and displaying it. And in all of these scenarios -- the only relevant scenarios for using a string -- a null string and an empty string should almost always have identical behavior.

There is a big difference between not knowing a value and knowing the value is empty.

That's really up to the context of the problem, and the specific semantics of a null value. (Which is a value, and a present one at that. If you need to account for missing data, you can do that any number of ways, with sentinel values of any kind.)

The length of your comment is 51 characters.

> it's hard to imagine people being willing to use Oracle nowadays.

Only when it's the lesser of the super-evils you have to pick from. I'd pick Oracle over Progress(OpenEdge), but not much else :)

I used DynamoDB for an inherited project around the time I had gotten really good with MongoDB. I don't have any idea why anyone would ever use DynamoDB. It's like Mongo, without any of the good parts, that you can't run locally or any non-Amazon server. I don't see the benefit.

Edit: Specifically, I was using some pretty advanced aggregation stuff. At the time I used it, DynamoDB had zero aggregation whatsoever.

The reasons we use it * doesn't require any setup (no provision servers, maintining backups, etc) * automatic backups * practically free during development * scales when you need it too

That said, we've run into issues with the partitioning. If we were given access to the partition information and the ability to reduce partitions when needed, that would solve those problems.

Mind sharing what issues? A few keys are the target of a disproportionate amount of transactions?

Correct, our read/write access isn't spread out. So, before knowing how the partitions worked, we increased our read/write throughput, this caused the partitions to increase, so now our throughput that was spread out across the entire table, was split between 2 partitions. This caused even more throttling.

Your list formatting didn't quite work.

I also have done 3 years of MongoDB before I (unwillingly) wrote an app with it... worst development experience I've ever had. I wouldn't have used it were it not for my corporate overlords at the time.

My new rule of thumb: Use plain databases for most things and only use DynamoDB if there's a precise feature in your system that it is well suited for.

DynamoDB is a horrible general use database.

It's specifically not a general use database, so it's not surprising that it doesn't have general use database features. If you want that on AWS, use RDS or Aurora. If you need to scale in certain ways, then Dynamo can certainly be worth the restrictions it imposes, because it is very scalable.

You had a horrible experience with MongoDB but you are drawing a conclusion about DynamoDB? Am I missing something?

I think your parent commenter meant "3y experience using Mongo before having to use it [- DDB, subject of thread -] ..."

By "plain database" do you mean a relational database?

Care to elaborate? MongoDB for app development on paper seems like a great solution.

The NOSQL model fits great with a Restful API layer. The process of getting and posting data over HTTP should be a lot simpler with Restful API/MongoDB when compared to relational database solution.

I know Restful APIs are not perfect either, so care to fill in the blanks of why its was the worst dev experience for you.

If you don't understand why anyone would use DynamoDB then MongoDB is probably perfect for you.

I used MongoDB in the past too. Now, I'm using Elasticsearch as a document store. Aggregations are good, and you can do some very advanced stuff with it (geographic clustering, scan left, range and exact match, range buckets, histograms). Of course it comes with a set of limitations, but it might be worth your time to investigate (especially if you want to use it also for full text search).

I've actually become quite familiar with Elasticsearch since then, whereas I stopped using Mongo (that project died). ES is indeed very powerful.

To be fair, you can run DynamoDB locally: https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

DynamoDB-Local is a service that implements DynamoDB's API, including its query language, datatypes, and grammar, but stores everything in SQLite. It's great for playing with the API or substituting a mock implementation that doesn't drain your wallet, but it's very unlikely have the backend implementation and the performance characteristics -- or even all of the exact bugs -- of "real" DynamoDB.

last time I used this, it had many significant differences from the real DynamoDB. Different enough that I don't really understand what purpose DynamoDB Local is actually supposed to serve.

Ah, I wonder if this was true a few years ago (early 2014) when I was working with it.

DynamoDB being part of the AWS ecosystem is a massive benefit.

I like the way you spelled "lock-in". ;-)

I dislike your sarcastic winky face, but I agree with the sentiment.

Unless you plan on handing amazon your money forever, you should always steer for agnostic solutions you have the ability to host yourself.

As much as it may not 'scale' if you manage to get a passive income with decent understanding of load, then purchasing hardware is almost certainly cheaper for you.

Cheaper is good, it means more return so you can invest that income in your new project (and pay amazon for easy scaling again, if you want). But removing that ability not only locks you into their platform which may change prices at any time... it also means you can't move to self-hosting when the time is right and save quite a lot of money.

I dont believe you are locked in with AWS forever if you choose DynamoDB.

It doesnt use a unique database model that no other DB supports. Its based on NOSQL and it should be simple enough (with effort of course) to export a DynamoDB document store to another NOSQL solution.

your reply makes me believe you've never migrated database solutions for larger projects, especially ones where you're trying to actually make money and spend little time.

Migrating database solutions is really quite expensive in terms of time. I've done it twice and it's not something to take lightly. The usual recommendation is to never migrate database solutions unless there is a catastrophic problem with continuing with the current.

If you are talking relational database (I dont know, you just say databases), yes there can be issues exporting foreign keys, views you name it. But is it possible, yes it is.

NOSQL database on the other hand, completely different scenario. Exporting is a much more steamlined process. Again, its not magic, there is effort, but it is possible.

Just because you or your company are risk averse or strapped for cash, doesn't mean it's impossible to export a database. All the databases mention have export/import operations available, and as I said you are not locked into AWS if you do not wish to be.

Oh and don't assume my history thanks, you know nothing of it.

>Oh and don't assume my history thanks, you know nothing of it.

I'm going to attempt to be constructive here, but of course I'm biased so please take it with a grain of salt.

I cannot look past the words you speak, they are how I form an opinion of you on the internet. I cannot see your history, I cannot see your face/body language. Your words are your personality, they show your experience.

What you've said reminds me of similar people I know who have a distinct lack of experience, so, I lump you into that category.

In an attempt to understand you better I have peeked into your comment history. I can honestly say that I've not seen as much arrogance and bitterness for quite some time on hackernews.

Instead of trying to be snide and assuming you're better than people, or trying to passive-aggressively "win" a conversation.. Why don't we try to extract as much understanding or knowledge from every interaction on HN as possible.

most people on hackernews have some strong technical background, we have people here who are absolutely the forefront of their industry posting as regular users (Bryan Cantrill and Branden Gregg come to mind) which is absolutely humbling.

Now, with that out of the way, I invite you to read the topic.

"DynamoDB cannot store empty strings"

contrast that to your comment

>"NOSQL database on the other hand, completely different scenario. Exporting is a much more steamlined process."

NOSQL solutions aren't comparable if they handle types differently, they suffer most, if not all, of the conversion problems of relational databases.

The problem I faced when converting database solution was always types... Triggers, views and other relational-isms are a one-time investment fix.. converting types can easily lead to corruption of a few columns for a few rows.. how do you check that?

The answer is writing a lot of tests, and doing things especially carefully and incrementally.

and also, optimising for more revenue is not being cash-strapped, it's being wise enough to be able to reinvest capital in a new venture.. it's incredibly unwise to throw money up the wall for no reason other than it would cost a lot more to switch away due to vendor lock-in. (which is a reason not to move in of itself) which is what I'm warning about.

Please don't lecture on experience, it reeks of hubris. The fact the you suggest someone buys their own hardware to run a database to support their business is somehow cheaper shows a lack of understanding in regards to the difficulties companies face and the risks of managing a database inhouse.

You speak in absolute terms that are simply not true. Choosing DynamoDB as your datastore does not mean you have to pay Amazon forever. Databases can be exported successfully. Costs can vary dramatically so don't assume it will blow the budget.

Amazon has a large repository of documentation in regards to exporting data and there is huge online community for AWS support and third-party tools. If you are as experienced as you claim to be, you would know these are big factors when choosing a platform to serve your data.

These benefits along with an incredible array of services, reliability, security, ease of configuration, reduced licensing costs, scalability and support coupled with the wealth of talented people available who are well versed in AWS would be enough to persuade most companies towards this choice. Thus why it is the most popular choice to host data at the moment and why most companies are moving away from in house databases. Thats not an opinion, thats fact.

There is no such thing as a "nosql database" and import/export are all specific and subjective to each database implementation.

Of course there isnt a NOSQL database groan its simply a term used to describe a category of databases that store data in a non relational manner.

It's a fair point, but when you stick to the same ecosystem you save a lot of time in terms of setup for credentials, monitoring, backups, devops etc.

You can run it locally for development. Just look for dynamodb-local images on the docker hub.

Dynamodb has performance guarantees and you don't have to get into nightmare of maintaining MongoDB cluster.

In my previous team, we instead use the string "null" to represent NULL. Its pretty dumb but we haven't thought of any other way.

What I like about DynamoDB is that you can scale the table easily. You don't need a dedicated DBA to maintain your database.

For what I don't like

1. You pay for what IOPS you allocated 2. You can't share free IOPS to other tables 3. Getting a snapshot of the whole DB is impossible, DB backups are not transactional 4. Use EMR or DataPipeline for backups 5. If you reach your IOPS limit, you need to retry your writes/updates instead of delayed ack's. Others uses libraries that limits writes but its per server and doesn't account the free writes on other servers since the accounting is on the client level

>In my previous team, we instead use the string "null" to represent NULL. Its pretty dumb but we haven't thought of any other way.

Yep, it is a classic, BUT (just in case): https://news.ycombinator.com/item?id=12426315

You could use the empty-set character "∅" instead. At least it's less likely to occur.

Before they announce tons of new stuff to please the shareholders, there are much awaited enhancements on existing products which are much over due. Case in point this thread. Keep the developers happy. Just saying.

I dealt with this. It's absurd. So many smart coding conventions are completely broken by this absurd convention.

Don't use Dynamo unless you for some reason have to. It's half assed and not getting many updates.

DON'T use something provided by AWS?! Why, that's heresy! ;)

This is (was) very common in database systems. Many don't distinguish empty string from NULL. Example: http://www.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com...

As a workaround, if the database supports unicode, then store a "zero width space" for a blank string.

"As a broader question, do you have a specific use case for having an empty string in an AttributeValue?"

Can anyone explain to me why this isn't a bizarre question? They can't imagine a use-case for empty strings on their own?

I would quote them back their own marketing: "Using the AWS SDK you can write applications that store JSON documents directly into Amazon DynamoDB tables"

Why is this news?

It might be annoying but it's super easy to work around. It makes sense to not store empty strings, they provide no value and it would only take up space.

How do empty strings provide no value? It's a valid value for strings, one which many programs encounter regularly, and it's reasonable to want to store them. Refusing to store empty strings makes as much sense as refusing to store zeroes in a numeric field, because zero is "nothing."

> It makes sense to not store empty strings, they provide no value

`""`, just like `NULL`, is semantically difference from absence of record.

Just hit this problem again this morning. Quite annoying. I figured out what string was empty eventually, but i've solved this on other projects by just removing all keys with empty values to get around possible errors.

The problem we have is that we're trying to store JSON provided by the user. Meaning that we need to persist empty strings provided by them. The only way I can see around this issue is to replace all empty strings with a special string token on the way in, and replace all those special string tokens on the way out.

See that's the best use case I've seen in this entire discussion. If you don't get to control if blank vs null is significant then I can see how you'd have a real issue on your hand.

I'm guessing you have to have it in a form you can search it (so you can't just GZIP it or something like that)?

This is all correct.

Append all strings with a zero width space, and remove it on the way out.

Using a printable character (or sequence of characters) seems like it would be a lot better. It'll ugly up your database, but at least it'll be obvious that there's a difference. Having all your string possess two forms which are visually identical but are not actually the same sounds unpleasant.

I think it would be better to use something both printable and commonly used for placeholders, eg. an underscore, so it's obvious if you forget to remove it and unlikely to seriously confuse anybody.

What if the user has a string ending with a meaningful zero width space already stored? For example, the string could be checksummed somewhere. It would corrupt their data.

If you want a kludge for this, it's better to generate a longish random string (e.g., a UUID) to indicate an empty value.

When you get a string from the client you prepend a single zero width space. When you send a string back you strip the single leading space you added. The client will always have the exact same data back that they sent originally.

You're right, of course. Sorry, I wasn't clear. I meant that a user might have stored a string with a zero width space at the end by the time you introduce this escaping mechanism. (I've already edited the comment to indicate this.) The same goes double if you append a common printable character. You'd have to rely on some additional indicator, such as the date and time the record with the string was stored, to know whether to unescape a string and also be sure nothing changed those date and time without escaping the data.

Oh, yeah, in that case I'd either if-case it by time stamp or I'd prepend a zero width space to all historical data as well. I would prefer doing the latter and would only do the former if there was some reason I couldn't do the latter, for example if I had too much historical data to able to process it (though I have a hard time imagining that happening for something so trivial as prepending a zero width space, unlike say converting thousands of hours of video which might actually be too time-consuming or computationally expensive).

One issue that might arise with altering historical data that I can imagine would be if it was ever necessary to restore from backup and your backup was made before you later added the zero width space, and then you forget to add the zero width space again when you restore from backup a few months down the road. But with proper documentation and procedures that shouldn't happen.

I believe the parent is saying

    string_to_store = userstring + extra space
    dynamo.store(key, string_to_store)


    stored_string = dynamo.retrieve(key)
    user_string = stored_string - extra space
That way the user puts a string in and gets the same string out. No problem.

Is the zero width space destined to be the next maligned value, comparable to null?

It's too bad DynamoDB can't just do this conversion for me.

While not ideal, you can create a new AttributeTransform that sets a place holder when storing into DynamoDB and removes it when pulling out of DynamoDB as part of your DynamoDBMapper instantiation.

I did this to convert some String Sets (SS) in my database to String Lists (L). I almost did this same thing to fix the empty String issue but didn't have the time to implement it yet.


Unfortunately I'm working in a node environment. The Java SDK for working with DynamoDB seems much nicer to work with. For example I believe that you can do transactions with the DynamoDB interface for Java.

Sorry I saw the link was referencing the Java SDK so I thought you were using the same. The DynamoDBMapper Java SDK has been an easy to use and readable ORM for me. Adding annotations to define keys and attributes has been great. That said I'm not sure if JavaScript has an equivalent to the AttributeTransformer I mentioned.

Can someone give me some examples of cases where there is a strong semantic difference that you'd want to represent with a blank string vs a null?

I'm interested in abstract concepts, not "I'm trying to port from Oracle/Mongo/whatever and this is how it works". Assume your designing against some generic record store.

I'm having a ton of trouble thinking of a case where that's a big problem (again, ignoring trying to save time in porting) and I was wondering if someone could provide me a few examples.

It's the difference between "this value is unknown" and "this value is known and empty". It's similar to why JavaScript has both null and undefined.

Here's an example. Let's say a user is taking a survey and the last question is: "Please share any additional thoughts if you have them." A null could mean the user did not answer this question, probably because they haven't gotten to it yet. An empty string could mean the user was asked this question and actually didn't have anything to say.

That's basically the only example I could think of: tracking if someone filled out a form field.

But in that case how can you tell the difference? I mean if they never clicked in the field or tabbed into it and submitted the form is that meaningfully different from they clicked into it and didn't type anything?

I can think of ways to work around it (keep a boolean for 'was filled out', a number that means 'they got up to field 17', etc.) but I'm still not sure you can draw a conclusion of 'they skipped it' vs 'they didn't enter anything'.

You're assuming all the questions are being presented to the user at once. Imagine a UI that presents questions to a user one-by-one. If the user bails out of the survey early, the rest of the survey answers can be null.

In that case you can easily keep track of what step they got up to in the wizard. It's a different way to store the data, but would work just as well.

I think the null vs empty argument works better when there are 40 boxes on screen at once. But again I'm not sure what semantic difference you'd find between 'they left it empty' and 'they never clicked into it'.

Keeping track of the step works until you move a field onto a different step in the wizard.

First, it's a document store so you can store the way the form is constructed with it (i.e. order of elements, etc).

Second, in Dynamo you could chose the absence of a key to represent "never got here" and a key with a null value to represent "they left this blank". You don't need an empty string.

So using null to mean the opposite of what it normally means? I would recommend against anyone using that workaround.

That depends on whether you see null as a marker of not having data or a marker of not knowing if there is data.

Both of those are data unknown, and both would normally be null. You don't distinguish between known unknowns and unknown unknowns at the database level.

When you say "The user submitted an empty form field as their answer", you do have data. It's weird to use null for that.

Since you can leave a key absent in DynamoDB, it basically has two ways to say "data unknown" and no way to say "knowingly left blank".

You can repurpose null as you suggested, and it will work fine in isolation, but it will massively violate the principle of least surprise and could lead to painful bugs.

To be fair, this would often be modeled by omitting the field completely. The lack of the field suggests the user hasn't gotten that far. The presence of the field with a NULL value means they didn't provide anything.

In a relational database there's a clear difference between "I did a join and there was no value on the other table" and, "this cell was intentionally left blank".

I haven't used DynamoDB enough to tell if that applies here or not.

You basically don't join in Dynamo (as I remember), so that doesn't fit.

Either way, you'd still need to give a good reason for "this space intentionally left blank" because I'm not seeing much in the well of compelling reasons.

> Can someone give me some examples of cases where there is a strong semantic difference that you'd want to represent with a blank string vs a null?

A null middle name means "we don't know their middle name, we added it to the schema after that user signed up or didn't ask them". An empty string means they explicitly don't have a middle name.

I entirely agree with your point with regard to a fixed-schema, row-based, relational-like datastore. But in a KV-store like DynamoDB, there's a different set of possibilities:

[1] key present, value undefined or null or some other special type

[2] key present, value is empty object of its expected type

[3] key absent

In schemaless datastores where the value's content somehow determines its datatype, it can be difficult to enforce a distinction between scenarios [1] and [2]. Meanwhile, in an externally-schema'd datastore (like most RDBMS), you don't have option [3]. I am familiar with the practice of mapping "we don't have a known-good value for this" to omitting the key in an KV-store, while in an RDBMS that semantic meaning is mapped to a NULL instead.

Right. In Dynamo you could store JSON with "middleName: null" and know that means "We asked, they didn't have it" as opposed to a record without a middleName key which means "we never asked".

And now you've just discovered why relational databases have a "null" concept that is separate from the "empty string" concept.

In a relational table, all rows in the table have a "middleName" column. So you can't "omit" the column entirely for some rows like you can in a doc/KV store.

"NULL" in a relational DB is to store, in a table where every record has a "middleName" column, the concept of "missing key" from a key/value store type layout.

NULL is the relational DB's way of representing: "a record without a middleName key" exists here.

But my middle name IS null

In other words, you don't have a middle name. Not to be confused with:

> But my middle name IS "null"

Your answer is entirely correct. This should be the top comment.

Although I think the limitation is an incontrovertible issue in other ways, that specific example sounds like a domain modelling blunder to me, since it is overloading a scalar value to encode value-object metadata.

Overloading a scalar for metadata is my complaint with it.

In what ways do you think the limitation is an incontrovertible issue?

Well... just thinking out loud:

It's an impedance mismatch with every programming language I personally know, which all permit the empty string. Even Erlang or Haskell, with their oddball strings that are actually lists, permit an empty/null list for this purpose. What might they be? Maybe it's a blank line from an array of chomped input lines, maybe it's a base case of a recursive operation (see also: you can't have an empty set, possibly related, maybe DynamoDB stores strings as lists...).

The point is, sometimes we have the empty string as a legitimate scalar value, so rejecting them creates extra work for the developer and feels like a POLA* violation, even if there's a solid underlying technical reason why it is so (and maybe there is, maybe it's due to a Merkle tree for the values, as suggested in the predecessor Dynamo's original paper). It's the kind of thing that leads people to develop wrappers for things that really shouldn't need them, which is a gateway drug into all sorts of unnecessary abstractions and extra work.

At the other end of the spectrum I've seen some hilaribad JSON structures, ones where a value might be absent, or null, or the empty string, or a real string, and these four things all have different meanings /o\.

* Principle of Least Astonishment.

A few thoughts.

0 <> NULL <> ""

They are all values that matter because they can exist.

Blank can have business meaning. Also, I suspect that there is a reason blank is separate from NULL in the ASCII table. They are different signals, and changing one to the other is fundamentally changing the meaning of the signal.

Blank often has business meaning on forms, like "leave this field blank to indicate the user has acknowledged the field but chose to enter a blank value" vs the "user never saw this field and no value was entered, or the system did not store a value for the user's input".

The old saying "NULL is not nothing"?

The concept of known non-existence vs unknown existence.

You could represent a path through a grid as a string using U=up, R=right, D=down, L=left. Then a solution to a maze could be 'UUUURRRRLRUULLD'. An empty string indicates a solution to a maze in 0 zero steps. A null string indicates that there is no solution.

I'll give you that, but it's a very contrived example. I'm trying to think more of real world problems.

Many people have mentioned the idea of keeping track of if someone doesn't have a middle name versus you don't know their middle name. That makes more sense as an example (although I'm not sure how critical that is).

Treating the empty string as NULL, as ORACLE does, is the sane option on real world problems. Having "" different from NULL creates a lot of problems and the advantages are minimal. As you say, in the rare cases that you must tell WHY it is null, it is better to have a separated Boolean field, than to store "" vs NULL and give a meaning to each one.

One of the first problems that you'll encounter if you use a DB with "" != NULL is at data display, because the natural display for NULL and "" are "an empty cell". Then you cannot tell if it is NULL or ""... then you must display something else for NULL... then...

You have no such problem if "" is NULL, an empty cell always means NULL

They're different, and they mean different things. NULL means "I don't know this", whereas "" means "I know this is empty".

Data display with NULL != "" is easy, do what SSMS does and italicise/otherwise highlight the NULL.

What problems does "" being different to NULL cause?

> Having "" different from NULL creates a lot of problems and the advantages are minimal.

My experience is the exact opposite.

I can think of one way to display the data... JSON.

I think this is a bad use of `NULL`.

You should instead use a list of all possible solutions, or a sublist of that list (if you only need one solution) because then the empty list can indicate there are no solutions.

If you then want to do something you can iterate over the list: This will protect you from "null pointer" exceptions, promote more functional programming, and is usually less code as well.

    js: if(x !== null) { … }
    q: $[not null x;…;'`oops]

    js: x.map((r) => …)
    q: {…} each x

It also doesn't store empty lists, sets, maps, etc, not just top level, but nested inside dynamo maps. The particular SDK i was using silently removes keys with empty (whatever) values from the map. Dynamo maps in general were full of gotchas and we ended up just serializing as json since the only gotcha there is integer keys becoming strings.

Also dynamo's capacity provisioning can be scaled up any time (although scaling up is slow-ish), but only down once per day, which means that unlike the promise of ec2 you end up having to essentially pay for your peak load at all times.

If I had it to do over again I'd have stuck with postgres, the only reason we used dynamo in the first place was to throw our boss a bone to soften our flat refusal to use Amazon Simple Workflow, which is an even bigger quagmire.

You can scale down a table 4 times per day.

Totally agree on SWF. :)

I've known about this "problem" for ages, but has it really been 4.5 years? Wow.

I tried using DynamoDB. It felt really unintuitive and like I had to bend my thought process to fit it.

I misconfigured some parameters and ended up with a big bill for a database that was doing pretty much nothing.

I don't see the upside of DynamoDB.

> I don't see the upside of DynamoDB.

Wide column, no-sql data store where you pay for provisioned throughput, and storage over 25GB.

All maintence is taken care by Amazon, no sharding, scale, patching, or tuning required.

If you need a non relational data store for use inside an AWS deployment, you would be hardpressed to find a better alternative.

> inside an AWS deployment

That's the key, isn't it?

I've been trying to wrap my head around DynamoDB for something that wouldn't necessarily live inside AWS and it seems like an awful lot of trouble.

But I guess if you're in the same AWS Region and you're doing high-volume stuff then the cost savings probably get compelling real fast.

> If you need a non relational data store for use inside an AWS deployment, you would be hardpressed to find a better alternative.

RDS Postgres supports HStore and JSON types.

To utilize it in a production environment, you would need:

1. have at minimum 2 instances 2. schedule backups 3. create and manage your db schema 4. manage users/roles

Admittedly RDS makes some of those tasks easy, but Dynamodb makes it even easier.

1. Create the table, provision read/write throughput 2. Create IAM role.

You don't do backups?

Good point.

You need to back up data from Dynamodb, but as a measure to protect against user caused data loss, not AWS infrastructure failure dataloss.

Same issue, different thread: https://github.com/aws/aws-sdk-js/issues/833

How about BigTable and Cassandra?

Google Cloud Bigtable will let you store empty strings as values. I'm not quite as sure about whether you can successfully set a 0-byte value and distinguish it from an empty string ['\0'] value; the answer may depend on the client library/language/API that you use to interact with the service.

[Source/Disclaimer: I work on Google Cloud Bigtable.]

I recommend people look into Azure's DocumentDB. I've used Dynamo before and even though it was fast and reliable, development experience was a pain.

I'm slowly migrating a self-hosted MongoDB database to DocumentDB and things are going pretty well. The only thing missing for me is aggregation and transparent encryption, but they are working on those.

Nor can it save empty sets it appears.

Simple workaround: prefix every string with a fixed character. Advantage: it will leave the lexicographical sort order unchanged.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact