Hacker News new | past | comments | ask | show | jobs | submit login
Amazon DynamoDB – a Fast and Scalable NoSQL Database Service from AWS (allthingsdistributed.com)
296 points by werner on Jan 18, 2012 | hide | past | favorite | 126 comments

A feature request (I see that Werner is reading this message thread): for development it woud be very nice to have the Java AWS SDK have a local emulation mode for SimpleDB and DynamoDB. This would allow development while flying or otherwise off the Internet. Similar functionality as AppEngine's local dev mode.

It is on the requested features list

one of our developers built a local emulated simpledb clone for development/testing that is open sourced at:


Since HN doesn't show vote totals I wanted to chime in and say this would be really useful to me as well.

HN doesn't show vote totals, but it does sort posts based on the number of votes (descending), so the more popular posts rise to the top, just like the front page.

HN most likely uses a combination of votes and time in scoring, similar to the reddit ranking algorithm: http://amix.dk/blog/post/19588

That isn't how the frontpage works, and if it did, we would rarely have news turnover.

Well, yes and no. Upvotes push submissions toward the top of the page, but the ranking algorithm also includes a decay factor so that old articles eventually fall away.

This looks like a great service - there are some really interesting ideas here. The provisioned throughput is a great idea. It's interesting to see it runs on SSDs as well, I wonder if its storage is log based like LevelDB. The composite hashed indexes are really interesting as well - I guess they partition the index across nodes, and each node maintains a range of the sorted index. It'll be interesting to see how usable that is in practise.

I read with interest Steve Yegge's mistakenly-leaked email about the oppressive conditions for engineers at Amazon. It's hard to reconcile with the sort of innovation they consistently show.

> I read with interest Steve Yegge's mistakenly-leaked email about the oppressive conditions for engineers at Amazon. It's hard to reconcile with the sort of innovation they consistently show.

My take on Steve's rant (as an engineer at Amazon) is that a lot of the issues he pointed out are legitimate, but at an entirely different scale than he was pitching them at.

Day to day I work on a product with another couple dozen or so engineers. We build what makes sense for our product, and for the most part we build it in a way that makes sense for us. Sometimes we are under pressure to leverage other parts of the platform, and sometimes that does entail a lot more work. Most of the time, though, it ends up reducing our operational load (because the systems we depend on support products much larger than ours :) and giving us someone we can page when things go pear-shaped.

Amazon isn't the perfect place to work, but it's generally not bad (other than the frugality thing; that sucks as an employee no matter which way you slice it).

Interesting, thanks for the perspective - he definitely made it sound like a sweatshop, and that's not the sort of environment normally associated with this kind of innovation.

I simply do not see how competing cloud vendors can keep up with this. Most of them are still struggling to provide anything beyond a simple API to start/stop machines.

Well, Azure has been offering a similar NoSQL service for a few years now.


I've been using Azure Table Storage since the beginning and this doesn't seem to be the same. Like others have mentioned TS is more similar to SimpleDB. Now, I would love for someone to give me a tl;dr on the the feature set of DynamoDB so I can make an accurate comparison.

Table Storage does not allow any other indexes other than the main primary ones (Row Key and Partition Key). You also cannot store complex object within fields and use them in a query. You basically just serialize the data and stuff it into the field.

The dynamic schema is very nice if you can leverage it but the actually query support is TERRIBLE. (Sorry Microsoft, I'm a fanboy but you blew it here). There is no Order By or event Count support which makes a lot of things very difficult. Want to know how many "color=green" records there are? Guess what, you're going to retrieve all those rows and then count them yourself. They're starting to listen to the community and have just recently introduced upserts and projection (select). I would love to see them adopt something like MongoDB instead :)

For more issues check out: http://www.mygreatwindowsazureidea.com/forums/34192-windows-...

Edit - For what its worth. We've moved more things to SQL Azure now that it has Federation support. Scalability with the power of SQL. http://blogs.msdn.com/b/windowsazure/archive/2011/12/13/buil...

No. They offer the equivalent of Amazon S3, SimpleDB and SQS, but nothing comparable to this

Can you elaborate how this is different and not comparable? Azure's table service offers the same automatic partition management, unlimited per-table scalability, composite keys, range queries, and availability guarantees. The linked paper goes into more details.

Besides the points I already complained about before... How about 200ms response times even when performing a query using the Row & Partition Keys. I'm not sure if by composite keys you were referring to something other than the RK & PK because those are the only indexes you get.

ATS response times within the Azure data center are pretty impressive in my experience.

Your partition keys can be composite, have a look here:


I agree with your other pain points - in terms of not being able to get counts, secondary indices etc. However, you can easily simulate some of those - maintain your own summary tables, indices and so on. These ought to emerge as platform features pretty soon though. It's not perfect, but its feature set is close to Dynamo.

As for Mongo DB, I guess this service has been built from ground-up to provide the availability guarantees and automatic partition management features. I don't know if Mongo provides those. You could run Mongo yourself on Azure if you wanted to; there's even a supported solution done recently.

Hmm, I guess when I think about composite keys I think of ways to indicate a specific field/column as being part of the key. Data duplication along with string concatenation aren't really an elegant way to do it. If I remember right you also can't update the key values once the record has been saved. This is coming from a big SQL guy though :)

Balakk is correct. There are a lot of similarities between Windows Azure Tables and DynamoDB, and the release of DynamoDB validates the Data Model we have provided for a few years now with Azure Tables

• They both are NoSQL schema-less table stores, where a table can store entities with completely different properties

• They have a two attribute (property) composite primary key.One property that is used for partitioning and the other property is for optimizing range based operations within a partition

• Both of them have just a single index based on their composite primary key

• Both are built for effectively unlimited table size, seamlessly auto scale out with hands off management

• Similar CRUD operations

How Windows Azure Tables is implemented can be found in this SOSP paper and talk: http://blogs.msdn.com/b/windowsazurestorage/archive/2011/11/...

As mentioned by someone else, one difference is that DynamoDB stores its data completely in SSDs, whereas, in Azure Storage our writes are committed via journaling (to either SSD or a dedicated journal drive) and reads are served from disks or memory if the data page is cached in memory. Therefore, the latency for single entity small writes are typically below 10ms due to our journaling approach (described in the above SOSP paper). Then single entity read times for small entities are typically under 40ms, which is shown in the results here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/...

Once and awhile we see someone saying that they see 100+ms latencies for small single entity reads and that is usually because they need to turn Nagle off, as described here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/06/...

This is running on SSDs and it makes a HUGE difference.

How is this vastly different to Azure Tables?

The cost per transaction, performance & ease.

Reads per $0.01 = (50.60).60 = 180000

Writes per $0.01 = (10.60).60 = 36000

Assuming that you hit your usage is at 100% capacity then from a read prospective DynamoDB is half the price. Writes are much more expensive but many applications are heavily read oriented.

DynamoDB claims single digit millisecond reads, azure tables does not (from my experience.)

Azure tables have a maximum performance over a given partition table of 500 requests per second and over the whole account of 5,000 requests per second. DynamoDB does not state this.


To put this into context:

Assume a system with 5000 writes per second and 50000 reads here are the costs:

AWS Reads: $240 AWS Writes: $120 Aws Total: $360

Azure Reads: $4320 Azure Writes: $432 Azure Total: $4752

Seems like quite a difference for a decent sized read heavy application.

Can you please explain your math? AFAIK Azure txns are not paid by the hour - they are a flat cost of $.01 per 10000 storage txns. If you do batched GETs and PUTs you make only 550 txns (55000/100 entites/batch).


I agree that Dynamo's provisioned throughput capacity is a very useful feature though. Azure does not provide any such performance guarantee; the throughput limit is also a guideline as far as i know, not an absolute barrier.

I should have explained that my costs were calculated on a "per day" assumption. Thus the costs are for:

5000 x 60 x 60 x 24 = 432000000 Writes

50000 x 60 x 60 x 24 = 4320000000 Reads

(432000000/10000) x 0.01 = $432

(4320000000/10000) x 0.01 = $4320

Azure Total Cost For One Days Use: $4752

((5000/10) x 0.01) x 24 = $120

((50000/50) x 0.01) x 24 = $240

AWS Total Cost For One Days Use: $360

You are right that I don't take into account the bulk feature of azure reads & writes but this is down to bulk requests only being possible on a single partition at a time which in my personal experience (not exhaustive) is non-trivial to take advantage of.

Your math is right, except you missed a factor for Dynamo - Unit size.


If your txns are all within 1KB, your math holds good; otherwise, you pay more. Interesting model, but I suspect it'll average out to similar costs.

The cost difference between Windows Azure Tables and DynamoDB really depends upon the size of the entities being operated over and the amount of data stored. If an application can benefit from batch transactions or query operations, the savings can be a lot per entity using Windows Azure Tables.

For the cost of storage. The base price for Windows Azure Tables is $0.14/GB/month, and the base price for DynamoDB is $1.00/GB/month.

For transactions, there is the following tradeoff

• DynamoDB is cheaper if the application performs operations mainly on small items (couple KBs in size), and the application can’t benefit from batch or query operations that Windows Azure Tables provide

• Windows Azure Tables is cheaper for larger sized entities, when batch transactions are used, or when range queries are used

The following shows the cost of writing or reading 1 million entities per hour (277.78 per second) for different sized entities (1KB vs 64KB). It also includes the cost difference between strong and eventually consistent reads for DynamoDB. Note, Windows Azure Tables allows batch operations and queries for many entities at once, at a discounted price. The cost shown below is the cost per hour for writing or reading 1,000,000 entities per hour (277.78 per second).

• 1KB single entity writes -- Azure=$1 and DynamoDB=$0.28

• 64KB single entity writes -- Azure=$1 and DynamoDB=$17.78

• 1KB batch writes (with batch size of 100 entities) -- Azure=$0.01 and DynamoDB=$0.28

• 64KB batch writes (with batch size of 100 entities) -- Azure=$0.01 and DynamoDB=$17.78

• 1KB strong consistency reads -- Azure=$1 and DynamoDB=$0.05

• 64KB strong consistency reads -- Azure=$1 and DynamoDB=$3.54

• 1KB strong consistency reads via query/scan (assuming 50 entities returned on each request) – Azure=$0.02, DynamoDB=$0.05

• 64KB strong consistency reads via query/scan (assuming 50 entities returned on each request) – Azure=$0.02, DynamoDB=$3.54

• 1KB eventual consistency reads – DynamoDB=$0.028

• 64KB eventual consistency reads – DynamoDB=$1.77

Open source really helps here. Amazon are innovative, but they are not the only place innovation is happening. In fact, here's a pretty good writeup (if a wee biased) on how the new offering compares to to the open source Cassandra project: http://www.datastax.com/dev/blog/amazon-dynamodb

Cassandra is a great project, as it is Hadoop, MySQL, etc. The issue I am raising is that it is not so much which project is better on a feature basis, but the fact that Amazon is able to offer it as a service, in a scalable way that no other vendor is able to do (with the exception of Google and, on a good day, Microsoft). Most other "traditional" cloud vendors, such as Rackspace, do not have anything remotely comparable to this, EBS, SQS, RDS, etc.

I also found it interesting that the storage media is specified and it's SSDs. Solid state will be hugely disruptive for hosted services, I've been hoping for an instance-by-the-hour service backed by SSDs and I'll surmise from this announcement that it won't be long before that shows up on the EC2 menu. Gimme :)

This still seems a bit expensive to me for an application that would require thousands of writes per second? ie. 5k writes per second is ~$120/day. Using this for performance based analytics for example would seem out of the realm of reason for the moment.

Can you explain your use case a bit more? I'm having a hard time imagining something that does ~430M DB writes/day but can't easily afford to pay $120 for those writes.

Remember that the throughput is per item, not per query. For instance we have an indexed query that returns ~1500 rows each time. Just doing that query a couple of times per second would create that kind of throughput requirement.

The amount of consumed read units by a query is not necessarily proportional to the # of items. It is equal to the cumulative size of processed items, rounded up to the next kilobyte increment. For example if you have a query returning 1,500 items of 64 bytes each, then you’ll consume 94 read units, not 1,500.

If that's the case then it's a completely different ball-game. I was about to abandon the whole idea of using DynamoDB due to the pricing of throughput. This makes it a whole lot more interesting!

The official documentation seems to clearly contradict you. The pricing calculator doesn't let you specify a value of less than 1KB. Who's right? Or maybe I'm just not understanding what either you or the official pricing doc is saying :)

From the pricing page (http://aws.amazon.com/dynamodb/pricing):


If your items are less than 1KB in size, then each unit of Read Capacity will give you 1 read/second of capacity and each unit of Write Capacity will give you 1 write/second of capacity. For example, if your items are 512 bytes and you need to read 100 items per second from your table, then you need to provision 100 units of Read Capacity.


Looks like 1KB is the minimum for calculations.

Agree, but Amazon's CTO said something different, hence my question.

Werner is right. The query operation is able to be more efficient than GetItem and BatchGetItems. To calculate how many units of read capacity will be consumed by a query, take the total size of all items combined and round up to the nearest whole KB. For example, if your query returns 10 items that were each 1KB, then you will consume 10 units of read capacity. If your query returns 10 items that were all 0.1KB, then you will consume only 1 unit of read capacity.

This is currently an undocumented benefit of the query operation, but we will be adding that to our documentation shortly.

A small mobile marketing company jumping into the wild wild west of real time bidding. It would be used more so for logging impression requests to be used later for further analysis. Our bidder would need to be able to handle upwards of 5000 bid requests per second. Though these requests can be throttled down, naturally the more data we can collect the better. This also doesn't include the associated costs with querying the data which would end up adding up quickly.

Now I'm not sure this would be the ideal solution for such a thing (in fact it probably is not), but it's just the first thing that came to mind. In the grand scheme of things sure that may seem like a trivial amount due to the use case, but we're still more in the realm of a startup where dropping ~$3k/month on the data store alone makes me cringe a little when we have other expenses to account for also. :)

$3k/month = $36k/year

Consider this cost relative to the cost of a trustworthy ops person, plus the capex & opex of running your own reliable & scalable DB.

For about $15/month for minimal "reserved capacity" charges and $1/gigabyte per month replicated disk space this service looks like it will cure a lot of deployment headaches for a reasonable cost.

Interesting that when I signed up for the service that they verified my identity with a cellphone call, like Google sometimes does.

"Amazon DynamoDB stores data on Solid State Drives (SSDs)" This is big.

I'd imagine we'll soon see a SSD option for EBS volumes. Sounds like it'll be pricier (as you'd expect) - $1/GB for DynamoDB storage (but replicated to three volumes, sounds like).

When Amazon notes "SSD," from a client's perspective, it is only marketing. The storage media matter when you are managing your own hardware. The storage media do not matter in SaaS. For example, the storage media could be floppys and you would get satisfactory performance if there was a memory cache. Similarly you could get poor performance with SSD media if the networking layer(s) were slow. Similarly if the storage media were fault-likely CD's that wouldn't matter either, to us, because of the data replication performed in "the service." What matters in this case is the reported and actual latency.

I would love the option to add N index (at cost).

My guess is they will add options for additional indexes in the future...everyone needs start somewhere. Even at Amazon's size and scale.

> Even at Amazon's size and scale.

At Amazon's size and scale, it's all the more important that you start with something simple with well-understood performance characteristics from day one. AWS doesn't really get a grace period during which they get to fix scalability problems.

Exactly, compare this to SimpleDB. SimpleDB started out with an advanced query language that let query and filter your results in all sorts of ways. And guess what? SimpleDB is still limited to 10 GB per domain (aka, database). Want to horizontally scale? The official suggestion is to shard your data across domains. This is a really messy solution because you have to preshard based on estimated database size and resharding is nearly impossible (you'd have to rewrite your entire DB).

AppEngine went the other route and provided a very simple database API at first and all queries had to be range scans over an index. Any query you wanted to perform had to be precalculated by defining a composite index and some things (like inequalities on multiple fields) weren't supported. Over time they've built upon their basic database and added features such as a zigzag merge join algorithm which lets you perform queries that were otherwise impossible with a given set of indexes.[1]

I bet DynamoDB will be going the AppEngine route by starting with a simple, scalable base which can be used to build more advanced query engines and features.

1. http://code.google.com/appengine/articles/indexselection.htm...

While putting my data completely in the hands of another company makes me nervous, I have used s3 since its release and have had no problems. Amazons really seems to "get" developer needs. I do wish they would ease down a smidge on thier pricing in general, but otherwise I really feel about them the way I feel about google at this point.


You could buy a lot of Riak or Cassie for that.

5000 reads per sec of 64Kb items, would make you stream 2.5 Gbits/sec using consistent reads and 1Gbits/sec writes, moving close to 1.5TB each hour. At the end of the month you have read well over 800 TB and updates 160 TB... That is a substantial application you have in mind... :-)

That may be true for an application with a constant load, but applications with a less balanced load have to provision for their peaks. My company (Malwarebytes) has very irregular traffic (at the hour mark we get very big spikes, but only for a couple of minutes) and it seems like we would have to provision (for this specific app) that peak for the entire hour. I might be misunderstanding the billing for this service though- if we ask for more units for 15 minutes, would the billing be prorated?

This actually hits on my only real issue with AWS in general, which is the hourly billing. We've used the mapreduce service a bit, and having a cluster roll up and fail on the first job is heartbreaking when it has a hundred machine hours associated with it. Obviously that is far, far cheaper than us building out our own cluster (especially with spot instances, which I can't even describe how much I love), but for some of the services smaller billing times would be useful.

Here's a brief overview comparison of DynamoDB vs. BigTable:


The site is also able to compare many other different databases.

It seems as if this site is a user-edited wiki, and there are a lot of things that need to be filled in (in case anybody is up to speed on DynamoDB and wants to help). For instance, the Map/Reduce entry was still '?' when I wrote this.

And Big Table (or at least GAE) does support transactions.

I work for a popular startup that has been privately testing dynamo for the last few months.

It's a fantastic product and even while in private beta has been stable and well supported.

1$/GB/month on SSD and replicated. So basically, 0.25$/rawGB/month if they replicate 4 times.

They are making money on the read/write and are selling the capacity at current cost. Which — knowing the tendency for AWS to decrease the prices very slowly combined with the huge decrease of the prices of the SSD drives in the past months/years — is not a bad strategy to convince us to switch.

I posted a comparison to Cassandra here: http://news.ycombinator.com/item?id=3480480

There are some bugs in signing up for the service. In console, if I go to dynamoDB tab I am asked to Sign Up first. I click the sign up button and I am told "You already have access to Amazon DynamoDB". Repeat.

Can do me a favor and drop that in the AWS DynamoDB Forum so folks can look at it? https://forums.aws.amazon.com/forum.jspa?forumID=131

Question about Composite Hash Keys that someone might have the answer to (or be able to relate to other known implementations):

The composite key has two attributes, a “hash attribute” and a “range attribute.” You can do a range query within records that have the same hash attribute.

It would obviously be untenable if they spread records with the same hash attribute across many servers. You'd have a scatter-gather issue. Range queries would need to query all servers and pop the min from each until it's done, and that significantly taxes network IO.

This implies that they try to keep the number of servers that host records for the same hash attribute to a minimum. Consequently, if you store too many documents with the same hash attribute, wouldn't you overburden that subset of servers, and thus see performance degradation?

Azure has similar functionality for their table service, requiring a partition key, and they explicitly say that there is a throughput limit for records with the same partition key. I haven't seen similar language from Amazon.

Whether you scatter-gather or try to cluster values to a small set of servers, you'll eventually degrade in performance. Does anyone have insight into Amazon's implementation?

Awesome. Anybody want to sublease a 3 yr RDS reserved instance? :P

What's the concern for lock-in here - how difficult with DynamoDB be to migrate away from? Amazon seems to be increasingly catching the App Engine Syndrome, though to be fair, it's been mostly network traffic-style functionality in the past.

In Simple Hash Key mode, it is just a key-value store. All the logic is in your app and/or supporting libraries, not in your DBMS.

In Composite Hash Key mode you also get limited range queries. These are supported by any DBMS that indexes its primary key via BTrees.

DynamoDB is cutting the features to the bone. This makes it easy to migrate out of DynamoDB and, at a first glance, hard to migrate into DynamoDB. Particularly hard to migrate from a RDBMS with complex schema/index/trigger support.

The beauty of it is that complex features can be built as libraries as the needs arise. The next five years look very exciting. Hats off to Amazon.

Lock-in? You'll need to get all your data out and transform it into whatever alternative you pick. I'd say as far as AWS lock-in, this is on the low side of things.

The lock-in comes not from the data format, but rather from the quirks of the particular system du jour: API lock-in.

Getting your data out will also cost you a boat load of money. Same as putting it in.

From the limits page(1) I see you can only have 256 tables per account.

(1) http://docs.amazonwebservices.com/amazondynamodb/latest/deve...

As with all the other AWS services you can have your limits lifted upon request.

Can the 10GB per domain limit be raised on SimpleDB? Or do I have to promise to refer to it as DynamoDB to do that? :-)

Does that apply for the 50-per-account cap on S3 buckets as well?

I'd imagine S3 falls within the scope of "all the other AWS services"...

The fact that you can make composite keys should make that irrelevant.

Edit: to clarify, I think that composite keys would make your problem limited to 256 distinct applications, not, for example, 256 customers, users, etc.

I read through a number of the docs and can't quite find the answer to this question, hopefully someone here can help me out quick.

I already have a bunch of (large-ish, deeply nested) JSON objects defined for my application. I don't really want to go about redefining these since they work great between my various node processes and the front end. I am saving them in a nosql database already, I am curious about switching (to save on devops costs). I only request based on 1 Hash Key (int) and 1 Range Key (int) for all my current get operations.

Looking through the docs/examples I see a lot of this type of thing:

    		"feeling":{"S":"not surprised"},
The JSON item has a kinda-of 'type syntax' on it. I really don't want to redefine my deep objects, but would be willing to redefine the Hash key and Range key, while leaving the rest of the nested types alone.

Ok, my question: Do my JSON objects need to conform to this 'type syntax' JSON notation in the examples? Or can I save just any JSON object into this database and only annotate the Hash Key and Range Key using this special notation?

Their usage of JSON is just incidental to your usage of JSON. They use JSON as a REST transfer format. You can pretty much ignore their JSON if you use one of the high level libraries in the SDK.

You can define a table with 3 fields: yourKey, yourRange, and yourJson. Put your entire JSON data as string in the yourJson field.

You will have to create an attribute for your json, where you'll store the json utf-8 encoded. If you want to index on parts of that json blob you'll have to pull them out into their own separate attributes and the recombine them into a single json object on read.

Pricing: $0.01 per hour for every 10 units of Write Capacity and $0.01 per hour for every 50 units of Read Capacity.

It is amusing that was positioned to try to address SimpleDB's problem of "pricing complexity". These aren't those complicated "Machine Hours", they're "Capacity Units"!

As someone who has grappled in the past and is grappling again with the issues of pricing a database-as-a-service though, this is very much a non-trivial issue. If you have nice scaling characteristics and you want to charge the minimal price, your price isn't going to be predictable. Essentially, what Amazon have done here is to set a cap and then charge you the cap irrespective of actual usage, which seems to be the model that's winning on the cloud.

Capacity Units strike me as a big improvement (for the user) over Machine Hours, because it's very clear how a given usage pattern will translate into Capacity Units. I can predict how many Capacity Units I'll need. I've gotten badly burned over seemingly simple queries using unexpectedly high Machine Hours in SimpleDB.

My understanding is that DynamoDB's Capacity Units are just a query throttle, and you get charged based on the throttle you set, whether or not you use that capacity. It also looks like you can still have one query that consume many, many Capacity Units (e.g. table scans).

SimpleDB's Machine hours are basically the same units, but without the throttle.

So, from a technical and value viewpoint, it's a huge step backwards (pay for capacity rather than for usage), but I'm learning that psychology is perhaps just as important here.

It seems like what you really want is a throttle with per-query charging, to cap your bill. Probably you'd much prefer not to be forced to pay your cap every month, but I don't think that's being offered.

(Edit: Downvotes? Am I wrong here? If so, please contribute to the discussion and tell me why!)

True, it's pay-for-capacity and that's worse for the user. On the flip side, the constant factor seems to be about 20x cheaper (caveat: this is based on my personal experience with SimpleDB; since Amazon doesn't seem to explain how "box usage" is computed, I don't know how broadly applicable my experience is).

The big plus for Capacity Units is that Amazon actually provides a deterministic model for figuring out what you'll be charged for a given query.

Ah - that is a fair point, transparency of the "unit". The "box usage" formula was reverse-engineered and shown to be fairly simple: http://www.daemonology.net/blog/2008-06-25-dissecting-simple...

This is a big step forward in transparency, although I would suggest that SimpleDB's pricing shouldn't have been obscured in the first place.

Presumably if your traffic is predictable you could reduce the cap during quieter periods.

True, but I then see this as a step backwards vs SimpleDB's pricing model.

I think this is huge. My first wish, before secondary indexes, is that they don't round up to the nearest 1KB for pricing.

noted. (both requests that is).

With regard to reliability, how safe would a service like this be to store your data? They mention "back up to s3" but this sounds more like archiving. I'm wondering about backups in the case of a problem, or is the data so spread across all of their physical locations that data loss impossible?

SSD are pretty safe in general. Amazon also makes replicate of your data. It's the same if not safer than other NoSQL solution. Definitely safer than those RAM-based NoSQL approaches.

The "back up to s3" part is for archiving data, like periodic snapshot backup of your data so that you can get back the old data in case they are deleted by app/human.

Fast, Consistent, Secure & Replicated.

Game changed.

Not really if you pay attention to the set of NoSQL DB's available already. They are just hosting their own now.

Would be interested in details about any with Strong Consistency, decent replication across fault areas, which get the same level of performance (single digit ms) at a cost equivalent to dynamodb.

Quite a lot can be done with smart use of hugely scalable simple key-value maps.

Quality hosting/management is a game changer in my book.

"They are just hosting their own now."

^^ There's your game changer.

Which game?

I'm too scared of the lock in. Afaik this isn't like EC3 where you can move a VM off the EC3 stack. The data is stored in a proprietary format so once your in, you're in.

Game changer? I need to see some evidence first on how well it performs and integrates.

It's a database, so you can always get your data out again.

It's a non-standard (NoSQL) database. Even SQL databases, despite years of standardization and efforts at poaching each other's customers, still have rough edges that make moving between different SQL database products non-trivial. Just because you can get your data out doesn't mean you're not locked in to all of SimpleDB/DynamoDB's quirks, of which there will be a lot more because it's not following a standard approach. Your code will have to go through contortions to work around / adapt to the SimpleDB limitations; effort which might well be wasted or counter-productive on a different system. That's the lock-in with AWS in general: API lock-in, not data format lock-in.

DynamoDB and SimpleDB are also a lot simpler than SQL databases. As far as I can tell, DynamoDB is a key/value DB with support for ranges and MapReduce, and not dissimilar to other NoSQL databases like Riak.

There may be instances where large datasets are hard to migrate from DynamoDB, but overall it doesn't look to me like lock-in would be that much of a problem, assuming you have a decent abstraction layer.

Very interesting that there's no mention of the CAP theorem here, despite Amazon's heavy reliance on it in the past when marketing their non-relational stuff.

I'm very curious about this too. I posted in another thread:

"Doesn't that just sweep the latency tradeoff under the rug, or is flash making up the difference? What about the availability tradeoff? (I like the formulation here, consistency vs availability and consistency vs latency, as opposed to CAP which never made sense as a 3-way tradeoff: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...)


The CAP theorem is boiled down to you choosing whether or not you want consistent reads.

I hope this means an SSD option for RDS is coming soon.

I wrote an Erlang API for DDB while it was still private. I'm hoping my client will let me opensource it in the near future.

Why is there a 1MB reply limit on BatchGets and obviously no limit on Gets? Did anybody find limits on a single item?

The items are limited to 64kb so its implicit.

Any word on when boto will support this?

Is there a Ruby SDK?

Yes (which I find mildly surprising given how long the Ruby library took to support batch ops on SQS)


Trying to read thru all the hype, this looks like Riak hosted on machines with SSDs, with less features, and a nice billing system in front of it.

Of course for people who want a hosted solution, the fact that it is hosted, is what gives it a lot of value. There haven't been a lot of hosted NoSQL databases, at least on this scale and availability, out there.

But technologically, what's new here? Is there anything here that is really innovative? (not a sarcastic question)

As you and others have pointed out, hosting + SSDs + synchronous replication across availability zones counts for a lot. If DynamoDB lives up to the hype, it could be a huge step forward in the world of "don't have to think about it" data storage.

DynamoDB does have at least one significant feature not provided by Riak -- range scans. This makes many common access patterns much easier to implement efficiently. Still, as you suggest, there don't appear to be any fundamental technical advances here. The advances are in the service model and operation.

And there are, of course, many limitations. (Just to name a few: items -- i.e. rows -- can't exceed 64K; queries don't use consistent reads; seemingly no atomic update of multiple items.) It's miles ahead of SimpleDB, but still not nearly as flexible as many of the existing NoSQL databases. If Amazon lives up to past performance, they'll make steady improvements, but slowly.

you can choose whether queries are consistent or eventually consistent.

>DynamoDB does have at least one significant feature not provided by Riak -- range scans.

Riak has the ability to select keys for processing via various queries, including range of the key.

Riak also has secondary indexes and full text search.

If there's something significant about the DynamoDB method of doing range scans I'm interested in hearing it. My purpose here isn't so much to bash DynamoDB (in fact, I don't want to do that at all) but to try and spread a little more awareness of Riak.

Riak really came into its own in 1.0.

> Riak has the ability to select keys for processing via various queries, including range of the key.

Based on the resources I can find online, any select-by-range operation in Riak requires broadcasting to all nodes (or at least enough nodes to hit at least one replica of each record), and then performing a scatter-gather operation to fetch the matching records. There also doesn't seem to be any way to specify a sorting order. This is not quite what I would call a range scan: while useful, it presumably doesn't have the same cost or scaling characteristics. It's the difference between scanning a block of data that is stored contiguously, and filtering through an entire table to identify records meeting a criterion which happens to take the form "a <= value <= b".

This is not to diss Riak, which is a nice piece of work and does many things that DynamoDB doesn't.

For one, it responds quicker than Riak: Riak has (cold) response times of about 300ms, while this service claims single-digit ms response times. Also setting up Riak is not exactly trivial, and using this service outsources that hassle.

You can use Riak Smartmachines on Joyent's cloud that would get similar performance for an order of magnitude cheaper than what amazon is charging. If you are seeing 300ms response times, you are not using SSD's and you are not using a similar number of nodes that Amazon is charging you for.

I'm not hating on Amazon, it is a good move for them and they are doing some things that Riak cannot do, but cost and response is not one of them.

I both wouldn't bet on Joyent's Smartmachines being a magnitude cheaper and being as fast: On the speed: We run a Riak cluster, and for us the actual response times we get from Riak are, as described, slower than what Amazon promises in its docs.

On the price: If you would get 3x16 GB machines with Joyent, that would cost you 1400$/month. You can get a lot of resources for that with this new AWS service.

I don't have any experience with either Joyent or this new DynamoDB, but I do have some experience with Riak, and from the docs this new service would be a very viable competitor.

Riak has many shortcomings, but I wouldn't describe latency or installation as primary concerns. Our cold response times have a 99% bound of 8ms, and median of 5ms on commodity SSDs. Installation is handled by apt and is trivial to automate.

What are some of Riak's shortcomings?

Could you clarify what you mean by "cold response time of 300ms"? Cold as in requesting data that hasn't yet been cached in RAM? How good does it get once the cache is warm?

Yes, that's what I meant with 'cold'. For recently requested data, that is cached in RAM, the response can be as quick as 3 ms.

Thanks, that's useful to know.

If anything, from my experience with Riak, Basho guys should be having an emergency meeting.

Not mentioned here - DynamoDB also has built in monitoring and management.

What has been your experience with Riak?

Riak is based on Dynamo, and the original paper by Werner (if I recall correctly?). This is just offering up an easy to use cloud service version of what's backing S3 already?

Also it is spread across AZs which is handy.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact