I read with interest Steve Yegge's mistakenly-leaked email about the oppressive conditions for engineers at Amazon. It's hard to reconcile with the sort of innovation they consistently show.
My take on Steve's rant (as an engineer at Amazon) is that a lot of the issues he pointed out are legitimate, but at an entirely different scale than he was pitching them at.
Day to day I work on a product with another couple dozen or so engineers. We build what makes sense for our product, and for the most part we build it in a way that makes sense for us. Sometimes we are under pressure to leverage other parts of the platform, and sometimes that does entail a lot more work. Most of the time, though, it ends up reducing our operational load (because the systems we depend on support products much larger than ours :) and giving us someone we can page when things go pear-shaped.
Amazon isn't the perfect place to work, but it's generally not bad (other than the frugality thing; that sucks as an employee no matter which way you slice it).
Table Storage does not allow any other indexes other than the main primary ones (Row Key and Partition Key). You also cannot store complex object within fields and use them in a query. You basically just serialize the data and stuff it into the field.
The dynamic schema is very nice if you can leverage it but the actually query support is TERRIBLE. (Sorry Microsoft, I'm a fanboy but you blew it here). There is no Order By or event Count support which makes a lot of things very difficult. Want to know how many "color=green" records there are? Guess what, you're going to retrieve all those rows and then count them yourself. They're starting to listen to the community and have just recently introduced upserts and projection (select). I would love to see them adopt something like MongoDB instead :)
For more issues check out: http://www.mygreatwindowsazureidea.com/forums/34192-windows-...
Edit - For what its worth. We've moved more things to SQL Azure now that it has Federation support. Scalability with the power of SQL. http://blogs.msdn.com/b/windowsazure/archive/2011/12/13/buil...
Your partition keys can be composite, have a look here:
I agree with your other pain points - in terms of not being able to get counts, secondary indices etc. However, you can easily simulate some of those - maintain your own summary tables, indices and so on. These ought to emerge as platform features pretty soon though. It's not perfect, but its feature set is close to Dynamo.
As for Mongo DB, I guess this service has been built from ground-up to provide the availability guarantees and automatic partition management features. I don't know if Mongo provides those. You could run Mongo yourself on Azure if you wanted to; there's even a supported solution done recently.
• They both are NoSQL schema-less table stores, where a table can store entities with completely different properties
• They have a two attribute (property) composite primary key.One property that is used for partitioning and the other property is for optimizing range based operations within a partition
• Both of them have just a single index based on their composite primary key
• Both are built for effectively unlimited table size, seamlessly auto scale out with hands off management
• Similar CRUD operations
How Windows Azure Tables is implemented can be found in this SOSP paper and talk:
As mentioned by someone else, one difference is that DynamoDB stores its data completely in SSDs, whereas, in Azure Storage our writes are committed via journaling (to either SSD or a dedicated journal drive) and reads are served from disks or memory if the data page is cached in memory. Therefore, the latency for single entity small writes are typically below 10ms due to our journaling approach (described in the above SOSP paper). Then single entity read times for small entities are typically under 40ms, which is shown in the results here:
Once and awhile we see someone saying that they see 100+ms latencies for small single entity reads and that is usually because they need to turn Nagle off, as described here:
Reads per $0.01 = (50.60).60 = 180000
Writes per $0.01 = (10.60).60 = 36000
Assuming that you hit your usage is at 100% capacity then from a read prospective DynamoDB is half the price. Writes are much more expensive but many applications are heavily read oriented.
DynamoDB claims single digit millisecond reads, azure tables does not (from my experience.)
Azure tables have a maximum performance over a given partition table of 500 requests per second and over the whole account of 5,000 requests per second. DynamoDB does not state this.
To put this into context:
Assume a system with 5000 writes per second and 50000 reads here are the costs:
AWS Reads: $240
AWS Writes: $120
Aws Total: $360
Azure Reads: $4320
Azure Writes: $432
Azure Total: $4752
Seems like quite a difference for a decent sized read heavy application.
I agree that Dynamo's provisioned throughput capacity is a very useful feature though. Azure does not provide any such performance guarantee; the throughput limit is also a guideline as far as i know, not an absolute barrier.
5000 x 60 x 60 x 24 = 432000000 Writes
50000 x 60 x 60 x 24 = 4320000000 Reads
(432000000/10000) x 0.01 = $432
(4320000000/10000) x 0.01 = $4320
Azure Total Cost For One Days Use: $4752
((5000/10) x 0.01) x 24 = $120
((50000/50) x 0.01) x 24 = $240
AWS Total Cost For One Days Use: $360
You are right that I don't take into account the bulk feature of azure reads & writes but this is down to bulk requests only being possible on a single partition at a time which in my personal experience (not exhaustive) is non-trivial to take advantage of.
If your txns are all within 1KB, your math holds good; otherwise, you pay more. Interesting model, but I suspect it'll average out to similar costs.
For the cost of storage. The base price for Windows Azure Tables is $0.14/GB/month, and the base price for DynamoDB is $1.00/GB/month.
For transactions, there is the following tradeoff
• DynamoDB is cheaper if the application performs operations mainly on small items (couple KBs in size), and the application can’t benefit from batch or query operations that Windows Azure Tables provide
• Windows Azure Tables is cheaper for larger sized entities, when batch transactions are used, or when range queries are used
The following shows the cost of writing or reading 1 million entities per hour (277.78 per second) for different sized entities (1KB vs 64KB). It also includes the cost difference between strong and eventually consistent reads for DynamoDB. Note, Windows Azure Tables allows batch operations and queries for many entities at once, at a discounted price. The cost shown below is the cost per hour for writing or reading 1,000,000 entities per hour (277.78 per second).
• 1KB single entity writes -- Azure=$1 and DynamoDB=$0.28
• 64KB single entity writes -- Azure=$1 and DynamoDB=$17.78
• 1KB batch writes (with batch size of 100 entities) -- Azure=$0.01 and DynamoDB=$0.28
• 64KB batch writes (with batch size of 100 entities) -- Azure=$0.01 and DynamoDB=$17.78
• 1KB strong consistency reads -- Azure=$1 and DynamoDB=$0.05
• 64KB strong consistency reads -- Azure=$1 and DynamoDB=$3.54
• 1KB strong consistency reads via query/scan (assuming 50 entities returned on each request) – Azure=$0.02, DynamoDB=$0.05
• 64KB strong consistency reads via query/scan (assuming 50 entities returned on each request) – Azure=$0.02, DynamoDB=$3.54
• 1KB eventual consistency reads – DynamoDB=$0.028
• 64KB eventual consistency reads – DynamoDB=$1.77
If your items are less than 1KB in size, then each unit of Read Capacity will give you 1 read/second of capacity and each unit of Write Capacity will give you 1 write/second of capacity. For example, if your items are 512 bytes and you need to read 100 items per second from your table, then you need to provision 100 units of Read Capacity.
Looks like 1KB is the minimum for calculations.
This is currently an undocumented benefit of the query operation, but we will be adding that to our documentation shortly.
Now I'm not sure this would be the ideal solution for such a thing (in fact it probably is not), but it's just the first thing that came to mind. In the grand scheme of things sure that may seem like a trivial amount due to the use case, but we're still more in the realm of a startup where dropping ~$3k/month on the data store alone makes me cringe a little when we have other expenses to account for also. :)
Consider this cost relative to the cost of a trustworthy ops person, plus the capex & opex of running your own reliable & scalable DB.
Interesting that when I signed up for the service that they verified my identity with a cellphone call, like Google sometimes does.
My guess is they will add options for additional indexes in the future...everyone needs start somewhere. Even at Amazon's size and scale.
At Amazon's size and scale, it's all the more important that you start with something simple with well-understood performance characteristics from day one. AWS doesn't really get a grace period during which they get to fix scalability problems.
AppEngine went the other route and provided a very simple database API at first and all queries had to be range scans over an index. Any query you wanted to perform had to be precalculated by defining a composite index and some things (like inequalities on multiple fields) weren't supported. Over time they've built upon their basic database and added features such as a zigzag merge join algorithm which lets you perform queries that were otherwise impossible with a given set of indexes.
I bet DynamoDB will be going the AppEngine route by starting with a simple, scalable base which can be used to build more advanced query engines and features.
You could buy a lot of Riak or Cassie for that.
This actually hits on my only real issue with AWS in general, which is the hourly billing. We've used the mapreduce service a bit, and having a cluster roll up and fail on the first job is heartbreaking when it has a hundred machine hours associated with it. Obviously that is far, far cheaper than us building out our own cluster (especially with spot instances, which I can't even describe how much I love), but for some of the services smaller billing times would be useful.
The site is also able to compare many other different databases.
It's a fantastic product and even while in private beta has been stable and well supported.
They are making money on the read/write and are selling the capacity at current cost. Which — knowing the tendency for AWS to decrease the prices very slowly combined with the huge decrease of the prices of the SSD drives in the past months/years — is not a bad strategy to convince us to switch.
The composite key has two attributes, a “hash attribute” and a “range attribute.” You can do a range query within records that have the same hash attribute.
It would obviously be untenable if they spread records with the same hash attribute across many servers. You'd have a scatter-gather issue. Range queries would need to query all servers and pop the min from each until it's done, and that significantly taxes network IO.
This implies that they try to keep the number of servers that host records for the same hash attribute to a minimum. Consequently, if you store too many documents with the same hash attribute, wouldn't you overburden that subset of servers, and thus see performance degradation?
Azure has similar functionality for their table service, requiring a partition key, and they explicitly say that there is a throughput limit for records with the same partition key. I haven't seen similar language from Amazon.
Whether you scatter-gather or try to cluster values to a small set of servers, you'll eventually degrade in performance. Does anyone have insight into Amazon's implementation?
In Composite Hash Key mode you also get limited range queries. These are supported by any DBMS that indexes its primary key via BTrees.
DynamoDB is cutting the features to the bone. This makes it easy to migrate out of DynamoDB and, at a first glance, hard to migrate into DynamoDB. Particularly hard to migrate from a RDBMS with complex schema/index/trigger support.
The beauty of it is that complex features can be built as libraries as the needs arise. The next five years look very exciting. Hats off to Amazon.
Edit: to clarify, I think that composite keys would make your problem limited to 256 distinct applications, not, for example, 256 customers, users, etc.
I already have a bunch of (large-ish, deeply nested) JSON objects defined for my application. I don't really want to go about redefining these since they work great between my various node processes and the front end. I am saving them in a nosql database already, I am curious about switching (to save on devops costs). I only request based on 1 Hash Key (int) and 1 Range Key (int) for all my current get operations.
Looking through the docs/examples I see a lot of this type of thing:
Ok, my question:
Do my JSON objects need to conform to this 'type syntax' JSON notation in the examples? Or can I save just any JSON object into this database and only annotate the Hash Key and Range Key using this special notation?
You can define a table with 3 fields: yourKey, yourRange, and yourJson. Put your entire JSON data as string in the yourJson field.
As someone who has grappled in the past and is grappling again with the issues of pricing a database-as-a-service though, this is very much a non-trivial issue. If you have nice scaling characteristics and you want to charge the minimal price, your price isn't going to be predictable. Essentially, what Amazon have done here is to set a cap and then charge you the cap irrespective of actual usage, which seems to be the model that's winning on the cloud.
SimpleDB's Machine hours are basically the same units, but without the throttle.
So, from a technical and value viewpoint, it's a huge step backwards (pay for capacity rather than for usage), but I'm learning that psychology is perhaps just as important here.
It seems like what you really want is a throttle with per-query charging, to cap your bill. Probably you'd much prefer not to be forced to pay your cap every month, but I don't think that's being offered.
(Edit: Downvotes? Am I wrong here? If so, please contribute to the discussion and tell me why!)
The big plus for Capacity Units is that Amazon actually provides a deterministic model for figuring out what you'll be charged for a given query.
This is a big step forward in transparency, although I would suggest that SimpleDB's pricing shouldn't have been obscured in the first place.
The "back up to s3" part is for archiving data, like periodic snapshot backup of your data so that you can get back the old data in case they are deleted by app/human.
Quite a lot can be done with smart use of hugely scalable simple key-value maps.
^^ There's your game changer.
Game changer? I need to see some evidence first on how well it performs and integrates.
There may be instances where large datasets are hard to migrate from DynamoDB, but overall it doesn't look to me like lock-in would be that much of a problem, assuming you have a decent abstraction layer.
"Doesn't that just sweep the latency tradeoff under the rug, or is flash making up the difference? What about the availability tradeoff? (I like the formulation here, consistency vs availability and consistency vs latency, as opposed to CAP which never made sense as a 3-way tradeoff: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...)
Of course for people who want a hosted solution, the fact that it is hosted, is what gives it a lot of value. There haven't been a lot of hosted NoSQL databases, at least on this scale and availability, out there.
But technologically, what's new here? Is there anything here that is really innovative? (not a sarcastic question)
DynamoDB does have at least one significant feature not provided by Riak -- range scans. This makes many common access patterns much easier to implement efficiently. Still, as you suggest, there don't appear to be any fundamental technical advances here. The advances are in the service model and operation.
And there are, of course, many limitations. (Just to name a few: items -- i.e. rows -- can't exceed 64K; queries don't use consistent reads; seemingly no atomic update of multiple items.) It's miles ahead of SimpleDB, but still not nearly as flexible as many of the existing NoSQL databases. If Amazon lives up to past performance, they'll make steady improvements, but slowly.
Riak has the ability to select keys for processing via various queries, including range of the key.
Riak also has secondary indexes and full text search.
If there's something significant about the DynamoDB method of doing range scans I'm interested in hearing it. My purpose here isn't so much to bash DynamoDB (in fact, I don't want to do that at all) but to try and spread a little more awareness of Riak.
Riak really came into its own in 1.0.
Based on the resources I can find online, any select-by-range operation in Riak requires broadcasting to all nodes (or at least enough nodes to hit at least one replica of each record), and then performing a scatter-gather operation to fetch the matching records. There also doesn't seem to be any way to specify a sorting order. This is not quite what I would call a range scan: while useful, it presumably doesn't have the same cost or scaling characteristics. It's the difference between scanning a block of data that is stored contiguously, and filtering through an entire table to identify records meeting a criterion which happens to take the form "a <= value <= b".
This is not to diss Riak, which is a nice piece of work and does many things that DynamoDB doesn't.
I'm not hating on Amazon, it is a good move for them and they are doing some things that Riak cannot do, but cost and response is not one of them.
On the price: If you would get 3x16 GB machines with Joyent, that would cost you 1400$/month. You can get a lot of resources for that with this new AWS service.
I don't have any experience with either Joyent or this new DynamoDB, but I do have some experience with Riak, and from the docs this new service would be a very viable competitor.
Not mentioned here - DynamoDB also has built in monitoring and management.