
DocumentDB - daigoba66
http://azure.microsoft.com/en-us/documentation/services/documentdb/
======
MichaelGG
I wonder what the search story is. One technology that really does deliver,
and has totally impressed me, is Lucene/ElasticSearch. I'm used to all sorts
of hyperbolic claims, but holy shit, ElasticSearch just delivers. We tossed in
about 40M documents from a SQL Server DB, and not only did it require less
resources (a 30%? reduction in size), the queries are beyond anything that'd
be approachable using SQL Server. And I've only touched the surface, using it
as a pure plug-n-play setup.

With DocumentDB, not having a local version severely limits what I'd consider
this for. Losing that flexibility is a big deal. Maybe this is just a limited
preview and they haven't build the management side for local installs.

~~~
nemothekid
I doubt MS will ever release a non-hosted version. I see this product as a
competitive offering to Amazon's Dynamo and Google's BigQuery on AppEngine.

~~~
spicyj
BigQuery isn't really in the same space; it's meant for analytics queries
where speed isn't as important (queries take seconds or minutes). Google Cloud
Datastore is a better parallel.

~~~
richardw
Search API. Datastore isn't really aimed at search.

[https://developers.google.com/appengine/docs/python/search/](https://developers.google.com/appengine/docs/python/search/)

~~~
Aardshaark
The Search API is pretty much for searching text, whereas this is a database.

Also the Search API is terrible.

~~~
richardw
While I agree the Search API isn't great, it searches text, numbers, dates and
geopoints. That's good enough for most uses.

------
ceejayoz
[http://azure.microsoft.com/en-
us/documentation/articles/docu...](http://azure.microsoft.com/en-
us/documentation/articles/documentdb-introduction/)

> Want to edit or suggest changes to this content? You can edit and submit
> changes to this article using GitHub.

Pretty remarkable given Microsoft's approach to open source in the 1990s that
they're now using a service built around Linus's bespoke open source version
control system to allow people to suggest changes to their documentation.

~~~
pjmlp
Microsoft is the new IBM. :)

~~~
scholia
Nothing remotely like it, in terms of power, range, degree of market control,
or willingness to be commercially evil. IBM used to be more than twice as big
as every other IT company put together, and in some years, made more than 100%
of the industry's profit.

Even today, after disposing of numerous divisions, it's still a $100bn
company, ie still larger than Microsoft. (So you pay IBM more than you pay MS
every year, and you always have done.)

However, it is true that Microsoft is the IBM of PC software, just as Google
is the IBM of the web, Intel is the IBM of processors, Facebook is the IBM of
social networking, Cisco is the IBM of routers, Oracle is trying to be the IBM
of server software, Apple is the IBM of hipster status symbols... Well, you
get the idea.

IBM used to be the IBM of _everything_ in IT, and started getting into other
areas (telephone switches, copiers, cash machines etc) before the wheels
started to come off.

Today, Google is a lot more powerful than Microsoft, much scarier, and much
more ambitious in IBM-like ways (self-driving cars, robots, superbrains etc).

~~~
pjmlp
Calm down. I am pretty aware of it.

I lived through it.

My only intention was related how IBM focused away into other areas like
consulting, to stay relevant in the PC world.

------
jpalomaki
Ad hoc queries using SQL like syntax. No need to define indexes.

Javascript execution within database. Stored procedures, triggers and
functions can be written with Javascript. "All JavaScript logic is executed
within an ambient ACID transaction with snapshot isolation. During the course
of its execution, if the JavaScript throws an exception, then the entire
transaction is aborted."

Pricing is based on "capacity units". Starts with $22.50 per month (this
includes 50% preview period discount). One capacity unit (CU) gives 10GB of
storage and can perform 2000 reads per second, 500 insert/replace/delete, 1000
simple queries returning one doc.

In order to see pricing details, change the region to "US West":
[http://azure.microsoft.com/en-
us/pricing/details/documentdb/](http://azure.microsoft.com/en-
us/pricing/details/documentdb/)

Very interesting addition to Microsoft offering. I was actually just yesterday
wondering if they have any plans for this kind of service. Table Storage is
quite primitive and Azure SQL on the other hand gets expensive when you have
lots of data.

One potential "problem" with this is the bundling of storage capacity and
processing power. If I understand this correctly, I would need to buy 10 CUs
per month to store 100GB of data even if I'm not very actively using that
data.

~~~
jnardiello
That is actually good to start with, yet scaling will cost A LOT.

------
streptomycin
_DocumentDB utilizes a highly concurrent, lock free, log structured indexing
technology to automatically index all document content. This enables rich
real-time queries without the need to specify schema hints, secondary indexes
or views._

How does that work? Isn't that going to incur a major performance hit? If not,
why don't other databases get rid of indexes?

Also, if anyone from MS is reading, [http://azure.microsoft.com/en-
us/documentation/articles/docu...](http://azure.microsoft.com/en-
us/documentation/articles/documentdb-introduction/) links to
[http://azure.microsoft.com/en-
us/documentation/articles/docu...](http://azure.microsoft.com/en-
us/documentation/articles/documentdb-concepts/) which is a 404 error.

~~~
andrea_s
One way to approach a behind the curtain automatic index generation would be
to gather data about collection usage, and use such data to enable targeted
indexes.

I don't think it would be feasible to have "indexes for everything", as the
numbers involved would scale geometrically.

~~~
arthurcolle
Forgive my ignorance but why would it scale geometrically? I tried to quickly
think of a complexity analysis and I don't see where you are coming from.

~~~
andrea_s
Well, ordering matter in indexes - and clearly building a dedicated index for
every field would not work, simplest queries aside.

So if I have three fields - a, b and c - and by hypothesis I want to map all
possible queries in a naive way, I will need to build, at the very least, the
following indexes:

[a, b, c] [a, c, b] [b, a, c] [b, c, a] [c, a, b] [c, b, a]

all of which represent a different and valid way of querying my data. Anything
less than this would leave some valid query uncovered by the indexes. These
are 6 indexes, which is 3! (factorial). Add a fourth field and we have 12
indexes, and so on. Hence, geometrical growth (to be fair, factorial growth is
even greater than geometrical).

Which is why I'm thinking that either (i) indexes get created automatically
based on the most frequent / heavy queries, or (ii) indexing works differently
for DocumentDB and they are actually able to map the document space in a more
efficient way (but I'd say that we lack the technical details to jump at this
conclusion, at the moment).

------
allegory
No local installation. No banana.

I wouldn't tie a product to a single cloud vendor.

~~~
mariusmg
Yes, because all products today are designed and for multiple cloud vendors.
Come on.......

~~~
weego
As the only source of DocumentDB is as an instance within Azure you are
limited to that vendor. You can obviously take redis/monogo/couch and put them
on any 'cloud' providers infrastructure as needs be.

Not sure if your interpretation of his point was wilfully incorrect for some
reason, but it was quite obvious what he meant.

------
bkeroack
...and MS goes after MongoDB. It would be nice to see an on-premises version,
if only to compare performance/consistency with Mongo.

~~~
reubenbond
DocDB is built atop a very well battle-tested distributed systems framework
with replication based on a multi-Paxos implementation. I haven't done any
benchmarks, but the replication model is far superior to MongoDB's "journal
journal" model for performance.
[http://daprlabs.com/blog/blog/2014/08/22/azure-
documentdb/](http://daprlabs.com/blog/blog/2014/08/22/azure-documentdb/)

~~~
dougws
Your objections to MongoDB's model seem reasonable, but I don't see any
evidence in either this comment or the linked blog post that DocumentDB is
better (especially in the absence of benchmarks). What is this "battle-tested
distributed systems framework"? Several of your complaints about MongoDB have
to do with the interaction between persistence to disk and replication to the
network; as the Multi-Paxos algorithm does not specify when data should be
written to disk (much less what the format should be), what reason is there to
believe that DocumentDB does this any better?

I'm totally willing to believe that DocumentDB beats the pants of MongoDB on
just about every axis (in fact, that seems pretty likely) but it's going to
take some actual numbers and a better description of the internals.

~~~
reubenbond
I agree with you - we need numbers before making that kind of conclusion and I
haven't run any benchmarks on the public version of DocDB. I'd like to see
someone measure MongoDB on Azure vs DocDB on Azure - even then it might not be
a fair measurement of db vs. db, since we don't know what machines DocDB is
hosted on.

All I can really say is that the replication model provides a significant
performance boost over MongoDB in the multiple replica (i.e., production)
scenario.

We were using MongoDB at Microsoft for a while (I left MS almost a year ago).
I was developing a real-time metrics system with it. It was very unstable at
our target load (500k increments per minute, high percentage of tomorrow's
documents preallocated the day before). We only managed maybe 10% of that with
MongoDB, IIRC. Sometimes it would choke and not come back until I restarted
the cluster (~30 machines total, I believe. 3 replicas * 10 shards).

We were so sure that MongoDB should be able to handle this scenario, since
they talk about it in their documentation. After talking with the MongoDB
devs, we came to the conclusion that even though we were issuing increment
operations on preallocated documents, MongoDB was:

a) using a global lock on the "local" db used for replication, and

b) "replicating via disk" instead of via the network. In other words,
replication requires writing to the journal journal before other members of
the replica set have a chance to apply the change and ack back. This results
in a loss of concurrency.

The lack of async query support in the C# driver didn't help either.

Eventually we used a replicated, write-back cache which sits atop the
framework DocDB uses. Not a fair comparison, but the goal was achieved easily
with 1/3rd the hardware. We just backed it onto Azure Table Storage. Our
queries were all range queries, which table storage supports.

I can't talk about the framework, unfortunately.

~~~
ddorian43
next time you need fast counters, try hypertable (non-reading increments)

~~~
reubenbond
It would still make sense to use the replicated write-back cache to avoid
trips to disk. We were considering replacing MongoDB with Cassandra, though.

I wanted to avoid having to deploy and maintain a database system, so using
table storage was a solid choice.

------
whalesalad
I liked everything about it until I saw the API for the Python client. What a
catastrophe.

I pray Microsoft is looking for Python developers:
[https://gist.github.com/whalesalad/2142f0075c6896f4547c](https://gist.github.com/whalesalad/2142f0075c6896f4547c)

~~~
ayrx
Wow, the function naming alone is terrible.

    
    
        for i in range(len(path_parts) - 1, -1, -1):
            if not path_parts[i] in resource_types and path_parts[i] in resource_tokens:
                return resource_tokens[path_parts[i]]
    

This bit of iteration is painful to look at as well.

~~~
jkldotio
Some optimisations to get it to web-scale on TempleOS were obviously
necessary.

------
fineline
"All JavaScript logic is executed within an ambient ACID transaction with
snapshot isolation. During the course of its execution, if the JavaScript
throws an exception, then the entire transaction is aborted."

Have I missed something, or have MS delivered a novel and valuable feature?
I'm not aware of support for transactions across documents in other NoSQL
platforms. I'd be grateful if someone has any experience or better information
in that regard, thanks.

~~~
int64
You are missing AmisaDB. Supports Multistatement transactions.
[http://www.amisalabs.com/](http://www.amisalabs.com/)

~~~
tristanz
It appears that AmisaDB is in-memory only, only supports read committed
isolation, and has limited transactions semantics (no arbitrary JS). This is
pretty far from what DocumentDB seems to offer.

[http://www.amisalabs.com/AmisaDB_Docs.html](http://www.amisalabs.com/AmisaDB_Docs.html)

------
luuio
A quick comparison between DocumentDB vs MongoDB:
[http://daprlabs.com/blog/blog/2014/08/22/azure-
documentdb/](http://daprlabs.com/blog/blog/2014/08/22/azure-documentdb/)

~~~
spiderPig888
Wow, how'd you get one done so fast? Were you a private preview customer?

~~~
jeroen
OP did not write that. See here:
[https://news.ycombinator.com/item?id=8209525](https://news.ycombinator.com/item?id=8209525)

------
orand
If I understand correctly, their multi-document ACID transaction support is a
big deal. The only other NoSQL/NewSQL systems I'm aware of with that ability
are FoundationDB and Google Spanner/F1.

~~~
int64
So does AmisaDB. [http://www.amisalabs.com/](http://www.amisalabs.com/)

------
pokstad
Sounds very similar to CouchDB. Server side Javascript written by the user,
and an HTTP interface. The ability to adjust consistency is really neat.

~~~
ahoge
CouchDB doesn't support ad hoc queries. CouchDB is all about MapReduce and
heavy caching. It's very rigid.

------
lubos
What are the limits of DocumentDB? You know, like max size of database, max
size of document, max number of documents per database, max. number of
attributes per document, max. number of databases per DocumentDB account.

What's the max. duration of database query, max size of query result.

What kind of performance can be expected, does it decrease as the size of
database increases or it remains constant?

I'm going to wait a few days until hype settles.

~~~
majorsc2noob
Most of this is documented since start. Did you actually read the docs?

[http://azure.microsoft.com/en-
us/documentation/articles/docu...](http://azure.microsoft.com/en-
us/documentation/articles/documentdb-limits/)

~~~
lubos
I did read documentation but couldn't find it. I'm still not sure how did you
find it other than going into their GitHub repository.

Great find though. Thanks.

------
jnardiello
And that is a creative product name. Well played MS.

~~~
programminggeek
It's really smart. It's a document database, so they call it DocumentDB. It's
better than calling it something like Dolphin and then explaining that it's a
document database. Simple and clear is a good thing.

~~~
tdicola
Wouldn't have surprised me to see it called something like Microsoft NoSQL
Database Cloud Edition for Agile Developers Powered By JavaScript 2014.

~~~
cledet
They haven't done this for a while.

------
mallipeddi
What are the size limits on a collection? Docs mention transaction support is
offered only within a collection. Is a collection essentially limited to a
single physical machine in the background or does it span across machines? It
looks like in Standard Preview, the max collection size is 10GB.

------
seanp2k2
Interesting: [https://github.com/Azure/azure-documentdb-
python](https://github.com/Azure/azure-documentdb-python) (it's empty for the
moment, but glad to see first-party support for Python)

~~~
luuio
The .NET & JS repos are empty, too. I think they just have not published the
code. It's still in Preview after all.

[https://github.com/Azure/azure-documentdb-
net](https://github.com/Azure/azure-documentdb-net)

------
reubenbond
The @DocumentDB twitter links to a tutorial on DocDB:
[http://www.documentdb.com/sql/tutorial](http://www.documentdb.com/sql/tutorial)

------
cvburgess
Does anyone know how this compares to AWS DynamoDB[1] ?

[1] [https://aws.amazon.com/dynamodb/](https://aws.amazon.com/dynamodb/)

~~~
ryanfitz
I've been using DynamoDB since it was released. I wrote a nodejs driver[1] and
a nodejs data mapper for it[2], so I have a decent bit of experience with it.
Browsing the DocumentDB docs the two services seem to be very different.
DynamoDB is really just a key value store with some very nice properties, but
also a lot of tradeoffs. One such tradeoff is querying data is very limited in
DynamoDB. In Dynamo you can only query data by its primary hash key and
optional range key. These keys you must specify upfront when you create your
table and cannot be changed afterwards.

DocumentDB seems much more similar to mongodb and appears to have a very
flexible query ability. In my opinion, one of the best features with DynamoDB
is you can tune the number of reads/writes each individual table requires.
This lets you scale up and down your database and greatly helps keep costs
down. This is a feature that only a hosted database service can offer. I
haven't yet read any pricing info on DocumentDB, but hopefully they offer a
similar feature as this is really where a hosted database service can shine.

[1][https://github.com/Wantworthy/dynode](https://github.com/Wantworthy/dynode)
[2][https://github.com/ryanfitz/vogels](https://github.com/ryanfitz/vogels)

~~~
cvburgess
Thanks, yeah I can see how this is more like hosted mongoDB - and I've seen
dynode on npm, nice work!

------
andrea_s
Am I alone in thinking that sql-like syntax is actually a step backwards from
building query documents programmatically (MongoDB style)?

~~~
reubenbond
You can still build documents programmatically. The SQL-esque syntax is
optional. The primary interface is HTTP.

~~~
ahoge
> _The SQL-esque syntax is optional. The primary interface is HTTP._

Then the introduction needs some work.

"DocumentDB enables complex ad hoc queries using the SQL dialect"

"Azure DocumentDB offers the following key capabilities and benefits: Ad hoc
queries with familiar SQL syntax"

The "Query DocumentDB" article also seems to be focused on that SQL dialect.

------
chippy
Spatial queries and indexing. Most data has some location component. I didn't
see anything with this. Is it in there, or planned?

------
petilon
So does it run on a cluster? If so which of Consistency, Availability and
Partition tolerance does it NOT offer? (See CAP theorem)?

~~~
falsestprophet
It supports tunable consistency

[http://azure.microsoft.com/en-
us/documentation/articles/docu...](http://azure.microsoft.com/en-
us/documentation/articles/documentdb-introduction/)

------
yxhuvud
It would have been nice to see some actual details of how it works so that it
can be compared to the competition.

~~~
orf
First link on the page: "introduction to DocumentDB".

[http://azure.microsoft.com/en-
us/documentation/articles/docu...](http://azure.microsoft.com/en-
us/documentation/articles/documentdb-introduction/)

------
poolpool
I wonder if this is built on JetDB

~~~
james2vegas
On Jet Blue (ESE)? It wouldn't be the first, RavenDB
[http://ravendb.net](http://ravendb.net) is based on ESE.

I remember awhle back doing work on a system that (ab)used MS Exchange Server
5 as a database, mostly because of the Outlook integration.

------
utunga
Another case of Not Invented Here syndrome from Microsoft. One wonders why
they couldn't just take the open source and very well architected RavenDB
[http://ravendb.net](http://ravendb.net) .Net Document DB and provide first
class support for that within Azure.

~~~
Maarten88
RavenHQ was recently added as an Azure Add-On, you can use it now. But RavenHQ
seems much more expensive than DocumentDB (even when doubling the
"introduction pricing"), and it's not really integrated in Azure (I understand
that it runs on Amazon, in a US datacenter only)

But it will be very interesting to compare features, Raven really has a lot of
advanced features.

------
gamesbrainiac
I find it surprising that DocumentDB wasn't already a copyrighted name. ;)

------
talles
Can I use DocumentDB out of Azure (hook my own)?

~~~
twotwotwo
You know, it's not a simple question, but there's some case for Microsoft just
up and releasing some scale-out DB for free (as in beer) or at least cheap.

My thinking is, scale-out deployments aren't all that likely to pay at SQL-
Server-like rates per CPU anyway, and you're helping give the Windows world
better parity with the horde of Linux options for folks that need scale-out
(or, just as commonly, hope to need it someday).

On the flipside, by doing that you might forego some sales of SQL Server
(though I suspect that's limited; most folks that need SQL Server really _need
SQL Server_ ) or sales on Azure that could theoretically help recover the dev
costs. But greatly improving the scale-out-on-Windows story seems like a big
deal, the kind of thing that might justify going to lots of effort to make a
DB product then giving it away.

~~~
aikah
> You know, it's not a simple question

Yes it is a simple question, and the answer is NO.

------
sarciszewski
Leave it to Microsoft to give it the most generic sounding name possible.

~~~
josteink
Yeah. Instead of giving it a reasonable, obvious name they should have gone
with something hipstery and vague instead, like UberDcmntor.io.

No thanks. I think this will do just fine.

~~~
rtkwe
No no no, it also has to be just one common word so that it's completely
ungoogleable by itself.

------
nandkishiee
Sick! Love it

------
cbsmith
'cause what the world needs is another proprietary NoSQL solution.

~~~
PerfectElement
As long as it offers something new, I don't see anything wrong with that.

I've been working with MongoDB for the last year and DocumentDB features look
very interesting to me.

~~~
cbsmith
Yeah, once you go down that road for a while, you'll understand how
proprietary NoSQL is a giant PITA.

------
hackerkushal
THIS THING IS A BEAST!! It is absolutely bad ass

------
Nux
A new "cool", locked-in service served on a silver platter by Microsoft to the
brainwashed.

Everybody else uses open source on premises or their cloud of choice.

