Hacker News new | past | comments | ask | show | jobs | submit login
DocumentDB (microsoft.com)
300 points by daigoba66 on Aug 21, 2014 | hide | past | favorite | 164 comments



I wonder what the search story is. One technology that really does deliver, and has totally impressed me, is Lucene/ElasticSearch. I'm used to all sorts of hyperbolic claims, but holy shit, ElasticSearch just delivers. We tossed in about 40M documents from a SQL Server DB, and not only did it require less resources (a 30%? reduction in size), the queries are beyond anything that'd be approachable using SQL Server. And I've only touched the surface, using it as a pure plug-n-play setup.

With DocumentDB, not having a local version severely limits what I'd consider this for. Losing that flexibility is a big deal. Maybe this is just a limited preview and they haven't build the management side for local installs.


Hi MichaelGG,

I am a Program Manager for Azure Search and as curiousDog mentioned, yes along with DocumentDB, we also announced Azure Search which is a PaaS based full text search service. We actually leverage ElasticSearch at the core of this service and as chippy says about spatial search, we do have the ability to provide a pretty solid geo-spatial search capability thanks to Elastic Search and Lucene. To nemothekids's point, is it very unlikely that we will offer this as a local (non-hosted) version because we found that although ElasticSearch is awesome, one of the common complaints many admin's have is the complexity around managing systems such as ElasticSearch/SOLR/Lucene at high scale and how difficult it is to implement more advanced search capabilities such as tuning and relevancy. Those are areas we think we can add a lot of value being a fully managed service. Longer term, we think this will also allow us to bring even more value on top of search by adding in other Microsoft technologies. For example, we could tie in Bing Maps to allow you to easily tie in reverse geo-coding right into your search. Or perhaps allow you to leverage Bing's synonym list so that you could allow people to search yet find results that are synonyms to commonly searched words (i.e., user types in shoes, but in your content it is referred to as footwear). Multi-language support is actually one of the big things we want to tackle in the short term and we believe that the NLP from Office will really help jump start us with this.

Liam


We don't need the actual infrastructure or even the same software locally.. what's wanted/desired is an API compatible system that can be used to develop against. It also doesn't need to run on a local cluster for development purposes, or smaller shops.

I know everyone is all about running huge clusters behind the scenes, but most people simply don't need that, and being able to start small, and buy bigger would be a nice option.

For example, a simple node application using leveldb(levelup) for the storage interface could be very effective as a backend for development/testing. From there, you provide an API compatible interface. Open-source that version, with the disclosure/understanding that the Azure hosted version is much more robust.

I think you'd see a lot more buy in from the open-source community... even more so if you accepted PRs to make the open version more robust.

If you are at a point where you are considering the likes of ES/SOLR/Lucene etc, you are likely ready to make the jump to self-host in the cloud, or use a SaaS provider. Where people get a bit concerned is in the lock in. I know why Azure would want to present that, but I think it's a bad idea without an open implementation that allows for self-hosting for development and on the small scale.

Right now, the company I work for is hosting in Azure. I recently switched from a couple RabbitMQ queues to Azure Queues. It works fine, as was a really simple replacement for a flow of near real-time but temporary data keys. I would be much more open to using a hosted MongoDB from MS than DocumentDB, in much the same way I'm happy to see you guys embracing Redis for a cache system.


tracker1,

I hear you on this and to add another justification for the need for openness, let me give you another example. With ElasticSearch there are some amazing tools that are available. A few that come to mind are Kibana and LogStash. Given that we have our own API layer on top of ElasticSearch, we are not able to support these tools even though we are using ElasticSearch at our core. This is most unfortunate.

There are many reasons why we put an API layer on top of Azure Search. One key reason is that for our particular customers, we found that there were things we could do in our API that could simplify the interaction with Search. In fact, we have a system called Scoring Profiles that allow you to easily (I hope) set weights on important fields and attributes to quickly tune the results of your search based on what is important to you. No coding required. Another reason we don't just expose ElasticSearch is that ElasticSearch allows you to run random code. This is generally not a great idea in a PaaS service and can often lead to issues. There are a number of other reason that I'll skip for now.

We still need to do some thinking in this area. I hope we can get Azure Search to a point that we can enable tools such as Kibana and LogStash to work with our service without compromising the goals of what we are trying to build. Not only would this allow us to really open up the types of things people can do with the service, but I suspect it would really help reduce concern around vendor lock in. We'll see if we can get there...

Liam


They released a full-text search service as well today: http://azure.microsoft.com/en-us/services/search/


I haven't used elastic search but I've used lucene/solr. It's similar to a technology our startup created which was acquired by IBM and is now part of the Watson portfolio. I think our indexing engine was faster. I know for sure we were better at getting data into the index than lucene appears to be. What are the common connector frameworks for lucene? Or do you generally just push the documents to the index?


I doubt MS will ever release a non-hosted version. I see this product as a competitive offering to Amazon's Dynamo and Google's BigQuery on AppEngine.


BigQuery isn't really in the same space; it's meant for analytics queries where speed isn't as important (queries take seconds or minutes). Google Cloud Datastore is a better parallel.


Search API. Datastore isn't really aimed at search.

https://developers.google.com/appengine/docs/python/search/


The Search API is pretty much for searching text, whereas this is a database.

Also the Search API is terrible.


While I agree the Search API isn't great, it searches text, numbers, dates and geopoints. That's good enough for most uses.


I think you mean a local version, or a non-hosted version.


Right, edited the parent post.


Similarly the Azure Search appears to be a competitor to Amazon CloudSearch. I'm not aware of a Lucene based search offering from Google, however.


ElasticSearch benefits from Lucenes spatial support too, which is a serious benefit.


http://azure.microsoft.com/en-us/documentation/articles/docu...

> Want to edit or suggest changes to this content? You can edit and submit changes to this article using GitHub.

Pretty remarkable given Microsoft's approach to open source in the 1990s that they're now using a service built around Linus's bespoke open source version control system to allow people to suggest changes to their documentation.


Beware a wolf in sheep's clothing...

Edit: the downvoters clearly weren't in the industry in the 80's and 90's and haven't dealt with them in the enterprise / volume licensing department recently. Comparing the two, they're even more ruthless, unfair and incompetent than ever and will screw you as hard as they can once you're locked in.

This stuff gets you through the door, as does BizSpark etc, then you're not a friend but a cow for the milking via VL, audits and licensing changes.

I speak from experience working for 4 paid up gold partners over the last decade and then dealing with them in a corporate capacity back to '95. Every game ends the same.

Edit 2: appears you can't tell the truth about Microsoft these days in the same fashion you couldn't tell the truth about Apple about 3-4 years ago...


Either that or the downvoters are willing to consider the possibility that 25 years -- not to mention a radically different market position -- makes a difference. The Microsoft of 2014 is not the Microsoft of 1989.

...in the same fashion you couldn't tell the truth about Apple about 3-4 years ago...

Yeah, if only there had been lots of people highly critical of Apple in 2011. Whatever happened to that Android thing, anyway?


Having been there, as I stated in '95 and '14 (19 years difference), the story is exactly the same.

Only the marketing and front end has changed. The cogs that drive the machine and the revenue mill have the same components and structure.

The market position is pretty much the same. Bar some new consumer markets, they have almost total domination of the business and enterprise sector. They even made a big dent in the entertainment sector with XBox with the piles of cash and losses they incurred and came out on top.


If you were around here back then, you'd know that it was pretty common for anti-Apple things to get down-voted quickly on HN. I think the pro-Apple / pro-Android voting has pretty much gone away since the comment scores have been hidden.


>If you were around here back then, you'd know that it was pretty common for anti-Apple things to get down-voted quickly on HN.

Because most were trite-BS? Like the same advice market pundits used to give Apple that in hindsight was always wrong, like that "Zune will crash them", or they "need to make a netbook NOW" (in 2010), "stuff is overpriced" etc etc. Heck, people were even championing the Dell Ditty in forums...

Now, if you have something serious to say about Apple, e.g regarding their technology, or the consequences of having a walled garden approach (and say it without assuming that everybody in a discussion "ought" to be against a walled garden approach), then I don't think there would be a problem. We have had serious discussions critisizing Apple in HN for ages.


HN existed in 95' ?


That was referring to the second half of the GGP post (i.e. "you couldn't tell the truth about Apple about 3-4 years ago.")


I've been trying to tell everyone this in the last year. Things have really changed in the Microsoft world. For the better.

It makes me forget they killed my pappy!


Because they've released a vendor lock-in Azure-only, proprietary NoSQL database, 5+ years after everyone else?


You don't celebrate a major tech company catching up 5 years late. You kick in the CEO's office, light his desk on fire and make pointed enquiries into how much money they wasted farting around on Windows 8 and other failed ventures.


They did that. He was fired, and now there's a new CEO.


it's microsoft, how is that not an achievement?


Releasing proprietary software behind a walled garden is not exactly a celebratory-worthy achievement, even Oracle and Apple can obtain this kind of achievement.


So the only celebratory-worthy achievements are about releases of "libre" stuff?

Because, I'd say, if the software is good, and fits its users needs, then "releasing proprietary software behind a walled garden" is totally celebratory-worthy too.


Releasing a vendor lock in focused system is an act of self interest.

Releasing something for the public good, open source, open for all to use is an entirely different, more applaudable action.

You don't get browny points for being baiting your traps. Even if it's fancy cheese.


Sometimes I'd rather buy and use something costing money, from a vendor that only sells it himself, than some free stuff offered that I think its crap.

A turd of a software, even if libre, is still a turd. I won't celebrate it just because someone offers it for free.

I'm not talking in general: there's excellent libre software, and excellent proprietary software too.

But there's specific libre software that's just plain crap for most use cases compared to its proprietary alternative (e.g consider a high-end DAW and the libre DAWs. Or a high-end NLE and the libre NLEs).

And some other libre software is mighty fine in itself, but lacks other characteristics that some proprietary software has (from 24/7 paid on the phone support, to quality documentation, to working with your preffered OS or your other infrastructure, etc). So some people can make good use of it, while for others it's not suitable.


MS can both act in self interest (it is business after all) and do something for the "public good".

It's weird that you think these two marks cannot both be hit in one stroke.


yah sure. If free market was a zero sum game it would be pointless. However, I see absolutely NO reason I should applaud them for it, not when then are people making real sacrifices for open source software.


Well, that narrow view got us into this mess in the first place.


What mess?

I was there in past decades. We're better off than ever, both with regard to accessibility and pricing of proprietary software and with regard to abundance of libre software.



Yep, you could also say Apple's CloudKit is another proprietary NoSQL solution for the purpose of providing more value and lock-in into the iOS ecosystem. But at least CloudKit has generous free usage up to:

   - 1PB for assets
   - 10TB for database
https://developer.apple.com/icloud/documentation/cloudkit-st...


Looks like you'll need 10 million users in order to have that much space, though.


By this logic we ought to discredit just about every single small business ever.


Yes, but 5 years behind the curve? I think not.


So what about their lawyers campaigning for the copyrighting of APIs?


Dude, that's just the article, not the source. Eg: Github as CMS.


I'm fully aware of that, considering I quoted the bit talking about contributing to the article.


Microsoft is the new IBM. :)


Nothing remotely like it, in terms of power, range, degree of market control, or willingness to be commercially evil. IBM used to be more than twice as big as every other IT company put together, and in some years, made more than 100% of the industry's profit.

Even today, after disposing of numerous divisions, it's still a $100bn company, ie still larger than Microsoft. (So you pay IBM more than you pay MS every year, and you always have done.)

However, it is true that Microsoft is the IBM of PC software, just as Google is the IBM of the web, Intel is the IBM of processors, Facebook is the IBM of social networking, Cisco is the IBM of routers, Oracle is trying to be the IBM of server software, Apple is the IBM of hipster status symbols... Well, you get the idea.

IBM used to be the IBM of everything in IT, and started getting into other areas (telephone switches, copiers, cash machines etc) before the wheels started to come off.

Today, Google is a lot more powerful than Microsoft, much scarier, and much more ambitious in IBM-like ways (self-driving cars, robots, superbrains etc).


Calm down. I am pretty aware of it.

I lived through it.

My only intention was related how IBM focused away into other areas like consulting, to stay relevant in the PC world.


IBM still does cool things that are completely beyond Microsoft's grasp. The breadth of their research is still very impressive - from basic physics all the way to some of the most functional AI around. When was last time Microsoft did a product that could beat a human on Jeopardy?

If they want to be the next IBM, they have a lot of work ahead of them.


Agree and I've been saying it for ten damn years, havn't I been saying it!


See my comment above.

Actually, there is one way that Microsoft did become the new IBM. Long ago, I went to a talk by a senior IBMer and he said "We used to be the Evil Empire. Somebody else has that job now."


Where does it say DocumentDB is Open Source? The lack of any mention is usually a good indicator that it's not.


A comment from someone on the team further up the page mentions that ElasticSearch is 'leveraged' inside the service, so I'd say there should be some mention of the Apache 2 license to represent the use of ElasticSearch.

https://github.com/elasticsearch/elasticsearch


Nowhere does it say DocumentDB is open source.

The Microsoft of the 1990s wouldn't be using Github.


That comment was about Microsoft's documentation...


This topic and the parents linked article is about MS's new DocumentDB as some example of how "remarkable"

> Microsoft's approach to open source

has become... when they're only using GitHub as a free-hosting CMS provider for docs.

I can't see how releasing a proprietary "DocumentDB" on a Microsoft-only Azure cloud is a glowing endorsement or valued contribution to OSS. Despite what their marketing messaging says about how "Open and approachable" it is: http://blogs.msdn.com/b/documentdb/archive/2014/08/22/introd...


Ad hoc queries using SQL like syntax. No need to define indexes.

Javascript execution within database. Stored procedures, triggers and functions can be written with Javascript. "All JavaScript logic is executed within an ambient ACID transaction with snapshot isolation. During the course of its execution, if the JavaScript throws an exception, then the entire transaction is aborted."

Pricing is based on "capacity units". Starts with $22.50 per month (this includes 50% preview period discount). One capacity unit (CU) gives 10GB of storage and can perform 2000 reads per second, 500 insert/replace/delete, 1000 simple queries returning one doc.

In order to see pricing details, change the region to "US West": http://azure.microsoft.com/en-us/pricing/details/documentdb/

Very interesting addition to Microsoft offering. I was actually just yesterday wondering if they have any plans for this kind of service. Table Storage is quite primitive and Azure SQL on the other hand gets expensive when you have lots of data.

One potential "problem" with this is the bundling of storage capacity and processing power. If I understand this correctly, I would need to buy 10 CUs per month to store 100GB of data even if I'm not very actively using that data.


That is actually good to start with, yet scaling will cost A LOT.


DocumentDB utilizes a highly concurrent, lock free, log structured indexing technology to automatically index all document content. This enables rich real-time queries without the need to specify schema hints, secondary indexes or views.

How does that work? Isn't that going to incur a major performance hit? If not, why don't other databases get rid of indexes?

Also, if anyone from MS is reading, http://azure.microsoft.com/en-us/documentation/articles/docu... links to http://azure.microsoft.com/en-us/documentation/articles/docu... which is a 404 error.


One way to approach a behind the curtain automatic index generation would be to gather data about collection usage, and use such data to enable targeted indexes.

I don't think it would be feasible to have "indexes for everything", as the numbers involved would scale geometrically.


Forgive my ignorance but why would it scale geometrically? I tried to quickly think of a complexity analysis and I don't see where you are coming from.


Well, ordering matter in indexes - and clearly building a dedicated index for every field would not work, simplest queries aside.

So if I have three fields - a, b and c - and by hypothesis I want to map all possible queries in a naive way, I will need to build, at the very least, the following indexes:

[a, b, c] [a, c, b] [b, a, c] [b, c, a] [c, a, b] [c, b, a]

all of which represent a different and valid way of querying my data. Anything less than this would leave some valid query uncovered by the indexes. These are 6 indexes, which is 3! (factorial). Add a fourth field and we have 12 indexes, and so on. Hence, geometrical growth (to be fair, factorial growth is even greater than geometrical).

Which is why I'm thinking that either (i) indexes get created automatically based on the most frequent / heavy queries, or (ii) indexing works differently for DocumentDB and they are actually able to map the document space in a more efficient way (but I'd say that we lack the technical details to jump at this conclusion, at the moment).


No local installation. No banana.

I wouldn't tie a product to a single cloud vendor.


Absolutely.

However there is an overwhelming trend towards hosting key parts of your infrastructure including data storage. Managing a database is surprisingly difficult in particular at scale.


Yes, because all products today are designed and for multiple cloud vendors. Come on.......


As the only source of DocumentDB is as an instance within Azure you are limited to that vendor. You can obviously take redis/monogo/couch and put them on any 'cloud' providers infrastructure as needs be.

Not sure if your interpretation of his point was wilfully incorrect for some reason, but it was quite obvious what he meant.


So you're questioning AWS's success then?


nor one as expensive as azure


...and MS goes after MongoDB. It would be nice to see an on-premises version, if only to compare performance/consistency with Mongo.


DocDB is built atop a very well battle-tested distributed systems framework with replication based on a multi-Paxos implementation. I haven't done any benchmarks, but the replication model is far superior to MongoDB's "journal journal" model for performance. http://daprlabs.com/blog/blog/2014/08/22/azure-documentdb/


Another way to look at this is that it is based on Microsoft's internal distributed systems infrastructure and thus will never be open-sourced for the same reason Google will never open-source Spanner, Megastore or Colossus. Having battle-tested internal systems to build on is nice, but it means the DocumentDB code is probably nearly impossible to run outside of Microsoft.


Your objections to MongoDB's model seem reasonable, but I don't see any evidence in either this comment or the linked blog post that DocumentDB is better (especially in the absence of benchmarks). What is this "battle-tested distributed systems framework"? Several of your complaints about MongoDB have to do with the interaction between persistence to disk and replication to the network; as the Multi-Paxos algorithm does not specify when data should be written to disk (much less what the format should be), what reason is there to believe that DocumentDB does this any better?

I'm totally willing to believe that DocumentDB beats the pants of MongoDB on just about every axis (in fact, that seems pretty likely) but it's going to take some actual numbers and a better description of the internals.


I agree with you - we need numbers before making that kind of conclusion and I haven't run any benchmarks on the public version of DocDB. I'd like to see someone measure MongoDB on Azure vs DocDB on Azure - even then it might not be a fair measurement of db vs. db, since we don't know what machines DocDB is hosted on.

All I can really say is that the replication model provides a significant performance boost over MongoDB in the multiple replica (i.e., production) scenario.

We were using MongoDB at Microsoft for a while (I left MS almost a year ago). I was developing a real-time metrics system with it. It was very unstable at our target load (500k increments per minute, high percentage of tomorrow's documents preallocated the day before). We only managed maybe 10% of that with MongoDB, IIRC. Sometimes it would choke and not come back until I restarted the cluster (~30 machines total, I believe. 3 replicas * 10 shards).

We were so sure that MongoDB should be able to handle this scenario, since they talk about it in their documentation. After talking with the MongoDB devs, we came to the conclusion that even though we were issuing increment operations on preallocated documents, MongoDB was:

a) using a global lock on the "local" db used for replication, and

b) "replicating via disk" instead of via the network. In other words, replication requires writing to the journal journal before other members of the replica set have a chance to apply the change and ack back. This results in a loss of concurrency.

The lack of async query support in the C# driver didn't help either.

Eventually we used a replicated, write-back cache which sits atop the framework DocDB uses. Not a fair comparison, but the goal was achieved easily with 1/3rd the hardware. We just backed it onto Azure Table Storage. Our queries were all range queries, which table storage supports.

I can't talk about the framework, unfortunately.


next time you need fast counters, try hypertable (non-reading increments)


It would still make sense to use the replicated write-back cache to avoid trips to disk. We were considering replacing MongoDB with Cassandra, though.

I wanted to avoid having to deploy and maintain a database system, so using table storage was a solid choice.


I wouldn't be surprised if it happens, Microsoft has on-premise versions of several Azure offerings (i.e. WebSites, IaaS, Service Bus) which are released in Windows Azure Pack (http://www.microsoft.com/en-us/server-cloud/products/windows...). It might take some time though.


I have to believe it's coming, since a lot of MSFTs core business is still selling server-licenses for business who host their own data centers. I for one would love to use something like this, and our company is one that's not comfortable with third-party open-source tools, and we have to host in our own data center...


Honestly, this seems a bit more like Cassandra or Couchbase than Mongo.


Not really so far, it's Azure only.


I liked everything about it until I saw the API for the Python client. What a catastrophe.

I pray Microsoft is looking for Python developers: https://gist.github.com/whalesalad/2142f0075c6896f4547c


Wow, the function naming alone is terrible.

    for i in range(len(path_parts) - 1, -1, -1):
        if not path_parts[i] in resource_types and path_parts[i] in resource_tokens:
            return resource_tokens[path_parts[i]]
This bit of iteration is painful to look at as well.


Some optimisations to get it to web-scale on TempleOS were obviously necessary.



I didn't compare the code, but maybe it's autogenerated from a more C-like language, making a bit awkward syntax for a language like Python?


"All JavaScript logic is executed within an ambient ACID transaction with snapshot isolation. During the course of its execution, if the JavaScript throws an exception, then the entire transaction is aborted."

Have I missed something, or have MS delivered a novel and valuable feature? I'm not aware of support for transactions across documents in other NoSQL platforms. I'd be grateful if someone has any experience or better information in that regard, thanks.


FoundationDB https://foundationdb.com/

Quite an amazing piece of technology, check how they test cluster failures, there is a blog post about it :)



You are missing AmisaDB. Supports Multistatement transactions. http://www.amisalabs.com/


No I didn't know about that, thanks. I guess I was more thinking about run-on-your-own-hardware/cloud - which I know is not equivalent to the MS offering. I see orands post about FoundationDB, which also looks interesting. Seems like it would be a killer feature for a Mongo/Rethink/Couch type system.


It appears that AmisaDB is in-memory only, only supports read committed isolation, and has limited transactions semantics (no arbitrary JS). This is pretty far from what DocumentDB seems to offer.

http://www.amisalabs.com/AmisaDB_Docs.html


A quick comparison between DocumentDB vs MongoDB: http://daprlabs.com/blog/blog/2014/08/22/azure-documentdb/


Wow, how'd you get one done so fast? Were you a private preview customer?


OP did not write that. See here: https://news.ycombinator.com/item?id=8209525


If I understand correctly, their multi-document ACID transaction support is a big deal. The only other NoSQL/NewSQL systems I'm aware of with that ability are FoundationDB and Google Spanner/F1.


So does AmisaDB. http://www.amisalabs.com/


yes, it is a big deal MarkLogic does this too (I think they have been doing it since '03 or something like that)(

http://www.marklogic.com/blog/acid-transactions-check/


Sounds very similar to CouchDB. Server side Javascript written by the user, and an HTTP interface. The ability to adjust consistency is really neat.


CouchDB doesn't support ad hoc queries. CouchDB is all about MapReduce and heavy caching. It's very rigid.


The ability to adjust consistency isn't really a new concept. The first I saw it was in Amazon's DynamoDB paper in 2007. Cassandra uses the model outlined there as well, and I'm sure there are others out there that I'm less familiar with.


similar to cloudantdb a little bit more because of searching and indexing.


What are the limits of DocumentDB? You know, like max size of database, max size of document, max number of documents per database, max. number of attributes per document, max. number of databases per DocumentDB account.

What's the max. duration of database query, max size of query result.

What kind of performance can be expected, does it decrease as the size of database increases or it remains constant?

I'm going to wait a few days until hype settles.


Most of this is documented since start. Did you actually read the docs?

http://azure.microsoft.com/en-us/documentation/articles/docu...


I did read documentation but couldn't find it. I'm still not sure how did you find it other than going into their GitHub repository.

Great find though. Thanks.


Great suggestion, please vote here to make it a reality: http://feedback.azure.com/forums/263030-documentdb/suggestio...


And that is a creative product name. Well played MS.


It's really smart. It's a document database, so they call it DocumentDB. It's better than calling it something like Dolphin and then explaining that it's a document database. Simple and clear is a good thing.


Err, Apple:

Airport, Bootcamp, Safari, Time Machine, FaceTime, Grand Central Dispatch, QuickTime...

and I won't even start on the open source side of things...


Wouldn't have surprised me to see it called something like Microsoft NoSQL Database Cloud Edition for Agile Developers Powered By JavaScript 2014.


They haven't done this for a while.


"Microsoft has DocumentDB service now."

"Oh cool, which document database is it? MongoDB? CouchDB? Cassandra?"

"It's DocumentDB!"

"...."


Just like trying to get help around libraries of Prototype or Async in regards to Javascript. 90% of the results (I'm exagerating) are jQuery


But it's also kinda selfish don't you think?


No more than AWS's SimpleDB.


Just like SQL Server


What are the size limits on a collection? Docs mention transaction support is offered only within a collection. Is a collection essentially limited to a single physical machine in the background or does it span across machines? It looks like in Standard Preview, the max collection size is 10GB.


Interesting: https://github.com/Azure/azure-documentdb-python (it's empty for the moment, but glad to see first-party support for Python)


The .NET & JS repos are empty, too. I think they just have not published the code. It's still in Preview after all.

https://github.com/Azure/azure-documentdb-net


The @DocumentDB twitter links to a tutorial on DocDB: http://www.documentdb.com/sql/tutorial


Does anyone know how this compares to AWS DynamoDB[1] ?

[1] https://aws.amazon.com/dynamodb/


I've been using DynamoDB since it was released. I wrote a nodejs driver[1] and a nodejs data mapper for it[2], so I have a decent bit of experience with it. Browsing the DocumentDB docs the two services seem to be very different. DynamoDB is really just a key value store with some very nice properties, but also a lot of tradeoffs. One such tradeoff is querying data is very limited in DynamoDB. In Dynamo you can only query data by its primary hash key and optional range key. These keys you must specify upfront when you create your table and cannot be changed afterwards.

DocumentDB seems much more similar to mongodb and appears to have a very flexible query ability. In my opinion, one of the best features with DynamoDB is you can tune the number of reads/writes each individual table requires. This lets you scale up and down your database and greatly helps keep costs down. This is a feature that only a hosted database service can offer. I haven't yet read any pricing info on DocumentDB, but hopefully they offer a similar feature as this is really where a hosted database service can shine.

[1]https://github.com/Wantworthy/dynode [2]https://github.com/ryanfitz/vogels


Thanks, yeah I can see how this is more like hosted mongoDB - and I've seen dynode on npm, nice work!


DynamoDB really isn't document oriented storage.

Instead it's a key-value wide column store, with 1st and 2nd level index support.

I haven't delved into the MSFT pricing model, but DynamoDB is pay-for-through put. You provision your table with a certain amount of read/write performance that is guaranteed, and you pay for that.


I'd assuming pricing will be similar and incredibly competitive... and thanks, didn't realize how limited dynamoDB was


This allows you to use SQL queries, and built in index support


Very cool features, thanks!


Am I alone in thinking that sql-like syntax is actually a step backwards from building query documents programmatically (MongoDB style)?


I'd say RethinkDB does it best. You just chain some regular function calls and pass some lambdas here and there. It looks and feels very natural. Plus, that "data explorer" thing lets you test those queries quickly.

Generating a list of all used tags:

  r.table('posts')('tags').reduce((a, b) => a.setUnion(b))
Grabbing all posts with the 'foo' tag:

  r.table('posts').filter((doc) => doc('tags').contains('foo'))


DocumentDB has a similar looking (maybe not as powerful) interface:

  context.collection.filter(function(doc) {
    return doc.city == "Melbourne" && doc.rating > 4.5;
  }).map(function(doc) {
    return doc.displayName;
  });


I've been a long time elasticsearch user and can say that I personally HATE sql syntax when it comes to querying data in mysql, just because I usually can't get the joins right etc to build a search that is reasonable. However, they might be taking this approach because sql syntax is familiar. JSON data structures for querying over HTTP was used with elasticsearch because they were familiar to most developers. I think overall it helps adoption if you have layers that are not completely foreign to the implementing developers. They could have written their query syntax in protocol buffers or coffescript or what ever the hell they wanted really, but I think that would have hurt adoption.


You can still build documents programmatically. The SQL-esque syntax is optional. The primary interface is HTTP.


> The SQL-esque syntax is optional. The primary interface is HTTP.

Then the introduction needs some work.

"DocumentDB enables complex ad hoc queries using the SQL dialect"

"Azure DocumentDB offers the following key capabilities and benefits: Ad hoc queries with familiar SQL syntax"

The "Query DocumentDB" article also seems to be focused on that SQL dialect.


Oh, I see, I've missed this detail on the first read. Thanks!


Not necessarily.

I would say it allows flexibility in thinking about your storage and access patterns, which in turn allows flexibility in use. Hopefully without the expense of performance, and integrity.


how? you're just doing the same thing with MongoDB in their own json-like syntax instead of the T-SQL that everyone knows


For simple queries there is virtually no difference (and clearly one should favor what he knows best).

But there are cases where you need to build the query object programmatically (and dynamically), and I believe it's a bit awkward to do that to obtain an SQL statement. But possibly it's just my personal taste :)


yes


Spatial queries and indexing. Most data has some location component. I didn't see anything with this. Is it in there, or planned?


So does it run on a cluster? If so which of Consistency, Availability and Partition tolerance does it NOT offer? (See CAP theorem)?



I don't think the CAP theorem is so black and white: http://en.wikipedia.org/wiki/CAP_theorem#2012


They can all be provided to varying degrees. But usually most distributed systems sacrifice some availability to guarantee consistency (i.e avoid data loss at all costs)


It would have been nice to see some actual details of how it works so that it can be compared to the competition.


First link on the page: "introduction to DocumentDB".

http://azure.microsoft.com/en-us/documentation/articles/docu...


I wonder if this is built on JetDB


On Jet Blue (ESE)? It wouldn't be the first, RavenDB http://ravendb.net is based on ESE.

I remember awhle back doing work on a system that (ab)used MS Exchange Server 5 as a database, mostly because of the Outlook integration.


Another case of Not Invented Here syndrome from Microsoft. One wonders why they couldn't just take the open source and very well architected RavenDB http://ravendb.net .Net Document DB and provide first class support for that within Azure.


RavenHQ was recently added as an Azure Add-On, you can use it now. But RavenHQ seems much more expensive than DocumentDB (even when doubling the "introduction pricing"), and it's not really integrated in Azure (I understand that it runs on Amazon, in a US datacenter only)

But it will be very interesting to compare features, Raven really has a lot of advanced features.


I'm going to posit licensing/cost issues as the easy downfall. However, something about the closed sourcing against the context of the open source movement from MS has me wondering about specific implementation details they feel are differentiating or expositions of brilliance/hacks?


Because that's just as much of a ghetto, Microsoft can't control it as they'd have to buy it and I doubt Oren would sell it to them.

Not only that, there are better products that are free.


I find it surprising that DocumentDB wasn't already a copyrighted name. ;)


Can I use DocumentDB out of Azure (hook my own)?


You know, it's not a simple question, but there's some case for Microsoft just up and releasing some scale-out DB for free (as in beer) or at least cheap.

My thinking is, scale-out deployments aren't all that likely to pay at SQL-Server-like rates per CPU anyway, and you're helping give the Windows world better parity with the horde of Linux options for folks that need scale-out (or, just as commonly, hope to need it someday).

On the flipside, by doing that you might forego some sales of SQL Server (though I suspect that's limited; most folks that need SQL Server really need SQL Server) or sales on Azure that could theoretically help recover the dev costs. But greatly improving the scale-out-on-Windows story seems like a big deal, the kind of thing that might justify going to lots of effort to make a DB product then giving it away.


> You know, it's not a simple question

Yes it is a simple question, and the answer is NO.


Leave it to Microsoft to give it the most generic sounding name possible.


Yeah. Instead of giving it a reasonable, obvious name they should have gone with something hipstery and vague instead, like UberDcmntor.io.

No thanks. I think this will do just fine.


No no no, it also has to be just one common word so that it's completely ungoogleable by itself.


Sick! Love it


'cause what the world needs is another proprietary NoSQL solution.


I don't think it is about world. I would expect it is about http://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish

And looking at their diagram[1] I already see where they can make this strategy work. While their collection/document/attachment structure is quite straightforward to move to another document DB, their UDF, sprocs and triggers I bet will be microsoft-specific. And microsoft will make everything possible to "optimize data fetching by running processing code close to data" and lure developers into using these. And in the end locking software solution to microsoft platform.

[1] http://azure.microsoft.com/en-us/documentation/articles/docu...


As long as it offers something new, I don't see anything wrong with that.

I've been working with MongoDB for the last year and DocumentDB features look very interesting to me.


Yeah, once you go down that road for a while, you'll understand how proprietary NoSQL is a giant PITA.


'cause what the world needs is another pXrXoXpXrXiXeXtXaXrXyX NoSQL solution.


Are those X's supposed ^H's.


THIS THING IS A BEAST!! It is absolutely bad ass


A new "cool", locked-in service served on a silver platter by Microsoft to the brainwashed.

Everybody else uses open source on premises or their cloud of choice.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: