Hacker News new | comments | show | ask | jobs | submit login
Dear MongoDB users, we welcome you in Azure DocumentDB (microsoft.com)
126 points by jeremya on Feb 17, 2017 | hide | past | web | favorite | 108 comments

Yes, except you can't do the most basic things with DocumentDB and it becomes very expensive very fast. Especially if you want multiple collections.

There's a lot lacking with DocumentDB, as evident from the feedback forum, that comparing it to Mongo is like comparing an infant to an adult. The infant might be cute, but it can't do a whole lot.


To build something similarly sized with the free version of Mongo would also be expensive. It's a trade off.

When users have evaluated DocumentDB against MongoDB, they see major shortcomings in Microsoft's offering:

"As we were developing our new financial benchmarking service last year, we evaluated Microsoft’s Azure DocumentDB, but MongoDB offered much richer query and indexing functionality"

KPMG France https://www.mongodb.com/blog/post/kpmg-france-enters-the-clo...

All of the following three share something in common: 1) Microsoft Azure DocumentDB 2) Google Cloud Spanner 3) Amazon Web Services DynamoDB

Total cloud-vendor lock-in. It's clear why the clouds want users investing in these difficult-to-migrate-from solutions...

Needs more hashtags. Reach all those Millenials graduating $10k 3-month bootcamps. Sigh.

lol, yeah, this marketing piece doesn't really feel professional. It is littered with useless illustrations and bad poetry.

> Third, we do it with love…

With love for our money sure. What the hell does that even mean? I really rolled my eyes reading that blog post. This is childish and out of place for an article trying to sell security.

I think they know their target audience. (It's for MongoDB users after all.)

It's really satisfying when I read some post with what I exactly wanted to write but didn't have the balls to. Thank you :)

What a sweeping generalisation to make.

I've been out of touch with Mongo for a while, but when did it stop being common practice to just hide :27017 behind a firewall with only your app's DB access layer (or, at most, a few nodes in the local subnet) talking to it?

Probably around the same time ops were deemed useless :)

[1] https://news.ycombinator.com/item?id=13644789

I saw that chain too. I guess I do devops differently. Since I'm bootstrapping I don't have money for an ops person. So I've set about learning ufw, let's encrypt, nginx/tls termination to a service only accepting local connections to the port, etc. I see devops as the developer learning the ops side to take responsibility for the whole stack.

As a dev, I've been going through the same process for several months now. Honestly, for each of the concepts I've come across (ufw, let's encrypt, etc), you might bang your head for a few days with each one but you will eventually get it. I don't know why people make it out to be so difficult.

Stigma around ops and QA work.

> hide :27017 behind a firewall with only your app's DB access layer talking to it?

Because if you can do without it, why bother? Developing an access layer costs time and money. If you can leverage the DB features to do what you need, you can make you stack simpler and more maintainable.

It does not take that much time or effort to set up useful subnet/vpc security in AWS. Put the database in your VPC, say only your application vpc can talk to it. Done.

Reasonably good security practices are not that much effort, and really it's a case for respecting your users for the most part.

The security trust game is starting to blow up. Yahoo just lost $250million dollars to it.

I think y0ghur7_xxx was talking about a usecase where you want to talk to the backend datastore from an application running outside of your datacenter in an untrusted environment (like an iOS/Android app, or a JS web frontend).

In this case, one can make the argument that a custom proxy layer, running in your DC (that proxies between the database and your actual frontend app) should not be necessary if the database offers sufficient per-connection ACLs and is secure.

That's a big if though.

Sorry, when I said "if you can do without it", with "it" I meant the application access layer, not the vpc.

One of the best things about AWS is the "Jeff Barr style" posts describing every service they release. I find them much easier to consume than a blog post like this.

Well, this post was a marketing post, not a product release. Product release posts in Azure are much more informative for a dev that this post.

> Product release posts in Azure are much more informative for a dev [than] this post.

Who is this for then?

The dev managers that see a new buzzword and think, "ooh azure supports mongo now... we should target it for our new app, I hear that MEAN stack is really nice, it should fit right in."

AWS needs something like this.

The missing piece for the AWS serverless story is a database that is suitable for writing real world applications. DynamoDB is far from suitable for that task, which leaves AWS serverless with no good database.

AWS has RDS - That's most certainly a database suitable for writing real world applications as its MySQL.

Does serverless somehow mandate a non SQL solution?

RDS also supports PostgreSQL, SQL Server, Oracle, Aurora and MariaDB


RDS is server based - you need to pay to have an instance running per hour. That's not serverless. That's "serverful".

On the one hand, everything is server based at some level; it's just a question of how much is being hidden from you and managed by a third party.

On the other hand RDS hides a lot of the complexity from you. You don't have to pick an OS, apply updates, secure it, manage it, configure it, or patch it. There are some number of virtual servers out there that are nominally running your RDS cluster, but it's all pretty theoretical.

So I'm not entirely understanding your point.

> you need to pay to have an instance running per hour

You are paying to have instances running with every other DB service too; they may just break it out on your bill a bit differently. :)

The real issue with RDS for me isn't that they haven't removed the server part from the equation (they have), it's that they haven't removed the RDBMS from the equation. Schema changes, data migrations, replicas, sharding, scaling: All the hard parts of running a RDBMS are still there.

If Amazon could somehow make a magical service that accepted SQL queries and somehow returned my data, I'd be ecstatic - but the difference between that and RDS isn't the fact that they're letting me know how much ram the virtual server which is nominally running MySQL for me has.

I'm not sure how that differs from Azure Document DB? I have no inside info on this, but, I'm pretty sure it runs on a server too.. In the specific context of databases used for "serverless", clearly there are servers involved, it's simply that your application and ops team doesn't manage them.

What I'm getting at is, a hosted DB is a hosted DB.. What makes SQL unsuitable for serverless?

Replying to myself here, I missed a key point.. the issue you raise is that you're billed per hour, even when it's unused? That makes some amount of sense, but any data storage is going to come with a per hour bill - either for the instance of it, or the data within it.

Anyway, my bad, I now see your point :)

It's a cloud service just like Dynamo, the implementation specifics seem irrelevant here.

Touting "serverless" as some sort of mysticism that doesn't really mean anything useful doesn't really get anybody anywhere.

Yeah, I dabbled in DynamoDB for a recent project - couldn't really get my head around it - very strange sort of NoSQL database. The query language is incredibly arcane and wordy, and mostly inflexible.

Thinking of setting up an EC2 instance running RethinkDB or PouchDB for my project (and for future projects).

Cross datacenter replication is the missing piece from AWS. I wish they'd just roll out a hosted Cassandra or something identical

While probably not what you're looking for if you're mentioning Cassandra, RDS does let you have read replicas in any region.

You can use scylladb.com and set it up pretty easily. Stable, distributed and fast out of the box with a lot less maintenance.

> DynamoDB is far from suitable for that task


DynamoDB would be pretty close if it just allowed null values.

DynamoDB is effectively useless for querying, except perhaps for some sort of highly specialised application able to fit within the DynamoDB strange and arcane query model.

What sort of database is effectively useless for querying?

Also they need to ditch the really, really confusiong and limiting scaling model. For a database that advertises scaling as one of its key strengths, DynamoDB sure has a bad scaling story.

> What sort of database is effectively useless for querying?

Cassandra, Riak, Voldemort, HBase, Bigtable, Azure Table Storage, and many other implementations of wide column stores have similarly limited querying.

I'm also not sure what you mean by the limiting scaling model. I can go from 0 to 160k reads/second by turning a knob, and 160k is only the default limit (you can request higher limits).

It is not a document store. It's a wide column store. Use it for the right job and it does very well. Treat it like postgres and you are gonna have a hard time.

The price for that 160k is horrifying though, esp. if the requirement is bursty rather than continuous.

Which is why you turn the knob back down when you stop being bursty.

But yes, it's pricy. It may not be the best fit for some. Hopefully by the time you're taking 160k writes per second you have a solid business model. I mean, Twitter peaked at around 8000 tweets per second. What are you doing that requires 160k, and do you really need to be storing it?

It's probably an indication that your use-case is not a good fit for dynamo, or that you didn't adapt your use-case to dynamo, you're doing something "wrong" like trying to use it as a relational database. I've experienced some of these pains as part of my dynamo learning curve.

For example by changing my query strategy I was able reduce the provisioned write units from 1900 to 150 (write units dominate the cost).

Ignoring reserved prices, it is $10.40/hr (these are eventually consistent reads, so half the cost of consistent ones). That puts it roughly on par with an RDS postgres r3.8xlarge instance with 10k provisioned IOPS.

Sure, you likely have more than one table on RDS, so that cost is amortized, but when you get to the scale where you need 160k reads/s, you aren't going to have much more than that one dataset in a single instance.

It works well for a CQRS model. Which helps with super high scale apps. But most devs want joins and dont want to take the discipline to manage the data duplication.

I just rolled out a feature on DynamoDB and when monitoring it, I look at one yeah. Provisioned capacity vs consumed capacity. That's all I have to care about. No CPU, RAM, disk space metrics. Usage can increase 4x and performance is flat. It's great.

The application is less flexible and required making a lot of decisions up front, but operationally it's fantastic.

For my application I have found it is more complex about provisioned vs consumed capacity. I get throttling all the time when consumed capacity is a third of provisioned capacity.

You also need to care about how DDB does its underlying partitioning. It would be nice to turn the knobs and be able to trust you will get X reads/sec and Y writes/sec, but that is only true per node! Unfortunately, DDB gives you zero information about how many nodes your DDB table is running on! (Yes you can guess pretty well if you keep track of your usage rate and do some math).

So when provisioning, you need to be aware that if you have 100 provisioned read ops, but you have data on 5 nodes, you really only have 20 reads/sec if one key gets hot.

I agree it's pretty easy operationally, but you can get burned if you don't know how it works under the hood.

I just ping support when I want to know partitions. They also told me a little trick. If you create a kinesis stream for your table, the number of shards in the stream is the number of partitions.

But you're right part of design for DDB is picking a proper partition key so you don't end up with hot shards.

Databases in this category are some of the most popular ones in the world with good reason. The only way you can scale is to adopt a query-free architecture.

It feels tedious at first but once you develop some good habits and frameworks around denormalization it becomes easy to do that from day one.

>> The only way you can scale is to adopt a query-free architecture

This is not really the case. There are database systems that can handle large scale and complex queries. Allthough usually at the price of providing reduced consistency guarantees.

Actually I guess the query language and indexing is pretty limiting too.

I think Microsoft should implement basic aggregation functions first.

Implementing aggregation at query time is a temporary solution. For systems like this aggregation should be done on insert time - many hugely popular databases do not provide much more than a basic get operation for this reason

You might be right but Mongo supports them and Microsoft says that we can start using DocumentDb without changing any line of code which is not true.

Yeah I don't get why they're marketing to Mongo users, weird choice. This is a DB for companies trying to moved to a query-less architecture - something no one should be doing with mongo

I never thought about it that way but makes total sense. Clearly you 'get' it!

Interesting... So compatible with Mongodb protocol but not using mongodb internally ?

What is your view of services, which provide functionality of some other software or SAAS and is API / Protocol compatible ?

Can API / protocols be copyrighted or patented? I believe not based on Google vs Oracle.

Hasn't tokuMX already put a 'proper' solid DB behind the mongo API?

TokuMX was discontinued last year

> First and foremost, security is our priority

In response to https://www.theregister.co.uk/2017/01/09/mongodb/ ?

"MongoDB databases are being decimated in soaring ransomware attacks that have seen the number of compromised systems more than double to 27,000 in a day."

As someone that almost got bitten by MongoDB's lax auth defaults, I was happy to read that DocumentDB has enabled access control out of the box and no default username/password.

Also, there's a query playground if you want to try it out quickly: https://www.documentdb.com/sql/demo

It's important to be aware of security implications of leaving an unauthenticated server listening on the open internet (listening on is not the default since some time now and if installing the rpm/deb package listening on is the default option). Also never leave an internet facing server without a firewall.

As a SaaS it's not surprising DocumentDB got security configured, and it also won't be surprising when people lose data because they'll put '123456' as their password or commit their password to a public repository

Pretty much everything Azure does is over TLS and requires authentication.. some of the authentication for services is more convoluted than others.

Personally, I'm pretty happy with how easy it is to use the Azure Storage services (blob, tables, queues) as well as their Azure SQL offering. Far less arcane configuration options than you get with AWS's competing options. If only their compute nodes weren't so pricey.

Of course, you really shouldn't be exposing databases if you don't know how to administer them. I agree that the defaults should be better, but...

We used to use DocumentDB, but switched to Azure Table Storage a while back. Did some benchmarking and DocumentDB was too slow for our needs (getting documents for a range between two epochs). Not sure if others experienced the same thing or if things have gotten better since then though.

Never looked at DocumentDB before. So if I get this straight, I can get a fully managed DB that can scale easily, but still have all the advantages and compatibility of a regular NoSQL like Mongo?

I think that's a first, right?

Except that the advantages and compatibility are not all there. Plus there are now fully managed options for the real thing.

The advantages are real. Compatibility might not be 100%, but honestly Mongo isn't magic. DocumentDB has strengths of it's own.

I'd be happy to learn about your experience developing an application against DocumentDB and its strengths. Care to share?

im actually building a fairly well massive app with documentdb and service bus. its not a user facing app, so i dont know if that counts for what you were talking about.

Sure, use case aside - expressiveness or the query language, learning curve, secondary indexing etc - keen to hear how all this feels to a new developer to the platform

Percona have their drop-in MongoDB replacement that uses TokuMX under the hood and they'll manage it for you.

TokuMX was discontinued last year

MongoDB Atlas will manage MongoDB for you

Isn't it the same general idea as Amazon's DynamoDB?

So long as you skip over the costs if you want any kind of performance (you need to configure a "number of accesses per time period" with costs scaling alarmingly the higher you go)

DynamoDB isn't the same thing at all, its very limited and you need to use it in very specific ways. You certainly can't just use your current MongoDb data with it.

Fully managed easily scalable nosql DBs have been available for many years as CouchDB variants from multiple providers.

I hear it's got even better write performance than /dev/null ;)

Potter's paying 50 cents on the dollar for your shares in the Building & Loan...

From frying pan to fire probably.

Trading privacy for a false sense of security, are we?

Trading privacy? Can you elaborate?

The idea of storing sensitive data at one of the most data hungry company in the world "for security", doesn't sound like it came from a genius. I thought I was pretty clear earlier?

Moving FOSS into the cloud as a SaS sounds kinda regressive to me...

If you can explain your fears perhaps we would understand, because I still don't know why using Azure is a privacy trade? Does Microsoft "steal" the data?

It happens all the time, and also; when using proprietary software you never know, and that's the big issue (for me). I never use SaS or MS-products (I only use FOSS) so I don't fear for myself.

When has a SaaS operator stolen data hosted on their service?

Not agreeing with thread OP here, but I would certainly differentiate between "stealing" and "exploiting" (not in a security exploit sense). User data certainly gets exploited on _some_ SaaSs that would otherwise be unexploitable on your own stack.

I'm not saying this necessarily applies to compute engines or storage as a service or whatever, but something like gmail (SaaS) where your data is used to target ads at you could be considered exploiting your data. I would not put it beyond large companies to start considering doing the same on their storage-as-a-service offerings soon enough.

The difference is that in the case of Gmail, monetisation is though ads and in the case of compute engine or storage the monetisation is through client payments.

If Google, Microsoft or other companies start to look at the data to exploit it hey will lose trust, the customers and the data.

The question was about Microsoft, and it depends on the definition of "stealing" and the definition of "your". It's in the terms and conditions when using their services. As a user you agree to a lot, that you probably wouldn't agree to if you took your time to read the fine print. Many SaaS operators has also a history of selling data to others, and a breach of the service might effect all users of that service and not just one. Many companies have restrictions against usage of SaaS for exactly those reasons...

> It's in the terms and conditions when using their services.

The it will be easy to point to those terms in Azure service.

> SaaS operators has also a history of selling data to others,

Then it willbe easy to link to news about those SASS operators selling the data.

> Many companies have restrictions against usage of SaaS for exactly those reasons...

Then it will be easy to bring examples of those companies with restrictions.

If you haven't looked it up for yourself in 10 days (when I have access to my PC) and if I remember this, I will send you some reference to read. But I if you really want to find out (which I doubt) just look it up yourself. It's so easy. I get the feeling you have some affiliation with MS?

> If you haven't looked it up for yourself in 10 days

I have looked and I have not find any single case of Microsoft stealing Azure data. Or any big SaaS company like Amazon or Google doing that

> But I if you really want to find out (which I doubt

I'm not the one making unsupported claims without trying to provide any single example, perhaps the one that doesn't want to lok that their claims are wrong is not me

> I get the feeling you have some affiliation with MS?

No, I don't have any affiliation with Microsoft and it is not one of my most loved companies. And I'm not the one accusing others of shills of secret ties with companies. I'm also not the one making unsupported claims.


Also, Azure has the best suite of compliance\certifications that demonstrated their commitments.

BTW, Have you read the terms and conditions before speculating on them?

> It happens all the time

Still waiting any proof of your claim

Next time when I have access to a PC I can do the searching and reading for you, but until then you will have to wait. I will have access to a PC in about 10 days. I'm not great at typing without a keyboard. But if you can't wait, just search it on the web. I'm sure HN is full of articles about Microsoft and privacy issues as well.

Microsoft trying hard to get developers to work on their platform and fail has really become very much fun. Microsoft deserves for being evil. Example: Microsoft does not save history in cmd shell(its so irritating for devs). The height of the cruelty is they aliased the curl and wget by default to its own program(do not remember).

Having a subpar shell application is evil?

Aliasing linux command tools to its own is definitively evil https://daniel.haxx.se/blog/2016/08/19/removing-the-powershe...

You cannot look at the code, and cannot monitor the infrastructure. The only thing left is trust.

Trust in the belief that Microsoft will act in your best interest regarding the privacy of your data.

But, isn't reasonable then to ask if Microsoft is actually trustworthy? PRISM, NSAKEY, Flame malware propagating via Windows Update, their 0day policy... I don't think Microsoft is trustworthy.

People choose managed services all the time.

If you worry about interception, then code inspection and monitoring isn't going to give you any assurances. You'd have to run open-source software locally, audit it, and not put it on a cloud like Azure in the first place?

(And the NSAKEY was something completely different if you dig into it.)

Using a service means trusting a service. Nothing wrong with either, it's a decision up to the consumer.

I am just saying, in this specific case, should you trust Microsoft with your data? That's all.


If you don't trust MS you can trust one of the organisations that certified them for the strictest compliance and regulations in the public cloud space.

Sounds good. But how do you exactly verify those claims? Again, you can't. You are back to square 1: trusting Microsoft acts in good faith and acts in your best interest.

How do you verify that a doctor is making the right call re how to treat your cancer? You either become a doctor yourself, or trust that they know what they're doing.

And what if a doctor has a controversial reputation?

Then you can make a decision to go with another doctor. Are you questioning the reputation of the auditors, or just writing off MS across the board?

You can have audit proof software that is completely secure. If you tamper the infrastructure, share keys, leave obscure backdoors, etc. it is not hard to come up with a NOBUS scheme.

Well, do you trust Amazon, Google, Oracle or Rackspace? How about your local mum pop hosting provider? How is Microsoft different?

A track record of backdooring everything that can be backdoored.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact