
Azure Cosmos DB: Microsoft's Cloud-Born Globally Distributed Database - ingve
https://muratbuffalo.blogspot.com/2019/04/azure-cosmos-db-microsofts-cloud-born.html
======
kuzehanka
After trying CosmosDB I put it down pretty quickly. I can't say I'd recommend
it to anyone.

Shockingly poor perf. Is CosmosDB really that bad or did we have it
misconfigured? We're not sure, the docs didn't help us understand.

Random failures when connecting. Random errors (or worse, no errors but
unexpected results) when querying.

Largely undocumented.

Painful interop with non-azure proprietary offerings.

Almost zero developer community. Most of the coverage of CosmosDB isn't by
developers, it's by various MS affiliated blogs such as this one, which are
more advertisements than resources.

Unfortunately CosmosDB seems to fit the Azure MO of investing more into
sales/marketing than into engineering/support. I don't know about the general
atmosphere, but in the circles where I work, Azure and its various proprietary
components are losing developer goodwill at a staggering pace.

~~~
foobarbazetc
I can say this exact thing about most Azure services. Nothing really works as
documented and everything is more expensive than it should be. I always feel
like I’m missing something.

~~~
tgtweak
It's hard to get behind cloud services that require you to send an email to
support engineers to reboot your instance.

~~~
rad_gruchalski
Wow, is that really the case? The last “cloud provider” I remember this to be
the case was SoftLayer. Is Azure really that bad?

~~~
tgtweak
With AKS (at least in early days, not beta or anything mind you) the
kubernetes coordinator node was managed by Microsoft, after a some arbitrary
amount of days/weeks it would become unresponsive and the recommended action
was to open a ticket and wait for them it restart it.

~~~
verst
Azure Kubernetes Service provides a free managed Kubernetes control plane.
Certainly possibly that prior to general availability a support ticket had to
be opened to restart this. The control plane's underlying VMs aren't exposed
on purpose. That's similar to what other providers do with managed Kubernetes
services.

Hopefully the unresponsiveness issue was addressed quickly after it was
reported by you.

------
lkschubert8
The absolute worst bug I have ever had to debug was using CosmosDB as table
storage. Everything worked fine up until we hit about 1700 records in table
storage. Once we hit that point it just stopped returning any records. We
eventually find out that if you query using a field that isn't indexed instead
of throwing an error or something sane like that it acts like everything is
fine and returns an empty set of records.

~~~
skrebbel
Woa, wtf.

Is that indicative of the general quality level of cosmos?

~~~
zzbzq
Sort of. It doesn't really sanity-check anything, ever. It's more of a blank
slate and you have to build whatever rules you want into your software layer.

The default mode is to index every field, so you can't get into the OP's
situation until you start trying to fine-tune it. He basically turned off the
index and then tried to search by it. This is not expected for people who just
bring their RDBMS assumptions in and try to wing it without reading any
documentation whatsoever.

~~~
lkschubert8
I dont know if something changed, but there were no default indexes back when
this happened (a year and a half ago). I had done no performance tuning, just
created the document collections and added items to them.

~~~
AaronFriel
It sounds like you inadvertently changed the index structure or deleted it.
There have always been default indexes, and Microsoft's tuning advice is
always to _add_ special case indexes to the wildcard index they provision. If
you know your data model well and want to improve write performance (RUs), you
can delete the wildcard index. That's how the indexing policy has worked since
the product launched.

[https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-
mana...](https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-manage-
indexing-policy)

------
ignoramous
This post is interesting but one or two MS employees did reveal some key
details about CosmosDB in the original announcement thread:
[https://news.ycombinator.com/item?id=14308814](https://news.ycombinator.com/item?id=14308814)

(edited)

> There are many significant differences in capabilities, and design
> approaches between other systems (CockroachDB and Spanner) and Cosmos DB. At
> a very high level differences are at two levels - the design of the database
> engine and the larger distributed system.

> The database engine design is inspired by

> 1\. LLAMA:
> [http://db.disi.unitn.eu/pages/VLDBProgram/pdf/research/p853-...](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/research/p853-levandoski.pdf)

> 2\. Bwtree:
> [https://pdfs.semanticscholar.org/7655/9c6cc259c6ab5baf7bd19d...](https://pdfs.semanticscholar.org/7655/9c6cc259c6ab5baf7bd19dd54bb6c773736a.pdf)

> 3\. Schema-agnostic indexing techniques:
> [http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf](http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf).

> Please note that these papers are significantly behind the current state of
> the implementation. The most crucial aspect that these papers dont cover is
> the integration of the database engine with the larger distributed system
> components of Cosmos DB including the resource governance, partition
> management, and the implementation of replication protocol / consistency
> models etc.

> Our goal is to publish all of the design specifications including TLA+ specs
> over time.

\---

[https://azure.microsoft.com/en-us/blog/a-technical-
overview-...](https://azure.microsoft.com/en-us/blog/a-technical-overview-of-
azure-cosmos-db/)

[https://azure.microsoft.com/en-us/blog/azure-cosmos-db-
pushi...](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-
frontier-of-globally-distributed-databases/)

~~~
dtrailin
The TLA+ specs where eventually published [0], although I wonder how up to
date they are to current state of the DB since they haven't been updated for a
while.

[0] [https://github.com/Azure/azure-cosmos-
tla](https://github.com/Azure/azure-cosmos-tla)

~~~
rnehme
Hi, I am from Azure Cosmos DB Engineering Team.

The TLA+ specs are fairly up to date.

Thanks.

------
chupa-chups
We used it also. Ran into the issue that collection (?) names are limited to
64 chars(1) (which wasn't documented anywhere), triggering obscure errors
while everything was fine from the Azure Frontend.

Also even simple queries using an index (and a uniform document structure)
cost a comparatively huge amount of RUs compared to a competitor.

Icing on the cake was the non-support for paging at that time, and a total
fuck-up when we tried a restore from backup operation (which can only be done
by calling MS support, no automated way possible). If anyone wants to know
details I can provide those.

Needless to say we switched to another provider.

(1) the names were generated by our automated deployment step, something like
{environment}_{product}_{collection-name}

~~~
sthota
Hi, I am from Cosmos DB engineering team.

For SQL API, the collection names have a limit of 255 bytes (we will update
the documentation). For Mongo API, we honor the limit prescribed by the API
for the combination of <database>.<collection> to be 120 bytes as explained
here -
[https://docs.mongodb.com/manual/reference/limits/#Restrictio...](https://docs.mongodb.com/manual/reference/limits/#Restriction-
on-Collection-Names)

We continue to optimize our indexing layout – there are several key
improvements being rolled out. Please share the details of your expensive
query (askcosmosdb@microsoft.com) to help us investigate.

OFFSET/LIMIT (Skip/Take) is currently in private preview. It will be broadly
available by 5/15/2019.

We are working on customer controlled PITR, which will be available later this
year.

Thank you for the feedback!

------
manigandham
CosmosDB does have problems but the comments here seem to mostly be about not
understanding the data model. It's a JSON-based document-store that
distributes data across partitions.

It offers several interfaces (set at the database level) but that doesn't mean
all the functionality is supported. It works well with MongoDB, Cassandra, and
Table storage but you won't get relational SQL joins or fast graph search
queries in Gremlin.

I agree that marketing and documentation is poor which leads to these
misunderstandings. Multi-model is never perfect and people forget that
emulation has a cost in performance, price and functionality.

~~~
outside1234
The problem is that their marketing and docs make it sounds like it is the
solution for everything.

Legacy Cassandra workload? Use CosmosDB

Graph Database? Use CosmosDB

Legacy MongoDB database? Use CosmosDB

SQL Server database? Use CosmosDB

When the only really valid use case is:

Document Database? Use CosmosDB

~~~
manigandham
Yes, I said exactly that.

Multi-model is a hard problem and only really works with similar data models.
It's also confusing since people conflate SQL and relational semantics when
it's really just a query language.

It does work well enough if you have an OLTP document use-case like MongoDB,
Cassandra, and JSON/SQL.

~~~
dagss
Disagree about "conflate". SQL is very geared towards relational semantics.

They may be independent in theory (a stretch), but any non-trivial SQL query
(say a medium sized one with 50 lines abd 10 tables) is meaningless without a
relational DB underneath.

What Cosmos calls "SQL" is not SQL at all, it lacks almost everything, it just
borrows a few keywords. If you cant even do inner joins and left joins and so
on between different tables, it is NOT anything close to SQL. Joins is sort of
the point of SQL.

------
shearnie
Cosmosdb still doesn’t support skip and take. And aggregate queries are
incredibly inefficient. We’ve had to move a heap of data from cosdb to sql
server, our reports ran from 3 hours, to now 5 minutes after moving to SQL.

~~~
cheschire
I may be taking your post at face value, but I suggest if you are using skip,
i.e. LINQ Skip(), that you stop. SQL's performance using OFFSET is pretty
terrible when you start getting into the 10's of thousands of records.
Instead, use seek instead.

Here's some random google result that talks about the concept. There are
plenty out there though.

[https://blog.jooq.org/2013/10/26/faster-sql-paging-with-
jooq...](https://blog.jooq.org/2013/10/26/faster-sql-paging-with-jooq-using-
the-seek-method/)

------
suff
FTA: "say datacenter automatic failover component, in one part of the
territory. Getting this component right may take months of your time. But it
is OK. You are building a new street in one of the suburbs, and this adds up
to the big picture."

Wait, what?! Driving metaphor aside... 'MONTHS of your time' to implement
failover? People, this is WHY the cloud was invented, so we didn't have to
spend months re-inventing. In AWS and even GCP, this takes a day or less if
you know the (well documented easy to use) storage offerings. Seriously
reconsider your selection criteria when you start saying things like this,
because what I just heard is that my team just told me implementing failover
is going to cost $60k. Guess how much easier that made my case to switch to
another cloud-native offering? TCO over everything. Even Ballmer would agree
with that - he made the same case for Windows, against Linux in the 90's.

~~~
silverlake
He's talking about building the underlying failover mechanism for CosmoDB. For
a customer it's easy and automatic, but GCP and AWS and Azure have to build it
first.

~~~
suff
Surely you're referring to Microsoft's team's time to implement failover to
the product offering, not time every customer spends on implementation???

FTA: "if _you_ are an engineer working on a small thing, say datacenter
automatic failover component" (emphasis added). Pretty sure he's talking about
EVERY customer spending months to turn on fail-over.

~~~
samdixon
The previous poster is correct, I think you are mistaken. It should read like
this:

"if you are an engineer at Microsoft working on a small azure component, say
datacenter automatic failover component"

At least this is how I read it since the OP works at MS it appears.

------
dagss
Beware of the REST API of Cosmos. We fell into the trap of using it..

1) it is not covered by the latency SLA. This is buried pretty deep...

2) suddenly you get magical responses that are supposed to trigger the client
to do something (e.g. sql qyeries with order by requires some algorithm client
side). These things are not docoumented anywhere, you have to read the Node or
Python client libs..

Using Cosmos mainly requires being on .NET and using the official driver that
communicates over a closed undocumented binary protocol. (even the Java driver
having the full SLAs and using the binary protocol was launched just weeks
ago).

IMO it would have been less misleading if MS had just removed the docs for
their REST API. Or at least put up a big warning about it being an
undocumented afterthought.

~~~
kirillg_hn
Hi, I am from Cosmos DB engineering team,

Cosmos DB offers Java, Javascript, Python, and .NET drivers. Just like with
most databases, it is recommended for applications to use drivers to work with
Cosmos DB.

Cosmos DB offers REST API to work with data, documented at [1]. The primary
focus of the REST API is SDK developers for platforms where we do not provide
drivers yet. Just like most databases, we recommend to use provided drivers
where available. This REST API is not intended for broad consumption by the
apps. Your point about being more explicit about scenario the REST API is
intended for and supported for is well taken. We will improve the
documentation.

[1] [https://docs.microsoft.com/en-us/rest/api/cosmos-
db/](https://docs.microsoft.com/en-us/rest/api/cosmos-db/)

thanks for feedback!

~~~
dagss
Even in this reply, you fail to mention that JavaScript and Python (and the
old Java) drivers uses another protocol for which the latency SLA does not
hold!

That information is vital. And it looks as though you are correcting me, when
you are not.

This kind of attitude (throughout documentation) is my exact issue -- MS is
always about selling, always hiding critical info, at the expense of helping
engineers take the right choices.

------
marcelftw
It costs an arm. It's not event mongo 4, it has a subset of mongo 3.6 features
: not all aggregations queries are available. Would not recommend.

------
jjirsa
Hey CosmosDB team in the comments:

Building distributed databases at scale is hard af, few people know what it’s
like to run hundreds of thousands of databases. Don’t be discouraged by random
negative posts, remember you’re doing something most of the world can’t do.

~~~
pratnala
Thanks for the kind words.

------
kerng
CosmosDB used to be called DocumentDB, which now Amazon uses as name for their
service too.

------
he0001
Do they guarantee that if a write is done one region, when completed upon
return, is available when read in another region?

~~~
balakk
Yes, CosmosDB lets you choose your preferred consistency level.

[https://docs.microsoft.com/en-us/azure/cosmos-
db/consistency...](https://docs.microsoft.com/en-us/azure/cosmos-
db/consistency-levels)

The performance impact of different consistency levels are available here.

[https://docs.microsoft.com/en-us/azure/cosmos-
db/consistency...](https://docs.microsoft.com/en-us/azure/cosmos-
db/consistency-levels-tradeoffs)

As you would expect, stronger consistency leads to lesser throughput levels.

------
deniswsrosa
As far as I know, apart from the global distribution, virtually any major
database offer the same thing today

------
manigandham
Here's a tech overview of Azure Cosmos DB form last year:
[https://www.youtube.com/watch?v=V_C7DlKVofc](https://www.youtube.com/watch?v=V_C7DlKVofc)

Should help with most of the questions here.

------
ZeroCool2u
Based on the feedback here, it's hard to imagine anyone choosing Cosmos DB
over Spanner or just CockroachDB. I'm not familiar with the AWS equivalent,
but it seems like Azure isn't exactly setting the bar high.

~~~
jjirsa
Really? You can’t figure out why someone might use a Microsoft hosted db over
one at google or one they’d have to run themselves?

Azure’s growth rate is amazing. There’s a ton of adoption there that doesn’t
have access to spanner, or maybe has a document based data model, where
spanner / cockroach doesn’t make any sense

------
niahmiah
Sounds similar to cockroachdb, by ex google devs
[https://www.cockroachlabs.com/](https://www.cockroachlabs.com/)

~~~
taffer
Cockroachdb is a proper SQL Database, Cosmos is just a Document Store that
also has an SQL interface.

~~~
jng
Can you explain in some precision what you mean by that?

~~~
manigandham
SQL = structured query language. It's just an interface to access data. All
relational databases offer it but so do many other non-relational systems.
This means using SQL to read/write data is completely separate from having
relational functionality like joins.

------
youdontknowtho
This was a really interesting write up. Doing a sabbatical with a large
distributed systems team sounds really cool.

