

Migrating from MongoDB to Cassandra - jeremiahjordan
http://www.fullcontact.com/blog/mongo-to-cassandra-migration/?

======
DigitalSea
When I read posts like this all but confirming MongoDB isn't the great product
10Gen make it out to be, I wonder how the heck MongoDB are still even relevant
and then I remind myself of the fact that 10Gen have one of the best marketing
and sales teams in the game at the moment.

While MongoDB has improved greatly over previous versions, I can't help but
feel if 10Gen put as much effort into improving their product as they do
selling it, Mongo would be a force to be reckoned with!

MongoDB is good at some things, but I think most people that try and fail with
it fall into one of two camps: 10Gen sold them into it or they bought into the
hype without assessing project requirements and ensuring MongoDB was a
sensible choice.

~~~
CptCodeMonkey
Knowing some of the FC people first hand, MongoDB did actually serve them
fairly well for a substantial amount of time. Until they started hitting max
limits, it didn't really make sense to move to c _.

Going straight to C_ or something like it would have been almost cargo-cultish
( eg If we build industrial strength, we will get industrial levels of traffic
).

~~~
threeseed
Exactly. Can you imagine all of the Java/C developers telling Ruby developers
that they are idiots and don't know anything about programming ? Simply
because they choose a technology that is designed for developer productivity
at the expense of scalability.

Because that what seems to happen for every database discussion.

~~~
lucian1900
The better comparison would be to PHP, and indeed, no one should ever use
tools of quality as poor as PHP or MongoDB.

------
jchrisa
Viber, one of the largest over the top messaging apps, recently shared their
conversion from Mongo to Couchbase. They ended up requiring less than half of
the original servers, and better performance.

If you want to see a video of their engineer telling the story, it's available
here: [http://www.couchbase.com/presentations/couchbase-
tlv-2014-co...](http://www.couchbase.com/presentations/couchbase-
tlv-2014-couchbase-at-viber)

~~~
prottmann
Like always: "Use the right tool for the job". I did not think that this was
MongoDB(10gen)s fault, they (viber) choose the wrong database type for their
needs.

~~~
pessimizer
That's really easy to say, so it's important to show your specific reasoning.

------
rco8786
_read first half of article_

 _get spammed_

 _leave immediately_

~~~
brightsize
Same here.

~~~
Xorlev
Author here, sorry about that.

It's not supposed to display on engineering posts but something must have
changed/been broken recently. We know you guys aren't interested in marketing
content.

Again, sorry about that and thanks for bringing it up.

~~~
MBCook
I read the article on my iPhone and found the hovering "top" button in the
lower right both unnecessary and very obtrusive.

iOS has a standard way to jump to the top (tap top of screen), there is no
need to interfere with the content.

------
monkey26
This article caught my interest as I've been reading into Cassandra. But some
previous research had me thinking that Cassandra works best with under a
TB/node. Is SQL still better when you have really large nodes (16-32TB) and
only really want to scale out for more storage?

I'm currently humming along happily with Postgres, but some of the distributed
features, and availability of Cassandra look really nice.

~~~
jbellis
Cassandra 2.0 can handle 5TB per node easily, 10TB with some care. Best to
scale out, not up.

That said, if someone else has already made the hardware choice for you, you
can always run multiple C* nodes on a single machine. I know several
production clusters that fit this description.

~~~
olavgg
PostgreSQL can handle petabytes easily. However if you need to query a
petabyte of data, then you need to rethink your solution. PrestoDB + Hive +
Hadoop may be what you need.

~~~
ddorian43
so can you put petabytes on 1 server? or can postgresql shard?

------
fiatmoney
Sounds the intended use case for ElasticSearch.

"Given some input piece of data {email, phone, Twitter, Facebook}, find other
data related to that query and produce a merged document of that data"

~~~
Xorlev
I will say ElasticSearch features heavily in our infrastructure elsewhere, but
for the Person API product, it's purely a primary-key lookup.

My coworker wrote a bit about how the search functionality of our offering
works here:
[http://www.fullcontact.com/blog/sherlock_search_engine_that_...](http://www.fullcontact.com/blog/sherlock_search_engine_that_does/)

That might make more sense why we do PK lookups.

------
brown9-2
Is DynamoDB never a serious option for people in situations like this and
already heavily on AWS?

~~~
Xorlev
One of the key detractors for us was the 64KB limit.

"The total size of an item, including attribute names and attribute values,
cannot exceed 64KB."

While we don't often have values over 64KB, it's possible. We didn't want to
have to store profiles separate from their metadata.

------
lynchdt
"To buy us time, we ‘sharded’ our MongoDB cluster. At the application layer.
We had two MongoDB clusters of hi1.4xlarges, sent all new writes to the new
cluster, and read from both..."

I'm curious about this. Why were you doing the sharding manually in your
application layer? Picking a MongoDB shard key - something like the id of the
user record - would produce some fairly consistent write-load distribution
across clusters. Regardless - it seems like write-load was a problem for you,
yet you sent all the write load to the new cluster - why not split it?

~~~
Xorlev
As explained, it was a stop-gap solution for data storage only, we did not
have a problem with write load on SSDs.

We were at the point that MongoDB sharding was just as difficult to deploy as
moving to Cassandra, which better fit our goals of availability. MongoDB
sharding isn't instant by any means for existing clusters.

------
talas9
Yet another shining example of throwing money and time away to work within AWS
constraints when bare metal and openstack (1) would have solved it cheaper (2)
and arguably faster.

1 (if you insist on cloud provisioning instances, even though it makes little
sense if the resources are as strictly dedicated as they are in this case)

2 (VASTLY, over time -- these guys are pissing money away at AWS and I hope
their investors know it)

~~~
Xorlev
We understand that AWS comes at a premium, however we find the opportunity
cost of losing the agility we have on AWS at this stage of our organization
more expensive than the delta in cost between moving on to our own hardware
and AWS.

Our organization is acutely aware of our costs and still strives to minimize
them. Our move to Cassandra saved 79% over continuing to run our reserved SSD
nodes & backup replica.

------
mcot2
Tokumx would solve all of these issues. 2TB goes a long way in tokumx with
lzma compression.

