
What The Heck Are You Actually Using NoSQL For? - abhijitr
http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html
======
jeremyjarvis
Most popular use right now is installing on [usually only] two nodes, then
writing a blog post about how it's going to change the world, then going back
to using good 'ol MySQL etc ;)

------
jforman
We're (Inkling) successfully using Cassandra in production to store most of
our data.

I wrote the first version of our data store modeled off of FriendFeed's
architecture (back when we were five people). There is an excellent blog post
about it here:

<http://bret.appspot.com/entry/how-friendfeed-uses-mysql>

When Cassandra came along, we had long debates internally about the risk of
hitching ourselves to their wagon. Eventually we decided that an open source
project with a similar (enough) architecture to what we'd been building
ourselves was preferable to maintaining a python library in-house.

So far we have no regrets. We have had no production issues and we have an
architecture that is built to handle a large amount of incoming data (every
note and highlight you take in an Inkling book is synched to our servers). We
had already stripped away most SQL functionality by using FF's architecture,
so we didn't lose much. And we gained a lot - the ability to tune the
durability of different writes depending on the use case, far better fault
tolerance, a conceptually simpler indexing strategy, etc.

It definitely has a learning curve - error and status reporting is not
wonderful, you generally need to be over-provisioned as a strategy, etc. But
I'm happy we learned those lessons sooner rather than later - I'd hate to
scramble to implement Cassandra in a strained production environment (my
sympathies to a certain website who suffered through this).

------
psadauskas
We use TokyoTyrant to store lots (TBs) of tiny (<100bytes), constantly
updating bits of data. The record overhead on Mysql and Postgresql was huge,
and the continual updates were leaving lots of dead tuples that had to be
cleaned, and killed performance.

I use MongoDB for greenfield app development and experiments, so I can quickly
alter "tables" and "attributes", without having to bother with first thinking
about schema, or having to write migrations.

~~~
rbranson
Had the same experience with Tokyo Cabinet/Tyrant on a project. We had
hundreds of millions of small records to store, and TCH could pack it into
space fall enough to fit into disk cache on an XL EC2 instance.

------
benologist
I use it mostly to store player-created levels and leaderboards which allows
developers to efficiently + effortlessly include and filter by custom fields
and data.

I also use it for some smaller tasks like storing filenames/hashes for
distributing content across my 5 servers (doesn't really _need_ to be mongo
for this), and some configuration information for games using me.

I do it via MongoHQ, I did have it running and replicated across 2 of my
servers but I'm just not familiar enough to keep it running and that bit me in
the ass early on. With MongoHQ all I have to worry about is making sure my
indexes are right.

Next up I'll be using it for some non-aggregated analytics stuff, which won't
be as big as Facebook's requirements but will still be in the billions-a-month
range.

~~~
vyrotek
Glad to hear that switching to Mongo worked out for you. I remember talking to
you a while ago when most stuff was running on SqlServer.

For our project we went the Windows Azure route and store our 'events' in
their NoSql Table Storage. I must admit I have been tempted to switch to
MongoHQ due to the lack of query support from Table Storage. Funny enough,
we're also working on a custom leaderboards feature for our customers and the
lack of secondary indexes really makes it difficult.

~~~
benologist
MongoHQ have a pretty awesome product. The only concern I have with them is I
noticed they recently capped databases to 20 gig but that's 3x more than my
leaderboards come to at the moment so that's future-me's problem.

The custom data on the leaderboards and levels is just ridiculously easy, it's
basically just copying object properties/values on the user end straight in to
document properties/values in MongoDB.

~~~
benwyrosdick
Future-you should fear not. The 20gb limit is a soft limit and we have many
who have gone past it. It is really there for two reasons ...

1) To give people some reasonable expectations about how much data they should
be putting on the plan given it is a multi-tenant server and they are sharing
resources.

2) To make room for an expanded offering, both a larger high-availability
shared plan, and a dedicated server plan.

------
panarky
I use key-value stores so I can implement the right data model for the right
problem.

The article cites graph databases as a better fit for graph applications than
relational SQL databases. I'm using Redis data structures like sets and sorted
sets that are 20x to 100x faster than relational databases.

But I still use SQL for most CRUD operations where I care about consistency
and volume isn't high enough to require sharding. There's a place for 'NoSQL',
but it's not a panacea.

------
mkramlich
I'm using MongoDB as the default data store for a new geo LBS startup.
Scalability and performance features were the main attraction, but also I just
like the interfaces and ergonomics it presents to me as a programmer. Much
prefer representing stored data structures in JSON, doing queries in RESTful
HTTP-ish friendly form, and doing processing in Python, JavaScript, etc. Oh
and not having to care about data schemas. At least early on in the project
lifecycle, when there's no revenue and you may need to frequently evolve
and/or significantly pivot the codebase. I love the fact that it tries to keep
everything in memory if it can, and that it was designed from the ground up to
be distributed.

MongoDB is also rapidly becoming my "goto" data store when making prototypes
to explore some new software project space. Because it's so amenable to rapid
development and change. It's also replacing memcached in some situations where
all I needed was a dumb memory cache. It can act like a dumb memory cache,
except it has these extra features waiting in the wings I want, which is a
bonus.

I liked Redis in my evaluations and may use it more in the future but it lost
out to MongoDB for the LBS project because it didn't fit the requirements as
well or get enough little "taste wins" with me.

------
rb2k_
I use Redis as a Cache and Queueing System at the same time.

Workers connect and get new jobs, they also write the results to redis.
Everything touches redis and redis is just so damn fast that I don't have to
care about scaling for a while :)

~~~
jfb
We do this as well. Having sets, sorted sets and hash tables built in makes
grody hacks much less common, and Redis is _seriously fast_. We persist the
important info back to a more conventional store (Postgres), but we can keep a
large proportion of our current working set in memory, in Redis. That rocks.

~~~
dasil003
This is the very thing that attracted us to redis. We haven't come close to
outgrowing a single MySQL server yet, but there was a handful of areas where
we felt MySQL lacking. So far we have been able to make major performance
improvements and decrease our database size dramatically, without worrying
about redis persistence at all. That is, we use it purely in a caching
capacity where any data it contains can be reconstructed at any time.

The low-hanging fruit have been counter caches (lock contention with mysql
counters was a problem wayyy too early), transient data (sessions, ip bans,
etc), and picking random items from sets (other RDBMSes may have better ORDER
BY random() properties, but MySQL sucks).

In general I feel like 90% of our data is best served by a relational
database. It's possible to shoehorn the rest in, but redis primitives allow
highly targeted improvements to both performance and elegance.

~~~
jfb
Yeah, not having to implement (say) priority work queues YET AGAIN in a RDBMS
was like checking your regulator at 40m and finding that you have enough air
to get back to the surface, after all.

We do a bunch of calculation with cached data in Redis, and it enables the
naïve pattern of grabbing chunks of shit from the data store without having to
cope with the horrible SQL generated by the ORM.

------
jasonjei
I've started using NoSQL with an ACID-based system by storing the document
revisions (e.g, modified invoices, purchase orders) into a NoSQL database with
the SQL database holding the master data. We store document revisions so if we
were to ever audit modifications of the invoice, we could pull up the very
first version to the second, third, nth revision of the document. Meanwhile,
we still keep master data on SQL database.

~~~
blantonl
Sounds like you are doing what we are doing. Testing the waters, evaluating
how NoSQL technologies fit into your architecture, and testing your problems
against NoSQL solutions.

It is refreshing to see the approaches.

~~~
jasonjei
Thanks for the compliments!

I'm still a bit wary of using NoSQL, and we store document revisions (e.g, if
I modified invoice 1021, the lastest version would be updated in SQL, and the
last version would be pushed into NoSQL) into SQL just in case (by using a
draft_id column).

We need to see if our software is still deemed auditable by accountants since
they have very strict guidelines on how a document trail should be stored.

------
metamemetics
I am investigating replacing MySQL with MongoDB in my model layer for my next
prototype.

My mind is still thinking in 3NF though. I understand dernormalization and
avoiding joins will be useful from a performance standpoint. However I am
unsure when to either go ahead and include a foreign key, retrieve it and
perform a second query from the application layer, and when to go ahead and
duplicate\embed all the field data. I'm leaning towards just making 2 simple
sequential key lookup queries, the 2nd on the retrieved foreign key, rather
than duplicating fields everywhere and keeping track of massively cascading
changes. Instead of performing 1 MySQL Join. Although I usually think in terms
of minimizing roundtrips to the database server.

Wondering if anyone has a heuristic for this or suggested reading?

~~~
rbranson
Please ask yourself why. In reality, MongoDB has almost all of the same
limitations of MySQL and PostgreSQL, but lacks the production proven record.
In addition, MongoDB has very weak durability guarantees on a single server,
poor performance for data that is not in-memory, and continues some common
SQL/ACID scalability pitfalls (use of arbitrary indexes and ad-hoc queries).

Outside of this, you need to switch the question you ask from "what data do I
need to capture?" to "what questions do I need to answer?"

~~~
metamemetics
> _Please ask yourself why._

Looking for a way to hands-free scale very cheaply without vendor lockin.
Would be nice if I can simply add another machine to the database to the
cluster. And not have to generate the ids in the application layer, use
hashing algo to select correct machine, and have everything stop working when
a single database goes down. Seems like it should be a solved problem by now.
Investigating new tech won't set me back much time, and my MySQL queries
aren't going to disappear if I don't like it. Furthermore looking at large
sites such as Flickr that massively scaled MySQL it seems like they stopped
using its relational features anyway.

~~~
rbranson
It's not as "hands-free" as you'd like to believe. Check out the MongoDB
sharding introduction[1]. There are some pretty big caveats. Very few people
are using auto-sharding at scale in production (bit.ly and BoxedIce are all I
know of).

There are other operational issues with MongoDB. MongoDB can only do a repair
if there is twice the available disk space as the database uses, and the
server must be effectively brought offline to do this. To reclaim unused disk
space, you have to do a, you guessed it, compact/repair. Want to do a backup?
The accepted way to do this is to have a dedicated slave that can be write-
locked for however long it takes to do your backup. They suggest using LVM
snapshots to make this short, but disk performance on volumes with LVM
snapshots is terrible.

I would consider using MongoDB for a setup that would either be either non-
critical, completely within memory with bounded growth (which itself sort of
begs the question...), or involve mostly write-once data, such as feeds,
analytics, and comment systems.

[1] <http://www.mongodb.org/display/DOCS/Sharding+Introduction>

~~~
metamemetics
Well my platform is n number of $20 linodes to start. I'm clustering the
python application across them using uwsgi+nginx (all I have to do is add an
IP address in the config to scale), it's going to be a given that I shard the
database across them as well. If you feel I should avoid Mongo would you
recommend Cassandra instead?

Regardless, I think my initial question regarding when to denormalize data
applies to any database including scaled MySQL, but perhaps was a better
question for stackoverflow.

~~~
rbranson
Cassandra has it's own hurdles, but I think if we're talking about getting
your mind in the right place, it might be a better answer. Cassandra
definitely has a much more mature scalability implementation that isn't
caveat-ridden like MongoDB is. It's operating at scale at both Twitter and
Facebook.

Cassandra has online compaction, but still requires up to 2x space for
compaction. However, Cassandra does not have to do a full scan of the entire
database to do compaction, and almost never actually uses the 2x space. It's
also much easier to maintain a Cassandra cluster, because each instance shares
equal responsibility, and replication topology is handled for you.

Despite what their fans will say, these are both beta-quality products.

------
voxcogitatio
The confusion about the use case of NoSQL probably stems from the term "NoSQL"
being so vaguely defined. All you can say for sure about one is that it's not
relational. But other than that there's not much in common between (for
example) key-value stores, graph databases and object databases.

Result: The author has to qualify all statements with "only applies to some
NoSQL databases".

~~~
jfb
I would love a relational database with some less horrible language than SQL
to manage it. That'd be NoSQL, no lie, but considerably different than a KV
store.

------
StavrosK
I used MongoDB for my MSc thesis, and I really enjoyed the silent data
corruption feature [1]. That said, I would use MongoDB again if it stops being
horribly unstable.

These days, I use redis for caching and as a celery backend, and I love it to
bits.

[1] not.

~~~
mkramlich
please dish. esp what version of MongoDB you used when you had the corruption
issue.

~~~
StavrosK
Here's part 2, there's a link to part 1:

<http://www.korokithakis.net/node/119>

Also, because everyone is going to skim the post and say "you were using the
32-bit version":

1\. Only for some of the corruptions.

2\. IT SHOULDN'T CORRUPT DATA ANYWAY!

------
fish2000
SQL gives some people the howling fantods. I think a large part of the
programming-nerd population looks at it and sees it as a kludgy chimera like
JCL, or an inscrutably unfunny INTERCAL variant. Maybe they've been forced to
use a shitty ORM leaked abstractions on a project of theirs which they had to
clean up; maybe someone slipped them the hot SQL injection, back in the days
of CGI; maybe their parents were tragically trampled to death by an elephant
while a nearby dolphin laughed. Regardless of motiation, you have to concede
to whom it riles that SQL is not the most likeable
language/model/framework/paradigm out there.

Personally, I actually love SQL. You asked me, nothing satisfies like nested
right inner joins. But that's not who I am. I have needs. Sometimes, what I
need is a schemaless eventually-consistent document-oriented persistent data
store, because I am aggregating data from multiple web service APIs whose
field names and structures change around like they were samples on a Girl Talk
record. CouchDB ain't SlouchDB... I can dance to that.

I can tell that some of my buddies are embracing databases from the ranks of
the quote-unquote NoSQL movement because these databases' aesthetics are not
at all like SQL. That's the catch with NoSQL -- it's a really pointless thing
to talk about because it's not a thing; it is an un-thing, a classification of
everything in the contemporary database world that is not SQL. It's the kind
of classification that makes the most sense in the emotional context of how
people feel about SQL.

------
gyardley
Writing and processing a ton of analytics data - all the raw logs sent into
Flurry are stored in HDFS and then processed with map-reduce jobs.

------
rantfoil
At Posterous we use MySQL as a main data store but MongoDB for simple
analytics and Redis for set operations around contacts and subscriptions.

Redis is also great as a backing store for the Rails queuing system Resque.

------
dadro
I'm using Mongodb to store Real Estate MLS data. MLS vendors all have their
own schema for storing property data that can change on a regular basis.
Rather than attempt to map every field we store each property as a document.
We index the important fields (price, beds, baths, etc) but all other
"metadata" is stored as a Dirty Attribute (in an embedded document).

We are still beta testing and have 100K properties and ~1 million images
stored. Query time is faster than our current LAMP site. We will shard based
on MLS Vendor as we add more.

------
mindcrime
This blog entry ( [http://www.engineyard.com/blog/2009/ldap-directories-the-
for...](http://www.engineyard.com/blog/2009/ldap-directories-the-forgotten-
nosql/) ) is a little old now, but makes an interesting point - that LDAP was
the original "NoSQL." If you buy that idea, lots of people have been using
NoSQL - and using it for all sorts of things - for quite some time now.

~~~
jfb
But LDAP is an Abomination unto the eyes of the Lord.

------
tmcw
PakistanSurvey.org is using MongoDB for general data storage of tons of survey
responses and aggregation of them.

Similar sites we're working on use CouchDB in the same capacity, because we
can push simple analysis back into the database and represent more complex
data in a natural form.

------
smhinsey
I use a combination of semi-reliable stores, including a NoSql table storage
provider, to implement durable messaging for distributed systems with nodes
located in any number of data centers or clouds.

------
kingnothing
Vitrue is currently using an autosharded mongodb cluster in production for
writing and reading analytics data for our apps. We're still adding it across
the board, but so far its going great.

------
die_sekte
AFAIK Stylous.com uses vertexdb, a graph database.

------
alecthomas
We're using Redis because it's zero-configuration-required. This makes
deploying on end-user networks much much easier.

------
dgregd
No ACID db is most useful in free web apps. Where users data is practically
worthless. So who cares about occasionally lost data record.

When people are paying for a service then ACID db is a must.

~~~
rbranson
Electronic banking systems are far from ACID (think ATM operations, check and
credit card transfer processes). Stock exchanges aren't ACID. Amazon.com isn't
ACID. Logistics systems (FedEx, UPS, etc) aren't ACID. In fact, if you look at
the information systems of Fortune 100 corporations, you'll find that almost
every single one of them is non-ACID at the core.

~~~
jhugg
This is not really very accurate.

At their core, each of these systems use ACID databases (or very nearly if you
nitpick about isolation levels in Oracle). Between databases and between
companies, they've developed "eventually correct" schemes to synchronize
information. The lease patterns that many of these systems rely on require
atomic operations at their core and offer stronger guarantees than the basic
eventually consistent systems popularized by models like Amazon's Dynamo.

It's not just that the account values have to agree between systems at the end
of the day; they have to actually be correct at the end of the day.

I know that consistency models vary between NoSQL systems (and even within a
single NoSQL system). There's some great technology out there and plenty of
problems to solve. There's are certainly plenty of use cases for NoSQL systems
within banks and stock exchanges.

But the "banks are eventually consistent" line of reasoning needs to die.

~~~
rbranson
EDIT: For full disclosure purposes, the parent post is from John Hugg, a
software engineer at VoltDB, which is a high-scale data store that competes
with many "NoSQL" databases. I am not claiming his point of view is invalid,
just that it comes from a certain perspective, and should be viewed in this
light.

Amazon's Dynamo itself is built on BerkeleyDB, which is ACID compliant. That
doesn't mean Dynamo is an ACID system. You have to view the system as a whole,
not just the component parts. The information systems I refer to in large
banks, stock exchanges, and logistics are often composed of thousands of
instances of ACID-compliant databases, but as a whole operate with eventual
consistency guarantees. EC is kind of a misnomer for Dynamo anyways, because
it's really TUNABLE consistency. Dynamo can operate in a fully consistent
mode, but you're going to sacrifice availability. CAP theorem doesn't care if
you're a bank or a stock exchange or you have a trillion dollars. It still
applies.

~~~
jhugg
I was referring to the Dynamo model for EC, not a particular software
implementation. That model has next-to-zero traction in systems that handle
non-trivial sums of money, and for good reason.

That's not to say it's not great for lots of things. Amazon uses this model
for all kinds of stuff, but as soon as you go to checkout your order, you get
kicked back into ACID-ville.

~~~
rbranson
Dynamo's consistency ranges from fully consistent to loosely consistent and is
tunable on a per-application and per-call basis. This means a write can be
reconciled against 100% of the replicas before it is considered finished. How
is this any different than the consistency guarantees an ACID compliant system
provides?

You make a claim that Amazon uses ACID semantics for the checkout process. The
Dynamo paper claims that in order to meet business goals, the complete
Amazon.com order process must be highly-available and partition tolerant. A
system that is ACID compliant must sacrifice availability or partition
tolerance, but this isn't the case for Amazon's purchase process. Amazon
simply strengthens consistency and durability guarantees in the case of
checkouts with a quorum write. It would be exceedingly rare for a partition or
disaster to knock out communication with more than one datacenter, so this
works very well in practice.

~~~
jhugg
Quorum writes to Dynamo nodes allow you make atomic and durable updates to
single keys and the data associated with them. ACID transactions allow you to
mutate data associated with multiple keys atomically. This is not the same
thing at all. Common operations like debit from one account and credit another
or sum a set of values become difficult.

Now atomic updates to single keys can actually be used as the building blocks
for transactional functionality (using the lease pattern and/or compensating
operations). If you need a transaction here or there, this might be a very
workable solution. If you need lots of transactions, then you end up using
Dynamo systems in unnatural ways; many of their performance and availability
advantages are wasted in this configuration.

So yes, you could build a bank given quorum writes, but the point is that it
would probably be a poor engineering choice.

~~~
rbranson
I wasn't saying banks should use Dynamo, but that Dynamo-style consistency is
appropriate for conducting e-commerce transactions, and originally that most
scaled industries don't depend on ACID semantics in their core datastores.

Bank accounts are eventually consistent logs, which doesn't look like Dynamo,
but also look nothing like an RDBMS table. It's well known and understood that
almost all banking systems are based around mainframe-era batch-process
systems that are eventually consistent in nature. It's awfully ironic that the
"hello world" used to demonstrate transactions in the RDBMS world is a debit
and credit of bank accounts, a situation that is vanishingly rare in the real
world.

Also, please address my original rebuttal to your claim that Amazon uses an
ACID database for their checkout process.

