Hacker News new | comments | show | ask | jobs | submit login
Ask HN: In what areas are NoSQL Databases beneficial over Relational Databases?
74 points by rochak on Sept 19, 2016 | hide | past | web | favorite | 74 comments
I have been working with MySQL and SQL Server for quite a long time and thought of trying MongoDB. I realized that I won't be having the ability of joining a lot of tables and all. Preferably what areas would you suggest that NoSQL databases would be beneficial over the traditional relational databases?



In my opinion, the major reason to leave behind ACID databases is if you have reached the point where they can't scale up any more. If you have reached this point, you're probably a company with 20+ developers. There might also be very specific use cases where specialised databases are useful, such as a search index.

ACID gives you a lot of nice guarantees and it's silly to give it up if you have the choice.

As for MongoDB in particular, I would never use it for anything. It doesn't seem to do anything particularly well. There are better specialist 'NoSQL' databases that have actual benefits to fit your problem.


Are you bigger than Wikipedia? Wikipedia's technology is quite conventional, yet they're the 6th busiest site by traffic. It's MariaDB with memcache, basically. If you're a mostly-read operation, you can have lots of slave databases and caches as you scale. By the time you need exotic database technology for a web site, you probably have hundreds or thousands of employees.

Most of the real use cases for NoSQL involve systems where writes dominate. These are data collection systems where, later, someone searches the data. Surveillance, geophysical data, that sort of thing.


Read-your-writes consistency is not preserved with read slaves which can show up in lots of ways. Create a document, get redirected to the URL for the document, document is missing.


That's sovable in numerous other, simpler ways than exotic DBs. For example, a project I worked on for a few years had a read-your-write issue. Our solution was to keep track of the timestamp of the last write each user made in an encrypted cookie. Then, we used the timestamp to control which slave (or even possibly the master) was used for queries. The result was that a user would always see their own writes, even if other users lagged a couple seconds behind.


Hacker News has that problem. Post a comment, refresh the page, and the comment may not be up yet. If you have cache front ends, it's expensive to avoid. You could probably do something with Javascript where the page checks if the cache has changed after N seconds, where N is the cache expiration time, and reloads the page if necessary. Then you're eventually consistent.


RDBMS might be to good database for the wiki data model. I read they also use cassandra (for serving the API) and a graphdb (for wikidata)


I thought about trying out MongoDB. Which NoSQL are better in your opinion?


https://www.rethinkdb.com/ looks very promising and also introduces some very interesting concepts and ideas.


RethinkDB got me interested some time ago and I took a look how they differ from SQL and found this page.

https://www.rethinkdb.com/docs/sql-to-reql/javascript/

Frankly, I cannot understand to write queries that verbose and complex when SQL looks clear and concise.

If I'm missing on how good RethinkDB is, I will gladly like to know.

One part SQL is notorious for is you don't know where it could fail until you send the whole query and receive the error unlike these NoSQL where operations can happen progressively.

While ORM can sort of mimic that behavior, I don't like that it hides SQL which makes crappy SQL alot easier to write and harder to debug.


It depends on the use case you want it for. Unless you're at massive scale, I don't think non-ACID databases are ever really appropriate as a general purpose store.

There are niches though. Graph databases, search indexes, databases for logging, time series databases, mapreduce, etc.


What is a good database for massive-scale logging and task queuing? I am thinking of experimenting with RethinkDB or Redis, and would love to hear your thoughts.


Do yourself a favour and use RethinkDB, its so much better then MongoDB, just try it


What are your thoughts on using RethinkDB for massive-scale logging and task queuing? Trying to select between that and Redis.


Massive-scale (searchable?) logging is done with elastic-search. Redis is ok for task-queueing but is in-memory. ~best in that case (most features) would be rabbitmq (based on "python celery") though it ~should be enough in your usecase

hell, you can even use your favorite (postgresql) rdbms to queue tasks


First of all, this question is similar to asking in which cases non-cars are beneficial over cars. Cars cover many use cases but sometimes bicycles perform better in heavy trafficked cities, planes carry you faster to the other side of the world but you need a cargo ship to move tons of containers there. That list goes on and on, up to less common activities like jumping fences (horses) or very rare ones like going to the moon.

Now, there are many use cases for non relational data stores. There are already many answers in the comments but first you have to ask yourself what you need the data store for. In the cars analogy it's where you need to go, how fast, carrying what, etc. Do you need a glorified hash table? Redis is good at that. Do you need very fast queries? You might try Cassandra but be very careful at defining in advance all the queries you're going to run: if you need sorted results you have to plan the columns for them. See http://www.planetcassandra.org/blog/we-shall-have-order/ and remember that NoSQL doesn't always mean schemaless.

Sometimes the choice is about the tradeoffs between the ease of use, installation and planning. MongoDB maybe doesn't particularly excel at anything but you install it with an apt-get (check in advance which version you get) and adding replication is easy. However you better start planning at least a partial schema for your documents quickly or you'll get ton of litter in your database. On the cons side, its native JS based query language is awfully verbose and complex compared to almost anything else, ORMs and especially SQL. Luckily we don't need the console much.

To recap: ask yourself which data you have to manage and why, then google for a database that optimizes your use case and check it against the posts at https://aphyr.com/tags/jepsen which does a super-excellent job at finding out how dbs fall apart in extreme but realistic cases. Often you'll find that a relational database is good enough. Other times you'll have to choose between relational and one or two categories of NoSQL. There are very few mainstream RDBMs but there are zillions of NoSQLs so the choice might not be obvious. Final car analogy: car + tent or camper van?


When clients ask this question, I always suggest starting with a good SQL solution and then figure out why it won't work. So if you're not hitting any major problems with MySQL or SQL Server, then NoSQL solutions will just be incredibly labor intensive for, from your perspective, a loss in functionality. Each NoSQL database is different and designed to approach problems from SQL databases differently. For Mongo, this is having a very flexible denormalized schema with very fast reads. However, companies such as Pinterest and Facebook have found they can accomplish this with MySQL or Postgres through different data modeling. I think the reasons for NoSQL become much clearer when trying to build applications that do not keep a normalized data model.


This is the WRONG QUESTION.

The real question is:

In what areas are (recently build in the internet era) NoSQL Databases beneficial over (barely-)Relational traditional RDBMS Databases (build decades ago)?

----

Scalability, the easy to mutate schemas and all that are artifacts of our times.

MySql, Sql Server, etc are barely relational databases, coupled with a almost-decent way to interface them (sql).

For example, in them you can't store a relation inside a field. Schema manipulation is hard. A lot of potential programming power is not possible, without convoluted recent additions the Sql language, or hacking together strings.

And most of them have been made for workloads and scenarios that are at odds to what internet-scale companies need.

Unfortunately the solutions was do "Nosql":

"Any sufficiently complicated database management system contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a good RDBMS."

---

Check the relational model. Is far simpler than people give credit, even without get too crazy like some purist want.

In fact, I don't see any reason why the relational model can't be used as flexible how json is applied in the backend.

---

Despite all the above and more, a traditional RDBMS is solid enough that most people not need to get crazy and move to (full) NoSql. However, could be good to use them in specific cases (like caches, search engines, etc).

But I have experience people using NoSql stores and trying to do the kind of work a RDBMS have solved easier and faster.


The benefit: typically near infinite scalability with little loss of performance. It's super fast even when there's a lot of it.

The cost: transactions, relations. These are actually incredibly valuable things to have.

If you're storing billions and trillions of records that don't need relations, NoSQL is great. You'll keep your fast lookups and saves no matter how big things get.

This can matter a lot more if you're a start up with the goal of having trillions+ records some day. Your SQLServer will eventually not be able to scale any higher, and then what? Moving to Nonrelational from relational is so painful.


I would say code for the scale you'll have in the near future, not for the scale you want to have in 5 years.

Transitioning database is painful, but dealing with the extra complexity of non-ACID stores can inhibit your growth. If you reach the scale where you need to change, you'll probably be able to afford more developers to help you.


Complete agreement. Like all of software developer, it's about tradeoffs.


infinite scalability with little loss of performance

I call shenanigans on that - Google's experience was that they tried it and then got bogged down as every developer had to roll their own mechanism for waiting for syncs - hence they developed F1 instead...


Near infinite scalability with little loss of performance? Definitely not with MongoDB. Can you show that with some numbers? I have seen quite the contrary: insertion speed slightly decreases with database size. Aggregate query speed decreases up to the point the database becomes unusable when you are on the order of small TB. Definitely a problem relational databases don't suffer (at least this significantly and at least at this scale).


I recommend you check out CockroachDB.

They're approach is to use a Key-Value DB to build a linear-scaling PostgreSQL-like Database.

Personally, I'm just waiting for that to become a bit more stable, but it's probably the most promising in this regard and already very usable.


"The cost: transactions, relations. These are actually incredibly valuable things to have"

This is quite arguable. Before the Computer Era, all businesses, including the whole financial sector did just fine without ACID. Still, ACID is used only at a local scope. I can often see ACID requirement used rather as an excuse for poor data modeling than a real need.


> This is quite arguable. Before the Computer Era, all businesses, including the whole financial sector did just fine without ACID.

Can you elaborate a bit? I believe that in the pre-ACID world, business processes were much slower, not online processes like today. When you're only really changing data once a day, backups and manual intervention are acceptable options.

> Still, ACID is used only at a local scope.

What do you mean by "local scope"?


ACID is not a way to deal with faster data change. Actually it is totally opposite. The more updates you have and the more online (distributed) you are, the less you want ACID. ACID is extremely latency-sensitive and failure-sensitive. And latency is a problem caused by just plain old physics - this isn't going to get better unless we find a way to make light travel faster than the speed of light ;) And the more distributed and larger your system is, the more frequent failures (like network partitions or servers going down) will happen.

As for "local scope" I meant "at a single business entity". If you look at a banking or online payment systems as a whole then they are not ACID, they are eventually consistent. The are based on two basic principles:

1. write everything down at least two times, so you never lose data

2. updates are incremental, so you never overwrite data


ACID sure is a way to deal with data change faster than once a day. Once you stop passing paper slips and order forms through your company, you can get accurate, up to date information at a human timescale, i.e. on the order of seconds.

Of course when you want stuff to happen at a faster-than-human timescale - or when you just want lots of stuff to happen - an ACID model might not be the right fit.

> As for "local scope" I meant "at a single business entity".

Understood.


My whole point is that ACID is not the only kind of consistency and it is not required to do "get accurate, up to date information at a human timescale". It is just a nice and convenient programming model, but it comes at a cost of availability, scalability, latency and throughput. You can design systems with no ACID data store and if you do it properly they can be just as accurate and up-to-date when everything works properly, but can also degrade nicely when some components fail.

http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on...

http://www.enterpriseintegrationpatterns.com/docs/IEEE_Softw...


> Before the Computer Era, all businesses, including the whole financial sector did just fine without ACID

The standard of the time looks quite ACID-y to me:

https://en.wikipedia.org/wiki/Double-entry_bookkeeping_syste...


Maybe it is similar, but you can implement it in probably any NoSQL store you wish. And it is not just a standard of the time, it is in wide use today, even if the underlyind database system is ACID.


NoSQL is for heterogeneous data that is read in a homogeneous manner, whilst relational excels at homogeneous data read in a heterogeneous manner. If your schema is constantly changing, relational is painful, conversely writing reports against a NoSQL datastore hurts too.

If you come from an RDBMS, you're probably going to dislike MongoDB. I like it for prototyping/MVP but once the schema settles down it's time to move.

However, NoSQL does not start and end at MongoDB, no way. There are lots of different flavours, from huuuuuuuuge key/value stores, to timeseries, to write-one-read-many document stores, to remote syncing mobile app datastores and probably loads of other types that I haven't used yet.

So figure out why you want to use NoSQL (CV building can be a reason too) and play around with some toy installations.


I'm still a newbie so I have a hard time articulating this, just a couple years into my career. The "rule of thumb" that I've worked up so far is that document stores are great for records that are useful in isolation but are possibly heterogeneous in nature. (of course there's nothing stopping you from storing foreign keys and doing "joins", or just storing data with a strict schema). Logs and configuration are both simple use cases.

One example from my work is ingesting usage information from a variety of products, like a system that takes in information about all your utilities - power, water, gas, and so on. Each product will have its own attributes that's important to describing it. In a relational world, you might: add columns to your table as new attributes need to be described; come up with some kind of composite attribute system where you try to encode multiple pieces of information in one field; go for a star schema so every product gets its own table; or pivot so that each "record" is represented by multiple rows and the attribute's identifier is a field. Each of those approaches has their ups and downs but they're all present due to hard schemas. In a NoSQL world, a single table can have wildly varying records, each of which is useful on its own without needing to spread metadata into other tables.


I find them useful in many scenarios:

- Bootstrap projects with open or changing requirements. A relational model is usually harder to maintain.

- Storing and querying objects with complex structures, such as those received from APIs.

- Having a single programming language across your entire application stack (ie MEAN). Sometimes comes up as a "should" or even a "must" requirement.

- Fast and simple object storage (ie sessions).


> - Storing and querying objects with complex structures, such as those received from APIs.

> - Fast and simple object storage (ie sessions).

Those would also be fulfilled by using a jsonb column in Postgres, right?


Shameless plug: my personal project, BedquiltDB (https://bedquiltdb.github.io) provides a mongo-alike nosql api on top of PostgreSQL jsonb functionality. Why choose when you can have both? ;)


For storage and retrieval I think it can probably be used pretty much the same way.

For querying, my guess is that frameworks such as MongoDB's Aggregation (https://docs.mongodb.com/manual/aggregation) are more versatile.


Well, postgres can do all that what that site said, too

If you have a table with a column data in format json or jsonb, you can do a

    SELECT data->>cust_id AS cust_id, SUM(data->>amount)
    WHERE data->>status = 'A'
    GROUP BY data->>cust_id;
just as well in postgres.

Or, for the third example,

    SELECT DISTINCT data->>cust_id;
(See https://www.postgresql.org/docs/9.5/static/functions-json.ht... )


>Those would also be fulfilled by using a jsonb column in Postgres, right?

Storing JSON is now natively supported in MySQL 5.7


Right, the point being that there is no need to choose between key-value and relational DB - since Postgres et al do it all. Which makes the so-called "noSQL" approach completely obsolete.


It’s less about storing JSON, more about querying it.

Look at my answer to the sibling comment for an example.


The two things, that Relational Databases typically don't do well: 1. High availablity and no downtime, ever. 2. Unlimited scalability of all operations, including writes.

Honestly, many (most?) NoSQL databases also don't really do them well. If you need these two properties, the only valid choice is really a master-less, shared-nothing database system. E.g. Apache Cassandra.


The only time you should use NoSQL is when you have no other choice. For example, if your database is too big to fit on one server, or you need a massive number of writes per second. Otherwise SQL will give you better performance, much more security, and better data integrity.


And much, much, much easier querying (if you know SQL). For me, the biggest disadvantage of non-sql databases is the difficulty of doing even the simplest aggregates!


I would suggest to try using both SQL and NoSQL for some hobbies project. From there you should be able to experience the difference between these db. In a fairly complex application, you will see SQL, NoSQL and Redis, all exist for different purpose.


For a personal project I used MongoDB over MySQL for somewhat faster prototyping. I was collecting a lot of data from the X display server during runtime and storing it into a database for querying later. Since there were a lot of different types of messages and I wasn't sure about which parts I wanted to keep, rather than building tables for each of them I converted them to JSON and piped them into a MongoDB collection.

I know this is not a good example compared to a production issue, however I think the reducing the amount of time need to get started with a project is very useful over something that needs a lot of configuring before it can be used.


NoSQL is all about missing useful features (such as integrity, transactions, query flexibility), that you unfortunately have to drop if you want to be able to scale in certain ways. Thus, NoSQL dbs are practically worthless until you get to a point where your SQL database won't work any more. At this point, you need to evaluate exactly what is it that your SQL db cannot handle, and switch to a different product accordingly. For example, you would switch to Aerospike if you need to scale writes.


Different NoSQL platforms solve different problems. Graph databases can build and query relationships between entities with significant speed. Mongo has been mentioned a few times because it's easy to iterate development with a schema-less database, but it's also fantastic when dealing with class based structures with sparse attribute populations (e.g. a CMDB). Cassandra's notion of an always available database is a key foundational element in high-scale devops environments. Scylla offers incredibly fast transactions (they advertise 1M transactions per second per server). There are XML databases that store and query documents in ways that are easy for developers to translate (i.e. xpath). There are databases like Axibase that are built for time series data. BayesDB is easy to query for statistical inference.

It's very easy to fall into the trap of "everything is relational" or "I could do that with Postgres/Oracle/etc". There are a lot of problems that have good RDBMS centric solutions, but you don't have to look too far to find end users or developers who are unhappy with the RDBMS solutions that they work with on a daily basis.


With RDBMS you have to design the schema first, with NoSQL you don't need to. This is nice when you're dealing with very diverse data (or data you don't even know how it'll look like) or are prototyping. One thing I like about this: You can store your data as it is, you don't have to fit it into your model / world view. You can make sense of it later.

At some point you'll probably have to deal with schema anyway. If this comes at a later stage, you'll have to deal with a lot of diversity and inconsistency on potentially many points. This can become much more painful and time consuming than having dealt with it from the beginning.

If you're thinking about NoSQL as something like a simpler, modern SQL alternative, you're probably having a faster start and a lot of problems in the long run.

Scalability is something, NoSQL databases have for them, definately.


> With RDBMS you have to design the schema first, with NoSQL you don't need to.

Not necessarily, if you have a nice framework. Years ago, I was working on an older Perl application with a tiny self-built ORM (a single class of about 1000 LOC). I gave it pluggable storage backend support and added a prototyping backend where the object is stored into a simple SQL-backed key-value store (a table with columns "id", "type", "data"; where "data" contains a JSON document). So when building something with a new data type, you could just start with the prototyping storage, and once your feature was working, switch to table-backed storage without much code changes.


Not all NoSQLs are document schema-less databases like Mongo. Some NoSQL database systems have static schema. E.g. Apache Cassandra has tables with static column metadata just as RDBMS.


At work we use NoSQL for stats collection and aggregations. Our reason for choosing MongoDB is that we can have a rich document structure with multiple levels of nested documents. And with the rich update queries that MongoDB provides, we are able to update all of those nested documents in a single call, without having to update multiple documents.

That's one of the places where it made sense to use a NoSQL solution. Our documents had multiple levels of nesting, but no joins. We also use MySQL for our relational data, so in the end you have to pick and choose. All of the research we did pointed to NoSQL being a great fit for storing nested documents for stats collection. We haven't regretted following that advice yet.


I use MongoDB for periphery data to my applications, i.e. to aggregate data such as logs and events. It's also great to store things like reports in it due to the flexible structure. However, for my actual application database I always use SQL (MySQL).


Depends, there are many types of NoSQL databases, that all have different focuses.

SQL started getting big because it was the one size fits all approach that worked 90% of the time. In fact it got so big, that universities started to teach only them when they talked about databases in general.

But there are many problems, where SQL is not suited.

If you got hundrets of terrabyte of data you need to analyse, you're probably better of with a Hadoop cluster.

If you need inter process communication, Memcached or Redis is worth considering.

If easy db-sync is your main goal, couch/pouch-db is probably a good choice.

etc. pp.

But these three DBs alone are so different, that they can't be compared like MySQL and PostgreSQL can.


NoSQL is amazing when you have non relational data. Think of it as a giant, distributed hash table. Hash tables are awesome and wicked fast. But toss in relations and now you got some interesting problems. RDMS solves that for you but scaling is now harder. At least in my opinion anyway. You can still use NoSQL in these situations but you're typically duplicating data or indexing it yourself so you can refer back to it.

It's all about trade offs. If you're small just pick the best one for the job at hand. When you need scale you can usually hack either into scaling well enough.


There are some specific scenarios where nosql solutions simply work significantly better. Probably the premiere example is time series data. You can use sql, but it's a pain. There's a reason why most monitoring systems use some sort of nosql database. A specific business domain example is multi-level marketing. Updates in SQL can take hours as one has to go through the entire tree to make any updates. Graph databases are a much more elegant solution to that problem domain.


The relational model is useful for being well understood and part of the reason it is well understood is that the relational model is founded upon relational algebra and another part of why the relational model is well understood is availability in the wild roughly corresponding with the creation of the personal computer.

Search versus query [for some definitions of 'search' and 'query'] seems like a case where business logic suggests an advantage of one approach versus the other.

Good luck.


You may be interested in this, it's a Jun 2016 paper on the current state of "NewSQL" and describes most of the dbms out there including NoSQL. http://15721.courses.cs.cmu.edu/spring2016/papers/pavlo-news...


typically the advantages of nosql are along the lines of schema-less, speed of writes, reads should be faster typically as well.

speed of writes, there are no constraints that need to be checked with nosql (typically), it's like writing a value into a hash map. with sql there needs to be a check for multiple constraints: uniqueness, null, datatype, etc. if there's a relation then it needs to check those constraints as well.

the whole thing about rdb's is that they're trying to normalize data where possible, you deal with id's and the fields of the table can change without you needing to propagate these changes to other places. in nosql, if you have relational data then you have to manually make sure all these changes are propagated, pretty annoying if you ask me and quite error prone, you end up writing logic an rdb already optimizes for you. take "cascaded deletes", too, this is manual in nosql.

the best usage of nosql is if you have stagnant data, and you need to write it quickly, take a log file contents, for example, or if your data consistency requirement is not that critical (let's say you implement likes for a web post system, if you're missing 300 here or there, not a huge deal). other cases are things like counters, if you have a game where you're tracking scores, things are changing rapidly , but there's really not a lot of relation with other data, redis is a great example of this type of nosql storage, built in incrementing counters.

most of the time when someone brings up nosql for an application, my initial reaction is that it's a premature optimization, for a lot of data, there are tight relationships and rdb is great for that. but i tend to see that you need both in many cases.


In the early days of companies MongoDB has helped me tremendously with development speed because it is schemaless. Since Postgres 9.5 I'm trying to replicate the speed with Postgres binary JSON but most access libraries are still SQL oriented and I haven't achieved the speed of MongoDB with Scala/Rogue yet.


I must admit that the term 'schema-less database' sends shivers up my spine. How do you manage that, especially in a large development team. My biggest fear is that someone will make a change to the structure in an ad-hoc fashion which may break something else I am working on.

End of the day, can't most things in the real universe be broken down into a schema? Be it DNA mapping, geological data, customer invoices, product listings etc.? Everything eventually can be contained in a discrete, defined schema?

I've dabbled on and off for several years but have yet to come across one situation where NoSQL would be a better solution than SQL. As others have pointed out - as soon as you start relating table together, NoSQL starts working against you.

I acknowledge that I have 30+ years with SQL, so obviously my viewpoint is biased by what I know. Could the hesitation to adopt SQL by most new developers point to the fact that they find it too intimidating and structured?


> I've dabbled on and off for several years but have yet to come across one situation where NoSQL would be a better solution than SQL.

Full-text search. Just compare the speed of ElasticSearch against "SELECT * FROM logs WHERE message LIKE '%foo%';".

Also other niches: time-series databases, graph databases, etc.


Actually, I’ve done tests with an indexed postgres database of several gigabytes size and tsvector for stemming and co vs. ElasticSearch.

At least until you reach several hundred gigabytes, postgres uses less resources for the same search speed.

This video shows searching through 3 GB of IRC logs, filtering, ranking them based on similarity to the query, creation time, likelyhood that it’s the type of message the user wants, highlighting the part of the message with the matching words, and displaying all that:

https://dl.kuschku.de/videos/2016-09-16_04-03-36.mp4

Postgres is, for smaller amounts of data, more than fast enough.


There are much more powerful ways to do full text search in SQL databases than LIKE. A fair comparison would be something like a PostgreSQL tsvector with a GIN index. Probably not as good as ElasticSearch, but sufficient for simpler cases.


Nested data structure.

I'd recommend other NoSql databases over MongoDB like RethinkDB.

Also, SQL like databases support JSON. So Postgres and MySql 5.7.x I think supports it.

The only think I don't like about sql is that I have to know what my data looks like in advance. Set up is a pain


On the topic of joins, I'd suggest if you're bumping into the joining limitation that you've chosen the wrong database for your needs. Having said that, a lot of it can be solved by rethinking your structure to take advantage of embedded documents.


For me it's all about not having to define a schema. I can just take my JavaScript object and insert it in a collection. Sure, I know you can do this in PostgreSQL and I might switch.


a good place for NoSql is where you don't know what data you are going to receive. e.g. i get weather data from a number of different sources in a number of different formats. the sources and formats change fairly regularly. i am able to extract the data in key/value pairs but don't know which keys will be in any given set of data. with nosql you can extract any key/value pair and save it to the database without needing to parse the data


You can just shove json or XML you receive into a SQL database, too.

That can be good strategy when dealing with changing data formats:

1) Import the data "as is" into the database.

2) Process the data, extracting the data you need for querying.

3) Whenever parsing fails, figure out why and fix it (update your source code, throw away bad data, whatever), then rerun the 'process' step for that data.

Reparsing can be made reliable if you use transactions in the step 2. With a NoSQL solution, it may be hard to guarantee that you don't lose a few records ('may' depends on the specific NoSQL solution and the amount of manual work you are willing to do to restart your pipeline). (You can also postpone discovering your data problems, but that's delaying the inevitable)

What NoSQL is particularly good at is running on multiple machines. It achieves that by giving up some letters of ACID.

It also typically makes it easier to store and query unstructured data than SQL databases do, but SQL databases are catching up.


No mention of BigTable so far - ACID compliant NoSQL with transactions IIRC. But you have to host with Google. (For at least the data store component)


I heard that the are better for the environment and smell nicer too.


If you need data integrity you will need a RDBMS.


A lot of the entities we deal with in web apps are complex, multi-table structures when stored in a relational database. Think about blog articles (tables: article, author, tag, comment), user profiles (tables: user, friends, follower, interests, etc.), photos, tweets, ads, and such. The beauty of something like MongoDB is that you can store all the data as one "record" in a collection. You can also retrieve it with a simple query. These "records" are structured as something like JSON, so you can just take the data structures you're using in your app's code (particularly Python or JavaScript), and store it as-is with no conversion.

Compare this to an RDBMS-based web app, where to display a blog article for example you'd need to make multiple queries of different tables: first the "article" table to get the text, then the "author" table, then all the comments, tags, images, or whatever. The NoSQL way is much easier to code for this sort of thing. The lack of ACID doesn't matter so much because these kinds of data are usually written just once, read many times, edited rarely, so there are few opportunities for inconsistencies to creep in.


>for example you'd need to make multiple queries of different tables: first the "article" table to get the text, then the "author" table, then all the comments, tags, images, or whatever

Haven't you heard of joins?


You're just outsourcing the impedance mismatch problem to the database. Yes, we can write complex SQL (and I've written some really complex SQL) to combine multitudes of entities together and generate a result. NoSQL databases eliminate this.

Where it becomes extremely impactful is when a database is distributed or sharded across a cluster. If the various tables (article, author, comment) are stored on different servers, it takes time for the database to reassemble them. Moreover, modern web apps often have content that's written once, read many times: blog posts, photo posts, status messages, news articles, advertisements, catalog listings, etc. Why re-do that work of assembling the content every time people want to look at it? When documents are rarely or never edited, the relational model's defenses against "anomalies" become less relevant. Far better to use a document store like MongoDB, which stores a piece of content as an integrated whole at a single place in the database.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: