Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Learning NoSQL, papers and books
142 points by wareotie on Oct 8, 2017 | hide | past | favorite | 43 comments
In your opinion, which papers and books are mandatory to really understand NoSQL subject?



I don't know what is your current knowledge/experience with NoSQL databases but I would suggest start with the well known Bigtable paper [1]. Post that instead of reading more papers have a look at AOSA chapter on NoSQL [2]. You can then either go through Bigtable paper again to improve understanding if you feel so or jump to Dynamo paper[3]. To develop your understanding further I think it would be good to go through documentation and source-code of some opens source databases. This would help you connect the usage scenarios with the design choices you saw in the papers.

After this it is upto you. The papers involve references to lot of distributed systems literature. If you are interested you can go through resources here [4]. If you want to go a more hands-on way, I would also recommend reading AWS DynamoDB best practices (you can read up Cassandra or CouchDB also) documentation [5] to see the practical consideration while using these systems. Then try to use it or any other NoSQL database in a side project and see whether they are good fit. The data modelling would involve thinking hard about use-cases and would also help you compare this to relational systems.

[1] https://static.googleusercontent.com/media/research.google.c... [2] http://www.aosabook.org/en/nosql.html [3] http://www.allthingsdistributed.com/files/amazon-dynamo-sosp... [4] https://github.com/aphyr/distsys-class [5] http://docs.aws.amazon.com/amazondynamodb/latest/developergu...


Not ranting or trolling, but in the vast majority of cases I've come across, PostgreSQL or even mySQL or SQLite would have been a better choice.

(There must be something appealing to developers using JSON's style syntax rather than a Structured Query Language.)

There should be a solid reason to pick noSQL in general, and when such appear, picking the right one amongst the available noSQL platform is another job.

https://en.wikipedia.org/wiki/NoSQL


> Not ranting or trolling, but in the vast majority of cases I've come across, PostgreSQL or even mySQL or SQLite would have been a better choice.

This is ranting.

I am a Postgres proponent but saying that PostgresSQL/mySQL/SQLite is the better choice in the vast majority of cases the parent has come across is reckless. The words were well chosen making the rant not that obvious.

There aren't good or bad DBs. Every DB has its strengths and respective trade-offs. As much I like Postgres, there so many use cases to use also other DBs and also NoSQL ones. I am not feeding the troll and starting reasoning why NoSQL can be terrific or SQL can be a big struggle, I am on both sides, both SQL and NoSQL have their place.

It's sad that a thread which is about learning NoSQL gets hijacked by a unrelated top comment opposing NoSQL.


There definitely are bad databases. You can easily make a system that is NOT consistent and NOT available and NOT partition tolerant, for example.


“Yeah, but Postgres” is the new “Is it webscale?” For any db related thread. They found a blue hammer that will work for every problem and want everyone to know.


Not only that, but Postgres has great JSON support. I think a good way to put it is: If you don't know the ins and outs of SQL, start there. It solves the vast majority of problems you'll encounter. Expand out to NoSQL as your needs (and knowledge) arise.


I know you mentioned Postgres, I was just wondering high if you or anyone else had experience with using JSON with MySQL? I am currently in a place at work where the design was made to use MySQL and we can’t back out of that and I find their documentation a little terse in this subject (I’m not a SQL expert) so I was wondering if anyone could speak to it at all? Does it automatically parse the keys in a JSON file as table names for where to put the values or are you just calling a file every time? Is one per say more effective or efficient than the other?

Sorry to latch on I’m very eager to learn. Our stacks of choice are Django and Flask respectively, if that helps


Right on! This is the reason I've actually mentioned Postgres.


“I want to learn about planes”

“Trains are usually a better choice. Most people don’t need planes”


Can you explain why? I agree, but I’d like to be able to justify it as much as possible when arguing for SQL.


Not the author but here is my explanation: SQL databases are similar to a Swiss army knife. You can apply them pretty much to every use case. However, for most use case they won't be as good as a more specialized tool. NoSQL DBs usually make stronger trade offs that limit them to fewer use cases, but make them incredibly well suited for others. If you know for sure what your problem is and you gotta scale, go NoSQL. If you and your company are starting out you are most likely better off with Postgres. Even if your current use case is a perfect fit for a specific NoSQL store your business needs are likely to change and now you gotta migrate. For all but intense cases Postgres will scale well. Once you are super successful you can migrate the pieces of your system that need to to a better scaling solution. You must make 100% sure though that you understand the tradeoffs that you are making. There is no system that is just in general better than any other reasonable system. If a knew system claims otherwise we just don't know the tradeoffs yet which is super dangerous.


>However, for most use case they won't be as good as a more specialized tool

It's just a small set of problems that really requires a nosql database.

Most (if not all) nosql databases are perceived as less complicated since they hand-wave away all complicated things to the users of the database, while focusing on being fast and simple to use and run in a cloud or cluster.

Anyone running a database system in a fault tolerant configuration immediately hits the CAP theorem, and SQL and nosql databases sacrifies or ignores different aspects of both CAP and ACID in order to scale.

As you write, you really have to know what you are sacrificing before doing that choice. Perceived complexity is probably not a good selector.

One problem is that SQL databases are normally installed in "pet-mode" where you have two or three servers that you really have to take care of. This feels less satisfactory when developing for the cloud, and typically also doesn't scale very well horizontally. Instead of running your own distributed database in the cloud (and fail) there are also PaaS databases, but SQL tends to be flavoured making it hard to change the infrastructure.

Maybe another problem is the model mismatch - relational databases are imposing restrictions on how data is represented, and how it's retrieved that makes no sense from a "rest-interface based" view as there's a mismatch between the relation-entity view (objects and lists) and relational algebra.

There are graph databases, and I personally think that they might be the future. Building strong models within a bounded context is still probably the best way to model complex data and processes that operate on that data.

Unfortunately the future isn't here yet and most graph databases are still slower than my laptop.

The best compromise is probably to use CQRS - Command Query Responsibility Segregation, meaning that queries and commands (modifications) are handled by separate stacks where read-only data might be distributed and updated ("cached") for use, but actual processing is made to a single consistent database running on a few "pet" servers.

This only makes sense for systems that mostly read things, and are updating it's data relatively seldom.


Q: How can I learn about noX?

A: Not trolling, but X is vastly usually better than noX.

IDK what tolling is.


Vast majority of cases I've come across, if not all of them, only suffer from any reliance on 90s era RDBMS systems.

And it's never about JSON, it's about latency and resilience, about being able to simply add and replace nodes, about just working in a modern distributed environment.


How many systems actually need a distributed database though? In my experience its usually resume driven development that makes the choice to go NoSQL.


Dear God, this. 90% of the people reading this (or more) - myself included - are currently working on a system that averages fewer than 100 concurrent users. I’ve worked on big systems, and DBs like Cassandra are great, and absolutely have their place, and that place is likely not your system. Quit over complicating everything, please. Please.


I love this comment, because it's exactly how I feel about when people talk about these systems, designed for big scale.


Anybody who needs more availability than an individual instance can provide.


That's all great until you need to perform a join.


RethinkDB handles joins just fine :)


As does couchbase. :-) Personally I like map-reduces.


So does the multi-model database ArangoDB. https://docs.arangodb.com/3.2/AQL/Examples/Join.html

And some NoSQL databases speak SQL as well - without being relational.

I like the JSON support in PostgreSQL a lot. Very easy to deal with unstructured JSON data while still using common attributes in a relational format. But there are more cases that one might think about - as a relational guy - that benefit from graph databases, document stores or optimized time-series databases.


NoSQL is a great fit for OLAP-type systems where there's tons of high volume writes, eventual consistency (or BASE in general) is good enough, strict schema is not enforced and the consumer - data scientist or a customer service rep etc is not affected too much if they have to wait a few extra seconds for the search results to come back.


I highly recommend Martin Kleppmann's Designing Data Intensive Applications(http://dataintensive.net).

It will not only help you understand what's "SQL" and "NoSQL" data stores, it also covers the differences between each of them, what problems they are designed to solve, how they try to solve it, and if it'll help with your problems as well.


I teach a course on database systems, including one class on distributed databases (like Dynamo and Spanner) and another on dataflow engines (like MapReduce/Hadoop and Spark).

Students seem to find the Dynamo paper to be the single most enlightening resource. It does a great job of explaining Amazon's use case and how the solution fits the problem. I also reference the relevant Red Book chapter and some students value that context.

It's worth noting that students are very comfortable with relational DBMSs by this point, both in theory and in practice. It quickly becomes clear to them that NoSQL is better called "no transactions", as they know the costs and benefits of various isolation levels in a traditional RDBMS. If you don't yet have an undergraduate-level background in database systems I'd encourage you to seek that out either first or at least along the way to understanding NoSQL systems. My recommendations for how to do this as a self-learner are up on https://teachyourselfcs.com.


Yet many non-relational systems do support ACID transactions across multiple resources. Just from Google there is Megastore, Cloud Datastore, Spanner, Cloud Firestore


Distributed systems. Consensus [0], CAP, PACELC theorems [1], CRDTs [2], maybe Chord DHT [3] for hash rings. Oh, and jepsen.io for actual database choices.

[0] https://en.wikipedia.org/wiki/Consensus_(computer_science)

[1] https://en.wikipedia.org/wiki/PACELC_theorem

[2] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

[3] https://en.wikipedia.org/wiki/Chord_(peer-to-peer)


The most important thing to understand about NoSQL is when you should use it. For many circumstances, NoSQL isn't the right tool for the job. The key is being able to recognize when it is.

I'm still learning how to determine when I should use NoSQL instead of SQL. My best advice is to carefully consider how to plan on querying your data. If you plan on making complex queries that link multiple relationships, NoSQL is not for you.


Or in a slightly different form, what I'd personally love to always have an answer for: Is there a fast way to do this complex query in reasonable time in rdbms or do we have to force it into NoSQLish solution? (say.. solr)

After I've optimized my query/indexes to get from 60s to like 4s running through usual stuff and trying to not do anything too stupid, how to get it to <200ms? Maybe better question how to structure data so you don't need the complex query?


Seven Databases in Seven Weeks https://pragprog.com/book/rwdata/seven-databases-in-seven-we...

Designing Data Intensive applications http://dataintensive.net/


Same as for SQL databases: Readings in Database Systems, 5th Edition -- Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, editors

http://www.redbook.io/


As a starting point, if you have little background in NoSQL, I strongly recommend this 1 hour talk by Martin Fowler: https://www.youtube.com/watch?v=qI_g07C_Q5I

It's slightly dated, but it still gives a strong overview of the different paradigms. The truth is what you want to learn probably differs greatly depending on the paradigm that fits your application. NoSQL databases can broadly be categorized into document-oriented, key-value store, columnar, and graph. This video will help you understand what (at least three) of those are. Then you can focus in on books/articles about the paradigm that makes the most sense for you.



Designing Data-Intensive Applications [1] is a good book all around for creating application and management of the data that they provide including NoSQL.

[1] See http://dataintensive.net


I don't know of any mandatory books or theory about NoSQL, I picked it up on the fly using Firebase for a web app. Not affiliated, but I'm a reasonably happy customer. It's super easy to learn, and they have lots of tips and pointers about how to use it well, as I'm sure others do.

Their tips are here, and I think this applies to most/all NoSQL (someone correct me if I'm wrong.) https://firebase.google.com/docs/database/web/structure-data

The tl;dr is:

- Avoid complex queries. Structure data so that you can make simple queries that execute fast.

- Avoid nesting & flatten data as much as is reasonable.

NoSQL is easier to learn & use than SQL, there's lower barrier to entry, but the trade off is that it's less powerful than SQL, so you have to keep your data simple too.


>Avoid nesting & flatten data as much as is reasonable.

Isn't this contradictory?


Yes, it is a little bit - if you mean that one reason to use NoSQL is to store nested JSON.

This is referring more to schema than data. In part what that means is to avoid nested indexes... subtle but different than avoiding any nesting at all. In other words, if you can treat the nested data as a blob, it's probably okay, but if it's being used for a query, it's adding complexity that can cause trouble.

Some of the reasons for that are Firebase-specific, it has to do with security rules and how security can get too complicated if you're not careful with nesting.

But I'd guess it still applies to other NoSQL data... nesting data as part of the schema is like making another table, and all the complexity that comes with it. Except it's a new table you can only get to by going through the first table.

A common problem with nesting is thinking you got the order right for your use case and later finding out you sometimes want to index by the inner data rather than the outer data. If you only have A/B (B nested in A) and you need to query for As, then you're fine. When you find out you need to query for Bs, you have a problem.

Firebase even recommends duplicating data, if necessary, to have two indexes A/B and B/A, rather than trying to query for nested data.


It seems like this stackoverflow question hits on the same issue you ran into.[1]

It looks like that might be specific to Firebase's implementation because this can be achieved with Mongodb.[2]

1. https://stackoverflow.com/questions/27207059/firebase-query-...

2.https://stackoverflow.com/questions/15654228/sort-by-embedde...


No, that's not the issue I ran into. Their API has changed since this question was asked & answered to fix that particular issue.

The bigger issue remains that schema nesting causes a type of complexity that SQL dbs avoid by always being flat. Even that answer you linked to, the very last sentence is: the most important one for people new to NoSQL/hierarchical databases seems to be "avoid building nests".

Schema nesting in mongodb is also best avoided, if you can, e.g.:

https://stackoverflow.com/questions/5108790/mongodb-best-pra...


Start with a general understanding of SQL/NoSQL/ACID/CAP and how they relate: https://www.quora.com/What-is-the-relation-between-SQL-NoSQL...

Then read this book for in-depth details - Designing Data-Intensive Applications : https://dataintensive.net/


I found the book CouchDB: The Definitive Guide to be a good introduction when I first read it some years ago. I bought the dead tree edition but they have an online version that I think may have been updated.

http://guide.couchdb.org/


NoSQL Distilled NoSQL for dummies 7 databases in 7 weeks NoSQL for mere mortals Professional NoSQL

and of course the orirginal papers from Amazon and Google.

If you have more questions - contact me at HN AT NoSql dot Com


[flagged]


Let's keep this "humor" on Reddit, please.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: