Hacker News new | past | comments | ask | show | jobs | submit login
Antirez: You Need to Think in Terms of Organizing Your Data for Fetching (highscalability.com)
66 points by aespinoza on Oct 10, 2012 | hide | past | web | favorite | 19 comments

Also keep in mind that learning to "organize your data for fetching" is not necessarily something you can do before you start your project. Many (most?) times you can't predict which data access patterns will be most common and benefit from using Redis, etc.

Starting with a "slower, but flexible" datastore like a traditional relational database, monitoring which access patterns need a boost, and then optimizing or introducing a new datastore is almost always a solid plan of attack.

I still feel the canonical answer to "what should my default data management policy in write-some-read-a-ton situations" is to, at write time:

    1) store an appropriate write-whole/data-mining-friendly format
    2) ASAP, for each major view, write out a Redis-style O(1)-to-read data structure
    3) think carefully about backup and replay strategies
You trade a slightly stale read of the very hottest data for much improved performance on everything else, and more importantly, much simplified view code.

The best reference I have found for this pattern, and it isn't great (too big-SQL-centric), is "command-query-responsibility segregation":


if the CEO/owner/founder of your company is non-technical he/she will request the data in ways you wouldn't have thought about in advance. That's just reality. That makes Redis not appropriate for most companies. It's also too expensive for side projects. So that leaves technically-led startups. Which is a good chunk of companies (and probably the funnest to work for).

Lots of people use Redis at a cache, not a primary data store. You can have full querying in your SQL database and fast access to common requests using Redis.

In what way is it too expensive for side projects? It's the easiest data store to compile and run that I've used.

Here are some Redis hosting options, you tell me if this is affordable for a side project:



How are 3rd party hosting options part of a cost comparison?

Hosting any DB offsite comes at a cost, and it does not appear that any one database platform has an advantage over another in terms of a service provider.

My current little side project uses about 200-300 MB of database storage. Using Heroku, that would cost (pr month) $9 on their shared postgres database, $10-15 using mongodb, $50 using heroku's dedicated production postgres service, or $125 using redis.

That's quite a difference.

Wait. If you use 200-300MB of DB storage, openredis costs $69 using it as a Heroku add-on and $45 if used independently.

$125 ($90) is the price of the large instance, which offers 1.7GB of storage.

That's quite a difference.

I stand corrected. I was using the prices from redis-to-go, which was the only available option last time I checked, and the service with the big prominent link at the top of the heroku pricing page. Good to see that there is some competition in the area and that prices have come down a bit. Still not cheap enough to move my hobby project from a 1GB VPS, but reasonable enough if I ever turn my project into something that might make money.

The smallest plans listed are $7 and $8 a month. Sounds affordable for a side project.

FYI, http://redistogo.com offers a free plan. 5mb, 1 database. Perfect for a side project.

Doesn't give you much though.

First, running redis remote where you have WAN times involved is typically not the most performant. Second, if it's the cost, run it on a VPS where you'll get more memory for the money (hopefully the same VPS provider you use for other things).

That's why is common to replicate to some SQL server for non-real time operation.

Doing transformations for one off reports is okay. Doing extensive data transformations on data that is loaded millions of times every day is not.

  So anyway if your data needs to be composed to be served, you are not in good waters. -- antirez
Perfect! This sentence puts an end to SQL vs NoSQL holy wars. There's no silver bullet. But we 'all' knew that already :)

Data should not be organised based around retrieval or insert / update patterns but organised according to the model that best captures the essence of what the data is. That may sound fluffy, but most of the time data captured is captured following something real happening that caused data to be generated. Your data model needs to make sense in the context of the thing that happened in the real world, not in the context of what is inserted or how it is read.

The issue being that youre methods of collection and retrieval will change over time and your data model needs to support that and still make sense for existing data.

While what you describe is certainly the ideal, and likely applicable in a variety of situations, there are a lot of real world situations where it can't cut it. Antirez says it best:

   remember all those stories about DB denormalisation, and hordes of memcached or Redis farms to cache stuff, and things like that? The reality is that fancy queries are an awesome SQL capability (so incredible that it was hard for all us to escape this warm and comfortable paradigm), but not at scale.
You'll be hard pressed to find any medium-sized project in the wild that doesn't require a layer of denormalization or caching to be reasonably responsive (I mean, some frameworks come with that built-in - their users might not even be aware this is happening). You might have some beautifully crafted model underlying that layer, but don't fool yourself into thinking that's all there is.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact