
From 1.5 GB to 50 MB: The Story of my Redis Database - janerik
http://davidcel.is/blog/2013/03/20/the-story-of-my-redis-database/
======
antirez
One of the long term things we want to address in Redis is lack of tools to
analyze the dataset directly available in the Redis distribution, as first
class citizens.

There are a few like redis-cli --latency, or redis-key --bigkeys (I encourage
you to try both if you never did it before), but more are needed in order to
profile memory, big keys, pretty-print the Slow Log output, and so forth.

The availability of external tools is surely of a great help, but I hope that
in the long run we'll have the most important things integrated into redis-cli
directly.

~~~
ksri
antirez - Author of redis-rdb-tools here. Would you be interested in a patch
that adds profiling support to redis-cli?

My approach is to parse the rdb file so that the running server isn't
affected. It approximates memory usage, but is fairly representative.

~~~
antirez
Thanks ksri, that would be very welcomed.

I think a viable design would be to put into redis-cli only things that are
network oriented, so we should actually merge the functionality of your
current tool into redis-check-rdb, that should be renamed just redis-rdb, with
the following features:

1) Everything redis-rdb-tools is currently able to do.

2) Sanity checks of RDB files, like redis-check-rdb is doing already.

Makes sense? Thanks again, this would be very very appreciated.

~~~
ksri
Definitely makes sense! I'll start looking into this.

------
AznHisoka
Great solution, but...

"Maybe I should revert to storing ratings in PostgreSQL and accept what would
certainly be a large performance hit during recommendation generation."

I wondered if you prematurely optimized here. Did you try Postgres in the
first place? What was the performance like? I can't help but wonder if you
dismissed Postgres simply because it wasn't as sexy as redis.

~~~
davidcelis
To be honest, when I first wrote my site (and, later, recommendable) it was
for my senior project in school. I had only been programming in school so far
(which was fairly different from programming in the real world) and this site
was really my first real-world application. Back then I was using MySQL (!) as
my main data store. I was storing Likes/Dislikes as polymorphic Rails models,
but then I read about Redis and it sounded like the perfect fit for all the
set math I would have to do. So when refreshing recommendations, I would pull
the ratings out of MySQL and feed them into Redis to do some fast set math.
Then I'd tear down the Redis sets. I only had friends on the site testing it
out for me, so I didn't notice much of a problem.

Later, I switched my database to PostgreSQL but wasn't having Redis issues at
that point, so I let it be. Then when goodbre.ws found itself with a lot of
new faces, I found myself with the I/O bottleneck I mentioned in the article.
I decided to try to solve that by moving the ratings to permanently reside in
Redis. That way I'd only have to hit Redis a couple times to regenerate
someone's recommendations. Then I started storing too much in my ZSETS, but
here I am now.

Now that I've reduced that memory footprint again, I feel as though attempting
to move the ratings back into Postgres would be a premature optimization in
itself. I was, however, looking into some combination of Postgres' array types
(to replace the sets) and hstore (to replace the ZSETS, but sort in the
query). I think it's worth looking into, but I have some other things to focus
on first!

~~~
joevandyk
One thing that's commonly done is to keep the data in a normalized format, but
use a view or a function to return the data in whatever format you want
(arrays, json, etc).

This lets the application request the data in the ideal format for
consumption. The database then becomes a "black box" service that returns a
nested json graph -- the application no longer needs to know how the data is
stored.

See <https://gist.github.com/joevandyk/5215014> for a simple example.

------
gtrubetskoy
If the values you are storing are integers, you should also look at your zset-
max-ziplist-entries config setting. I've been able to shrink a 30GB Redis
memory footprint to just under 3GB. The caveat is that all search operations
in ziplists are sequential (since it uses variable-length strings and
integers), but oftentimes scanning a short array is as fast or faster than
hashing or locating an entry in a skiplist. There is more details about this
here <http://redis.io/topics/memory-optimization>, and I've submitted a patch
to Redis that makes it even more efficient
(<https://github.com/antirez/redis/issues/469>).

------
ocharles
For relational databases, we already have some pretty good rules to avoid
problems like this - the normal forms (specifically BCNF, which is "the"
normal form to reduce redundancy inside a single relation). I wonder what
rules apply to non-relational models? Has anyone done research into this?

~~~
parasubvert
Much of the research on network and graph databases from the 70's and 80's
apply to today's non-relational models, at least from a logical data integrity
subject area.

The short answer is that the "relational" model is a generic logical theory
for managing data under a closed world assumption. "Non relational" mostly is
about saying "I don't need a logical model, I need a persistence framework",
wherein you are just looking to implement durable storage, but aren't really
managing data in a logical manner.

Some network/graph databases are arguably a special case of the relational
model, but they all vary to the extent under which they enforce logical
integrity vs. exposing the innards to muck with (same with all RDBMS'
frankly).

William Kent's 1978 book _Data and Reality_ is a great read to think of data
modeling and integrity issues generically, irrespective of whether you use an
RDBMS or a non-relational database. He highlights certain logical concepts as
universal regardless of whether you use tables, K/V pairs, or graphs of binary
relationships.

------
macspoofing
>What could possibly be causing it to use that much memory? I considered the
length of my keys...Maybe Redis was eating memory by storing tens of thousands
of really long key names in RAM? I decided to try shortening them to a more
reasonable format: u:1234:lb for example.

Wow ... why would you reengineer your naming conventions without at least
doing some back-of-a-napkin arithmetic beforehand and compare the potential
savings? That strikes me as ... insane.

~~~
davidcelis
> Optimization is a rabbithole I’ve not had to go down very often. I am hardly
> an expert.

But really, the answer to this is because it's easy to copy a database in
Redis and then just do a mass RENAME. It took all of two minutes with a Ruby
script and Redis' RENAME command. I could have done a theoretical back-of-a-
napkin attempt, but I wanted some real numbers. And if it's easy to get them,
why not?

~~~
macspoofing
Nevertheless, it was effort for dubious statistics (is knowing that the memory
savings are exactly 10MB any better than estimating that the savings would be
around 8MB to15MB, working with a data set that's 1.5GB big?). Anyway, it
clearly doesn't really matter in this case. You're playing around with a side-
project that you can just blow away anytime you feel like. However, when the
calculations are this trivial, it's probably good to get into the habit.

~~~
jholman
Fermi estimates are a terrific technique, and indeed in this case we can see
that "tens of thousands" x "five for each user" x 20B saved = 10MB, so that
worked out nicely. So yeah, that would have been smart. BUT...

When Fermi estimates fail, they fail in one of two ways. Sometimes they fail
by a constant factor, as expected, and that's fine. But sometimes they fail
because you forgot one or more important factors entirely, and in that case,
they can be arbitrarily wrong. (In fact, it seems pretty likely that the so-
called Fermi paradox is an example of this). It's not just 8MB vs 15MB vs
10MB.... it's 10MB vs some chance of totally utterly wrong, like maybe 1GB or
something.

So when the test for real data is cheap (as OP says it is in this case, not
that I know squit about Redis), it's "probably good to get into the habit" of
doing it right, instead of getting a perfectly good napkin dirty.

~~~
macspoofing
I know what you're saying but even if you're off by an order of a magnitude,
at least it's some sort of starting point. Estimating that your key-space in
Redis is between 5MB and 500MB isn't particularly useful but it is a big
improvement over having no idea. I don't disagree with you that in most cases,
there is no substitute for real world data, especially if you can come across
it without much pain, but I maintain you should _always_ start with some sort
of estimate, though wrong it may be. It's good to take the time to try to
reason out a system or problem, and then seeing if your mental model matches
the real world reality.

In this particular example the napkin analysis is so trivial and so accurate
and the 'real world' data is so expensive ( linear look up of the entire key-
space, followed by data destruction ), that it immediately jumps out at me.

~~~
jholman
When you forget even one factor, you can be off by many orders of magnitude.

And you're measuring the cost of the real data in totally useless terms. Who
gives a shit about linear scans and "data destruction", lolz. What matters is
how long it took the OP to figure out what command to write, and write it, and
write the cleanup commands. If he already knew how to do it, then it was
basically free. If he didn't already know how to do it, then he learned
something, which pays for most of the cost of doing it.

And the napkin analysis is so trivial, but you don't EVER know that it's
accurate, because there's no error bound on "oh I forgot the important part".

~~~
macspoofing
I don't think you know what you are actually arguing. I'm pretty sure I'm not
arguing against performing real world tests to diagnose a problem. I'm pretty
sure I'm not doing that because I'm not insane. I'm also fairly sure I didn't
argue that you have to do one or other but not both. Though I didn't say it
outright, I'm pretty sure I'm implied, and I stand by it, that it is good
practice to at least do a fast mental (or napkin) estimate when it is
warranted. Debugging is a hard process, but it shouldn't be a random one. You
should have a mental model of your system, and it is good practice to go
through the exercise of defining said mental model and yes, checking your
assumptions, when it is warranted.

I'm clearly not a fan of the direction the OP took to verify his assumption
that his key sizes have an impact on his Redis DB size. I think there's a
better, more accurate way of checking his hypothesis (napkin math). His
approach wasn't great and doesn't work for anything other than toy or test
deployments, and the results aren't as clear as you may think (see below). I
stand by that.

>And the napkin analysis is so trivial, but you don't EVER know that it's
accurate, because there's no error bound on "oh I forgot the important part".

We're still talking about the same situation, correct? I suppose if it's
pedantry you want, pedantry I can give. Tell me, how does the OP know that his
"real world analysis" is correct? Because at some point in time, for some sort
of input the system gave him one kind of result? Apparently a simple
multiplication ( #of keys x avg. key size) is fraught with errors, but issuing
a command (RENAME) against a black-box datastore the OP probably doesn't fully
understand provides a clear, unambiguous result? What if Redis caches (in
memory or to disk) all original values before issuing a key rename and then
clears them out over a period of time, or better yet, doesn't clear them out
until it needs the space? So you run your script, check memory usage, and see
no difference.... so of course, because we're you, we trust "real-world
result" and we live happily ever after ... yes?

Obviously this is a contrived example and OP most likely got the right result
but I think I made my point. "Real world results" are full of gotchas and
ambiguities (and sometimes require great deal of background knowledge to
properly interpret), and in such a case it would be nice to have a mental
model of what the expected result is so that it can be either verified or
proven wrong (and thereby provide a direction for further investigation).

------
ddlatham
Always good to have a reminder to measure and validate your ideas of what is
causing a problem before jumping to the solution.

------
ispivey
tl;dr version: his recommendation algorithm stored an amount of data that was
Θ(n^2) + Θ(n*m) for # of users, beers. He optimized that to put a constant
limit on the storage space used.

The clearest takeaway is: if you want to reduce the disk or memory footprint
of your DB, you need to figure out what tables/rows/columns are consuming the
lion's share of the space (there's almost always one column that's way worse
than the rest) and figure out how to change your app logic to store much less
data in that column.

The OP's lessons learned apply no matter which DB you're using!

------
GaveUp
This reminds me of a situation I've seen and found myself in far too many
times. When I encounter a performance problem I immediately think about areas
where, during initial development, I stopped and said "this might cause a
problem with a large amount of data" versus doing the proper thing of actually
hooking up a profiler or doing proper testing to determine what the actual
problem is; Rarely is it in an area I previously identified as potentially
causing future problems.

Ego (I know what the problem is!) and the drive to fix problems quickly always
get in the of finding the solution. Tales like this are a great reminder to
not assume you know what the problem is.

------
ajtaylor
I was thinking you had switched from string keys to hashes, since I had read a
post from antirez about hashes being more efficient for this type of thing.
But as they say, measure twice, cut once. After all, profiling is the first
step of optimisation.

~~~
ihsw
Hashes can be efficient because redis compresses hashes that match certain
restrictions (eg: under 1000 items, each item being under 1000 characters
long) and the same can apply to lists, sets, and zsets.

This compression may have inadvertently been activated by the author when he
pruned his zsets, which is why the savings are so apparent.

~~~
ajtaylor
The primary reason for the reduction was going from ~7k items in the ZSET to
100 for several thousand ZSETS. That's not to say the built-in compression
didn't affect anything, but obviously less data == less memory.

------
codegeek
Great read. Stuff like this is why HN is awesome. If I ever get into Redis
memory/performance issues, will remember this.

------
dangoldin
Thanks for posting this. The real-world optimization posts tend to be the most
useful for me.

------
gregjor
I see a few good lessons here:

1\. You can learn a lot about your own code and your tools by digging in and
debugging and experimenting.

2\. As macspoofing argues, having a correct mental model of your application,
and your tools, is what separates professional programmers from amateurs and
beginners. You don't need to do much math or understand Fermi estimates to
figure out that removing 30 bytes of excess key name from, say, 100,000 rows
is only going to get back a few megs. However doing the experiment validated
your hypothesis and taught you something about how Redis (probably) stores
keys internally. For many real problems with real data, estimating is orders
of magnitude faster and simpler (and probably safer) than experimenting, so
it's a skill to cultivate.

3\. This is exactly the kind of data management problem that relational
databases are _really good_ at. Relational theory is based on sets and RDBMSs
are good at working with sets. But you would need to normalize your schema and
know how to write efficient SQL -- perhaps an exercise for another day. RDBMSs
and SQL are not easy to learn and take real study and practice, but the
rewards are significant. I agree with AznHisoka that dropping PostgreSQL (or
MySQL) in favor of Redis was a premature optimization, but you would need to
spend a lot of time mastering relational database design and SQL to get the
benefits of an RDBMS, whereas Redis doesn't make the same demand. If you
posted more details of your data and how you need to query it on stackoverflow
you'd get some free lessons in schema design and normalization.

4\. A database of a few tens or even hundreds of thousands of rows and taking
up only 50 megs of RAM is trivial for any RDBMS. Not trivial as in not
important, but in the same sense that delivering a carton of milk with a semi
truck is a trivial use of the truck. Your data set would not begin to run into
any performance limits of a stock RDBMS. I'm not criticizing or making fun,
just stating a fact.

5\. Don't assume you know what the problem is when debugging or optimizing --
it's too easy to go into the weeds and get frustrated. Over time you'll
probably get better at narrowing the range of possibilities but even the most
experienced programmers are frequently wrong about the source of a bug or
performance problem. Your methodology is correct, do the same thing next time.

You say that you are relatively new to programming and databases, and there's
nothing wrong with that. We've all made the same kinds of decisions you have,
and some of us are still making them without the kind of introspection you've
documented.

If your code seems too complex (something you develop a feel for over time)
you need to choose better data structures. With the right data structures the
code writes itself. When you speculate about using an array column type in
Postgres I know with 99.99% certainty that you are not normalizing -- nothing
you've described can't be accomplished with normalized relations. And you'd be
able to do all of your queries with simple SQL statements.

I recommend Chris Date's book "Database in Depth: Relational Theory for
Practitioners," which is aimed at programmers rather than DBAs.

I've written more on this (and linked to some more good books) on my site at
<http://typicalprogrammer.com/?p=14>

------
DanTheAutomator
Hurry for real-world optimization!

