There are a few like redis-cli --latency, or redis-key --bigkeys (I encourage you to try both if you never did it before), but more are needed in order to profile memory, big keys, pretty-print the Slow Log output, and so forth.
The availability of external tools is surely of a great help, but I hope that in the long run we'll have the most important things integrated into redis-cli directly.
My approach is to parse the rdb file so that the running server isn't affected. It approximates memory usage, but is fairly representative.
I think a viable design would be to put into redis-cli only things that are network oriented, so we should actually merge the functionality of your current tool into redis-check-rdb, that should be renamed just redis-rdb, with the following features:
1) Everything redis-rdb-tools is currently able to do.
2) Sanity checks of RDB files, like redis-check-rdb is doing already.
Makes sense? Thanks again, this would be very very appreciated.
"Maybe I should revert to storing ratings in PostgreSQL and accept what would certainly be a large performance hit during recommendation generation."
I wondered if you prematurely optimized here. Did you try Postgres in the first place? What was the performance like? I can't help but wonder if you dismissed Postgres simply because it wasn't as sexy as redis.
Later, I switched my database to PostgreSQL but wasn't having Redis issues at that point, so I let it be. Then when goodbre.ws found itself with a lot of new faces, I found myself with the I/O bottleneck I mentioned in the article. I decided to try to solve that by moving the ratings to permanently reside in Redis. That way I'd only have to hit Redis a couple times to regenerate someone's recommendations. Then I started storing too much in my ZSETS, but here I am now.
Now that I've reduced that memory footprint again, I feel as though attempting to move the ratings back into Postgres would be a premature optimization in itself. I was, however, looking into some combination of Postgres' array types (to replace the sets) and hstore (to replace the ZSETS, but sort in the query). I think it's worth looking into, but I have some other things to focus on first!
This lets the application request the data in the ideal format for consumption. The database then becomes a
"black box" service that returns a nested json graph -- the application no longer needs to know how the data is stored.
See https://gist.github.com/joevandyk/5215014 for a simple example.
The short answer is that the "relational" model is a generic logical theory for managing data under a closed world assumption. "Non relational" mostly is about saying "I don't need a logical model, I need a persistence framework", wherein you are just looking to implement durable storage, but aren't really managing data in a logical manner.
Some network/graph databases are arguably a special case of the relational model, but they all vary to the extent under which they enforce logical integrity vs. exposing the innards to muck with (same with all RDBMS' frankly).
William Kent's 1978 book _Data and Reality_ is a great read to think of data modeling and integrity issues generically, irrespective of whether you use an RDBMS or a non-relational database. He highlights certain logical concepts as universal regardless of whether you use tables, K/V pairs, or graphs of binary relationships.
But this is the story of an engineer using a profiling tool to solve a performance problem, and the solution was to drop resolution in his probabilistic data model. I don't see how BCNF would have avoided this if the system were implemented in RDBMS. The actual solution reminds me of the quantization step in JPEG processing - its lossy of the high-entropy but imperceptible data.
To your question: cost accounting - technology neutral, and brutal in it's power to decide.
Wow ... why would you reengineer your naming conventions without at least doing some back-of-a-napkin arithmetic beforehand and compare the potential savings? That strikes me as ... insane.
But really, the answer to this is because it's easy to copy a database in Redis and then just do a mass RENAME. It took all of two minutes with a Ruby script and Redis' RENAME command. I could have done a theoretical back-of-a-napkin attempt, but I wanted some real numbers. And if it's easy to get them, why not?
When Fermi estimates fail, they fail in one of two ways. Sometimes they fail by a constant factor, as expected, and that's fine. But sometimes they fail because you forgot one or more important factors entirely, and in that case, they can be arbitrarily wrong. (In fact, it seems pretty likely that the so-called Fermi paradox is an example of this). It's not just 8MB vs 15MB vs 10MB.... it's 10MB vs some chance of totally utterly wrong, like maybe 1GB or something.
So when the test for real data is cheap (as OP says it is in this case, not that I know squit about Redis), it's "probably good to get into the habit" of doing it right, instead of getting a perfectly good napkin dirty.
In this particular example the napkin analysis is so trivial and so accurate and the 'real world' data is so expensive ( linear look up of the entire key-space, followed by data destruction ), that it immediately jumps out at me.
And you're measuring the cost of the real data in totally useless terms. Who gives a shit about linear scans and "data destruction", lolz. What matters is how long it took the OP to figure out what command to write, and write it, and write the cleanup commands. If he already knew how to do it, then it was basically free. If he didn't already know how to do it, then he learned something, which pays for most of the cost of doing it.
And the napkin analysis is so trivial, but you don't EVER know that it's accurate, because there's no error bound on "oh I forgot the important part".
I'm clearly not a fan of the direction the OP took to verify his assumption that his key sizes have an impact on his Redis DB size. I think there's a better, more accurate way of checking his hypothesis (napkin math). His approach wasn't great and doesn't work for anything other than toy or test deployments, and the results aren't as clear as you may think (see below). I stand by that.
>And the napkin analysis is so trivial, but you don't EVER know that it's accurate, because there's no error bound on "oh I forgot the important part".
We're still talking about the same situation, correct? I suppose if it's pedantry you want, pedantry I can give. Tell me, how does the OP know that his "real world analysis" is correct? Because at some point in time, for some sort of input the system gave him one kind of result? Apparently a simple multiplication ( #of keys x avg. key size) is fraught with errors, but issuing a command (RENAME) against a black-box datastore the OP probably doesn't fully understand provides a clear, unambiguous result? What if Redis caches (in memory or to disk) all original values before issuing a key rename and then clears them out over a period of time, or better yet, doesn't clear them out until it needs the space? So you run your script, check memory usage, and see no difference.... so of course, because we're you, we trust "real-world result" and we live happily ever after ... yes?
Obviously this is a contrived example and OP most likely got the right result but I think I made my point. "Real world results" are full of gotchas and ambiguities (and sometimes require great deal of background knowledge to properly interpret), and in such a case it would be nice to have a mental model of what the expected result is so that it can be either verified or proven wrong (and thereby provide a direction for further investigation).
How many keys do you have? Not something you should guess.
How many bytes per key are you saving? Not something you should guess.
The clearest takeaway is: if you want to reduce the disk or memory footprint of your DB, you need to figure out what tables/rows/columns are consuming the lion's share of the space (there's almost always one column that's way worse than the rest) and figure out how to change your app logic to store much less data in that column.
The OP's lessons learned apply no matter which DB you're using!
Ego (I know what the problem is!) and the drive to fix problems quickly always get in the of finding the solution. Tales like this are a great reminder to not assume you know what the problem is.
This compression may have inadvertently been activated by the author when he pruned his zsets, which is why the savings are so apparent.
If your sets and zsets are bounded within a certain limit then they will be compressed and take up quite a bit less space.
Updating your configuration obviously requires a restart so that will need to be kept in mind (ie: if your data fragmentation is high then restarting will eliminate the fragmentation, so space savings will be apparent but they're not actually indicative of any changes).
1. You can learn a lot about your own code and your tools by digging in and debugging and experimenting.
2. As macspoofing argues, having a correct mental model of your application, and your tools, is what separates professional programmers from amateurs and beginners. You don't need to do much math or understand Fermi estimates to figure out that removing 30 bytes of excess key name from, say, 100,000 rows is only going to get back a few megs. However doing the experiment validated your hypothesis and taught you something about how Redis (probably) stores keys internally. For many real problems with real data, estimating is orders of magnitude faster and simpler (and probably safer) than experimenting, so it's a skill to cultivate.
3. This is exactly the kind of data management problem that relational databases are really good at. Relational theory is based on sets and RDBMSs are good at working with sets. But you would need to normalize your schema and know how to write efficient SQL -- perhaps an exercise for another day. RDBMSs and SQL are not easy to learn and take real study and practice, but the rewards are significant. I agree with AznHisoka that dropping PostgreSQL (or MySQL) in favor of Redis was a premature optimization, but you would need to spend a lot of time mastering relational database design and SQL to get the benefits of an RDBMS, whereas Redis doesn't make the same demand. If you posted more details of your data and how you need to query it on stackoverflow you'd get some free lessons in schema design and normalization.
4. A database of a few tens or even hundreds of thousands of rows and taking up only 50 megs of RAM is trivial for any RDBMS. Not trivial as in not important, but in the same sense that delivering a carton of milk with a semi truck is a trivial use of the truck. Your data set would not begin to run into any performance limits of a stock RDBMS. I'm not criticizing or making fun, just stating a fact.
5. Don't assume you know what the problem is when debugging or optimizing -- it's too easy to go into the weeds and get frustrated. Over time you'll probably get better at narrowing the range of possibilities but even the most experienced programmers are frequently wrong about the source of a bug or performance problem. Your methodology is correct, do the same thing next time.
You say that you are relatively new to programming and databases, and there's nothing wrong with that. We've all made the same kinds of decisions you have, and some of us are still making them without the kind of introspection you've documented.
If your code seems too complex (something you develop a feel for over time) you need to choose better data structures. With the right data structures the code writes itself. When you speculate about using an array column type in Postgres I know with 99.99% certainty that you are not normalizing -- nothing you've described can't be accomplished with normalized relations. And you'd be able to do all of your queries with simple SQL statements.
I recommend Chris Date's book "Database in Depth: Relational Theory for Practitioners," which is aimed at programmers rather than DBAs.
I've written more on this (and linked to some more good books) on my site at http://typicalprogrammer.com/?p=14