So it's no surprise that the Redis solution is faster. Besides parsing and indexing the data, PostgreSQL will also prioritize reliability and won't confirm your transaction until it's actually persisted to disk. Whereas Redis is prioritizing speed, and working straight to RAM--the disk serialization is an asynchronous process. If anything, I'm surprised that the Redis solution is _only_ 16-20x faster than PostgreSQL given what appears to be happening.
So, for this use case--storing a cache of Amazon API results for mostly-static data--the Redis solution is a no-brainer: durability of the data is not a serious concern; there's no real risk of data corruption because there's only one schema and no interrelations; you don't need the data store to know anything about the data; you don't need indexing beyond straight key-value. For other use cases, Redis might be entirely unsuitable because of durability or data integrity reasons, in which case the speed benefits wouldn't even matter.
Now, all that said, in the real world there are lots of other considerations: Are you using either of these databases already for some other purpose in the same app? Which data store is easier to set up? Which is easier to code against? Do you even care how fast it is if the slower speed is 11ms on a single-user system?
The title asks a reasonably specific question, but the article then goes on to talk about things in ill-defined ways. It mixes in other layers which aren't directly part of the titular question, but nonetheless influence the answer. In an update, the author goes so far as to state that a new test done differently resulted in a different answer. As you point out... very little relevant detail is included for PostgreSQL (which I know well) and I'm sure insufficient detail for Redis (which I don't know well)... certainly not enough detail and well reasoned testing to merit a public posting on the subject.
That's not to say that the gist of the article is incorrect or unexpected... I would expect a well configured Redis to perform better than PostgreSQL for reading and writing JSON data. But there's nothing I can really learn here that makes clear differences with well understood external influences.
The article is just anecdote, not really good testing.
The article should be titled something along the lines of, "How much faster is my way of using Redis at storing JSON than my way of using PostgreSQL?". Perhaps "Medium" would have been a more appropriate venue as well.... ;-)
Mayhaps. Or, just a snapshot from the trenches of real application development.
We all have to make choices. Different pros and cons. Before jumping into, perhaps run some tests with real stuff and change direction based on that.
In application development is usually doesn't matter what the core difference is between two databases because there are drivers and mappers that you more or less have to use. I.e. the Django ORM makes PostgreSQL convenient.
I run everything in hyper-v docker, which should actually get pretty close to bare-metal performance (as hyper-v sort of runs an instance alongside windows, rather than virtualized on top).
However, due to some quirkyness, maybe something with docker networking, I rarely get very good performance and I assume in this case, they are either using ported versions of the software or also running virtualized.
So I'm not sure if these kind of benchmarks add data points to anything but "performance on a dev machine".
In the spirit of constructive criticism, I second the "not really good testing" feedback, mostly due to the lack of tuning effort and omission of performance-affecting details like configuration, the wire protocols used, how data is being serialized/deserializaed, etc. (though I am no expert with these technologies so some of this may have been implied). All of that may have been overkill for what you were trying to do, but it's what people look for in benchmarking / "good testing". I hope this helps.
Point being they serve very different purposes, it's not about benchmarks.
People generally reach for Redis too quickly, IMO. Postgres is about as fast, has better tooling, and will slow down gracefully as the dataset increases, rather than evicting records as the limit is hit.
That said, I guess if you're relatively junior, it's hard to grossly misconfigure Redis, so that's a point in its favor. Somewhat like Java vs Node.js--you can write higher performance network services in Java, but it's a lot easier to make mistakes.
Redis persistence (or possibly lack-thereof) is probably quite complicated for a junior developer to understand. I'd actually expect a better outcome with PSQL.
Redis is one of the simplest data stores around to use and its persistence is solid.
I've lost countless amount of data in production systems running Redis. Haven't ever lost anything in Postgres. (I still run both.)
class AmazonAffiliateLookup(models.Model, TotalCountMixin)
There also is:
song = models.ForeignKey(Song, on_delete=models.CASCADE)
Then, we have:
created = models.DateTimeField(auto_now_add=True, db_index=True)
Again, that may be suboptimal if there are few records in the database. DateTimeField also may be ‘too structured’ for the task at hand (it might needlessly parse year/month/day out of time stamp in nanoseconds, for example, or even do time zone calculations to convert a local time to UTC), but I don’t know. It would be helpful if the text showed the actual SQL DDL.
In any case, Redis is probably a more natural and better fit (remote API result caching). That said, the comparison is awkward at best. If it were to cache localized results, where the dbms (PostgreSQL) is the source of truth, then adding a caching layer above the db may or may not be worth it.
I think the performance comparisons come down to Granny Smith Apples vs Rome Apples... you can bake a pie with either, but they don't taste the same.
You can go for more complex caches such as Hazelcast which offers hot restarts by persisting data to disk.
a) PostgreSQL has sub-millisecond read/writes for storing JSON blobs than Redis.
If you're going to make bold claims the onus is on you to provide evidence to back it up.
I too would expect the performance improvement to be significantly faster. This guy probably doesn't know how to benchmark his code, either.
I tend to track versioned changes in scripts, with a config table with a lock and a record of which the current version is, locking and upgrading the db on api service startup. It tends to be very solid, and doesn't rely on potentially unexpected behaviors. At work, a project I'm on uses SQL Server projects which has had a great many issues tracking versions. I've also used several ORM tools across differing platforms and they all have issues. More often than not, I'd rather do things the "hard" way. It's a little more initial effort, but saves in the long run.
Beyond this, without understanding how the data is persisted, it's hard to compare equally. If you used pgsql how you used Redis, I'd be surprised if it was more than 5x as slow. There's also less risk of data loss with PostgreSQL. In this case, data caching foreign api results, Redis is probably the more appropriate tool anyway, but the comparisons in TFA are muddy at best.
Its like saying that memcache is faster than postgres.
they do different things for different purposes.
I still wouldn't trust redis for anything other than ephemeral storage. Most of the use cases where we use redis assume that the data will at some point go away. The places where we use postgres assume that data is permanent.
Infact, we use both together, and it works really well, best of both worlds.
The Redis AOF (commands flush to disk) and replication story is rock solid.
You can replicate writes to an offline secondary that even has scheduled RDB memory dumps or scheduled AOF rewrites.
We've never encountered data loss issues with our read and write heavy Redis services.
Curious... how does one confirm this? A lost acknowledged write is a very obscure thing and I don't think I would know if we lost writes.
Operationally, we sequence Redis downtime events. They're very rare. This is when most would be concerned about losing data.
We shift traffic to be hitless. Our model is eventually consistent across multiple regions. We won't accept writes in a cluster that is going down.
The Redis replication chain was chosen so that the primary Redis doesn't have to fork (aofrewrite, bgsave) or even do disk IO (AOF ops in general). We let our offline replicas do that work. The primary Redis can focus on serving reads and writes.
We alert on replication chain breakdown. We sequence traffic draining and primary / secondary swap operations for things like OS or Redis upgrades.
It's pretty sophisticated and largely automated.
This model tolerates losses of Redis instances. Only writes to the primary occurring in the short time before replication while not failed out might be lost. But that number would be incredibly, incredibly small. We would tolerate such losses.
We've got a lot of nines of reliability, and this pattern has scaled and served us well.
> We would tolerate such losses
In the context of the original concern, these are conflicting statements.
The underlying argument here is that you shouldn't use redis for anything you can't tolerate losing. Your use case and architecture is great and I'm glad it works well for you. But at the end of the day, there are workloads that can't tolerate such losses. And for those cases, redis is not a good fit.
If the design condition allows asynchronous operation and I don't need to wait for background tasks to finish before I reply, it's not faster unless I somehow care more about background CPU time per request than I do about things like storage guarantees.
If the design condition requires durable storage to have happened before I reply, putting it in Redis doesn't even count as getting started.
For a small project you might not want the added complexity of another database and decide to store JSON blobs in PostgreSQL. In this case it's important to understand the performance trade-off you're making.
A good description of redis is a "shared heap". HashSets, HashMaps, Heaps, Priority Queues, etc. are all great and fast in an application, but once you need to start sharing that data things get complicated quickly. So you designate a single server to implement those data structures and expose them to your application. And what you end up with is basically redis.
That's the difference between thinking about it as a solution to a problem and thinking about it as part of your infrastructure.
These days most databases are available as a managed service and Redis in particular is a standard feature on AWS, Azure and Google Cloud.
What about "redis-cli flushall"? (Don't type this on your production cluster).
ACLs in postgres are powerful and valuable and something that redis lacks.
I've certainly heard about MongoDB many, many times. I've heard of Redis some. I've heard of CouchDB. But this is the very first time I've ever heard of "Couchbase".
Compared to a closer example like RethinkDB, which was really cool, but nowhere near the money and marketing team ultimately failed. Rethink started in terms of stable data first, performance tuning from there... mongo went the opposite approach and caught up on the stability later. Of course scaling mongo and administration isn't as easy as a single node, or multiple read-replicas.
Because MongoDB is a single binary to download and install and has one of the simplest config files around.
It's arguably the easiest database to use and operate.
If you're looking for a super low-latency (everything sub-millisecond) document/json store with strong resiliency options, cross-data centre replication, options for optimistic/pessimistic locking, json sub document access and modification, etc etc, I don't know why you'd use anything else.
It really depends on your specific environment in terms of how much you are, or are not willing to lose information and it's all relative. I'm a fan of high performing databases (couchbase), or those more tunable (ScyllaDB/Cassandra), or build for indexing (elasticsearch), or those built for consistency (sql).
Each have their use cases, and there's definitely overlap. Understanding the difference and using the right tool for the job is important.
You can configure postgres to behave more like Redis, and it will be faster if you do (at the cost of transactional guarantees).
A more interesting test might have been Postgres configured-like-redis vs Redis.
Aside from lack of ACID you better hope your dataset never gets too large to fit in memory, tho.
1. if you are claiming to compare PostgreSQL to Redis, do NOT test with django or SQLAlchemy or whatever, use the raw DBAPI driver, most likely psycopg2, but you might get much faster results with a more optimized driver like asyncpg.
2. Use the Python profiler to generate profiles and to actually see how much time is spent waiting on the network to get the value.
The article doesn’t give enough details to really understand what is that is happening here, but I would recommend everyone to take this 14/16x report and not expect that to be the general case.
To be fair, RAM is cheap these days, and most real world projects fit into RAM without a crazy budget.
Very few real world projects actually do need the cloud features. It's nice to have, sure, but you're paying a lot for it.
Disk is for writes.
In Postgres this is called “write-ahead-log” or WAL. On MySQL/MariaDB it is called Binlog.
For a long time the only persistence redis even had was an append only binary log-style system.
Additionally you could tune redis to a maximum persistence threshold which emulated the absolute minimum persistence threshold of a system like Postgres or MySQL.
There is no comparison between these systems. One is designed to treat your data very safely by default and the other is designed for speed.
They both achieve those goals.
I don’t even want to go into the fact that vfs access isn’t still slower than direct memory access (for, I hope, many obvious reasons).
The topic of this thread is the equivalent of judging a fish by its ability to climb a tree.
- A comparison of Redis and Postgres for in-memory storage without use of TEMP or UNLOGGED tables?
- SQL for table creation and queries not shown
- Using an ORM on one side and a more direct client on the other
While some data is better than none, the quality of this analysis is in question.
I think the comparison, in the context of a regular Django developer, is apt.
If you're implying that the redis cache client in Django is comparably stunted, I don't agree. Postgres is a lot more tunable than Redis, and the more complex the underlying machinery the leakier/less powerful the abstraction ORMs can provide without taking on the same complexity. An abstraction layer on top of Redis != an abstraction layer on top of Postgres.
Let me put this in context -- you could literally write a custom Postgres access method (see talk on it) which did nothing but take data and write it to /dev/null, or put it in a bounded space in memory. Of course, you wouldn't want to do that, but there's no way redis is going to be able to compete with that speed -- and from there you can imagine implementing a subset of Redis which is faster than redis can be (while throwing away a lot of course). This is the kind of complexity an ORM has no business replicating most of the time (I don't know which ORMs support the USING in CREATE TABLE yet), but it's part of what could be out there.
> The cache client is also doing compression and uses msgpack for serialization.
Great for performance but bad for testing -- this skews the results of the analysis, they're not doing the same work.
> I think the comparison, in the context of a regular Django developer, is apt.
I don't disagree, because this is what I'd expect a regular Django/Rails developer to do and call it a day -- but for those who want more serious analysis, this isn't it. It's a great start (good analysis is hard), but people shouldn't be thinking this is a good analysis with reasonable options explored -- UNLOGGED & TEMP tables are pretty low hanging fruit that isn't even mentioned. Initial settings of the Postgres cluster aren't mentioned/tweaked.
This is a pretty naive case and Postgres is near an order of magnitude, while being capable of doing a lot of other things. That's pretty amazing.
I'm pretty sure we can go to a x3 max for the db.
I don't really know what the hell is going on with this, but something isn't correct.
If the index used are sensitive to low key diversity, fix that, and maybe shard.
You might prefer a vm type with nvme.
If u can divide the data into more small ones, more indexes and cache fits in ram.
Also, cluster data by what works, time, region, userclass, etc.
Only decompress the blobs at point of use.
Design your app to need only small fast data which cant be cached.
My bet: Connection pooling is not enabled for PostgreSQL (Django default value for CONN_MAX_AGE is 0). A new connection is established for each request, which probably explained most of the overhead. Moreover, the connections are are probably TLS secured by default, which increases the handshake time a lot.
We know for a fact that redis stores shit fast. And a blob of json is perfect.
The problem is multi-fold. Now you have a schema-on-read. You may spend a lot more processing time querying and accessing the data rather than getting exactly what you want from a sql query. Concurrency and locking. Constraints dbs offer.
So it is important to evaluate on more than JUST read/write speed.
I worked on a team that used nosql for their db needs. The performance was so terribly slow, not because of read/write speeds, but because of everything I listed above. They had to basically manually write a query optimizer vs sql. They quickly swapped to postgres and had somewhere along the lines of 10-100x performance (depending on what parts) improvements with only a week of work.
That said, Redis is hugely expensive for any meaningfully sized data set. Thankfully our use case allowed us to keep our data size bounded over time, or else I would’ve traded latency for drastically reduced cost (eg use Postgres).
What a lot of people don't understand is that you want it both. The resilience of Postgres and the speed of fetching from Redis.
You have data stored. You want to retrieve it as painlessly (shorter time, less CPU usage, less complexity, etc.) as possible. What are the options? How much faster is Redis? How much RAM is that going to eat up? Different tools have different properties. Run some tests to answer some of those questions. Don't just guess. Sure Redis is going to save some time, but how much? Do you know by heart?
Benchmarking is surprisingly hard. After some digging, you might discover that you're actually measuring how long it takes to bring up a TCP connection, or how long it takes PostgreSQL to serialise the JSON from its stored form. It could be differences in the driver layer Django is using. Coming up with the corrct experiment can be the toughest part of doing the science.
Of course I want both. I also want a car that requires zero fuel, can go super fast, has great safety features, and is less than $100. Life is about trade offs and you have to make them for specific reasons.
People trade off using Redis vs Postgres for different workloads intentionally. They aren't meant to be used for the same workloads and are _not_ interchangeable.
Your comparison (and article) is worthless and, in fact, hurts more than it helps.
This article is obvious in its conclusion, inaccurate in measurement, and useless for any discussion.
I don't know if dkango is doing the same, but using an ORM might just do that to you.
Yeah they're laughably naïve.
Access patterns in Redis are very simple and it is a well optimized single threaded event loop. It is very fast but cannot scale up to multiple cores etc. So in this case, yes, storing and retrieving data will be fast, but uses will be narrower and if you need a high degree of concurrency, Redis will pose some issues for you.
At least in my experience, Redis's key problem is that while it is well optimized it is still a single threaded event loop and therefore with complex use comes a lot of headache. For simple things however, it is really hard to beat.
Bruh it does! That's what page pools are all about. Not to mention the query cache...
TLDR; this update is not about redis vs postgresql but about caching!
Postgres is a RDMS, with transactional support and everything. While Redis is powerful, it is much simpler and focused, not nearly as comprehensive as Postgres.
TL;DR Postgres is doing a lot more heavy-lifting than Redis, it feels slow because that much of work is not necessary for simple kv lookup
I dont have any direct link to the docs, but here's a bug report when such behavior was NOT happening.
if possible, this benchmark should be run again with a largish TOAST_TUPLE_TARGET
At ORM the credibility of the author was all lost on me. I mean you went from cloud to local cache as a performance improvement and then you chose to do it via ORM? Why not make your own custom solution if you went this road anyway? It's like changing your alarms in your house from using the ones provided by a security company with making your own because you're an electronics engineer and then you build them using electronic kits from commerce and complain that using a different kit is better. It's your house dude, make them all custom for maximum efficiency if you went this way.