Hacker News new | past | comments | ask | show | jobs | submit login
How much faster is Redis at storing a blob of JSON compared to Postgres? (peterbe.com)
224 points by Manozco 13 days ago | hide | past | web | favorite | 114 comments





Alas, there's not enough context or code here to even really know what's going on (or maybe I just don't know Django well enough to guess at what isn't said). It sounds like the author is comparing storing and retrieving a record with a JSON field in a PostgreSQL table versus storing and retrieving a raw string (the raw JSON) in Redis?

So it's no surprise that the Redis solution is faster. Besides parsing and indexing the data, PostgreSQL will also prioritize reliability and won't confirm your transaction until it's actually persisted to disk. Whereas Redis is prioritizing speed, and working straight to RAM--the disk serialization is an asynchronous process. If anything, I'm surprised that the Redis solution is _only_ 16-20x faster than PostgreSQL given what appears to be happening.

So, for this use case--storing a cache of Amazon API results for mostly-static data--the Redis solution is a no-brainer: durability of the data is not a serious concern; there's no real risk of data corruption because there's only one schema and no interrelations; you don't need the data store to know anything about the data; you don't need indexing beyond straight key-value. For other use cases, Redis might be entirely unsuitable because of durability or data integrity reasons, in which case the speed benefits wouldn't even matter.

Now, all that said, in the real world there are lots of other considerations: Are you using either of these databases already for some other purpose in the same app? Which data store is easier to set up? Which is easier to code against? Do you even care how fast it is if the slower speed is 11ms on a single-user system?


I found the thinking in the article to be mushy.

The title asks a reasonably specific question, but the article then goes on to talk about things in ill-defined ways. It mixes in other layers which aren't directly part of the titular question, but nonetheless influence the answer. In an update, the author goes so far as to state that a new test done differently resulted in a different answer. As you point out... very little relevant detail is included for PostgreSQL (which I know well) and I'm sure insufficient detail for Redis (which I don't know well)... certainly not enough detail and well reasoned testing to merit a public posting on the subject.

That's not to say that the gist of the article is incorrect or unexpected... I would expect a well configured Redis to perform better than PostgreSQL for reading and writing JSON data. But there's nothing I can really learn here that makes clear differences with well understood external influences.

The article is just anecdote, not really good testing.

The article should be titled something along the lines of, "How much faster is my way of using Redis at storing JSON than my way of using PostgreSQL?". Perhaps "Medium" would have been a more appropriate venue as well.... ;-)


> The article is just anecdote, not really good testing.

Mayhaps. Or, just a snapshot from the trenches of real application development.

We all have to make choices. Different pros and cons. Before jumping into, perhaps run some tests with real stuff and change direction based on that.

In application development is usually doesn't matter what the core difference is between two databases because there are drivers and mappers that you more or less have to use. I.e. the Django ORM makes PostgreSQL convenient.


Normally, I would agree that it's a snapshot or data point. But one of the things that I don't see mentioned anywhere else is that this is running on a macbook. I don't know how it works with his setup, but I get significantly different performance results on my local dev system (windows) than a linux server.

I run everything in hyper-v docker, which should actually get pretty close to bare-metal performance (as hyper-v sort of runs an instance alongside windows, rather than virtualized on top).

However, due to some quirkyness, maybe something with docker networking, I rarely get very good performance and I assume in this case, they are either using ported versions of the software or also running virtualized.

So I'm not sure if these kind of benchmarks add data points to anything but "performance on a dev machine".


I'm glad you wrote this article because I think people should put themselves out there more (myself included). By doing so, you open yourself up to feedback, both positive and negative, that can help you improve. You have a good example of this with your 9/29 update.

In the spirit of constructive criticism, I second the "not really good testing" feedback, mostly due to the lack of tuning effort and omission of performance-affecting details like configuration, the wire protocols used, how data is being serialized/deserializaed, etc. (though I am no expert with these technologies so some of this may have been implied). All of that may have been overkill for what you were trying to do, but it's what people look for in benchmarking / "good testing". I hope this helps.


Then cat your data to /dev/null because it will blow both Redis and PostgreSQL out of the water when it comes to benchmarks.

Point being they serve very different purposes, it's not about benchmarks.


A reasonably configured Postgres is easily sub-millisecond for reads and writes on a small dataset like this. I suspect that the author is doing something quite wrong.

People generally reach for Redis too quickly, IMO. Postgres is about as fast, has better tooling, and will slow down gracefully as the dataset increases, rather than evicting records as the limit is hit.

That said, I guess if you're relatively junior, it's hard to grossly misconfigure Redis, so that's a point in its favor. Somewhat like Java vs Node.js--you can write higher performance network services in Java, but it's a lot easier to make mistakes.


> That said, I guess if you're relatively junior, it's hard to grossly misconfigure Redis, so that's a point in its favor.

Redis persistence (or possibly lack-thereof) is probably quite complicated for a junior developer to understand. I'd actually expect a better outcome with PSQL.


I disagree. But anytime I've been exposed to redis, it's always been as an ephemeral data store.

Not sure what you are getting at here.

Redis is one of the simplest data stores around to use and its persistence is solid.


> Redis is one of the simplest data stores around to use and its persistence is solid.

I've lost countless amount of data in production systems running Redis. Haven't ever lost anything in Postgres. (I still run both.)


There’s

  class AmazonAffiliateLookup(models.Model, TotalCountMixin)
I guess that TotalCountMixin adds some triggers that maintain a count of the records in the table in a separate table (could or could not be a good idea, depending on the table size, relative frequency of ‘give me the number of items’ calls relative to inserts and deletes, etc, but might introduce a bottleneck on updating that count)

There also is:

  song = models.ForeignKey(Song, on_delete=models.CASCADE)
I don’t understand why that foreign key is there, but it can’t speed up inserts and deletes.

Then, we have:

  created = models.DateTimeField(auto_now_add=True, db_index=True)
I think creating an index is done to implement the equivalent of Redis’ auto-expiry.

Again, that may be suboptimal if there are few records in the database. DateTimeField also may be ‘too structured’ for the task at hand (it might needlessly parse year/month/day out of time stamp in nanoseconds, for example, or even do time zone calculations to convert a local time to UTC), but I don’t know. It would be helpful if the text showed the actual SQL DDL.


Probably, hard to tell, the biggest difference, is Redis is likely storing the JSON as a string, and Posgres is storing it as binary json (separated out into distinct values/collections). I'd guess if Postgres was using UTF8 text for the JSON it would be closer. Also, Postgres is probably a bit less efficient on persistance requirements in favor of higher stability and preventing data loss.

In any case, Redis is probably a more natural and better fit (remote API result caching). That said, the comparison is awkward at best. If it were to cache localized results, where the dbms (PostgreSQL) is the source of truth, then adding a caching layer above the db may or may not be worth it.

I think the performance comparisons come down to Granny Smith Apples vs Rome Apples... you can bake a pie with either, but they don't taste the same.


Isn’t the fundamental difference also use case? You would have to use Postgres if you need actual durability. AFAIK, one outage and your redis cache is gone... unless you’re running an HA cluster (which I dunno if that’s offered by redis). But even that doesn’t mean you have durability.

Redis can be configured as `appendonly yes; appendfsync always`, which the Redis docs say is “very very slow”, but really would just make Redis work the way Postgres does with its WAL, with probably equivalent performance (or still slightly better, since there’s no checkpointing going on.)

In general caching systems are volatile and not meant to be used to store data that cannot be afforded to be lost. Of course that doesn't stop clients from storing critical data on there in return for the simple GET/PUT APIs and low latencies.

You can go for more complex caches such as Hazelcast which offers hot restarts by persisting data to disk.


You have no evidence that:

a) PostgreSQL has sub-millisecond read/writes for storing JSON blobs than Redis.

b) Java results in buggier code than Javascript.

If you're going to make bold claims the onus is on you to provide evidence to back it up.


I guess the answer to does apple smell more apple than orange is yes

As he chose to show us Django ORM code, instead of the postgres table definition, I'm guessing he's not exactly a postgres wizard. It reads to me like to me like he's writing to a normalized table (complete with foreign key constraints and likely indexes) and comparing against writing a single serialized blob to to a Redis key.

I too would expect the performance improvement to be significantly faster. This guy probably doesn't know how to benchmark his code, either.


Another way to look at this is that he's using Postgres the way most people use Postgres + Django.

I likewise suspect the FK and the index on the created timestamp account for most of the difference.

I've used Postgres in production since 2002 and speaker that PgOpen. But I'm not an expert. I never will be because I make things with tools. What do you do?

Try not to take it personally. With relevant details not made clear, people have to try to fill in the blanks. Wondering about the table definition is definitely understandable.

Personally, I try to understand at least conceptually how the underlying data store connects and serializes the data in question. I'm not a fan of ORM tooling, quite frankly and find it often gets in the way more than helps.

I tend to track versioned changes in scripts, with a config table with a lock and a record of which the current version is, locking and upgrading the db on api service startup. It tends to be very solid, and doesn't rely on potentially unexpected behaviors. At work, a project I'm on uses SQL Server projects which has had a great many issues tracking versions. I've also used several ORM tools across differing platforms and they all have issues. More often than not, I'd rather do things the "hard" way. It's a little more initial effort, but saves in the long run.

Beyond this, without understanding how the data is persisted, it's hard to compare equally. If you used pgsql how you used Redis, I'd be surprised if it was more than 5x as slow. There's also less risk of data loss with PostgreSQL. In this case, data caching foreign api results, Redis is probably the more appropriate tool anyway, but the comparisons in TFA are muddy at best.


"As he chose to show us C++ code, instead of the assembly code, I'm guess he's not exactly a compiler expert."

Its faster, but then, thats the point of it.

Its like saying that memcache is faster than postgres.

they do different things for different purposes.

I still wouldn't trust redis for anything other than ephemeral storage. Most of the use cases where we use redis assume that the data will at some point go away. The places where we use postgres assume that data is permanent.

Infact, we use both together, and it works really well, best of both worlds.


> I still wouldn't trust redis for anything other than ephemeral storage

The Redis AOF (commands flush to disk) and replication story is rock solid.

You can replicate writes to an offline secondary that even has scheduled RDB memory dumps or scheduled AOF rewrites.

We've never encountered data loss issues with our read and write heavy Redis services.


> We've never encountered data loss issues with our read and write heavy Redis services

Curious... how does one confirm this? A lost acknowledged write is a very obscure thing and I don't think I would know if we lost writes.


Under normal circumstances Redis doesn't lose writes. If it did, we'd be able to detect it in metrics for cache misses or stale data. Our vector clock state machines wouldn't work. We have very fine monitoring for all of these classes of failure.

Operationally, we sequence Redis downtime events. They're very rare. This is when most would be concerned about losing data.

We shift traffic to be hitless. Our model is eventually consistent across multiple regions. We won't accept writes in a cluster that is going down.

The Redis replication chain was chosen so that the primary Redis doesn't have to fork (aofrewrite, bgsave) or even do disk IO (AOF ops in general). We let our offline replicas do that work. The primary Redis can focus on serving reads and writes.

We alert on replication chain breakdown. We sequence traffic draining and primary / secondary swap operations for things like OS or Redis upgrades.

It's pretty sophisticated and largely automated.

This model tolerates losses of Redis instances. Only writes to the primary occurring in the short time before replication while not failed out might be lost. But that number would be incredibly, incredibly small. We would tolerate such losses.

We've got a lot of nines of reliability, and this pattern has scaled and served us well.


> We've never encountered data loss issues with our read and write heavy Redis services

> We would tolerate such losses

In the context of the original concern, these are conflicting statements.

The underlying argument here is that you shouldn't use redis for anything you can't tolerate losing. Your use case and architecture is great and I'm glad it works well for you. But at the end of the day, there are workloads that can't tolerate such losses. And for those cases, redis is not a good fit.



It's faster, but only under the specific design condition of both having to wait for storage to finish and not caring about differences between the presented storage options.

If the design condition allows asynchronous operation and I don't need to wait for background tasks to finish before I reply, it's not faster unless I somehow care more about background CPU time per request than I do about things like storage guarantees.

If the design condition requires durable storage to have happened before I reply, putting it in Redis doesn't even count as getting started.


The article talks about _how much_ it is faster which is the interesting question.

For a small project you might not want the added complexity of another database and decide to store JSON blobs in PostgreSQL. In this case it's important to understand the performance trade-off you're making.


Just use a hashmap. Hashmaps in some cases can be like 8 orders of magnitude faster than Redis.

Hashmaps are faster, except when you need the data to be available to multiple machines.

A good description of redis is a "shared heap". HashSets, HashMaps, Heaps, Priority Queues, etc. are all great and fast in an application, but once you need to start sharing that data things get complicated quickly. So you designate a single server to implement those data structures and expose them to your application. And what you end up with is basically redis.


In Django it's literally a matter of adding 5 lines of code to enable the redis cache. The complexity is in actually setting up the Redis server, but this can also be trivial in many cases.

Sorry, no. The complexity is that you are now relying on another server which can break in new ways, requires startup and shutdown and configuration and possibly load-balancing; management of all the above... updates, security reviews, and documentation. Will you need new server types with an emphasis on RAM? How many and when? How are you monitoring it for errors and performance?

That's the difference between thinking about it as a solution to a problem and thinking about it as part of your infrastructure.


True, but a lot of that can be solved with PaaS/DBaaS products like compose or something

Shame you got downvoted because it's a valid point.

These days most databases are available as a managed service and Redis in particular is a standard feature on AWS, Azure and Google Cloud.


One of the shops I worked in used Redis as the primary datastore. While there was a lot of extra complexity around relational data (especially referential integrity), the snapshotting systems worked very well and we didn't have to worry about data loss. I don't necessarily think this was an ideal setup (I can't tell you how many headaches Postgres would have solved for us), but I think there are plenty of cases where you could use it without fear that your data will evaporate.

> but I think there are plenty of cases where you could use it without fear that your data will evaporate.

What about "redis-cli flushall"? (Don't type this on your production cluster).


What about 'drop database' for Postgres?

> access denied

ACLs in postgres are powerful and valuable and something that redis lacks.


what about rm -rf?

Protip: don't use flushall in production, use flushall async or you will have a bad time

Couchbase really is the best of both worlds, in one product. Store in memory first (memcached) and disk later, on 0 or more cluster nodes. From the calling API, you can choose whether to block or not for the disk commit.

Edit: typos


This is what I found in my experience as well. I founded a startup in 2013 (and ran it for 3 years) and used Couchbase as our primary data store (a saas that integrates with web sites as customers that received tens of millions DAU). With no dev ops person, running it was easy, and we've never had performance or outage issues. I know you can't compare a document store/kv store with a relational DB like PG or Mysql, but I have wondered why couchbase is not as popular as say Mongo.

I can give you one hint. I don't really work with databases, so I'm not well-versed on them, but I can easily name all the major RDBMSes (PG, MySQL, Oracle, etc.), and some of the NoSQL ones. I've done a little work at home playing with Postgres, but that's about it.

I've certainly heard about MongoDB many, many times. I've heard of Redis some. I've heard of CouchDB. But this is the very first time I've ever heard of "Couchbase".


It was popular briefly on HN back in 2015, so it had a late start compared to Mongo which was maybe 2012? And the commercialization of memcached was perhaps noticed most by PHP devs, as memcached is common enough with Wordpress and Drupal, though Redis might have overtaken it some. I always thought couchbase was a very interesting idea but never had the chance to seriously suggest it at work. These endorsements actually help me consider it, but at the time it seemed like they didn’t do a great job explaining who was using it and why/how it made a difference. I’ve heard more, recently, about using DocumentDB or CockroachDB than it, actually. And most places I work at are still riding the Postgres/Oracle/SQL traditional database shop, where it’s easier to introduce Redis or Memcached than to switch where all the data is stored. Using it for a new project would require buy in, but might be viable.

Mongo's marketing team. The near 1:1 logical mapping for Node.js and other languages with object structures. It's effective, and very easy to use.

Compared to a closer example like RethinkDB, which was really cool, but nowhere near the money and marketing team ultimately failed. Rethink started in terms of stable data first, performance tuning from there... mongo went the opposite approach and caught up on the stability later. Of course scaling mongo and administration isn't as easy as a single node, or multiple read-replicas.


> but I have wondered why couchbase is not as popular as say Mongo

Because MongoDB is a single binary to download and install and has one of the simplest config files around.

It's arguably the easiest database to use and operate.


Ditto. I loved using Couchbase at my previous job (Blizzard Entertainment) and I'm using at my current gig (FinTech) and I've never understood how such an excellent product has such terrible marketing. It blows Mongo out of the water.

If you're looking for a super low-latency (everything sub-millisecond) document/json store with strong resiliency options, cross-data centre replication, options for optimistic/pessimistic locking, json sub document access and modification, etc etc, I don't know why you'd use anything else.


Bit of a silly question, but how would you say Couchbase compares to something like CouchDB? I've never used Couchbase directly, but I've always really enjoyed the simplicity of CouchDB.

Would you be willing to rely on this for something like tracking election results?

It really depends on your specific environment in terms of how much you are, or are not willing to lose information and it's all relative. I'm a fan of high performing databases (couchbase), or those more tunable (ScyllaDB/Cassandra), or build for indexing (elasticsearch), or those built for consistency (sql).

Each have their use cases, and there's definitely overlap. Understanding the difference and using the right tool for the job is important.


It's actually worse than that. It's apples to oranges because it's asking one service to do X and one service to do Y.

You can configure postgres to behave more like Redis, and it will be faster if you do (at the cost of transactional guarantees).

A more interesting test might have been Postgres configured-like-redis vs Redis.


Redis can persist and replicate just fine.

Aside from lack of ACID you better hope your dataset never gets too large to fit in memory, tho.


First off:

1. if you are claiming to compare PostgreSQL to Redis, do NOT test with django or SQLAlchemy or whatever, use the raw DBAPI driver, most likely psycopg2, but you might get much faster results with a more optimized driver like asyncpg.

2. Use the Python profiler to generate profiles and to actually see how much time is spent waiting on the network to get the value.


Thanks for this! I did not know about asyncpg, which is purportedly an average of 3x faster than psycopg2.

Something is very wrong for the median read of what should be a moderate sized blob with an indexed lookup to be 8ms.

The article doesn’t give enough details to really understand what is that is happening here, but I would recommend everyone to take this 14/16x report and not expect that to be the general case.


I think the author suggested that the cause of some of the latency might be the processing in Python as data goes through the ORM. I like the Django ORM but it does not prioritize speed. It's possible that Python time plus the ORM doing something stupid (perhaps multiple round trips to the DB, reconnects, or weird instrumentation in the DB router or manager) could account for those extra milliseconds.

It's also in a json type in Postgres rather than a blob or varchar, while it's just raw data in redis.

Looking up a row of JSONB data by a primary key in a 3GB PostgreSQL table on my laptop takes 0.4-0.5ms, since it easily fits in memory and Postgres caches the data.

If it fits in RAM, you might as well put it in memcached. You will likely get a 10X speedup.

EDIT:

To be fair, RAM is cheap these days, and most real world projects fit into RAM without a crazy budget.


Why even memcached? It's not going to be 10x faster than your typical LRU cache (e.g. Postgres' shared buffers). It might scale a little better, but you should only consider that when it's worth it.

Ram is cheap everywhere but "cloud" instances. Was just noticing the other day, I can get about 128GB of ram on a dedicated instance for the same price I am paying in the cloud for 16GB.

Yeah, cloud prices are insane. I only rent individual servers.

Very few real world projects actually do need the cloud features. It's nice to have, sure, but you're paying a lot for it.


By default Django closes PostgreSQL connections after handling each web request. I’m not certain but what we’re probably seeing here in latency is mostly the substantial overhead in setting up a PG connection. Redis will still be faster but I’d guess the gap would be significantly tighter with persistent connections enabled.

Really? It doesn't use a connection pool?

https://docs.djangoproject.com/en/2.2/ref/settings/#std:sett... pooling is available, but off by default. I suspect most new projects turn it on, but I have no data to support that.


“Memory is faster than disk, more at 11”

Also, treating json like a string is faster than treating it like an object. I'd be curious to see the Postgres numbers with the json in a varchar or similar.

yeah and in normal operation the OS/PG will have all your data in memory (or all your indexes if very, very much data or under spec'd db server).

Disk is for writes.


There is one disk operation that is guaranteed to happen when pushing data into a database like PostgreSQL. It will write data to an append-only binary log.

In Postgres this is called “write-ahead-log” or WAL. On MySQL/MariaDB it is called Binlog.

For a long time the only persistence redis even had was an append only binary log-style system.

Additionally you could tune redis to a maximum persistence threshold which emulated the absolute minimum persistence threshold of a system like Postgres or MySQL.

There is no comparison between these systems. One is designed to treat your data very safely by default and the other is designed for speed.

They both achieve those goals.

I don’t even want to go into the fact that vfs access isn’t still slower than direct memory access (for, I hope, many obvious reasons).

The topic of this thread is the equivalent of judging a fish by its ability to climb a tree.


A few problems:

- A comparison of Redis and Postgres for in-memory storage without use of TEMP or UNLOGGED[0] tables?

- SQL for table creation and queries not shown

- Using an ORM on one side and a more direct client on the other

While some data is better than none, the quality of this analysis is in question.

[0]: https://www.postgresql.org/docs/current/sql-createtable.html


The redis cache client in Django is effectively an ORM - but a much lighter one. The cache client is also doing compression and uses msgpack for serialization.

I think the comparison, in the context of a regular Django developer, is apt.


> The redis cache client in Django is effectively an ORM

If you're implying that the redis cache client in Django is comparably stunted, I don't agree. Postgres is a lot more tunable than Redis, and the more complex the underlying machinery the leakier/less powerful the abstraction ORMs can provide without taking on the same complexity. An abstraction layer on top of Redis != an abstraction layer on top of Postgres.

Let me put this in context -- you could literally write a custom Postgres access method[0] (see talk on it[1][2][3]) which did nothing but take data and write it to /dev/null, or put it in a bounded space in memory. Of course, you wouldn't want to do that, but there's no way redis is going to be able to compete with that speed -- and from there you can imagine implementing a subset of Redis which is faster than redis can be (while throwing away a lot of course). This is the kind of complexity an ORM has no business replicating most of the time (I don't know which ORMs support the USING in CREATE TABLE yet), but it's part of what could be out there.

> The cache client is also doing compression and uses msgpack for serialization.

Great for performance but bad for testing -- this skews the results of the analysis, they're not doing the same work.

> I think the comparison, in the context of a regular Django developer, is apt.

I don't disagree, because this is what I'd expect a regular Django/Rails developer to do and call it a day -- but for those who want more serious analysis, this isn't it. It's a great start (good analysis is hard), but people shouldn't be thinking this is a good analysis with reasonable options explored -- UNLOGGED & TEMP tables are pretty low hanging fruit that isn't even mentioned. Initial settings of the Postgres cluster aren't mentioned/tweaked.

This is a pretty naive case and Postgres is near an order of magnitude, while being capable of doing a lot of other things. That's pretty amazing.

[0]: https://www.postgresql.org/docs/12/tableam.html

[1]: https://www.pgcon.org/2019/schedule/events/1374.en.html

[2]: https://anarazel.de/talks/2019-05-30-pgcon-pluggable-table-s...

[3]: https://www.youtube.com/watch?v=5RCkZl5HNiQ&list=PLuJmmKtsV1...


No index, not the same size of datastore, queries in a loop the same entry, no cache config on the db side, only one hydrates an ORM... Don't get confused, it's not because data comes from a database that it comes from the disk when you query it.

I'm pretty sure we can go to a x3 max for the db.


I use PostgreSQL regularly and pull (and even search) JSON blobs. We have something like 150 million JSON blobs (dataset is 500Gb) that are searched almost every query, response time is maybe 150-200ms... That's on an AWS t3.medium RDS instance.

I don't really know what the hell is going on with this, but something isn't correct.


If its mostly on disk, you're paying ms for block io; get more iops.

If the index used are sensitive to low key diversity, fix that, and maybe shard.

You might prefer a vm type with nvme.

If u can divide the data into more small ones, more indexes and cache fits in ram.

Also, cluster data by what works, time, region, userclass, etc.

Only decompress the blobs at point of use.

Design your app to need only small fast data which cant be cached.

Etc


Bad benchmark with not enough information to draw any relevant conclusion.

My bet: Connection pooling is not enabled for PostgreSQL (Django default value for CONN_MAX_AGE is 0). A new connection is established for each request, which probably explained most of the overhead. Moreover, the connections are are probably TLS secured by default, which increases the handshake time a lot.


The problem of redis isn't speed of single object storage and retrieval.

We know for a fact that redis stores shit fast. And a blob of json is perfect.

The problem is multi-fold. Now you have a schema-on-read. You may spend a lot more processing time querying and accessing the data rather than getting exactly what you want from a sql query. Concurrency and locking. Constraints dbs offer.

So it is important to evaluate on more than JUST read/write speed.

I worked on a team that used nosql for their db needs. The performance was so terribly slow, not because of read/write speeds, but because of everything I listed above. They had to basically manually write a query optimizer vs sql. They quickly swapped to postgres and had somewhere along the lines of 10-100x performance (depending on what parts) improvements with only a week of work.


Everyone should read https://redis.io/topics/persistence where it explains postgresql vs redis depending on fsyncness.

My team built a heavy duty load testing simulator in preparation for a upcoming launch. We kept dialing the simulated clients up until we finally saw timeouts talking to Redis (we used RedisLabs). The cause? We had saturated the 1 gig Ethernet link our AWS C4 instances had. Upgrading to the next tier of NICs solved the problem and Redis kept up the pace.

That said, Redis is hugely expensive for any meaningfully sized data set. Thankfully our use case allowed us to keep our data size bounded over time, or else I would’ve traded latency for drastically reduced cost (eg use Postgres).


Hey, Original blog post author here.

What a lot of people don't understand is that you want it both. The resilience of Postgres and the speed of fetching from Redis.

You have data stored. You want to retrieve it as painlessly (shorter time, less CPU usage, less complexity, etc.) as possible. What are the options? How much faster is Redis? How much RAM is that going to eat up? Different tools have different properties. Run some tests to answer some of those questions. Don't just guess. Sure Redis is going to save some time, but how much? Do you know by heart?


"Do you know by heart?" is very much the wrong question though. :-)

Benchmarking is surprisingly hard. After some digging, you might discover that you're actually measuring how long it takes to bring up a TCP connection, or how long it takes PostgreSQL to serialise the JSON from its stored form. It could be differences in the driver layer Django is using. Coming up with the corrct experiment can be the toughest part of doing the science.


> What a lot of people don't understand is that you want it both.

Of course I want both. I also want a car that requires zero fuel, can go super fast, has great safety features, and is less than $100. Life is about trade offs and you have to make them for specific reasons.

People trade off using Redis vs Postgres for different workloads intentionally. They aren't meant to be used for the same workloads and are _not_ interchangeable.

Your comparison (and article) is worthless and, in fact, hurts more than it helps.


I'm not sure you are measuring what you think you are measuring. Have you set CONN_MAX_AGE in Django?

So reading from RAM in a single-threaded key/value store with a lightweight text protocol, is faster than reading from disk in a multi-threaded relational database with SQL parsing and ACID semantics. As measured by a heavy web framework running in a slow interpreted language that overshadows the speed differential so much that it's only 16x at the end.

This article is obvious in its conclusion, inaccurate in measurement, and useless for any discussion.


Thou shall edit thy postgresql.conf file. Better yet, use PGTune https://pgtune.leopard.in.ua

I saw something similar recently when reading JSON data from postgres via Rails, which was unbelievably slow. I found the reason to be that Rails would automatically parse the JSON value into a in-memory ruby object. Since this was for a API-like backend, the solution was simple: just read the column as a string ("SELECT payload::varchar..")

I don't know if dkango is doing the same, but using an ORM might just do that to you.


Seems a really bad study, since it doesn't mention the exact methodology (caching, exact queries, size of the JSON data, etc.), nor does it attempt to figure out why PostgreSQL is taking 8 ms to do something that should take much less and why reads and writes take the same time, or why Redis is taking 0.5 ms (i.e. millions of CPU cycles) for something that should essentially just do a memcpy.

A good read, but wouldn't count these results as anywhere near valid - if you're using an ORM, you're just perf testing the ORM.

> anywhere near valid

Yeah they're laughably naïve.


These sort of benchmarks are both useful and useless, depending on what you want. The architectures of the two are very different and so speed depends very much on what you are trying to do.

Access patterns in Redis are very simple and it is a well optimized single threaded event loop. It is very fast but cannot scale up to multiple cores etc. So in this case, yes, storing and retrieving data will be fast, but uses will be narrower and if you need a high degree of concurrency, Redis will pose some issues for you.

At least in my experience, Redis's key problem is that while it is well optimized it is still a single threaded event loop and therefore with complex use comes a lot of headache. For simple things however, it is really hard to beat.


Curiosity strikes: I have to wonder how much these benchmarks would change on an actual Linux kernel? I don't know if the PostgreSQL team is necessarily tuning their project for macOS/Darwin. I'm sure Redis would still be faster (because memory vs disk), but I wonder by how much.


For a couple similar use-cases (write rarely, read many times, JSON data) I ended up using actual JSON files, on disk. Time to read and parse them is very small compared to asking Redis or PG for it, and this benefits from the Linux disk cache, which makes reading from those most-unchanged files almost the same as reading it from memory.

This is meaningless without knowing the schema.

Why can't Postgres implement a transparent caching layer similar to Redis?

There is a query cache. But I sorta doubt this benchmark took appropriate advantage of it. We don't know enough about how the author is running postgres to be sure. But the reported postgres numbers are pretty slow for machine-local SELECTs. Maybe Django's ORM is getting in the way?

> Why can't Postgres implement a transparent caching layer similar to Redis?

Bruh it does! That's what page pools are all about. Not to mention the query cache...


How would you cache a write?

You can't cache it, but if you don't care about durability as in this case you can acknowledge it before it is fsync'd.

A write is really fast if you don't do it.

FWIW there is an update to this article at - https://www.peterbe.com/plog/update-to-speed-comparison-for-...

TLDR; this update is not about redis vs postgresql but about caching!


If it were about caching, he wouldn't have used an ORM and a json Postgres type. And there would have been more detail about the query cache, backing store, and wal+fsync settings for Postgres.

Is this really that surprising?

Postgres is a RDMS, with transactional support and everything. While Redis is powerful, it is much simpler and focused, not nearly as comprehensive as Postgres.

TL;DR Postgres is doing a lot more heavy-lifting than Redis, it feels slow because that much of work is not necessary for simple kv lookup


postgresql does TOAST compression of large jsonb blocks. You might be incurring the cost of compression/decompression by postgresql

Uh...TOAST is a system for storing data off page. It doesn't have anything to do with compression? Reading a TOAST actually requires reading two pages, so you're correct that there is read and write overhead. But it isn't CPU bound a la compressing the data.

postgres does TOAST compression of larger jsonb.

I dont have any direct link to the docs, but here's a bug report when such behavior was NOT happening.

https://www.postgresql.org/message-id/20140801182102.2696.87...

if possible, this benchmark should be run again with a largish TOAST_TUPLE_TARGET



This is cute.

Quote:"My current implementation uses PostgreSQL (via the Django ORM) to store this stuff."

At ORM the credibility of the author was all lost on me. I mean you went from cloud to local cache as a performance improvement and then you chose to do it via ORM? Why not make your own custom solution if you went this road anyway? It's like changing your alarms in your house from using the ones provided by a security company with making your own because you're an electronics engineer and then you build them using electronic kits from commerce and complain that using a different kit is better. It's your house dude, make them all custom for maximum efficiency if you went this way.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: