Hacker News new | past | comments | ask | show | jobs | submit login
Moving persistent data out of Redis (githubengineering.com)
220 points by samlambert on Jan 10, 2017 | hide | past | web | favorite | 75 comments

Wow I'm learning today that Github used Redis for persistent data, now that they moved away :-) Anyway very happy that Redis helped to run such an important site. From the blog post it looks like that for certain things to move away from Redis was hard even if they are very skilled with MySQL, this is a good thing from the POV of Redis since it means that Redis allows to model certain things easily. However they wanted to move away as an important priority, so I wish to know why they wanted to move away so badly and how Redis could be improved in order to serve better the users. If Redis was better for their use case, they could have avoided to move to MySQL I guess. Unfortunately the blog post is short of details on that regard, perhaps because the blog post author(s) are too gentle to bash Redis after using it for a long time.

FWIW we went through a very similar process to that documented here by Github (~3 months ago). It was entirely due to operational reasons and nothing to with shortcomings in Redis itself. MySQL was the master record for 99% of our data while Redis was the master record for the other 1% (as it happens it was also a kind of activity stream). Having the single 'master' reference for our data reduced complexity to a degree that it was worth running a less computationally-efficient setup. We also have nowhere near Github's volume so we did not have to do such significant re-architecting to make unification possible.

Now we still use Redis for reading the activity streams and as LRU cache for all sorts of data, but it is populated like all of our specialised slave-read systems (elasticsearch, etc) by replicating from the MySQL log.

Hope that helps!

Yes, this helps and totally makes sense to me. Thanks. I would do the same... In this case however it looks like there were certain high volume writes that could be handled in a simpler manner with Redis, however it is totally possible that while this looks like an important use case, it accounted for a small percentage of all the data, so we are back to the consolidation thing of moving everything to a single system that is in general a good idea.

What method are you using to replicate from MySQL binlog to various other systems?

FWIW, I've used github.com/siddontang/go-mysql to successfully replicate from MySQL to DynamoDB. Currently not using GTIDs and looking into that next.

just asking for some info,

but how do you make sure that multiple of your db systems are in sync (specifically interested in MySql and elasticsearch)?

Hope it's alright to ask you that.

In the case of ES the short answer is; we don't. We have fault tolerance in our replication system to guarantee eventual consistency instead. I would say using ES as a consistent source of data isn't really playing to its strengths so we don't use it that way. The consistency you want is determined at read time: If you need consistency then hit MySQL, but for our use case that almost never happens as eventual consistency is usually instantaneous enough.

Our other tool is to decouple lookup (which objects to fetch) and population (what data to return for each object). You can mix and match, e.g. do a lookup against an inconsistent ES but still get consistent objects by populating from MySQL (or vice versa). As others have alluded to it depends entirely on the requirements for the result set.

Where I work we use several different MySQL replicas in production, where we don't expect them to be in sync.

So long as the source of truth (Master MySQL node) is up to date, it's okay.

For example, if we show a user how much money is in their account on every page, we can run query that on a replica, since it's fine if this is a few seconds delayed. However, immediately after an action changed their balance, on a confirmation screen, we'd want to show the value from Master.

It's entirely possible that any place elasticsearch is being used just don't need consistency.

There are actually a few strong solutions out there for Mysql, most starting with change data capture like: https://github.com/shyiko/mysql-binlog-connector-java (I link that one in particular because he links to alternatives right in his readme!)

Pgsql is a bit harder, but if I needed to start somewhere it would be with:


or https://github.com/confluentinc/bottledwater-pg

These are the start of pretty sophisticated solutions where you need super real-time elasticsearch indexes and can bring up infra like Kafka.

For many applications, queueing an update when something hits your ORM to update, with the hourly/daily refresh is pretty satisfactory.

If you need any kind of consistency guarantee, I think you would need to use some kind of distributed transactions.

If its not, you could tail the MySQL log and have a process making the same changes to elasticsearch. The elasticsearch may lag behind if there are problems.

I'm facing a similar challenge, although at a much(MUCH!) smaller scale.

We have nearly everything in Postgres, and redis serves as both caching layer (non-persistent), but also for rails session storage and Sidekiq (persistent).

Having one source of truth can make things like failover much easier. I can handle PG failover, and also redis, but I'd rather not have to deal with both. Especially if you consider the potential of things going slightly out-of-sync (think a job in sidekiq that relies on an id in PG, one of which loses a few microseconds of data during replication etc, just speculating a scenario here)

Did anybody face similar challenges and care to share their thoughts?

Reading more into this point in the post than i maybe should but "Take advantage of our expertise operating MySQL." sounds like they have more engineers familiar and comfortable working with MySQL than they are with Redis.

That's a valid reason indeed. Also technological consolidation, that is, if I can do everything with a single DB / language / ... I always tend to use a single thing.

Watch how far you generalize that view. I realize that "I can do everything with a single...language" can make "everything" mean anything. I had a coworker that helped a middle eastern country with security software. He went full Javascript on it: Node, Mongo, etc. It worked. JS did everything. While he didn't go into details, afterwards he thought it was a bad idea. A great learning experience on when to define "everything" properly as well as what is outside of it.

Yes I totally agree that like most "rules on software" there is the need to be judicious enough to know when to follow the rule is not a good idea... However I more often see the contrary, of adding a multitude of systems together without very strong reasons.

just wondering why Mongo is considered a part of "JS". Is it because of the MEAN stack?

Part of it is JSON as the storage format. Another part is its Node driver. The whole API fit it well. It understood async programming. It felt JS-like. The input and output were JSON instances. Finally, yea, from what I know of him "web scale" did play part in the decision. Oh those heady days.

For data that's mostly to do with the the API provided by the particular mongodb driver, than mongodb itself. Mongo stores and transmits BSON, not JSON. Most mongo drivers expose an API that serialises your data to BSON for writes and wraps the BSON data with a JSON-like interface for reads.

postgresql has the jsonb column type which is just as powerful as mongo

Possibly because of JSON as the storage format.

Because of JSON and "web scale".

That's a bit suprising and I think sad. One would expect Github would have at least reached out and informed and thanked you at the very least if not tried to support your project in some active way.

Whenever this comes up on the HN the perspective is quickly shifted to the developer's choice of license but there are no expectations. But let's shift the perspective to the other side. Surely startups and others using open source projects for commercial reasons even if not obligated legally or not expected to by the developers have some ecosystem responsibility to try to contribute back when they can in some meaningful way.

Acquiring open source projects or hiring developers are 'influence plays' to gain control and should not be the only way for commerical projects to contribute.

I understand your POV, and I thank you for your comment, but mine is actually opposite and I want to explain why. I consider Redis, even if the license is different, kinda of the old "Public Domain", that you grab it and do whatever you want, without also expecting much if not what you see the project direction and activity is. However Github I think was the very first big site using Redis and clearly stating it, when it was in beta, so they did a very bold thing and helped Redis a lot to grow up. Github current CEO even wrote the first Redis-based queue system that provided Redis with an huge popularity boost. And they are still using Redis even if no longer for durable data, so it's a 7 years symbiosis going forward. Even if we never exchanged much infos, I think it's fine, I actually think it's the hackers way :-)

I think this is key:

"We needed something that would work for both github.com and GitHub Enterprise, so we decided to lean on our operational experience with MySQL."

Pretty cool that redis was helping to host its own source code & development.

A lot of software help host their own source code and development. Just considering Github, 1. MySQL is hosted on Github. 2. I think they use elasticsearch and it is hosted on github. and lots of others Software is pretty cool like that!

I'm pretty sure that git's source is also version-controlled in git :)

You might appreciate http://fossil-scm.org/

from the post sounds like it still is.

From the projects I've been involved in, the reason is simply that we don't want to have 2 persistent storage systems. There's a need for a fast cache system, and there's a need for a reliable – as in certainty above speed – database. The former is usually Redis, and the latter most often needs to be a full-blown SQL database to handle the required complexity of larger applications.

It's just easier to have one single source of truth. Please don't change Redis into a large SQL database. :)

Thanks! No plans to change it into an SQL database :-) Actually the idea is to focus more in the caching/streaming area.

We've recently had to move away from redis for persistent data storage at work too - opting instead to write a service layer ontop of cassandra for storing data.

Redis was tremendous in our journey up there - but one of the shortcomings is that it isn't as easy to scale-up as cassandra is if you haven't designed your system to scale-up on redis from when it was built (which we didn't) - instead of re-architecting for a redis-cluster setup, we decided to move the component to a clustered microservice written in go, that sits as a memory-cache & write buffer infront of cassandra for hot, highly mutated data.

Would anyone be interested in a blog post about our struggles & journey?

>instead of re-architecting for a redis-cluster setup, we decided to move the component to a clustered microservice written in go, that sits as a memory-cache & write buffer infront of cassandra for hot, highly mutated data.

Somehow setting up a Redis cluster and doing whatever you have to do to distribute/shard your keys effectively (which afaik is not much) does sound a little more efficient than rewriting a clustered microservice in Go with a Cassandra backend. Redis clustering is actually quite easy.

Forgive me if I seem grumpy. My recent experiences have caused the "We had a minor issue, so we redid everything in a Totally Cool Super-Neato New Stack That Integrates All The Hiring Manager's Favorite Buzzwords!" perspective to become a bit grating.

Redis is one of the few new pieces of infrastructure over the last 10 years that's truly deserving of its position.

My post above describes the main reason for moving from redis - the fact that data for inactive users doesn't need to be memory perpetually. :P

Cool. I look forward to the post that reveals the unique properties of Cassandra that ended up making it the most practical data store for your use case.

I understand that Cassandra et al exist to solve real problems that someone out there has experienced, and I seek to throw no shade on the great engineers who make these fine products. I am, however, somewhat dubious that these niche products are applicable in the vast majority of cases where they're deployed. I strongly believe, and I think the data would bare out on this, that when it gets down to brass tacks, most people are integrating such specialized tools into generic products to either a) make life at the office more exciting; b) beef up resume points for their next job application cycle; or c) both.

Someone in our company wrote a blog post pretending to justify the move to a niche datastore. He's very proud of it and makes several spurious, nonsensical justifications in it. The truth is that MySQL would've been many times more practical along all axes, except the one this guy cares most about, which involve his personal career ambitions.

This move was partially under the radar so objections couldn't be raised and full backups were not properly arranged. It cost the company a lot of money not only in time and infrastructure, but also in the recovery process that had to be undertaken by real data experts (or nearest we had at the time, at least) when the cluster was destroyed by one of his careless scripts. :)

Second nightmare, currently ongoing: shifting everything to docker/k8s, which, for just one example among a very long laundry list of complaints, only got support for directly addressing app servers behind a load balancer last month, as a beta feature (in k8s nomenclature, that's "Version 1.5 has a beta StatefulSets feature to make Pods in a ReplicaSet uniquely addressable from inside the cluster! Don't forget to make a Headless Service and Persistent Volume." Exhausted yet? Just wait.).

Why are we switching to something that lacks such basic functionality (we're like 3 versions behind, so we can't use it)? If I told you, I'd have to kill you, but it sure makes our resumes pretty.

I'm all for learning, experimentation, and doing things for fun. We are on Hacker News after all. I guess I've just developed a taste for a stable production ethos that, to co-opt a scriptural term, is not "blown about by every wind of [tech fad]". I crave a company that makes its decisions based on a significant and real cost-benefit analysis that shows substantial unique benefits and sufficient maturity to a tech before jumping on the bandwagon. I guess I just want some sanity.

As it stands, people just pretend that these justifications exist by making up some mumbo-jumbo about "dude JavaScript on the backend is like really event-driven, brah!"

I'd be interested in seeing that, also exactly what bottlenecks were you running into? CPU? I/O? Memory?

Memory & CPU. We knew we'd be running into both eventually - without adding additional redis instances (memory growing way faster than CPU usage in this specific scenario).

When it was initially built, it was basically a bunch of redis lua scripts to handle updating the data - on redis configured in master/slave managed by sentinels.

Given the nature of the data too, only the data for active users would be hot - but users that were inactive would have stuck around in memory needlessly. Our new system keeps only the hot set of users in memory. We also built it to transparently migrate users from redis to cassandra when they were accessed.


Can anybody here help me understand why many teams are using MySQL as a KV store? (Uber did it recently, so assuming many others probably did it too, network effect)

I personally love MySQL. Just want to understand what makes MySQL a great KV store as opposed to more seemingly specialized systems like Redis?

> Just want to understand what makes MySQL a great KV store

Its not, it just happens to be good enough, which matters a lot for operational expertise/costs/etc.

For example, you can store hundreds of millions of KV rows in an InnoDB table and still have <1-3ms response times on queries, while having persistence built in. Perfect is the enemy of good enough.


Less complexity, I imagine. If you've already got a beefy MySQL setup, then maintaining a separate Redis setup will introduce quite a bit of redundancy and chances for failure. Given that the big SQL databases have key value engines these days, you're going to get decent performance out of them as well.

All the tooling around SQL databases that allow you to easily setup fixtures, backups/restores, monitoring,viewing data etc. Also you can scale wide through read slaves easily.

For me a primary reason would be to avoid platform spread.

Operational complexity at scale and prior experience with it. It is one thing to run one instance system and completely different story to run a clustered setup for large amounts of data with performance consideration.

MySQL also proved very good at scale, at Facebook, YouTube, Uber, etc. and there are a lot of people with experience running it.

Uber wrote a very good article about their switch, for them it was more about performance, though a little bit controversial.

Cool stuff. I found it especially interesting how they removed 30% of writes with new logic to compose some timelines of events in other timelines. It's a thought provoking optimization that calls to mind graph partitioning.

For example you have 10 people in your organization with various permissions on repos. Some people (CTO let's say) can see every repo while others might only be able to see some repos. Or you might have consultants or open source projects which non-employees contribute to. Then you construct a graph where each node is an contributor that is connected to other contributors by the permissions they have on repos (or are the repos the nodes and the contributor permissions the connections?). Finally you run a graph partitioning algorithm where the number of partitions is the number of unique timelines you have to write for an organization. Thinking about an organization with closer to 500 contributors I can see how this could reduce the number of timelines by 30%.

Activity streams are such a common use case. It is very interesting that Soundcloud chose to do something different: https://developers.soundcloud.com/blog/roshi-a-crdt-system-f...

Assembling the inbox on demand is quite interesting. I don't quite understand the querying and operations involved with Roshi for doing that.

Working with large amounts of persistent data is hard. Limiting the architecture to a single database system (MySQL in this case) generally makes managing and scaling much easier, versus having to know/learn how to scale multiple systems independently.

Even if Redis was a better fit for some of their use cases, it just makes it much easier to not have the additional persistent database system to manage.

Has Github stopped allowing free searching for code? All I get is "Must include at least one user, organization, or repository". I think that its a greater problem than the speed of its streams.

I believe this restriction was put in place to limit searching for accidentally committed AWS keys.

AWS keys are already being detected and the account owners get notified when they are found. Private key material was a bigger issue at one point in time (i.e dotfiles/.ssh/id_rsa).

Have to be logged in.

I wonder if this will allow better scaling of GitHub enterprise. We are pegging our usage; if we could we would migrate everything to Gitlab Enterprise (which we also have) which seems to have better scalability.

How can we help you move to GitLab EE? (As you indicated it scales to 100k users so that shouldn't we the problem)

Does GitLab really handle 100k users? Where is this indicated?


Thanks for asking. The issue you referred to talks about users on a single machine. Unlike GitHub you can run a cluster of application servers with GitLab https://about.gitlab.com/high-availability/

Some of our users have 25k+ users on their cluster. We know GitLab can scale to 100k users because we run GitLab Enterprise Edition without modifications on GitLab.com

GitLab.com currently has much more than 100k users and the performance leaves much to be desired https://gitlab.com/gitlab-com/infrastructure/issues/947

But we're comfortable that you can run 100k users on a cluster of machines without much tuning.

> Unlike GitHub

GitHub seems to have a clustering product that came out a year ago, and it looks like IBM has reported they're running over 13,000 users on it (back in August, i imagine it's closer to 20,000 by now): https://www.ibm.com/blogs/bluemix/2016/08/ibm-internal-githu...

> We know GitLab can scale to 100k users because we run GitLab Enterprise Edition without modifications on GitLab.com

What does "modification" mean in this context? Beyond recommended specifications listed at https://docs.gitlab.com/ee/install/requirements.html#cpu and https://docs.gitlab.com/ee/install/requirements.html#memory? it doesnt list anything above 40,000 users. Beyond that, the HA documentation (https://docs.gitlab.com/ce/administration/high_availability/...) isn't _really_ active/active HA, it says it is but it's not true. True active/active would mean that you wouldn't rely on a shared NFS server, postgres, or redis server.

However, with https://help.github.com/enterprise/2.8/admin/guides/clusteri... it sounds like my organization can scale to well over 100,000 by adding more clustering nodes instead of trying to figure out independently how to scale services like redis or postgres or a shared NFS server.

> without much tuning.

I'd love to hear a comparison between the two products, and also what kind of tuning you've done on gitlab.com to support those user numbers. Would love to see that on the documentation to support the open source way!

can you share how many machines handles gitlab.com currently?

It is around 100 of machines. Commonly it is a couple of thousand active users per application server, but your mileage may vary based on many things.

> which seems to have better scalability

Out of curiosity: how many users do you have, and what measurements are you using to determine scalability?

Disclaimer: I work at GitHub.

> We changed up how writing to and reading from Redis keys worked for [the organization] timeline before even thinking about MySQL ... This resulted in a dramatic 65% reduction of the write operations in for this feature.

Interesting. Is there a comparison of overall performance between the intermediate design (w/ Redis) and what they ended up with?

Wonder why they didn't use Cassandra for this use case.

They told you their motivations. Reduce complexity, and take advantage of their MySQL knowledge. Cassandra would fail both of those if they didn't have a lot of MySQL knowledge, since setting it up and maintaining it would increase complexity by introducing a tool they did not know.

Last year I was setting up a trial of Cassandra for something, going through the usual swearing of a new tool not quite working as expected (eg by default picking a random port for inter-node communication)... and the next desk over, a non-tech colleague called Cassandra kept hearing me mutter angrily about 'cassandra' and wondered what she'd done. Whoops :)

Do you sit close to Ezekiel as well?

Though having someone sit near me not knowing what I am/we are working on would surprise me. But it does happen especially if there are hotdesks nearby for people from other offices to work on temporarily. I do swear loudly often so probably not a good choice to have those too near me...

its fairly unfortunate that they are tied to mysql, because this is pretty much the usecase that postgresql jsonb was built for.

it has first class support in most ORM, and works quite well.

I don't know why people use redis as an LRU cache. Its a terrible LRU cache. Its eviction algorithm isn't true LRU and does sampling which may cause new keys to get incorrectly convicted. LRU is also really slow being single threaded.

Why does single threaded make it slow?

Also, 3.0 has a better LRU algorithm. https://redis.io/topics/lru-cache

A lot of uses of Redis are suboptimal, but easy to set up and good enough. For example, "true" LRU is not always needed -- an approximation of LRU is sometimes sufficient. It's often not worth the time and money to do better.

When you need to scale past what Redis can provide, you can move on to a different solution, as Github has done.

What should people use?

use Redis when you need advanced data structures built in. Memcached is just a KV store.

Memcached is a LRU cache.

I went and searched for memcached redis because up until now I haven't gotten around to checking out the differences. Here are some things I found.




It would be nice if they can share some info about Github::KV.

I love Redis!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact