Now we still use Redis for reading the activity streams and as LRU cache for all sorts of data, but it is populated like all of our specialised slave-read systems (elasticsearch, etc) by replicating from the MySQL log.
Hope that helps!
but how do you make sure that multiple of your db systems are in sync (specifically interested in MySql and elasticsearch)?
Hope it's alright to ask you that.
Our other tool is to decouple lookup (which objects to fetch) and population (what data to return for each object). You can mix and match, e.g. do a lookup against an inconsistent ES but still get consistent objects by populating from MySQL (or vice versa). As others have alluded to it depends entirely on the requirements for the result set.
So long as the source of truth (Master MySQL node) is up to date, it's okay.
For example, if we show a user how much money is in their account on every page, we can run query that on a replica, since it's fine if this is a few seconds delayed. However, immediately after an action changed their balance, on a confirmation screen, we'd want to show the value from Master.
It's entirely possible that any place elasticsearch is being used just don't need consistency.
Pgsql is a bit harder, but if I needed to start somewhere it would be with:
These are the start of pretty sophisticated solutions where you need super real-time elasticsearch indexes and can bring up infra like Kafka.
For many applications, queueing an update when something hits your ORM to update, with the hourly/daily refresh is pretty satisfactory.
If its not, you could tail the MySQL log and have a process making the same changes to elasticsearch. The elasticsearch may lag behind if there are problems.
We have nearly everything in Postgres, and redis serves as both caching layer (non-persistent), but also for rails session storage and Sidekiq (persistent).
Having one source of truth can make things like failover much easier. I can handle PG failover, and also redis, but I'd rather not have to deal with both. Especially if you consider the potential of things going slightly out-of-sync (think a job in sidekiq that relies on an id in PG, one of which loses a few microseconds of data during replication etc, just speculating a scenario here)
Did anybody face similar challenges and care to share their thoughts?
Whenever this comes up on the HN the perspective is quickly shifted to the developer's choice of license but there are no expectations. But let's shift the perspective to the other side. Surely startups and others using open source projects for commercial reasons even if not obligated legally or not expected to by the developers have some ecosystem responsibility to try to contribute back when they can in some meaningful way.
Acquiring open source projects or hiring developers are 'influence plays' to gain control and should not be the only way for commerical projects to contribute.
"We needed something that would work for both github.com and GitHub Enterprise, so we decided to lean on our operational experience with MySQL."
It's just easier to have one single source of truth. Please don't change Redis into a large SQL database. :)
Redis was tremendous in our journey up there - but one of the shortcomings is that it isn't as easy to scale-up as cassandra is if you haven't designed your system to scale-up on redis from when it was built (which we didn't) - instead of re-architecting for a redis-cluster setup, we decided to move the component to a clustered microservice written in go, that sits as a memory-cache & write buffer infront of cassandra for hot, highly mutated data.
Would anyone be interested in a blog post about our struggles & journey?
Somehow setting up a Redis cluster and doing whatever you have to do to distribute/shard your keys effectively (which afaik is not much) does sound a little more efficient than rewriting a clustered microservice in Go with a Cassandra backend. Redis clustering is actually quite easy.
Forgive me if I seem grumpy. My recent experiences have caused the "We had a minor issue, so we redid everything in a Totally Cool Super-Neato New Stack That Integrates All The Hiring Manager's Favorite Buzzwords!" perspective to become a bit grating.
Redis is one of the few new pieces of infrastructure over the last 10 years that's truly deserving of its position.
I understand that Cassandra et al exist to solve real problems that someone out there has experienced, and I seek to throw no shade on the great engineers who make these fine products. I am, however, somewhat dubious that these niche products are applicable in the vast majority of cases where they're deployed. I strongly believe, and I think the data would bare out on this, that when it gets down to brass tacks, most people are integrating such specialized tools into generic products to either a) make life at the office more exciting; b) beef up resume points for their next job application cycle; or c) both.
Someone in our company wrote a blog post pretending to justify the move to a niche datastore. He's very proud of it and makes several spurious, nonsensical justifications in it. The truth is that MySQL would've been many times more practical along all axes, except the one this guy cares most about, which involve his personal career ambitions.
This move was partially under the radar so objections couldn't be raised and full backups were not properly arranged. It cost the company a lot of money not only in time and infrastructure, but also in the recovery process that had to be undertaken by real data experts (or nearest we had at the time, at least) when the cluster was destroyed by one of his careless scripts. :)
Second nightmare, currently ongoing: shifting everything to docker/k8s, which, for just one example among a very long laundry list of complaints, only got support for directly addressing app servers behind a load balancer last month, as a beta feature (in k8s nomenclature, that's "Version 1.5 has a beta StatefulSets feature to make Pods in a ReplicaSet uniquely addressable from inside the cluster! Don't forget to make a Headless Service and Persistent Volume." Exhausted yet? Just wait.).
Why are we switching to something that lacks such basic functionality (we're like 3 versions behind, so we can't use it)? If I told you, I'd have to kill you, but it sure makes our resumes pretty.
I'm all for learning, experimentation, and doing things for fun. We are on Hacker News after all. I guess I've just developed a taste for a stable production ethos that, to co-opt a scriptural term, is not "blown about by every wind of [tech fad]". I crave a company that makes its decisions based on a significant and real cost-benefit analysis that shows substantial unique benefits and sufficient maturity to a tech before jumping on the bandwagon. I guess I just want some sanity.
When it was initially built, it was basically a bunch of redis lua scripts to handle updating the data - on redis configured in master/slave managed by sentinels.
Given the nature of the data too, only the data for active users would be hot - but users that were inactive would have stuck around in memory needlessly. Our new system keeps only the hot set of users in memory. We also built it to transparently migrate users from redis to cassandra when they were accessed.
I personally love MySQL. Just want to understand what makes MySQL a great KV store as opposed to more seemingly specialized systems like Redis?
Its not, it just happens to be good enough, which matters a lot for operational expertise/costs/etc.
For example, you can store hundreds of millions of KV rows in an InnoDB table and still have <1-3ms response times on queries, while having persistence built in. Perfect is the enemy of good enough.
MySQL also proved very good at scale, at Facebook, YouTube, Uber, etc. and there are a lot of people with experience running it.
Uber wrote a very good article about their switch, for them it was more about performance, though a little bit controversial.
For example you have 10 people in your organization with various permissions on repos. Some people (CTO let's say) can see every repo while others might only be able to see some repos. Or you might have consultants or open source projects which non-employees contribute to. Then you construct a graph where each node is an contributor that is connected to other contributors by the permissions they have on repos (or are the repos the nodes and the contributor permissions the connections?). Finally you run a graph partitioning algorithm where the number of partitions is the number of unique timelines you have to write for an organization. Thinking about an organization with closer to 500 contributors I can see how this could reduce the number of timelines by 30%.
Assembling the inbox on demand is quite interesting. I don't quite understand the querying and operations involved with Roshi for doing that.
Even if Redis was a better fit for some of their use cases, it just makes it much easier to not have the additional persistent database system to manage.
Some of our users have 25k+ users on their cluster. We know GitLab can scale to 100k users because we run GitLab Enterprise Edition without modifications on GitLab.com
GitLab.com currently has much more than 100k users and the performance leaves much to be desired https://gitlab.com/gitlab-com/infrastructure/issues/947
But we're comfortable that you can run 100k users on a cluster of machines without much tuning.
GitHub seems to have a clustering product that came out a year ago, and it looks like IBM has reported they're running over 13,000 users on it (back in August, i imagine it's closer to 20,000 by now): https://www.ibm.com/blogs/bluemix/2016/08/ibm-internal-githu...
> We know GitLab can scale to 100k users because we run GitLab Enterprise Edition without modifications on GitLab.com
What does "modification" mean in this context? Beyond recommended specifications listed at https://docs.gitlab.com/ee/install/requirements.html#cpu and https://docs.gitlab.com/ee/install/requirements.html#memory? it doesnt list anything above 40,000 users. Beyond that, the HA documentation (https://docs.gitlab.com/ce/administration/high_availability/...) isn't _really_ active/active HA, it says it is but it's not true. True active/active would mean that you wouldn't rely on a shared NFS server, postgres, or redis server.
However, with https://help.github.com/enterprise/2.8/admin/guides/clusteri... it sounds like my organization can scale to well over 100,000 by adding more clustering nodes instead of trying to figure out independently how to scale services like redis or postgres or a shared NFS server.
> without much tuning.
I'd love to hear a comparison between the two products, and also what kind of tuning you've done on gitlab.com to support those user numbers. Would love to see that on the documentation to support the open source way!
Out of curiosity: how many users do you have, and what measurements are you using to determine scalability?
Disclaimer: I work at GitHub.
Interesting. Is there a comparison of overall performance between the intermediate design (w/ Redis) and what they ended up with?
Though having someone sit near me not knowing what I am/we are working on would surprise me. But it does happen especially if there are hotdesks nearby for people from other offices to work on temporarily. I do swear loudly often so probably not a good choice to have those too near me...
it has first class support in most ORM, and works quite well.
Also, 3.0 has a better LRU algorithm. https://redis.io/topics/lru-cache
When you need to scale past what Redis can provide, you can move on to a different solution, as Github has done.