I dunno how long you've been on HN, but around 2007-2008 there were a bunch of HighScalability articles about Twitter's architecture. Back then it was a pretty standard Rails app where when a Tweet came in, it would do an insert into a (replicated) MySQL database, then at read time it would look up your followers (which I think was cached in memcached) and issue a SELECT for each of their recent tweets (possibly also with some caching). Twitter was down about half the time with the Fail Whale, and there was continuous armchair architects about "Why can't they just do this simple solution and fix it?" The simple solution most often proposed was write-time fanout, basically what this article describes.
Do the math on what a single-server Twitter would require. 150M active users * 800 tweets saved/user * 300 bytes for a tweet = 36T of tweet data. Then you have 300K QPS for timelines, and let's estimate the average user follows 100 people. Say that you represent a user as a pointer to their tweet queue. So when a pageview comes in, you do 100 random-access reads. It's 100 ns per read, you're doing 300K * 100 = 30M reads, and so already you're falling behind by a factor of 3:1. And that's without any computation spent on business logic, generating HTML, sending SMSes, pushing to the firehose, archiving tweets, preventing DOSses, logging, mixing in sponsored tweets, or any of the other activities that Twitter does.
(BTW, estimation interview questions like "How many gas stations are in the U.S?" are routinely mocked on HN, but this comment is a great example why they're important. I just spent 15 minutes taking some numbers from an article and then making reasonable-but-generous estimates of numbers I don't know, to show that a proposed architectural solution won't work. That's opposed to maybe 15 man-months building it. That sort of problem shows up all the time in actual software engineering.)
No, they fell over at "shit we're hitting the limits of our hardware, lets re-architect everything instead of buying bigger hardware". Rather than buy 1000 shitty $2000 servers, buy 2 good $1,000,000 servers. I know it is not fad-compliant, but it does in fact work.
Look, if you're going to make the case that one should buy bigger servers instead of more servers, then this becomes an economic argument. The reason large web-scale companies don't do this is because it outsources one of their core competencies. When they scale horizontally across thousands of commodity machines, then knowledge of their problem domain becomes encoded in the scaling decisions they make and stays internal to the company. When they scale vertically by buying bigger hardware, then they are trading profits in exchange for having someone else worry about the difficulties of building really big, fast supercomputers. It makes life a lot easier for the engineers, but it destroys the company's bargaining position in the marketplace. Instead of having a proprietary competitive advantage, they are now the commodity application provider on top of somebody else's proprietary competitive advantage. If someone wants to compete with them, they buy the same big iron and write a Twitter clone, while if their server vendor wants to raise prices, it has them by the balls since the whole business is built on their architecture.
(I have a strong suspicion that Twitter would not be economically viable on big iron, anyway. They would end up in a situation similar to Pandora, where their existence is predicated on paying large rents to the people whose IP they use to build their business, and yet the advertising revenue coming in is not adequate to satisfy either the business or their suppliers.)
>When they scale horizontally across thousands of commodity machines, then knowledge of their problem domain becomes encoded in the scaling decisions they make and stays internal to the company.
Or to put it another way: "they create a massive maintenance nightmare for themselves like the one described in the article".
>When they scale vertically by buying bigger hardware, then they are trading profits in exchange for having someone else worry about the difficulties of building really big, fast supercomputers.
You are overestimating the cost of high end servers, or underestimating the cost of low end ones. Again, their existing redis cluster is less RAM, CPU power, and IO throughput than a single, relatively cheap server right now.
>Instead of having a proprietary competitive advantage, they are now the commodity application provider
Twitter is a commodity application provider. People don't use twitter because of how twitter made a mess of their back end. People don't care at all about the back end, it doesn't matter at all how they architect things from the users perspective.
>while if their server vendor wants to raise prices, it has them by the balls since the whole business is built on their architecture.
What do you think servers are? They aren't some magical dungeon that traps people who buy them. If oracle wants to fuck you, go talk to IBM. If IBM wants to fuck you, go talk to fujitsu, etc, etc.
When two of your two servers die, you ... um, well, you lose money and reputation. Quickly.
- Servers that phone home; sometimes the first you know of a potential problem is engineers at your door come to service your server.
- Hot swappable RAID'ed RAM.
- Hot swappable CPU's, with spares, and OS support for moving threads of CPU's that are showing risk factors for failure.
- Hot swappable storage where not just the disks are hot swappable, but whole disk bays, and even trays of hot swappable RAID controllers etc.
- Redundant fibre channel connections to those raid controllers from the rest of the system.
- Redundant network interfaces and power supplies (of course, even relatively entry level servers offers that these days).
In reality, once you go truly high end, you're talking about multiple racks full of kit that effectively does a lot of the redundancy we tend to try to engineer into software solutions either at the hardware level, or abstracted from you in software layers your application won't normally see (e.g. a typical high end IBM system will set aside a substantial percentage of CPU's as spares and/or for various offload and management purposes; IBM's "classic" "Shark" storage system used two highly redundant AIX servers as "just" storage controllers hidden behind SCSI or Fibre Channel interfaces, for example).
You don't get some server where a single component failure somewhere takes it down. Some of these vendors have decades of designing out single points of failure in their high end equipment.
Some of these systems have enough redundancy that you could probably fire a shotgun into a rack and still have decent odds that the server survives with "just" reduced capacity until your manufacturers engineers show up and asks you awkward questions about what you were up to.
In general you're better off looking at many of those systems as highly integrated clusters rather than individual servers, though some fairly high end systems actually offer "single system image" clustering (that is, your monster of a machine will still look like a single server from the application point of view even in the cases where the hardware looks more like a cluster, though it may have some unusual characteristics such as different access speeds to different parts of memory).
We now know how to solve the C10k problem in orders of magnitude - use a single server!
For example, if a $30k car can go 150mph, it doesn't mean a $300k car can go 1,500mph it just doesn't happen. A Bugatti Veyron goes, what? 254mph that's not even double (and it costs a lot more than $300k)
32TB RAM 1024 Cores (64 x 16 core), 928 x PCI Express I/O slots:
16TB RAM 256 cores (probably multiple threads per core), 640 x PCIe I/O adapters:
4TB RAM 256 cores (512 threads), 288 x PCIe adapters:
It's true that for some use cases, you'd be better off carving it up using some form of virtualisation, but it isn't a requirement to reap the benefits of a massive system.
Both the Solaris scheduler and virtual memory system are designed for the kind of scalability needed when working with thousands of cores and terabytes of memory.
You also don't run into the same distributed system issues when you use the system that way.
You also do actually have a fair amount of flexibility in dealing with failures as they arise. Solaris has extensive support for DR (Dynamic Reconfiguration). In short, CPUs can be hot-swapped if needed, and memory can also be removed or added dynamically.
Fujitsu's SPARC Enterprise M9000 mentioned in another reply is $5 to $10 million depending on configuration (assuming you want a high-end config).
If you go big iron with any supplier worth buying from, they will absolutely murder you on scaling from their base model up the chain to 4tb+ of memory. The price increases exponentially as others have noted.
The parent arguing in favor of big iron is completely wrong about the economics (by a factor of 5 to 10 fold). The only way to ever do big iron as referenced, would be to build the machines yourself....
He gave the argument that spreading the CPU's and memory out does not make them better.
So part of the point is that if the starting point is 8TB of RAM and 4096 cores distributed over a bunch of machines, then a "large machine" approach will require substantially less.
I've not done the maths for whether or not a "Twitter scale" app would fit on a single current generation mainframe or similar large scale "machine" (what constitutes a "machine" becomes nebulous since many of the high end setups are cabinets of drawers of cards and on persons "machine" is another persons highly integrated cluster), but it would need to include a discussion of how much a tighter integrated hardware system would reduce their actual resource requirements.
What happens when you need multiple datacenters? How about if you need to plan maintenance in a single datacenter? Therefore you need at least two servers in each datacenter, etc.
Let's say you decide to use smaller machines to serve the frontend but your backend machines are big iron. Are you going to perform all your computation and then push the data out to edge servers?
There's much more to this then loading up on memory and cores.