Hacker News new | comments | show | ask | jobs | submit login

I really question the current trend of creating big, complex, fragile architectures to "be able to scale". These numbers are a great example of why, the entire thing could run on a single server, in a very straight forward setup. When you are creating a cluster for scalability, and it has less CPU, RAM and IO than a single server, what are you gaining? They are only doing 6k writes a second for crying out loud.

They create big, complex, fragile architectures because they started with simple, off-the-shelf architectures that completely fell over at scale.

I dunno how long you've been on HN, but around 2007-2008 there were a bunch of HighScalability articles about Twitter's architecture. Back then it was a pretty standard Rails app where when a Tweet came in, it would do an insert into a (replicated) MySQL database, then at read time it would look up your followers (which I think was cached in memcached) and issue a SELECT for each of their recent tweets (possibly also with some caching). Twitter was down about half the time with the Fail Whale, and there was continuous armchair architects about "Why can't they just do this simple solution and fix it?" The simple solution most often proposed was write-time fanout, basically what this article describes.

Do the math on what a single-server Twitter would require. 150M active users * 800 tweets saved/user * 300 bytes for a tweet = 36T of tweet data. Then you have 300K QPS for timelines, and let's estimate the average user follows 100 people. Say that you represent a user as a pointer to their tweet queue. So when a pageview comes in, you do 100 random-access reads. It's 100 ns per read, you're doing 300K * 100 = 30M reads, and so already you're falling behind by a factor of 3:1. And that's without any computation spent on business logic, generating HTML, sending SMSes, pushing to the firehose, archiving tweets, preventing DOSses, logging, mixing in sponsored tweets, or any of the other activities that Twitter does.

(BTW, estimation interview questions like "How many gas stations are in the U.S?" are routinely mocked on HN, but this comment is a great example why they're important. I just spent 15 minutes taking some numbers from an article and then making reasonable-but-generous estimates of numbers I don't know, to show that a proposed architectural solution won't work. That's opposed to maybe 15 man-months building it. That sort of problem shows up all the time in actual software engineering.)

>They create big, complex, fragile architectures because they started with simple, off-the-shelf architectures that completely fell over at scale.

No, they fell over at "shit we're hitting the limits of our hardware, lets re-architect everything instead of buying bigger hardware". Rather than buy 1000 shitty $2000 servers, buy 2 good $1,000,000 servers. I know it is not fad-compliant, but it does in fact work.

And then you grow by another 50%. If you go the commodity hardware route, you buy another 500 shitty $2000 servers. If you go the big-iron route, you buy another 2 $5,000,000 servers, because server price does not increase linearly with performance. If you're a big site, server vendors know they can charge you through the nose for it, because there are comparatively few hardware vendors that know what they're doing once you get up to that level of performance.

Look, if you're going to make the case that one should buy bigger servers instead of more servers, then this becomes an economic argument. The reason large web-scale companies don't do this is because it outsources one of their core competencies. When they scale horizontally across thousands of commodity machines, then knowledge of their problem domain becomes encoded in the scaling decisions they make and stays internal to the company. When they scale vertically by buying bigger hardware, then they are trading profits in exchange for having someone else worry about the difficulties of building really big, fast supercomputers. It makes life a lot easier for the engineers, but it destroys the company's bargaining position in the marketplace. Instead of having a proprietary competitive advantage, they are now the commodity application provider on top of somebody else's proprietary competitive advantage. If someone wants to compete with them, they buy the same big iron and write a Twitter clone, while if their server vendor wants to raise prices, it has them by the balls since the whole business is built on their architecture.

(I have a strong suspicion that Twitter would not be economically viable on big iron, anyway. They would end up in a situation similar to Pandora, where their existence is predicated on paying large rents to the people whose IP they use to build their business, and yet the advertising revenue coming in is not adequate to satisfy either the business or their suppliers.)

No, you buy another 2 servers at the same price, because performance continues to increase incredibly quickly, and what you got $200,000 2 years ago is now half the speed of what $200,000 gets you.

>When they scale horizontally across thousands of commodity machines, then knowledge of their problem domain becomes encoded in the scaling decisions they make and stays internal to the company.

Or to put it another way: "they create a massive maintenance nightmare for themselves like the one described in the article".

>When they scale vertically by buying bigger hardware, then they are trading profits in exchange for having someone else worry about the difficulties of building really big, fast supercomputers.

You are overestimating the cost of high end servers, or underestimating the cost of low end ones. Again, their existing redis cluster is less RAM, CPU power, and IO throughput than a single, relatively cheap server right now.

>Instead of having a proprietary competitive advantage, they are now the commodity application provider

Twitter is a commodity application provider. People don't use twitter because of how twitter made a mess of their back end. People don't care at all about the back end, it doesn't matter at all how they architect things from the users perspective.

>while if their server vendor wants to raise prices, it has them by the balls since the whole business is built on their architecture.

What do you think servers are? They aren't some magical dungeon that traps people who buy them. If oracle wants to fuck you, go talk to IBM. If IBM wants to fuck you, go talk to fujitsu, etc, etc.

When two of your 2,000 servers die, your load balancers etc kick in and route around the problem.

When two of your two servers die, you ... um, well, you lose money and reputation. Quickly.

If you buy $1 million servers, a whole lot of things needs to go bad in whole lots of ways that would likely take own large numbers of those 2,000 servers too. I'm not so sure I agree with the notion of going for those big servers myself, but having had mid range servers from a couple of the big-iron vendors in house, here's a few of the things you can expect once you tack a couple of extra digits on the server bill:

- Servers that phone home; sometimes the first you know of a potential problem is engineers at your door come to service your server.

- Hot swappable RAID'ed RAM.

- Hot swappable CPU's, with spares, and OS support for moving threads of CPU's that are showing risk factors for failure.

- Hot swappable storage where not just the disks are hot swappable, but whole disk bays, and even trays of hot swappable RAID controllers etc.

- Redundant fibre channel connections to those raid controllers from the rest of the system.

- Redundant network interfaces and power supplies (of course, even relatively entry level servers offers that these days).

In reality, once you go truly high end, you're talking about multiple racks full of kit that effectively does a lot of the redundancy we tend to try to engineer into software solutions either at the hardware level, or abstracted from you in software layers your application won't normally see (e.g. a typical high end IBM system will set aside a substantial percentage of CPU's as spares and/or for various offload and management purposes; IBM's "classic" "Shark" storage system used two highly redundant AIX servers as "just" storage controllers hidden behind SCSI or Fibre Channel interfaces, for example).

You don't get some server where a single component failure somewhere takes it down. Some of these vendors have decades of designing out single points of failure in their high end equipment.

Some of these systems have enough redundancy that you could probably fire a shotgun into a rack and still have decent odds that the server survives with "just" reduced capacity until your manufacturers engineers show up and asks you awkward questions about what you were up to.

In general you're better off looking at many of those systems as highly integrated clusters rather than individual servers, though some fairly high end systems actually offer "single system image" clustering (that is, your monster of a machine will still look like a single server from the application point of view even in the cases where the hardware looks more like a cluster, though it may have some unusual characteristics such as different access speeds to different parts of memory).

I'm pretty sure the insides of your 2 $1,000,000 servers are architecturally essentially identical to a large cluster of off-the-shelf computers, with somewhat different bandwidth characteristics, but not enough to make a difference. They'll have some advanced features for making sure the $1,000,000 machine doesn't fall over. But it's not like they're magically equipped with disks that are a hundred times faster or anything, you're still essentially dealing with lots of computing units hooked together by things that have finite bandwidth. You can't just buy your way to 25GHz CPUs with .025ns RAM access time, etc. etc.

I'm not sure what you are trying to say. My entire point was that you can buy a single server that is more powerful than their entire cluster. You appear to agree, but think that is a problem?

Massively parallel single-machine supercomputers are still, essentially, distributed systems on the inside. You still have to use many of the same techniques to avoid communicating all-to-all. If you treat such a system as a flat memory hierarchy, your application will fall down.

True that they're still essentially distributed systems on the inside, but the typical bandwidth can often be orders of magnitudes higher, and the latencies drastically lower when you need to cross those boundaries, and for quite a few types of apps that makes all the difference in the world to how you'd architect your apps.

"you can buy a single server that is more powerful than their entire cluster." is pure genius.

We now know how to solve the C10k problem in orders of magnitude - use a single server!

$1,000,000 machines are not that much faster, you won't get 1000x the performance or anywhere near. The memory and IO performance is going to be within a factor 2 or 4 of that high end machine. It might have 50x the cores, but most likely that's not the limiting factor anyway.

The whole point is you can get equal performance from a single server instead of a ton of little ones. The ton of little ones forced them to totally re-architect to work around the massive latency between servers. A single server would have allowed them to stick with a sane architecture, and saved them millions in development time and maintenance nightmares.

It would be nice if were true, but you simply can't, there's no magic that makes an expensive server that much faster - it's just a bit faster for a lot more money. It can make sense if you only want 10x the performance and the server is cheaper than the rewrite.

For example, if a $30k car can go 150mph, it doesn't mean a $300k car can go 1,500mph it just doesn't happen. A Bugatti Veyron goes, what? 254mph that's not even double (and it costs a lot more than $300k)

We're not talking about cars. We're talking about computers. 8TB of RAM is 8TB of RAM, it doesn't get better by spreading it across a thousand servers. 4096 CPU cores are 4096 CPU cores, they don't get better by spreading them across 1000 servers. Those things get worse spreading them across servers, because you massively increase the latency to access them, and for them to access shared data.

Please give an example of this "monster server" you keep talking about with 8 TB of RAM, 4096 cores, N network interface cards, and 100 TB of SSD, with a cost estimate. Otherwise we can't have a real discussion. People have a pretty good idea of how to build / what it costs to build something with 128 GB of RAM, 32 cores, a couple of NICs, and 2 TB of SSD, but what you're talking about is 50-100x beyond that.

Not posting this to support his argument, but for the record some of the high end unix hardware available (for a price, no idea what these cost):

32TB RAM 1024 Cores (64 x 16 core), 928 x PCI Express I/O slots: http://www.oracle.com/us/products/servers-storage/servers/sp...

16TB RAM 256 cores (probably multiple threads per core), 640 x PCIe I/O adapters: http://www-03.ibm.com/systems/power/hardware/795/specs.html

4TB RAM 256 cores (512 threads), 288 x PCIe adapters: http://www.fujitsu.com/global/services/computing/server/spar...

And exactly to his point - you still can't treat these as one uniform huge memory / computational space for your application (these machines seem designed for virtualization rather than one huge application). You run into the same distributed computing issues you would with your own hardware, just with a 5/10x larger initial investment and without a huge amount of pricing control / flexibility in terms of adding capacity / dealing with failures as they arise.

Actually you can treat these as one uniform huge memory / computational space for your application. They're not meant only for virtualisation. In particular, the Oracle Database is a perfect fit for a system with thousands of cores and terabytes of memory.

It's true that for some use cases, you'd be better off carving it up using some form of virtualisation, but it isn't a requirement to reap the benefits of a massive system.

Both the Solaris scheduler and virtual memory system are designed for the kind of scalability needed when working with thousands of cores and terabytes of memory.

You also don't run into the same distributed system issues when you use the system that way.

You also do actually have a fair amount of flexibility in dealing with failures as they arise. Solaris has extensive support for DR (Dynamic Reconfiguration). In short, CPUs can be hot-swapped if needed, and memory can also be removed or added dynamically.

Just for a reference point, an appropriately spec'd big iron machine from a major supplier with 8tb of RAM will run you $10 to $20 million. There's no mainstream commercial configuration that is going to get you to 4096 cores though.

Fujitsu's SPARC Enterprise M9000 mentioned in another reply is $5 to $10 million depending on configuration (assuming you want a high-end config).

If you go big iron with any supplier worth buying from, they will absolutely murder you on scaling from their base model up the chain to 4tb+ of memory. The price increases exponentially as others have noted.

The parent arguing in favor of big iron is completely wrong about the economics (by a factor of 5 to 10 fold). The only way to ever do big iron as referenced, would be to build the machines yourself....

just fyi, 2.8 million for 4096 core 16TB http://news.cnet.com/8301-30685_3-20019153-264.html?part=rss...

SGI won't build you that computer for commercial use for $2.8 million. It would cost you several fold more.

5.6 million. You need a hot standby. Or 8.4 if you're feeling cautious :)

Why should he give an example of a "monster server" with specs like that?

He gave the argument that spreading the CPU's and memory out does not make them better.

So part of the point is that if the starting point is 8TB of RAM and 4096 cores distributed over a bunch of machines, then a "large machine" approach will require substantially less.

I've not done the maths for whether or not a "Twitter scale" app would fit on a single current generation mainframe or similar large scale "machine" (what constitutes a "machine" becomes nebulous since many of the high end setups are cabinets of drawers of cards and on persons "machine" is another persons highly integrated cluster), but it would need to include a discussion of how much a tighter integrated hardware system would reduce their actual resource requirements.

I'm sure you could get vastly better performance. But this is not about performance.

What happens when you need multiple datacenters? How about if you need to plan maintenance in a single datacenter? Therefore you need at least two servers in each datacenter, etc.

Let's say you decide to use smaller machines to serve the frontend but your backend machines are big iron. Are you going to perform all your computation and then push the data out to edge servers?

There's much more to this then loading up on memory and cores.

Think about fault tolerance with 2 servers.

You are right to be sceptical, and I think you're right about throughput: they mentioned that all tweets are ~22MB/sec. My hard disk writes 345MB/sec. A modern multicore processor should probably tear through that datastream.

However, you should realize that in a decent sized organization there are usually different teams working on different things that are evolving at different rates. You have to manage these people, systems and their changes over time. After awhile, efficiency drops in priority versus issues such as manageability and security... often due to separation of concerns requirements.

Beyond the build and maintenance issues, a serious challenge for real time services is high availability (HA). Tolerance for hardware/software/network/human failings must be built in. That also throws efficiency out the window. It's cheaper to treat cheap, easily replaceable machines as a service-oriented cluster than to build two high performance machines and properly test the full range of potential failover scenarios.

I hope this comment is a troll and you don't actually believe that you could run Twitter on a single server. If you are interested in why this is true I suggest you get a job at a company that has to serve consumer scale web traffic.

The fact that you think "consumer scale web traffic" is some magical thing is exactly what I am talking about. Have you ever heard of TPC? They have done benchmarks of database driven systems for a long time. TPC-C measures performance of a particular write query, while maintaining the set ratio of other active queries. The top non-clustered result right now does 142,000 new orders per second. Yes, a single server can handle 300k reads and 6k writes per second.


I appreciate that you believe what you are saying, but TPC-C doesn't measure anything at all relevant. Having worked both on the enterprise side building WebLogic and on the consumer side at Yahoo and Twitter I can tell you definitively that you are wrong about the applicability of that benchmark to serving large scale web applications. The database and server you are talking about could not do 300k timeline joins per second, or 300k graph intersections per second or virtually any of the actual queries that Twitter needs to operate. All reads are not created equal. Checking the latency profile, it is horrendous even for these simple transactions — hundreds of milliseconds! Worse, there is no where to go as the service grows.

It would be great if you were right and all the big web companies are wrong. I can assure you that it isn't the case.

sam don't waste your time this is a joke

sup nick

One of the people I work with was at eBay from 2000 to 2009 and they did what you talk about - have one giant database server. Some Sun monstrosity.

Then they ran out of horsepower on that one server. Let me tell you, the stories are pretty horrible.

So you know who you are replying to: http://www.crunchbase.com/person/sam-pullara

Does TPC require durability? It seems like you could get even better performance (or perhaps that's what they do) if you just got something like a Fujitsu M10-4S Server and stuffed 32TB of ram in it.

>Does TPC require durability?


I think that in a lot of ways it's because of the roots of the organizations in question. Instead of saying "yeah, we have money, we can go buy this solution and trust in the vendor to support us" (an objectively viable solution in many situations) we instead say "we've got $100,000 in seed money and one engineer plus some friends she's convinced to help out pro bono, that won't buy us a sideways glance from Oracle".

Once you've survived the first year or two and gained some traction, suddenly the decision to investigate other solutions becomes a break from tradition -- and no engineer wants to throw away their work. It's incredibly hard to do a full stop, look around, and reevaluate your needs to see if you should just go in a different direction. I'm not sure I've ever seen it happen, really; we always just bow to the organic growth and inertia that we've established.

There is also a strong tradition of Not Invented Here (NIH) at work in this ecosystem. If it wasn't written by one of your peers (other startup people) then it's just not very cool. Use a Microsoft or Oracle product? Hahaha, no. Open source or bust!


To be fair, though, I've never worked for a startup that could afford that kind of software, so I guess the point is moot. I'm not spending 20% of my startup's cash reserves on software licenses to Oracle, when I can instead use that $2m to hire a few people to build it for me and know it inside and out. Plus, then they're invested in my company.

Also, please don't think I'm denigrating open source software with this commentary. I just think that the kind of zealotry that precludes even considering all of the options is, in general, a bad business decision, but one we seem to make all of the time.

Your expensive vendors sit on security patches for weeks and my free ones shoot them out almost immediately. Why is that not a business concern?

I don't think it's zelotry, there are a lot of advantages to open source software you aren't going to get from Microsoft. If you are an edge case and you happen to stumble on that race condition bug are you going to have your engineers black box test and reverse engineer someone elses product illegally while they wait around for Microsoft support or have them look under the hood, patch the bug and move on?

What the big licenses fees get you is accountability, which is of course a huge thing, but what open source gives you is control.

I think if your business is software, then open source makes perfect business sense, especially if one of your assets is a team of competent engineers.

You're arguing for scaling up (vertical) instead of scaling out (horizontal). Both are valid approaches. Scaling out is preferred because your architecture is mode modular and you do not have to constantly buy bigger machines as your usage grows; you just add additional machines with similar capacity. The main problem with scaling up is that price and performance are not linearly related and eventually you will be limited by the performance available to one system. But it's a perfectly valid approach for certain scenarios.

It's not as simple as 6k writes - it's 6k write requests. Most of these 'write requests' are tweets which must be written to the timelines of hundreds or thousands (or in some cases millions!) of followers.

If you try to run it all on a single server, and Obama and Justin Bieber tweet at the same time, you suddenly have a backlog of 75 million writes to catch up on. Now imagine what happens if Bieber starts a conversation thread with any of the other 50 users with 10m+ followers.

Agreed that this is an issue, but it's a smaller one than you make out: Consider that there's likely a marked hockey stick going on here. Now keep the very most recent tweets of the very "worst" in terms of followers in a hot cache, and mix their most recent tweets into the timelines of their followers as needed until they're committed everywhere. Then collapse writes, and you reduce the "conversation" problem.

It's still not cheap to handle, but consider that fan-out on write is essentially an optimization. You don't need that optimization to be "pure" in that you don't need to depend on the writes being completed in a timely fashion - there are any numbers of tradeoffs you can make between read and write performance.

They may average 6K w/s, but history has shown higher peaks, and those peaks often include tweets from those with large 30e6+ followers -- large events naturally have celebrities chiming in. Having more CPU [cores], RAM, and IO is not going to solve for the need to shard your Redis, which is bound by a single core. For the fanout, that is anywhere from 90 to 90e3 cores at 2e6 Redis IOPs -- 1-6e3 Lady Gagas per second. Before pipelining, which you'd probably consider complex or fragile. All you've done is moved some networking parts around. Your code and topology hasn't really changed.

>Having more CPU [cores], RAM, and IO is not going to solve for the need to shard your Redis, which is bound by a single core

But they chose redis in the first place as part of their messy, crummy pile of barely working junkware solution. You wouldn't use redis in the first place if you were building a solution that scales up.

So, now we also have to assume per-core software and support licenses to accompany our purchase of several $10M racks? What platform are we talking about now? Presumably I have to also purchase the IDE and toolchain licenses, per developer, too?

Twitter may have and will still make missteps, but you're being egregiously vague and unhelpful. Your comments are as vaporous as you make Twitter's architecture to be.

You're funny. I like you.

I do consider it a pathology when tiny services, or tiny apps in a corporate structure, act like they have the problems of Google.

You are not Google.

You do not have Google's problems.

You do not have scaling issues.

For you, N is small and will stay small.

Stop giving me this delusional resume-padding garbage to implement. For you, here, it is delusion and lies.

The point here is that Twitter really is one of the cases where vertical scaling is not, on the balance of non-functional requirements to engineering overhead, the right decision. They really do need to pay the complexity piper.

Oh, yeah. I'm speaking to everyone else :-)

I remember asking a question about scaling on SO and getting a response about how my data was small, because it could fit on an SSD (for that one small component)

I miss not having real scaling issues sometimes. They make my head hurt.

That's probably 6k writes to a "normal" DB. The fanout is handled by Redis which I doubt is included in the 6k writes. Not if you want to fan out to 30 million followers in under 5 seconds.

The "fanout" or "wasteful duplication of a single message 30 million times" is only required because they are using tiny underpowered hardware to begin with. The approach that they claim can't possibly work actually does work. You just can't do it on a $2000 "server".

The messages are not repeated in fanout, just the ids. I guess you would know that if you actually the links.

That still means 30 million writes instead of one. That the size of each write is smaller is not going to help you all that much if each write still worst case ends up forcing you to rewrite at least one disk sector.

While I'm inclined to favour a fanout approach, a "full" fanout can easily be incredibly costly on the write side.

Let me guess, you work for Oracle?

The property of distributed message passing systems that everyone likes is the arbitrary scalability. Subsystem performing slowly? Add more nodes! The cost of this arbitrary scalability, as you have noticed, is brittleness, single points of failure, and complexity that will drive even the most seasoned of engineers nuts.

In the end though, the users want their tweets and they don't care how it works. Complexity is why we make so much money, after all.

Something is not easy to say simple or complex by just looking at the shape or counting the number of them.

One big powerful server's architect and circuit design is not that simple as what it looks like, in the other hand, 2000 standard servers are not that complex as what they are.

The way of how you think is the key. You can think that the 2000 servers is a big computer cluster, but I prefer to think each one of them is a simple replaceable black box.

I can show you some scenarios at the following to explain why 2000 servers make more sense than one big machine,

Firstly, When I want to upgrade the system to deal with a suddently increased load I don't need to call vendor to arrange an onsite upgrade service, I can do it by connecting more servers right away, and when the load decreased, then I can disconnect some servers. This makes the maintenance job is more flexible and more convenient.

The second one is when I estimate the performance of the whole system, I can focus on estimating the performance for one server first, and then add them on, even there are some variables need to be involved into the calculation, it is still more clear and simpler than estimating the performance by looking at the vendor provided system specification.

The last scenario is that you cannot separate one big machine into different geolocations to handle the access all over the world, but seperate 2000 servers is much easier and doable. Moreover, different geolocation deployment can provide a genric 24*7 online service as some nature disasters happens.

When we engineer things, we should split one big(complex) thing into multiple small parts and keep it with a simplest structure or function, and later, we connect these small parts together.

English is not my first language, so if somewhere sounds wired, please excuse me.

Maybe Twitter would indeed be better off using a smaller number of more beefy servers. But at the software level, I wonder if the architecture would be very different.

Suppose you implemented something like Twitter as a single large application, running in a single process. Maybe run the RDBMS in its own process. The app server stores a lot of stuff in its own in-process memory, instead of using Redis or the like. Now, what happens when you have to deploy a new version of the code? If I'm not mistaken, the app would immediately lose everything it had stored in memory for super-fast access. Whereas if you use Redis, all that data is still in RAM, where it can be accessed just as fast as before the app was restarted.

You'd use failover to do a rolling upgrade.

Because if you want to have reasonable availability and low latency you can't just buy two "big" machines.

Going with the incrementally sized approach gives you more flexibility to economically distribute the risk, in addition to the load.

They are only doing 6k writes a second for crying out loud.

And 600k reads per second (as of when this article was written), which seems like an important thing to leave out.

300k reads, which is approaching trivial. You can do that with an array of desktop SSDs. But in reality, it would be almost all in RAM anyways making those reads incredibly cheap.

Right, because no matter how much RAM you put in a machine, access is always the same speed.

Two thoughts: scaling down can be just as important as scaling up, and cluster nodes that need to talk to each other are probably less common than stateless cluster nodes that need to talk to the client.

How do you expect to serve 150 million concurrent users on a single server?

They are not serving 150 million concurrent users. They have 150 million active users. As in, people who do in fact use twitter at all, as opposed to the millions of dead accounts nobody touches. They are not all being served at once.

Do you hit the disk when somebody, say, checks my Twitter profile that I haven't updated since 2008? What will that do to your performance?

I'm not sure why you are asking me about this.

Because he/she hasn't read the article.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact