Hacker News new | comments | show | ask | jobs | submit login
The Architecture Twitter Uses to Deal with 150M Active Users (highscalability.com)
316 points by aespinoza on July 8, 2013 | hide | past | web | favorite | 162 comments

Since tweets are very much like financial ticks (fixed size, tiny), we can put the numbers in perspective. Let's compare Twitter vs say, OPRA (North American options), a single "exchange" feed.

Twitter: 300K QPS, firehose 22 MB/sec, 400 million tweets/day

OPRA: 12.73M MPS, firehose 3.36 GB/sec, 26.4 billion messages/day


edit: Also worth noting, the diff between OPRA 1/1/13 & 7/1/13 is ~256 MB/sec, 2.0 billion messages/day. So in just 6 months the existing firehose increased roughly 10x Twitter's entire output and roughly 5x'd the number of messages/day.

Tweets are fanned out to more than a single feed and, in the "most important" cases, millions of feeds -- 31 million feeds for Lady Gaga. Your comparing 300K reads, which might not even include the sub-queries, to 12.7M writes. A single tweet from a single user would trump the writes; it's unclear whether it could be done in the same one-second window.

Twitter already does 30B Timeline deliveries a day, compared to the 26.4B of OPRA. Again, there is no telling what can be implied by a "Timeline delivery;" does it include pulling user and other secondary and tertiary objects? An HTML renderer? The doesn't say much for the capabilities and focuses on what they do.

There's also little comparison to be made on what powers OPRA. If Twitter can simply add nodes to their architecture, are they doing it wrong?

Securities work a similar same way -- When you're looking at your portfolio you're only directly monitoring the individual fields for the securities in your portfolio, not the entire firehose. The OPRA feed is an equivalent to the raw Twitter firehose, only much fatter. Once you integrate it into the last-mile display to the users, you're doing all the same things. This is typically done with multicast topic-style subscriptions by ticker / type of data. It makes sense that if Lady Gaga writes a tweet you only want it to be "1 timeline delivery" (multicast) as opposed to "31 million timeline deliveries", which is making your infrastructure do a lot more work. Granted, you're kind of limited by what the browser can do in this regard, so you're kind of stuck with the socket model.

I think we're conflating topics. While the Firehose can be done with some combination of Pub-Sub/Topic/Sub-Topic Fanout, it should have little to do with the QPS for Timeline fanouts. I'd imagine the Firehose footprint is a really small part of their architecture or throughput pains. Timeline is a per-user join on multiple graphs and with paging.

Securities feeds are, I suspect, more write-dominant and have a much lower fanout rate. They also don't have the outlier-retweeting-outlier or outlier-tweeting-@-outlier problems.

Yes, much more write-dominant. Apps which are built on top of the feeds can create issues, though. E.g., An obscure ticker pasted into a chat room with 500 people instantly starts monitoring in all their windows -- not just the static price (equiv to a RT), but the live feed. That would be as if A RT'd B and suddenly all of A's followers added B to their timeline. The degree to which that happens depends on how many features like that are integrated into the app.

I have no idea why you're getting so much pushback. Products like those offered by Bloomberg ingest and distribute massive amounts of data in real time, as well. The comparison is completely suitable.

While their products for investment managers are mostly run off of persistent databases, the trader terminals rely on a high-volume, nebulous fan out. For many traders, a five second latency is unacceptable.

Incredibly interesting talk and a good write up. Twitter continues to impress! I was surprised to see Redis playing such a critical role, too.

Sometimes the pushback is right, sometimes it's wrong. He's been very good at explaining the different and heavy requirements of financial data feeds.

It's not the data coming in that makes things hard. It's the fanout of messages into their subscribers' timelines.

I've read before that Twitter uses NoSQL (or non-relational databases) and that each tweet requires a copy to be pushed to each follower every time someone tweets (same for updates or deletes). I am of a very much traditional RDBMS mindset though I am willing to bend a bit :) ... but this idea of fanning out tweets (copying tweets) to each and every follower just sounds grossly inefficient. The amount of writes must be exponential.

The way I see it, here are the WRITES they'll ever need:

1:User -> n:Users (Followers) (Temporal links with effective dates for follow history)

1:User -> n:Tweets (Temporal with tweet datetime) - this includes Retweets.

The rest are just READS (pulls instead of fanning out the same data).

What am I missing here folks? :-) Educate me please.

The entire tweet is not copied; it is a per-user list [0] of 16B to 28B records [1] -- IDs to the tweet and tweet author and sometime adhoc bits and retweet ID.

[0] http://www.infoq.com/resource/presentations/Twitter-Timeline... [1] http://www.infoq.com/resource/presentations/Twitter-Timeline...

As for what you might be missing, a user will need the latest, for example, 50 tweets from only the 200 people they follow....

Do we pull in 200 * 50 latest-per-user [cached] tweets lists? Now we might have network IO issues; it's also slower and on every read; could possibly ignite some hot spots for certain read sites. We can solve for these things, but they are just different problems and not necessarily easier nor simpler.

Do we walk a [timestamp] range index with a composite of user IDs? We'll have to shard this index with 200M tweets/day, and there's not an easy way to keep the necessary joins or unions local or making sense of how many shards I should query asynchronously before discarding extras. I'll probably have to make several roundtrips with a supervising process, between synchronous fetches.

Then there is paging...

There is probably a sweet spot for not doing the fanout for those with large followings, which they touch on at the bottom of The Future. Ultimately, you have to pay the price somewhere on read or write; they've simply picked write, as it apparently is more predictable and manageable on their working set.

I think what you're missing is simply that they have explicitly decided not to compute the home timeline on request; they've decided that they need it to be precomputed and constantly updated.

300K QPS is pretty impressive. If you'd want to sustain that rate on a single modern machine you'll have to fit all your processing into about thirty LLC misses per request.

I admit the QPS/MPS thing doesn't really fit. I mainly wanted to illustrate the data rates of the raw feeds. Twitter has one of the largest raw feeds of the web world, but it's useful to know how that compares to other non-web feeds.

> Twitter no longer wants to be a web app. Twitter wants to be a set of APIs that power mobile clients worldwide, acting as one of the largest real-time event busses on the planet.

Wait, then why are they actively destroying their third-party app ecosystem...?

The piece did not say "third party". I'm assuming these clients are the official Twitter mobile apps.

Probably their mobile clients. After I read the article perhaps Twitter felt that having control of the clients reduces variability in stress to their system. Like bad actors accessing the firehose suboptimally.

The two are not the same. It sounds like they are becoming API driven to support multiple interfaces and to put clear boundaries between systems.

Sounds like html. Or qt. Not 140 characters.

Yeah, somebody hasn't gotten the memo it seems.

This article cites its source as a talk by Twitter VP Raffi Krikorian. 38 minutes of video, audio, and slides are at http://www.infoq.com/presentations/Twitter-Timeline-Scalabil...

Thank you for sharing it.... I was wondering about the video.

I also enjoyed this presentation: http://www.infoq.com/presentations/Timelines-Twitter

It goes into more depth about how they handle timelines.

> Your home timeline sits in a Redis cluster and has a maximum of 800 entries.

Wow, that's pretty cool. Congrats to antirez - it must be a nice feeling knowing that your software powers such a big system!

Thank you David, I'm very happy indeed that many companies are using Redis to get work done, this is basically one of the biggest reasons why after 4 years I'm not giving up...

As you know I'm usually not the kind of guy focused to the task at hand for more than a limited timeframe, and then I've the temptation to switch to something else, but this time I'm finding the right motivations in the big user base.

Redis is probably the most useful tool powering the internet after nginx. It really is an amazing piece of engineering.

I have not used nginx, and have never heard of it spoken about in such glowing terms. Can you maybe compare/contrast it with apache?

The biggest difference is that nginx uses an evented I/O, and handles many connections per thread/process, Apache can only handle one at a time. This allows it to to use far less memory per connection, and lets it perform reasonably under very high loads. It also has very low latency, even for small loads, and is relatively easy to install and configure on most platforms.

nginx does an excellent job as a reverse proxy for applications, there are many configurations where nginx acts as a load balancer, and serves static content, and everything else is passed off to a real application. It's also useful if you're running on a tiny VPS with very little RAM.

However, Apache has a few features that nginx lacks, like embedding languages into it like mod_php does, and per directory configuration, in that directory.

I knew nginx was good for high loads but wasn't sure why and for what else. Makes sense, thanks a lot for the explanation.

>I have not used nginx, and have never heard of it spoken about in such glowing terms.

Seriously? There are lots of glowing blog posts, articles etc about Nginx all the time, including on HN. Just last week or so, it was reported that it powers the majority of the top-1000 biggest sites.

Yeah I guess what I mean was the post I was replying to called it the most useful tool powering the internet. I have not seen that before. I also like to ask people on HN who seem extremely passionate about something to explain why they are passionate because you often learn something new, that you often would not from articles etc.

nginx is fast and lightweight. Apache is a bloated mess in comparison. That's about it.

Late reply, but just saw this. So would you go as far as saying it never makes sense to use apache anymore? Is there any negative tradeoff to nginx?

You forgot the JVM.

Yup that, Eclipse, vi, Apache, MySQL and Oracle are my bread and butter. Never used nginx or Redis.

He said useful not ubiquitous.


Redis is memcache only better.

What's better about Redis? Many caching strategies require the storage to auto expire, memcache does this automatically. Redis does not do this (I believe). Is it significantly faster?

A simple Google search defeats your belief


> Options

> EX seconds -- Set the specified expire time, in seconds.


> EXPIREAT has the same effect and semantic as EXPIRE, but instead of specifying the number of seconds representing the TTL (time to live), it takes an absolute Unix timestamp (seconds since January 1, 1970)

In addition, the Redis config file allows you to specify an eviction policy when the maximum memory limit is reached:

  # volatile-lru -> remove the key with an expire set using an LRU algorithm
  # allkeys-lru -> remove any key accordingly to the LRU algorithm
  # volatile-random -> remove a random key with an expire set
  # allkeys-random -> remove a random key, any key
  # volatile-ttl -> remove the key with the nearest expire time (minor TTL)
  # noeviction -> don't expire at all, just return an error on write operations
Source: https://raw.github.com/antirez/redis/2.6/redis.conf

I apologize for not being specific. Setting timed expiration is not what I meant. Memcache will fill the entire cache and auto select items to be destroyed, based on current resource availability and order of entry into the cache (FIFO). I want that behaviour. I do not want to set and manage the time things expire.

It's all the functionality of memcache, plus persistence and data types like lists or sorted sets. You have much more control of your data.

It's the swiss army knife of in-memory data stores.

This[0] StackOverflow answer nicely lays this fact out.

[0]: http://stackoverflow.com/a/11257333/191438

Memcache is probably faster. But not as fast as writing data to /dev/null.

Figuratively, Twitter have switched from doing a design rooted in Databases 101 (OLTP) to a design more rooted in Databases 102 (OLAP).

That is, they moved processing from query time to write time. And that's a perfectly legitimate strategy; it's the basis of data warehousing.

OLTP is about write-time speed. It's great for stuff like credit card transactions, where you only really care about the tally at the end of the banking day. The dominant mode of operation is writing.

OLAP is about read-time speed. You do a bunch of upfront processing to turn your data into something that can be queried quickly and flexibly.

One thing that's problematic about the teaching of database technology is that the read/write balance isn't explicitly taught. Your head is filled with chapter and verse of relational algebra, which is essential for good OLTP design. But the problems of querying large datasets is usually left for a different course, if it's treated at all.

Maybe at a high level, but OLAP involves precomputing aggregations and bit indexes. It's a pretty different beast.

OLAP is very rarely under real-time constraints and, when it is, it tends to push the heavy lifting out to OLTP.

Oh, I'm handwaving, not talking about the underlying details of star schemata, clever column representations and whatnot.

But I think the analogy is still correct.

Every such system has two functional requirements:

1. Store data.

2. Query data.

And every system has the same non-functional requirement:

1. Storage (write) should be as fast as possible.

2. Queries (read) should be as fast as possible.

However, per an observation I made a while back, complexity in the problem domain is a conserved value.

Insofar as your data requires processing to be useful, that complexity cannot be made to go away. You can only decide where the complexity cost is paid.

You can pay it at write time and amortise that across reads. You can pay it at read time and excuse writes. Or you can pay it in the middle with some sort of ETL pipeline or processing queue.

But you must always pay it. The experiences of data warehousing made that bitterly clear.

So really, the job of a software architect is to take the business requirements as a non-functional requirement (an -ility) and then pick the architecture that fits that NFR. That includes dropping other nice-to-have non-functionals.

Twitter's non-functional requirement is that they want end-to-end latency to be 5 seconds or less, under conditions of very low write:read ratio. This suggests paying the complexity cost up front and amortising it over reads. And that's what they've done.

That's an interesting thought, but rather than focusing on write vs read balance, I would claim that OLAP and OLTP are most distinguished by the nature of queries they need to support.

OLAP is characterized by fairly low-volume aggregation queries that touch very large volumes of data ("what fraction of tweets in English come from non-english speaking countries?"). Sure, the ingest is large, but it tends to be batch and has fairly loose real-time (in the EE sense) requirements. What makes OLAP hard is the sheer volume of data from which an aggregation must be calculated.

OLTP is characterized by very selective queries with much tighter real-time bounds ("what are the last 20 things that the 100 people I follow said?"). The overall size of the dataset might even be the same, but each individual query needs fast access to a tiny fraction of the dataset. In many applications, this is accompanied by very high QPS and in Twitter's case, extremely high write volume.

Thanks, I hadn't mentally partitioned OLAP/OLTP that way. I still think that the read/write distinction more correctly forms the dividing line.

That's because the classical OLTP approach is to use relational databases. Relational databases do well in writes because in a properly normalised DB there is one, and only one, place for each datum to go. There is no fan out and potentially no coordination. A place for everything and everything in its place. That reduces the amount of write traffic required to support complex models of the problem domain by a lot.

But of course the relational view of a problem domain doesn't really look like the humanistic view of a problem domain. And building the humanistic view of the problem domain usually means denormalising and breaking the things that made OLTP useful. Enter ETL pipes and OLAP systems.

From there it's very humanistic and it requires no understanding of the relational model. Tools can easily turn the dimensional tables into dropdown filter lists that look like a spreadsheet -- or even run inside a spreadsheet.

This near hour long video is a deep look into Twitter's backend. Specially into the Firehose feature, Flock, etc. They go into detail on how they use Redis and even show one of the actual data structures they store in Redis. A must see video for anyone into high scalability.


I am playing the armchair architect and my question will be probably wrong in infinite ways,but I might learn something, what is the reason why the service has to write a tweet on two million timelines, wouldn't it be cheaper if they let the client build the page on its own via restful apis?

Requesting the timelines of each of the people you follow is slow process for the client, meaning that the client has to make 100s or even 1000s or requests.

Also, much of the data is thrown away because it's replies to people you don't follow or too old. It's hard for clients to reconstruct the timeline that way. Also, it would vastly increase the number of HTTP requests and data Twitter has to ship out.

Also, it's basically impossible to make a system like that real time because you have to check 1000 feeds to see if anything is new.

I used to write a multi-network client, basically the combined home timeline request is the only feasible method for a client to use.

Mostly because they chose the wrong backend technology (which they have been doing repeatedly since their early days).

The right way to solve twitter would be to have 140-byte tweets sorted by a <userid,time> 64-bit key, with a few more attributes (all falls into 256-bytes neatly), shard them across servers and keep everything recent in memory.

Logging into a server would fetch the list of following to the front end server, broadcast the request to all tweet servers, wait 50ms or so for responses, merge, sort and HTML format them.

The front end servers would not need any memory or disk (could be an army of $500 servers behind a load balance, or a few beefy ones). The backend servers would have to have some beefy CPU and memory, but still ultra commodity (256 bytes/tween means 1GB=4M tweets, so one 64GB server=256M tweets). Shard for latency, redundancy, etc. Also, special case the Gagas/Kutchers of this world by giving them their own server, and/or have them broadcast to and cache their tweets in the front end servers (Spend 256MB memory on tweet cache in the front end servers, and you get 1M cached tweets - which would cover all of the popular people and then some).

Network broadcast was invented for a reason.

My understanding is that Facebook does it that way. I think for Twitter it creates two problems though:

- It's hard to make that realtime, because you'd need to some sort of broadcast every second or at least every few seconds.

- It's hard to follow a lot of people (you could be getting a large number of replies), so there would need be a follow limit.

Facebook has a follow limit and people don't really expect Facebook to be realtime - the central bit is not really.

Also, there is no one right way to do these things, in my view.

> It's hard to make that realtime, because you'd need to some sort of broadcast every second or at least every few seconds.

Twitter isn't realtime either - they say they don't succeed to stay within the 5 seconds all the time, and when Gaga tweets it takes up to 5 minutes.

Furthermore, I was talking about broadcasting a request for updates on demand when needed. PGM/UDP can blast through hundreds of megabytes per second on gigabit connection. That's quite easy. 22MB/sec is nothing, even 100MB/sec is not much these days (though you might have to bond/team to make that reliable)

> It's hard to follow a lot of people (you could be getting a large number of replies), so there would need be a follow limit.

Not at all. Make the replies (at most) 5 tweets from each person you follow, with a "and there's more ..." flag in the reply, have the front end ask for more if it makes sense once the 50ms is done.

It's ok if the 300 people who follow one million people take 200ms instead of 50ms to get a reply. And if you want to make it quicker for them, have these ones (and only these ones) on a "push" rather than "pull" model. The vast majority of people follow less than 50 people, perhaps less than 20.

> Facebook has a follow limit and people don't really expect Facebook to be realtime - the central bit is not really.

People do expect facebook to be realtime, it mostly delivers (better than twitter), their limits are not hard (I know people who asked and got them lifted within a few minutes).

> Also, there is no one right way to do these things, in my view.

No, but there's a lot of wrong ways, and twitter keeps choosing among them.

That is actually a very interesting question and it turns out that whether it's better to fanout on write or on read depends on a few different things. There's a very widely read paper on the subject you might enjoy:


The link doesn't seem to work (error: "Unable to connect to database server"); do you have an alternate link?

Not a link to the full paper, but a summary is presented here: http://highscalability.com/blog/2012/1/17/paper-feeding-fren...

Title? Link doesn't work.

Huh, weird. Works fine for me. Anyways:

Feeding Frenzy: Selectively Materializing Users’ Event Feeds Authors: Silberstein, A.; Terrace, J.; Cooper, B.F.; Ramakrishnan, R.

A quick Google should turn it up for you somewhere.

They are making a tradeoff between cost of reading your timeline and cost of a write at tweet-time to optimize the former.

In addition to what others have mentioned, consider passive endpoints, such as SMS and push notifications.

But they don't all have to fit within the same framework - indeed, don't fit into the same framework.

e.g., if you can't send the SMS or push right now, you have to do a retry 5 minutes later. If you can't update a followers queue (if that's what you do), you have to display a fail whale....

Here is a similar talk that Jeremy Cloud gave at QCon NY a few weeks ago: http://www.infoq.com/presentations/twitter-soa

Jeremy Cloud discusses SOA at Twitter, approaches taken for maintaining high levels of concurrency, and briefly touches on some functional design patterns used to manage code complexity.

I would absolutely trust someone named Jeremy Cloud on this subject.

The surprise for me is that the core component is Redis.

My first guess would have been custom C code. Yeah, you have to do everything yourself. Yeah, it would be hard to write. But you'd control every little bit of it.

Obviously, I must not fully understand the problem and what Redis buys.

Sam Puralla (if you are reading) -- do you know why didn't Twitter go with a full custom system at its heart?


I'm probably a better person to answer this than Sam - I'm a former lead on this project - so I'll take a swing:

We chose Redis because it gave us the specific, incremental improvements over our existing memcached-based system that we required, without requiring us to write (yet another) component. There was enough to do, and this choice has turned out to be good enough, I think.

As the project progressed though, we treated Redis much like we did own it. We altered the wire protocol, changed the eviction strategy, and reduced protocol parsing overhead, for example. Much of that work has long since made it upstream.

[edit] grammar

Sounds like good engineering choices. I think of Twitter as having unlimited resources but, of course, that can't be true.

Two follow-on questions:

1. Did your changes make it back into open source or were they only relevant to Twitter? When you say upstream to you mean on Redis or earlier in the Twitter pipeline?

2. How much is Redis on the critical path? Is it 90% of the processing work in the large fanout cases?

1. Yes, most everything we changed is in the open source Redis code base. That's what I'm referring to as upstream, above.

2. Redis is in the critical path for a majority of API requests. I can't provide a specific percentage.

Think of Redis as a nicer interface to data structures that otherwise would have been written in custom C code.

If you read the redis manifesto, it describes itself in point 1 as a DSL for abstract data types:


> The surprise for me is that the core component is Redis.

> My first guess would have been custom C code.

I'm pretty sure they use a heavily tailored Redis, so both is true.

Well... It seems that hisghscalability doesn't scale... Error from squarespace on that page.

No one else seems to have noticed that these details are out of date. Twitter has publicly stated that they currently have "well over 200M active users". The stats are also misleading in that -- I'm pretty sure -- 300k reads and 6k writes per second are only referencing tweets. Flock, Twitter's graph database, handles more than 20k writes and 100k reads per second on its own (peak numbers available from the two year old Readme on Github).


> Twitter knows a lot about you from who you follow and what links you click on.

No kidding. But we don't care as we live in the glass house of a celebrity culture.

PS: downvote the quoted text if you must. My point is not obvious.

If you expect others to misunderstand you, perhaps try to better represent your thoughts?

You realize it's public data, right? When I tweet something, I don't expect it to be private. HN knows a lot about me too.

Yup, I mentioned glass houses.

The meaning of your metaphor certainly seems obvious; explicit even. Are you referring to a layer of more profound non-obviousness?

>> it can take up to 5 minutes for a tweet to flow from Lady Gaga’s fingers to her 31 million followers

Why not break the load up among a farm of servers? 5 minutes to deliver a single message? It's too bad multicast can't be made to work for this use case.

At least analyze to see if there's a pattern of geographic concentration of her followers and optimize for where their datacenters are.

Use peer to peer, let the clients help distribute the messages.

Their setup is very similar to what we use at Fashiolista. (though of course we only have millions and not hundreds of millions of users). We've open sourced our approach and you can see an early example here: https://github.com/tschellenbach/Feedly/

Does anybody know how that compares to Facebook? I believe they do not use write fanout but instead rely on the search federation model (the model claimed not to have worked for Twitter).

Facebook uses a fan-out-on-read approach


Scala is awesome language to implement scalable things, like Twitter services


They have a lot of RAM. Dang.

Not as much as I would have thought.

2TB can be one server: http://www.supermicro.co.uk/products/system/5U/5086/SYS-5086...

An expensive server for sure... but it's just one server.

In case people have a hard time finding prices for this, my vendor gives ~57k GBP ($85k) as the lower end price for configurations with that SuperMicro cabinet, 64 cores (8x 8-core CPU's) and 2TB of RAM. Dropping to 1TB lets you pick from a lot of cheaper cabinets and so the price drops by quite a bit more than half for the basic configurations.

As others have said they don't actually have a lot of RAM dedicated to Redis. You can put 1.5TB of memory in many 'inexpensive' Dell servers (single machines, not clusters). So basically a cabinet of machines you could have over 30TB of memory available to you. Basically some of the design choices Twitter has made seem tailored to their (previous? Not sure if they have their own hardware now) choices to run on 'cloud' services in the past.

With good hardware and a bit of a budget you can easily scale to crazy numbers of processor cores and memory. That isn't to say the software side of the solution is going to be any easier to solve.

That's really not a lot, especially when you consider that it's distributed among a cluster.

A single SPARC M5-32 can have up to 32TB of memory:


Even the low end of single servers goes up to a couple TB of RAM. "Enterprise" hardware has been there for ages.

I really question the current trend of creating big, complex, fragile architectures to "be able to scale". These numbers are a great example of why, the entire thing could run on a single server, in a very straight forward setup. When you are creating a cluster for scalability, and it has less CPU, RAM and IO than a single server, what are you gaining? They are only doing 6k writes a second for crying out loud.

They create big, complex, fragile architectures because they started with simple, off-the-shelf architectures that completely fell over at scale.

I dunno how long you've been on HN, but around 2007-2008 there were a bunch of HighScalability articles about Twitter's architecture. Back then it was a pretty standard Rails app where when a Tweet came in, it would do an insert into a (replicated) MySQL database, then at read time it would look up your followers (which I think was cached in memcached) and issue a SELECT for each of their recent tweets (possibly also with some caching). Twitter was down about half the time with the Fail Whale, and there was continuous armchair architects about "Why can't they just do this simple solution and fix it?" The simple solution most often proposed was write-time fanout, basically what this article describes.

Do the math on what a single-server Twitter would require. 150M active users * 800 tweets saved/user * 300 bytes for a tweet = 36T of tweet data. Then you have 300K QPS for timelines, and let's estimate the average user follows 100 people. Say that you represent a user as a pointer to their tweet queue. So when a pageview comes in, you do 100 random-access reads. It's 100 ns per read, you're doing 300K * 100 = 30M reads, and so already you're falling behind by a factor of 3:1. And that's without any computation spent on business logic, generating HTML, sending SMSes, pushing to the firehose, archiving tweets, preventing DOSses, logging, mixing in sponsored tweets, or any of the other activities that Twitter does.

(BTW, estimation interview questions like "How many gas stations are in the U.S?" are routinely mocked on HN, but this comment is a great example why they're important. I just spent 15 minutes taking some numbers from an article and then making reasonable-but-generous estimates of numbers I don't know, to show that a proposed architectural solution won't work. That's opposed to maybe 15 man-months building it. That sort of problem shows up all the time in actual software engineering.)

>They create big, complex, fragile architectures because they started with simple, off-the-shelf architectures that completely fell over at scale.

No, they fell over at "shit we're hitting the limits of our hardware, lets re-architect everything instead of buying bigger hardware". Rather than buy 1000 shitty $2000 servers, buy 2 good $1,000,000 servers. I know it is not fad-compliant, but it does in fact work.

And then you grow by another 50%. If you go the commodity hardware route, you buy another 500 shitty $2000 servers. If you go the big-iron route, you buy another 2 $5,000,000 servers, because server price does not increase linearly with performance. If you're a big site, server vendors know they can charge you through the nose for it, because there are comparatively few hardware vendors that know what they're doing once you get up to that level of performance.

Look, if you're going to make the case that one should buy bigger servers instead of more servers, then this becomes an economic argument. The reason large web-scale companies don't do this is because it outsources one of their core competencies. When they scale horizontally across thousands of commodity machines, then knowledge of their problem domain becomes encoded in the scaling decisions they make and stays internal to the company. When they scale vertically by buying bigger hardware, then they are trading profits in exchange for having someone else worry about the difficulties of building really big, fast supercomputers. It makes life a lot easier for the engineers, but it destroys the company's bargaining position in the marketplace. Instead of having a proprietary competitive advantage, they are now the commodity application provider on top of somebody else's proprietary competitive advantage. If someone wants to compete with them, they buy the same big iron and write a Twitter clone, while if their server vendor wants to raise prices, it has them by the balls since the whole business is built on their architecture.

(I have a strong suspicion that Twitter would not be economically viable on big iron, anyway. They would end up in a situation similar to Pandora, where their existence is predicated on paying large rents to the people whose IP they use to build their business, and yet the advertising revenue coming in is not adequate to satisfy either the business or their suppliers.)

No, you buy another 2 servers at the same price, because performance continues to increase incredibly quickly, and what you got $200,000 2 years ago is now half the speed of what $200,000 gets you.

>When they scale horizontally across thousands of commodity machines, then knowledge of their problem domain becomes encoded in the scaling decisions they make and stays internal to the company.

Or to put it another way: "they create a massive maintenance nightmare for themselves like the one described in the article".

>When they scale vertically by buying bigger hardware, then they are trading profits in exchange for having someone else worry about the difficulties of building really big, fast supercomputers.

You are overestimating the cost of high end servers, or underestimating the cost of low end ones. Again, their existing redis cluster is less RAM, CPU power, and IO throughput than a single, relatively cheap server right now.

>Instead of having a proprietary competitive advantage, they are now the commodity application provider

Twitter is a commodity application provider. People don't use twitter because of how twitter made a mess of their back end. People don't care at all about the back end, it doesn't matter at all how they architect things from the users perspective.

>while if their server vendor wants to raise prices, it has them by the balls since the whole business is built on their architecture.

What do you think servers are? They aren't some magical dungeon that traps people who buy them. If oracle wants to fuck you, go talk to IBM. If IBM wants to fuck you, go talk to fujitsu, etc, etc.

When two of your 2,000 servers die, your load balancers etc kick in and route around the problem.

When two of your two servers die, you ... um, well, you lose money and reputation. Quickly.

If you buy $1 million servers, a whole lot of things needs to go bad in whole lots of ways that would likely take own large numbers of those 2,000 servers too. I'm not so sure I agree with the notion of going for those big servers myself, but having had mid range servers from a couple of the big-iron vendors in house, here's a few of the things you can expect once you tack a couple of extra digits on the server bill:

- Servers that phone home; sometimes the first you know of a potential problem is engineers at your door come to service your server.

- Hot swappable RAID'ed RAM.

- Hot swappable CPU's, with spares, and OS support for moving threads of CPU's that are showing risk factors for failure.

- Hot swappable storage where not just the disks are hot swappable, but whole disk bays, and even trays of hot swappable RAID controllers etc.

- Redundant fibre channel connections to those raid controllers from the rest of the system.

- Redundant network interfaces and power supplies (of course, even relatively entry level servers offers that these days).

In reality, once you go truly high end, you're talking about multiple racks full of kit that effectively does a lot of the redundancy we tend to try to engineer into software solutions either at the hardware level, or abstracted from you in software layers your application won't normally see (e.g. a typical high end IBM system will set aside a substantial percentage of CPU's as spares and/or for various offload and management purposes; IBM's "classic" "Shark" storage system used two highly redundant AIX servers as "just" storage controllers hidden behind SCSI or Fibre Channel interfaces, for example).

You don't get some server where a single component failure somewhere takes it down. Some of these vendors have decades of designing out single points of failure in their high end equipment.

Some of these systems have enough redundancy that you could probably fire a shotgun into a rack and still have decent odds that the server survives with "just" reduced capacity until your manufacturers engineers show up and asks you awkward questions about what you were up to.

In general you're better off looking at many of those systems as highly integrated clusters rather than individual servers, though some fairly high end systems actually offer "single system image" clustering (that is, your monster of a machine will still look like a single server from the application point of view even in the cases where the hardware looks more like a cluster, though it may have some unusual characteristics such as different access speeds to different parts of memory).

I'm pretty sure the insides of your 2 $1,000,000 servers are architecturally essentially identical to a large cluster of off-the-shelf computers, with somewhat different bandwidth characteristics, but not enough to make a difference. They'll have some advanced features for making sure the $1,000,000 machine doesn't fall over. But it's not like they're magically equipped with disks that are a hundred times faster or anything, you're still essentially dealing with lots of computing units hooked together by things that have finite bandwidth. You can't just buy your way to 25GHz CPUs with .025ns RAM access time, etc. etc.

I'm not sure what you are trying to say. My entire point was that you can buy a single server that is more powerful than their entire cluster. You appear to agree, but think that is a problem?

Massively parallel single-machine supercomputers are still, essentially, distributed systems on the inside. You still have to use many of the same techniques to avoid communicating all-to-all. If you treat such a system as a flat memory hierarchy, your application will fall down.

True that they're still essentially distributed systems on the inside, but the typical bandwidth can often be orders of magnitudes higher, and the latencies drastically lower when you need to cross those boundaries, and for quite a few types of apps that makes all the difference in the world to how you'd architect your apps.

"you can buy a single server that is more powerful than their entire cluster." is pure genius.

We now know how to solve the C10k problem in orders of magnitude - use a single server!

$1,000,000 machines are not that much faster, you won't get 1000x the performance or anywhere near. The memory and IO performance is going to be within a factor 2 or 4 of that high end machine. It might have 50x the cores, but most likely that's not the limiting factor anyway.

The whole point is you can get equal performance from a single server instead of a ton of little ones. The ton of little ones forced them to totally re-architect to work around the massive latency between servers. A single server would have allowed them to stick with a sane architecture, and saved them millions in development time and maintenance nightmares.

It would be nice if were true, but you simply can't, there's no magic that makes an expensive server that much faster - it's just a bit faster for a lot more money. It can make sense if you only want 10x the performance and the server is cheaper than the rewrite.

For example, if a $30k car can go 150mph, it doesn't mean a $300k car can go 1,500mph it just doesn't happen. A Bugatti Veyron goes, what? 254mph that's not even double (and it costs a lot more than $300k)

We're not talking about cars. We're talking about computers. 8TB of RAM is 8TB of RAM, it doesn't get better by spreading it across a thousand servers. 4096 CPU cores are 4096 CPU cores, they don't get better by spreading them across 1000 servers. Those things get worse spreading them across servers, because you massively increase the latency to access them, and for them to access shared data.

Please give an example of this "monster server" you keep talking about with 8 TB of RAM, 4096 cores, N network interface cards, and 100 TB of SSD, with a cost estimate. Otherwise we can't have a real discussion. People have a pretty good idea of how to build / what it costs to build something with 128 GB of RAM, 32 cores, a couple of NICs, and 2 TB of SSD, but what you're talking about is 50-100x beyond that.

Not posting this to support his argument, but for the record some of the high end unix hardware available (for a price, no idea what these cost):

32TB RAM 1024 Cores (64 x 16 core), 928 x PCI Express I/O slots: http://www.oracle.com/us/products/servers-storage/servers/sp...

16TB RAM 256 cores (probably multiple threads per core), 640 x PCIe I/O adapters: http://www-03.ibm.com/systems/power/hardware/795/specs.html

4TB RAM 256 cores (512 threads), 288 x PCIe adapters: http://www.fujitsu.com/global/services/computing/server/spar...

And exactly to his point - you still can't treat these as one uniform huge memory / computational space for your application (these machines seem designed for virtualization rather than one huge application). You run into the same distributed computing issues you would with your own hardware, just with a 5/10x larger initial investment and without a huge amount of pricing control / flexibility in terms of adding capacity / dealing with failures as they arise.

Actually you can treat these as one uniform huge memory / computational space for your application. They're not meant only for virtualisation. In particular, the Oracle Database is a perfect fit for a system with thousands of cores and terabytes of memory.

It's true that for some use cases, you'd be better off carving it up using some form of virtualisation, but it isn't a requirement to reap the benefits of a massive system.

Both the Solaris scheduler and virtual memory system are designed for the kind of scalability needed when working with thousands of cores and terabytes of memory.

You also don't run into the same distributed system issues when you use the system that way.

You also do actually have a fair amount of flexibility in dealing with failures as they arise. Solaris has extensive support for DR (Dynamic Reconfiguration). In short, CPUs can be hot-swapped if needed, and memory can also be removed or added dynamically.

Just for a reference point, an appropriately spec'd big iron machine from a major supplier with 8tb of RAM will run you $10 to $20 million. There's no mainstream commercial configuration that is going to get you to 4096 cores though.

Fujitsu's SPARC Enterprise M9000 mentioned in another reply is $5 to $10 million depending on configuration (assuming you want a high-end config).

If you go big iron with any supplier worth buying from, they will absolutely murder you on scaling from their base model up the chain to 4tb+ of memory. The price increases exponentially as others have noted.

The parent arguing in favor of big iron is completely wrong about the economics (by a factor of 5 to 10 fold). The only way to ever do big iron as referenced, would be to build the machines yourself....

just fyi, 2.8 million for 4096 core 16TB http://news.cnet.com/8301-30685_3-20019153-264.html?part=rss...

SGI won't build you that computer for commercial use for $2.8 million. It would cost you several fold more.

5.6 million. You need a hot standby. Or 8.4 if you're feeling cautious :)

Why should he give an example of a "monster server" with specs like that?

He gave the argument that spreading the CPU's and memory out does not make them better.

So part of the point is that if the starting point is 8TB of RAM and 4096 cores distributed over a bunch of machines, then a "large machine" approach will require substantially less.

I've not done the maths for whether or not a "Twitter scale" app would fit on a single current generation mainframe or similar large scale "machine" (what constitutes a "machine" becomes nebulous since many of the high end setups are cabinets of drawers of cards and on persons "machine" is another persons highly integrated cluster), but it would need to include a discussion of how much a tighter integrated hardware system would reduce their actual resource requirements.

I'm sure you could get vastly better performance. But this is not about performance.

What happens when you need multiple datacenters? How about if you need to plan maintenance in a single datacenter? Therefore you need at least two servers in each datacenter, etc.

Let's say you decide to use smaller machines to serve the frontend but your backend machines are big iron. Are you going to perform all your computation and then push the data out to edge servers?

There's much more to this then loading up on memory and cores.

Think about fault tolerance with 2 servers.

You are right to be sceptical, and I think you're right about throughput: they mentioned that all tweets are ~22MB/sec. My hard disk writes 345MB/sec. A modern multicore processor should probably tear through that datastream.

However, you should realize that in a decent sized organization there are usually different teams working on different things that are evolving at different rates. You have to manage these people, systems and their changes over time. After awhile, efficiency drops in priority versus issues such as manageability and security... often due to separation of concerns requirements.

Beyond the build and maintenance issues, a serious challenge for real time services is high availability (HA). Tolerance for hardware/software/network/human failings must be built in. That also throws efficiency out the window. It's cheaper to treat cheap, easily replaceable machines as a service-oriented cluster than to build two high performance machines and properly test the full range of potential failover scenarios.

I hope this comment is a troll and you don't actually believe that you could run Twitter on a single server. If you are interested in why this is true I suggest you get a job at a company that has to serve consumer scale web traffic.

The fact that you think "consumer scale web traffic" is some magical thing is exactly what I am talking about. Have you ever heard of TPC? They have done benchmarks of database driven systems for a long time. TPC-C measures performance of a particular write query, while maintaining the set ratio of other active queries. The top non-clustered result right now does 142,000 new orders per second. Yes, a single server can handle 300k reads and 6k writes per second.


I appreciate that you believe what you are saying, but TPC-C doesn't measure anything at all relevant. Having worked both on the enterprise side building WebLogic and on the consumer side at Yahoo and Twitter I can tell you definitively that you are wrong about the applicability of that benchmark to serving large scale web applications. The database and server you are talking about could not do 300k timeline joins per second, or 300k graph intersections per second or virtually any of the actual queries that Twitter needs to operate. All reads are not created equal. Checking the latency profile, it is horrendous even for these simple transactions — hundreds of milliseconds! Worse, there is no where to go as the service grows.

It would be great if you were right and all the big web companies are wrong. I can assure you that it isn't the case.

sam don't waste your time this is a joke

sup nick

One of the people I work with was at eBay from 2000 to 2009 and they did what you talk about - have one giant database server. Some Sun monstrosity.

Then they ran out of horsepower on that one server. Let me tell you, the stories are pretty horrible.

So you know who you are replying to: http://www.crunchbase.com/person/sam-pullara

Does TPC require durability? It seems like you could get even better performance (or perhaps that's what they do) if you just got something like a Fujitsu M10-4S Server and stuffed 32TB of ram in it.

>Does TPC require durability?


I think that in a lot of ways it's because of the roots of the organizations in question. Instead of saying "yeah, we have money, we can go buy this solution and trust in the vendor to support us" (an objectively viable solution in many situations) we instead say "we've got $100,000 in seed money and one engineer plus some friends she's convinced to help out pro bono, that won't buy us a sideways glance from Oracle".

Once you've survived the first year or two and gained some traction, suddenly the decision to investigate other solutions becomes a break from tradition -- and no engineer wants to throw away their work. It's incredibly hard to do a full stop, look around, and reevaluate your needs to see if you should just go in a different direction. I'm not sure I've ever seen it happen, really; we always just bow to the organic growth and inertia that we've established.

There is also a strong tradition of Not Invented Here (NIH) at work in this ecosystem. If it wasn't written by one of your peers (other startup people) then it's just not very cool. Use a Microsoft or Oracle product? Hahaha, no. Open source or bust!


To be fair, though, I've never worked for a startup that could afford that kind of software, so I guess the point is moot. I'm not spending 20% of my startup's cash reserves on software licenses to Oracle, when I can instead use that $2m to hire a few people to build it for me and know it inside and out. Plus, then they're invested in my company.

Also, please don't think I'm denigrating open source software with this commentary. I just think that the kind of zealotry that precludes even considering all of the options is, in general, a bad business decision, but one we seem to make all of the time.

Your expensive vendors sit on security patches for weeks and my free ones shoot them out almost immediately. Why is that not a business concern?

I don't think it's zelotry, there are a lot of advantages to open source software you aren't going to get from Microsoft. If you are an edge case and you happen to stumble on that race condition bug are you going to have your engineers black box test and reverse engineer someone elses product illegally while they wait around for Microsoft support or have them look under the hood, patch the bug and move on?

What the big licenses fees get you is accountability, which is of course a huge thing, but what open source gives you is control.

I think if your business is software, then open source makes perfect business sense, especially if one of your assets is a team of competent engineers.

You're arguing for scaling up (vertical) instead of scaling out (horizontal). Both are valid approaches. Scaling out is preferred because your architecture is mode modular and you do not have to constantly buy bigger machines as your usage grows; you just add additional machines with similar capacity. The main problem with scaling up is that price and performance are not linearly related and eventually you will be limited by the performance available to one system. But it's a perfectly valid approach for certain scenarios.

It's not as simple as 6k writes - it's 6k write requests. Most of these 'write requests' are tweets which must be written to the timelines of hundreds or thousands (or in some cases millions!) of followers.

If you try to run it all on a single server, and Obama and Justin Bieber tweet at the same time, you suddenly have a backlog of 75 million writes to catch up on. Now imagine what happens if Bieber starts a conversation thread with any of the other 50 users with 10m+ followers.

Agreed that this is an issue, but it's a smaller one than you make out: Consider that there's likely a marked hockey stick going on here. Now keep the very most recent tweets of the very "worst" in terms of followers in a hot cache, and mix their most recent tweets into the timelines of their followers as needed until they're committed everywhere. Then collapse writes, and you reduce the "conversation" problem.

It's still not cheap to handle, but consider that fan-out on write is essentially an optimization. You don't need that optimization to be "pure" in that you don't need to depend on the writes being completed in a timely fashion - there are any numbers of tradeoffs you can make between read and write performance.

They may average 6K w/s, but history has shown higher peaks, and those peaks often include tweets from those with large 30e6+ followers -- large events naturally have celebrities chiming in. Having more CPU [cores], RAM, and IO is not going to solve for the need to shard your Redis, which is bound by a single core. For the fanout, that is anywhere from 90 to 90e3 cores at 2e6 Redis IOPs -- 1-6e3 Lady Gagas per second. Before pipelining, which you'd probably consider complex or fragile. All you've done is moved some networking parts around. Your code and topology hasn't really changed.

>Having more CPU [cores], RAM, and IO is not going to solve for the need to shard your Redis, which is bound by a single core

But they chose redis in the first place as part of their messy, crummy pile of barely working junkware solution. You wouldn't use redis in the first place if you were building a solution that scales up.

So, now we also have to assume per-core software and support licenses to accompany our purchase of several $10M racks? What platform are we talking about now? Presumably I have to also purchase the IDE and toolchain licenses, per developer, too?

Twitter may have and will still make missteps, but you're being egregiously vague and unhelpful. Your comments are as vaporous as you make Twitter's architecture to be.

You're funny. I like you.

I do consider it a pathology when tiny services, or tiny apps in a corporate structure, act like they have the problems of Google.

You are not Google.

You do not have Google's problems.

You do not have scaling issues.

For you, N is small and will stay small.

Stop giving me this delusional resume-padding garbage to implement. For you, here, it is delusion and lies.

The point here is that Twitter really is one of the cases where vertical scaling is not, on the balance of non-functional requirements to engineering overhead, the right decision. They really do need to pay the complexity piper.

Oh, yeah. I'm speaking to everyone else :-)

I remember asking a question about scaling on SO and getting a response about how my data was small, because it could fit on an SSD (for that one small component)

I miss not having real scaling issues sometimes. They make my head hurt.

That's probably 6k writes to a "normal" DB. The fanout is handled by Redis which I doubt is included in the 6k writes. Not if you want to fan out to 30 million followers in under 5 seconds.

The "fanout" or "wasteful duplication of a single message 30 million times" is only required because they are using tiny underpowered hardware to begin with. The approach that they claim can't possibly work actually does work. You just can't do it on a $2000 "server".

The messages are not repeated in fanout, just the ids. I guess you would know that if you actually the links.

That still means 30 million writes instead of one. That the size of each write is smaller is not going to help you all that much if each write still worst case ends up forcing you to rewrite at least one disk sector.

While I'm inclined to favour a fanout approach, a "full" fanout can easily be incredibly costly on the write side.

Let me guess, you work for Oracle?

The property of distributed message passing systems that everyone likes is the arbitrary scalability. Subsystem performing slowly? Add more nodes! The cost of this arbitrary scalability, as you have noticed, is brittleness, single points of failure, and complexity that will drive even the most seasoned of engineers nuts.

In the end though, the users want their tweets and they don't care how it works. Complexity is why we make so much money, after all.

Something is not easy to say simple or complex by just looking at the shape or counting the number of them.

One big powerful server's architect and circuit design is not that simple as what it looks like, in the other hand, 2000 standard servers are not that complex as what they are.

The way of how you think is the key. You can think that the 2000 servers is a big computer cluster, but I prefer to think each one of them is a simple replaceable black box.

I can show you some scenarios at the following to explain why 2000 servers make more sense than one big machine,

Firstly, When I want to upgrade the system to deal with a suddently increased load I don't need to call vendor to arrange an onsite upgrade service, I can do it by connecting more servers right away, and when the load decreased, then I can disconnect some servers. This makes the maintenance job is more flexible and more convenient.

The second one is when I estimate the performance of the whole system, I can focus on estimating the performance for one server first, and then add them on, even there are some variables need to be involved into the calculation, it is still more clear and simpler than estimating the performance by looking at the vendor provided system specification.

The last scenario is that you cannot separate one big machine into different geolocations to handle the access all over the world, but seperate 2000 servers is much easier and doable. Moreover, different geolocation deployment can provide a genric 24*7 online service as some nature disasters happens.

When we engineer things, we should split one big(complex) thing into multiple small parts and keep it with a simplest structure or function, and later, we connect these small parts together.

English is not my first language, so if somewhere sounds wired, please excuse me.

Maybe Twitter would indeed be better off using a smaller number of more beefy servers. But at the software level, I wonder if the architecture would be very different.

Suppose you implemented something like Twitter as a single large application, running in a single process. Maybe run the RDBMS in its own process. The app server stores a lot of stuff in its own in-process memory, instead of using Redis or the like. Now, what happens when you have to deploy a new version of the code? If I'm not mistaken, the app would immediately lose everything it had stored in memory for super-fast access. Whereas if you use Redis, all that data is still in RAM, where it can be accessed just as fast as before the app was restarted.

You'd use failover to do a rolling upgrade.

Because if you want to have reasonable availability and low latency you can't just buy two "big" machines.

Going with the incrementally sized approach gives you more flexibility to economically distribute the risk, in addition to the load.

They are only doing 6k writes a second for crying out loud.

And 600k reads per second (as of when this article was written), which seems like an important thing to leave out.

300k reads, which is approaching trivial. You can do that with an array of desktop SSDs. But in reality, it would be almost all in RAM anyways making those reads incredibly cheap.

Right, because no matter how much RAM you put in a machine, access is always the same speed.

Two thoughts: scaling down can be just as important as scaling up, and cluster nodes that need to talk to each other are probably less common than stateless cluster nodes that need to talk to the client.

How do you expect to serve 150 million concurrent users on a single server?

They are not serving 150 million concurrent users. They have 150 million active users. As in, people who do in fact use twitter at all, as opposed to the millions of dead accounts nobody touches. They are not all being served at once.

Do you hit the disk when somebody, say, checks my Twitter profile that I haven't updated since 2008? What will that do to your performance?

I'm not sure why you are asking me about this.

Because he/she hasn't read the article.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact