Hacker News new | comments | show | ask | jobs | submit login

Hi, early Twitter employee here. The reliability team called the infrastructure Fisher-Price internally so this wasn't just one random executive coming up with a term on his/her own.

The problem wasn't ruby. The problem was the way that Twitter used Ruby. We had one big mono repo with every single function and every form of business logic baked into a single place. That logic relied on monkey patching and all sorts of crazy horrible glue to keep it working together. Every time we had to scale up we would glue infra in place to keep things working while we came up with a real solution (which never really materialized).

In my time there we had memcache instances which held timelines. Populating them took hours/days and while they were unpopulated the site was offline. Rebooting/restarting the caches was simply not an option. We had a data sharding strategy that was temporal. We would spin up a new database cluster every few weeks to handle all of the incoming tweets and failing to spin up a new cluster in time meant we would have a global site outage. Don't even get me started on the "load bearing mac mini".

In reality the only problem rails really contributed on its own was that it could only process a single request per process at a time. Each machine would spin up 16 or 32 processes to handle requests in parallel but each process needed its own connection to the database, to memcache, etc. At one point we had something like 100k processes all trying to talk to a single mysql master. Much of this could have been mitigated by better design of course, but rails encourages models that don't scale up to crazy dimensions.

In reality moderation was virtually impossible because we were in a 24/7 fight with ourselves about how to keep the system alive for the next couple of days. Constant infighting, managerial changes (I had 9 different managers in 3 years), focus changes (we didn't finish the last major site redesign before starting the next one) and a general unwillingness to pause features long enough to stabilize the system meant we were always on the losing end of a infra battle.




I don't know nearly enough about large scale projects or the startup world maybe this is a silly question but when it became apparent that it was blowing up far beyond what you were expecting, was it not an option to have another team (hire or split existing) in parallel, to start writing a re-implmentation in another language or system that would work better for your new quickly expanding needs? Was that just not feasible?, no doubt it was a crazy hectic time for you folks back then, I can't even really imagine what that was like.


Thats what kept happening and it was an abysmal failure.

Every so often a person would have a brilliant idea on how to solve our scaling issues. They would then disappear into a corner to invent yet-another-bird-themed-datastore. After a few weeks/months they would appear with a magical new thing that would fix all our problems and would make everybody happy. Every single time it would fail.

Having a team that is not the main team design something means that they likely didn't understand the state of the thing that they were replacing. The thing they were replacing was a bucket of edge cases non of which they knew about. The scale never looked like what they expected because in the meantime the load had changed. This was compounded by the constant desire to hire somebody external that could solve the problem for us. They would come in with ego and a feeling that they had a mandate to replace it all. Eventually they would learn just how fragile and complicated the system was, only to then be considered old guard enough to be replaced by the next wave of experts. =/

But the number one killer was that every single thing was baked into the mono repo so it wasn't like they could have just easily shimmed in something to replace the old thing. All the while that they are building in a change to the data store another dev has added 15 new features that they now have to port over. In the time it took to port those over another 20 had been added.. etc.

Just getting the okay to pause feature development was like pulling teeth and it only bought you a few weeks at best.


> The thing they were replacing was a bucket of edge cases none of which they knew about.

Can I get this on a T-shirt?


Maybe I'm being dumb here, but twitter doesn't look like a product from the outside that has many features. Are these focused on advertisers, analytics or what?


At the time Twitter had a ton of features under the hood that kept being supported and maintained, all of which just added complexity to the system.

We had an API service, a web interface, the legacy web interface that was still used for select devices because the new UI didn't quite work right on them, the even older legacy interface that was necessary because a bunch of badly behaved early day clients still relied on the functionality and they were popular enough that turning them off would cause outrage, the "zero" interface used in countries with low bandwidth capabilities, the mobile interface.

Each interface had to implement all the different variations on functionality. Timelines with inline tweet rendering (automatic expansion of images, etc), list (alternate view time lines), the whole following graph (duplicated for lists as well), verified users and all the infra around that, search, public/private designations, direct messages, notifications via email, text message, and mobile app, favorites, retweets, replies, plus a slew of statistics and information tracking data integrated directly into the site.. Thats only the user visible stuff. There are a TON of experiments and projects that run behind that interface in a way the user will never completely see.

We heard over and over that twitter was so simple that it could run on a laptop and every time it reminded me just how clueless most developers are when it comes to seeing the body of work needed to make something like twitter work, even more so at the scale we are talking about.


So this is really just an issue of the reporter/editor not bothering to fact-check the statements of a non-technical former executive with, one assumes, an axe (or two) to grind.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: