Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Back of the Envelope Calculations at Twitter 2.0 (cohesive.so)
25 points by ankit1841 on Dec 1, 2022 | hide | past | favorite | 35 comments



Personally I would skip, or at least smell-test, such calculations by a trip to the publicly available hard data:

https://d18rn0p25nwr6d.cloudfront.net/CIK-0001418091/947c0c3...

Twitters cost of revenue was 1.7B in FY2021. That includes their data centers, operations salaries, depreciation of assets, and AWS costs (plus other categories, it’s defined in the financials). There’s also a delta of 400M from FY2020 and over half of it is salaries.

So from these pieces of coarse information and the suggestion in this piece that storage costs alone are around 1B, I can surmise that these envelope calculations are off by a lot.


I have a feeling some of the numbers are extremely inflated from their actual values.

0.54 petabytes of “tweet” data a day, no multimedia, seems extremely high. This also assumes no compression or LUTs, which would make it multiple orders of magnitude off.

The multimedia as well leaves me with questions, how much of it is original content? What’s the rate of deduplication (because many people post the same non-original content)?

I’m sure they work with orders of magnitude of data larger than most of us are working on, but I highly doubt that Twitter alone contributes to 5-10% of Amazon AWS total revenue.


It’s a nonsensical, clickbait article that uses trivial techniques to create ridiculous conclusions. Using AWS street prices to estimate TWTR’s on-prem costs, leaving aside their initial assumptions, is not only incompetent, it suggests they don’t have the most basic understanding of their domain.


Seems like the author just completely made up the numbers here, as well as all underlying technical assumptions. In reality, based on previous public information, Twitter runs most of their infra from their own datacenters, and the number of tweets per day hit 500 million back in 2013.

Not to mention, the author also completely ignores read traffic (which is the majority of traffic for a social network), replication/redundancy/HA, so many other things. It's just a garbage article. It appears the company hosting this blog pays randos to submit posts on any topic? https://www.cohesive.so/write-for-cohesive


Yeah there's a screwup here. The earlier calculation is 0.54 PB of data per five years, not per day, including a tripling of the cost to achieve 3-way replication.

The same mistake is made for video. But one paragraph later the author uses the per five years value as if it's per day.


The estimation for a video's average size seems nuts too

> 1% of tweets contain videos of about 100MB each

I can't imagine the average video on twitter is taking up 100MB of storage.


Do you have to multiply by 3 at all? I thought AWS/Azure already do that redundancy storage for you


Even if you assume the rest of the numbers are right, nobody in their right mind at that scale pay list prices.


Twitter ran ALL its infra for $1.7B in 2021. It made $5B in revenue that years, and was profitable with gross profit margin of 20+%. Your calculations are full of shit. And trying to fire 50% of the workforce and have the rest leave because the ship is sinking still doesn't make sense. To top it off, if your major problem is infra cost you don't fire engineers, you task engineers to reduce infra costs.


That's $4M per day - or the way cloud products are priced per hour - ~$190k per hour.

W/o bulk discounts, ~$0.4 gets you 32GB of memory and 16 vCPU on AWS.

So this is only 400k blades. ~20 blades per rack. ~20k racks.

That's like 1 large data center in US, EU, Asia, & South America. Sure, Twitter should be getting a lot more for that because they have a larger scale and AWS has like >50% margins or whatever.

But before anyone says this is out of control for a company the size of Twitter, it's really not.


That would absolutely be out of control. Assuming 24 core machines (which I would think is an underestimate), that'd be 10M cores, or ~1 core per 15-30 DAU. That seems like it's 3-4 orders of magnitude more than necessary.


1 of your orders of magnitude is you're forgetting that 50-75% of traffic is signed-out, and that 30-50% of resources go to logging & analysis vs the core product.

Another order of magnitude is you're forgetting the cost of 3x storage & buffer for 99.99% uptime.

Another order of magnitude is that data transfer is REALLY expensive. They're not paying just for compute & storage.


3x2x3 = 18x, which is still only ~1 order of magnitude (and I had actually already doubled the number of DAU I thought they had for that 30 number, but apparently it's more like 250M now, not 150M. I don't know whether that includes logged out users). I'm not sure how the data transfer or storage are relevant to how many blade servers are reasonable. My point was just that 400k blade servers would be insane. 400 seems like it'd be excessive, including triple redundancy.


If you're Netflix - it costs a non-trivial amount of money to send video compared to the server that runs it.

If you're Twitter, this is also true to a lesser extent.

Netflix & Twitter have more server expenses than just querying databases.


Right, they have other expenses. I was commenting only on the "only 400k blades" part. That's a staggering amount of computing power. I'm sure their analytics and advertising stuff requires a lot of compute, but the core functionality from the user perspective should doable on something closer to 4 dozen.


> Twitter ran ALL its infra for $1.7B in 2021. It made $5B in revenue that years, and was profitable with gross profit margin of 20+%.

I don’t know that it makes sense to say a company is profitable when they make a net loss, regardless of gross profit. In 2021 Twitter made a net loss of over $220 million.


In 2021, Twitter settled a shareholder lawsuit for $800 million. That's a one-time expense, not an ongoing cost each year.


Twitter also made a net loss in 2020 to the tune of over $1 billion.

They only managed to report net income in 2018 and 2019.

So looking at the overall picture I still argue that saying Twitter is a profitable business to be very very optimistic and not really true.


Seems less cut and dry to me:

* Profitable in 2018

* Profitable in 2019

* Would have been profitable in 2021, if not for the shareholder lawsuit

* Profitable in H1 2022, despite having an operating loss, due to making a $655m profit on the sale of MoPub

I'm not saying this is a great business. But some folks are making it sound like Twitter was about to collapse and go broke on its own, which is ridiculous. They still had $2.7B cash + $3.4B short-term investments on hand at the end of Q2!


> In 2021 Twitter made a net loss of over $220 million.

Due to a legal settlement payout of $800M. Without that one-off, they'd have been $580M in profit.


> 150 million DAU. Say 10% of these tweets contain pictures of average size 300kB each & 1% of tweets contain videos of about 100MB each.

My assumption would be that a new image is uploaded per 1000 tweets a day. As retweeting an image is more popular, that doesn't occupy space.

And 100 MB videos are pretty rare. Most are short, few seconds.


https://blog.youtube/press/ says that they ingest 500 hours of video every minute.

It seems pretty unlikely that twitter is ingesting 15M videos a day. That would mean that they are ingesting (15 million * max video length of 2.3 minutes) / (1440 minutes per day) = 400 hours of video every minute.

Given Twitter's video length limit it suggests that more people are uploading videos to Twitter than are to YouTube which seems unlikely.


These calculations are off by several orders of magnitude, the author without knowing there is a difference switches between a British Billion with a US Billion and back, the author does not know what a relational database is, makes the rediculous assumption Twitter uses an AWS database to store tweets, and incorrectly believes DynamoDB is a relational database. This is not a correct or reasonable analysis.


The tweets/second calculation seems to ignore the existence of bots (the permitted kind).

The number is derived solely from the MAU/DAU number, but bots are not counted there. One active user may be running several bots that automatically post content on a schedule or as replies to keywords.

I'm not sure how to estimate the number of bots, but by their nature they can produce a lot more tweets than a person.


There's 0 chance that Twitter pays anything close to AWS's public pricing, even their published volume discount rates are likely far higher. There's a point where back-of-the-envelope calculations cease being useful... estimating volume here is one thing, but cost is just too abstract at this scale without inside info.


The problem with this article is that it starts out with great examples, but then failed to actually provide the utility of it because the assumptions are off compared to reality.

Fagpacket maths is great for specific problems, Can I make a mobile bridge, What price should I sell product x at. But all these examples have hard and agreed upon constraints.

The example fails because the author doesn't appear to have worked with systems of a big enough scale to make good assumptions. Once you start hitting petabytes its way way way cheaper to do on prem. The issue becomes one of backup and retrieval speed (which they touch on)

The reason why this is important is because for this fagpacket calc to work, you need to know the rough price per meg for storage of each asset class(and its purpose, ie, for machine learning, serving the CDN, for logs etc etc).

Tweets will be stored differently from pictures, and pictures will be different from video.

Archiving is impractical for an always on system (ie all old tweets and media need to load within 150ms)

The other big issue here is that caching is never really mentioned here. CDNs will be one of the bigger costs for twitter. The more efficient caching is, the less your raw storage costs are.

TLDR: Fagpacket maths is great, but you need to know your constraints and parameters first. Otherwise you'll endup with wildly incorrect numbers.


At that cost, seems like the only way to win is not to play? Unless you build your service from the ground up as a paying service, you are doomed to fail at some point.


Assuming Twitter pays list prices for S3 storage costs is laughable. I've seen the kind of discounts companies a tiny fraction of their size can get.


Interesting to see how the article breaks down back of the envelope calculations, specifically in relation to data access and latency.


back of the envelope or fermi estimations are decent at giving you a ballpark figure. but in this case numbers are way off. prices listed are not taking account of AWS discounts which everyone gets storage costs are overestimated compression etc is not taken into account twitter media traffic is overestimated


Seems like there's a missing variable in this calculation: it estimates 7000 QPS for _posting_ tweets but doesn't include any information about _reading_ tweets. Those 150M DAU are presumably reading a lot more than two tweets per day...

OTOH, perhaps that omission makes this a more accurate representation of how Elmo makes decisions. :P


This looks like a PR piece disguised as an engineering post.


Let's understand how we can do back of the envelope calculations, and view Twitter's engineering challenges from the eyes of its decision makers.


So in summary, Twitter was already running itself to the ground quicker without the 'Deep Cuts' and has always been running in a very extremely inefficient manner with rising operational costs which that justifies the radical shakeup and cutting of employees needed to save it.

Not much of a surprise there, but necessary to significantly reduce these operational costs despite all the deranged and exaggerated scare stories of the immediate and complete collapse of the blue bird site.


That’s not evident from their financial filings, which show profits for recent years and would have been profitable last year except for a one-time lawsuit settlement. They spent a lot in stock-based compensation but that’s a different category.

Any time the numbers are that far off, I’d question the conclusion being advanced by the person who isn’t legally liable for errors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: