Hacker News new | past | comments | ask | show | jobs | submit login
How Facebook Keeps Messenger from Crashing on New Year's Eve (ieee.org)
275 points by victorvation 6 months ago | hide | past | web | favorite | 113 comments

I found the concept of affinity used for bunching messages to be a new concept. The other concepts are not surprising, they are the typical engineering solutions used.

One thing I am curious about: how does it compare to how the old postal systems used to handle Christmas and new year loads?

On a lighter note: perhaps you can predictively generate and cache messages at the receiver's end based on their contacts and their style of communication. When a sender actually sends a message, just send one bit across, and the local cache gets flushed and displayed :)

> I found the concept of affinity used for bunching messages to be a new concept.

I'm reasonably sure that UUCP + things like INN were "batching messages per destination" for efficiency a long time ago. Nowhere near the same scale, obviously, but the same kind of concept, no?

Actually, I'd expect the scale would be similar.

The raw numbers (users, bandwidth, latency) would be quite different.

take the 16 most common messages, normalize them a bit, and encode them in 4 bit, and I would guess that would probably cover over 50% of the messages sent, maybe even close to 80%.

While it's a nice compression technique, I doubt you'd gain anything from this at the message level.

Even by reducing a message content to close to 0 bits, I doubt you would get that much gains. Most common messages would probably be things like "lol" or an emoji, which in utf-8 is 24 to 32 bits.

But There'll always be metadata associated with each message: message id, sender, receiver, timestamps, etc. which are unremovable. If those first 3 are typical GUIDs, this is already 3*128 bits. This will put a minimum on the entire message size , and more importantly the processing time to simply route the messages is what is costly in terms of CPU time. Then there's the different receipts mentioned in the article, adding a lot more traffic.

> If those first 3 are typical GUIDs, this is already 3*128 bits

Facebook user ID's are 64bit and I doubt that the message ID needs more than 64bit either. Timestamp is typically 8 bytes, so 64 bits as well. So, you're looking at 256 bits of message header.

It may be a good idea to provide range of fixed templates for standard messages and send just selective message Id along with recipient name over the network.

Interesting how they decide to remove features like 'seen' and 'online' in an effort to prioritize the actual message.

Offtopic- I hate those features to begin with, but I know I'm the product not the customer, and those features are to keep people on the app.

I haven't used FB Messenger, but I also hate features where they show if the other person is typing a response. I intentionally disable it in Slack, so the other person simply sees a response when I'm done with it.

> I intentionally disable it in Slack, so the other person simply sees a response when I'm done with it.

FYI, I don't think that is possible in Slack. You can choose to hide typing indicators in your view, but AFAICT you can't prevent other people from seeing when YOU are typing: https://twitter.com/SlackHQ/status/1057288342671368194

Hmm I suspect that was the case. anyway, what I do now is type to myself, then copy and paste the response.

If you use the browser client, you should be able to make a simple Greasemonkey script to replace the default text input. Might save some time in the long run.

I guess someone like me could make a browser add-on, too. Though I don't know how many people use the browser client vs. the app.

If you run a proxy you could also block those requests.

The ability to roll your own Slack Client would be great thing imho. I really don't like the default UI (which is also the same layout as all the web based messaging apps, like it's somehow a good thing). I 100% happy with a small window with an IRC style layout - it's a shame I can't use slack (+whatsapp, +messenger) like this.

Back in the days we used to solve this by developing protocols not services.

I did something similar. Hook "WebSocket.prototype.send", check if arg is type "typing", if so return.

Likely against their ToS (https://old.reddit.com/r/programming/comments/9bc6gi/bye_bye...)

Heh, MSN messenger (back in the day... I don't know how they managed to kill it off) did this notification.

I had a 3rd party client (aMSN) that would popup a chat window as soon as someone would start typing an initial message to me. I would say "Hi" to them first and they would thing it was a big coincidence that they were trying to talk to me at the same time.

AIM had a similar add-on as well; I recall spooking lots of friends like this and telling them I was a Jedi :P

Pidgin had this too for clients which supported it. It even warned you by displaying: "You feel a disturbance in the force." Good times.

Came here to post exactly that, I really enjoyed that feature back in the days!

Ha! I remember Trillian having the same exact phrasing

It was even funnier when I saw a friend’s sister kept opening up a chat box to see my profile picture without ever saying a word ;)

> I don't know how they managed to kill it off

They kept incrementing the server protocol version without updating the client. Eventually all the clients just stopped being able to connect. Strangely, for a long time, only the version bundled with XP stayed working, until eventually it too could no longer connect.

I heard they kept it alive in China longer. Maybe they updated their client. Maybe.

Diff'rent strokes: I really like that feature in iMessage. I find it kind of like being in a cartoon, with the little thought bubble about to produce a response.

The worst is someone contacting you

> Hi

(waiting) (waiting) (waiting)

< OK Hi

(typing) (pause) (typing) (pause) (typing) (pause) (typing) (pause)

FFS hurry up already

I believe that started off as an accessibility feature back in the AIM days.

I also despise those features, but I agree wholeheartedly that these decisions push the money forward

Out of curiosity, why do you not like those features?

To me, "seen" is a feedback mechanism. When I send a message to you, I don't always need a response, but I do usually want to know that you saw the message -- sort of an ACK.

This is a little difficult to explain, but as someone who (extremely ashamedly, in hindsight) became very obsessive about romantic interests - to the level of stalking behavior - at least twice over the course of my life - this is precisely the kind of technology design decision that has negative externalities which maybe aren't apparent to people who are of sound and balanced mind.

If someone is obsessed about another person, they're going to get the same kind of dopamine rush simply from seeing that the other person is online and active that most of us get from receiving new messages in our inbox or messaging apps. It's tricky to explain, but as you draw closer (or feel like you draw closer) to someone, the urge to be connected and know that they're present becomes overpowering. Moreso if that person is in any way encouraging this kind of dependence / attachment.

It's particularly bad when apps allow this kind of 'presence monitoring' without trying to deal with the obsessive behavior that it causes. Facebook, for example, clearly has the data and the ability to inform both parties of what is happening. Ironically in my case it suggested one of these people as a 'Close Friend'. It seems so much worse if the service provider is (at any level) encouraging these malign use cases - for the purposes of 'engagement', for example.

Fortunately I believe I now have the distance and levelheadedness to see how this all works (in terms of my own psychology, the relationships involved, and also the service providers who intermediated the communications), but it's something I wouldn't wish anyone else to suffer, and we could be a lot more careful about the way we design apps and services to avoid these kind of psychologically damaging situations.

The talk entitled 'What is Good Technology' from the 35c3 congress makes some very good points along these lines, in terms of what we can do improve our technology product decision-making process to avoid causing pain and damage to cultures or people we might not otherwise understand: http://streaming.media.ccc.de/35c3/relive/9965

Congrats for becoming a more ethical person! It sounds like you're in a better place in terms of your relationships and I can't imagine that it's been easy. Keep on truckin'

Thank you very much, that means a lot to me; and yep, it's been tough, but the future's looking brighter.

Not afraid to admit I like to pretend I haven't seen a message until I can make time to actually respond to it, if its a message asking something of me.

Maybe I should type a response the moment I receive the text across my phone alerts, but I want the freedom to defer it without having you think I'm ignoring you (even tho I am)

You can accomplish that today just by reading the notification instead of clicking into the conversation. Everyone does it.

I think the "seen" feature is a net positive though. Almost everyone likes it from the PoV of the sender.

I believe that the iOS app started preventing these recently; I noticed that all of my Messenger notifications stopped including the text of the message, instead just stating "<name> sent a message." (or something of the like), forcing you to actually open the app.

Lets just say it was not my favorite change, and not just for the privacy reasons under discussion.

> I noticed that all of my Messenger notifications stopped including the text of the message

Isn't that a preference in the notifications setting? "Show Previews" gives the options of "Always", "When Unlocked (Default)", "Never". Perhaps it's just defaulted to "When Unlocked" after an update?

I did just check, and it is set as "When Unlocked (default)". The unlocked screen does just show a no-preview Messenger notification, the actual preview text seems to be being provided to iOS as "<name> sent a message".

I do the same. I do feel ok not responding right away after I see a message, though. I wonder if it’s a cultural expectation that is/will change over time.

I'm not sure if that's still the case, but a few years ago you could block the ACK so the other parties couldn't know that you read the message [1]

[1] https://easylist-downloads.adblockplus.org/message_seen_remo...

Because a lot of people are not like you. If someone saw that I read their message they might immediately expect a response, and be left wondering why it took hours to get back to them (and react negatively to the delay). Not everyone treats messaging as the async thing it is.

>Not everyone treats messaging as the async thing it is.

But, but, but, it says 'instant messaging' right in the name.

You should reply the _instant_ you get MY message!

Unfortunately some people have the same attitude even towards email...

Hmmm... I thought the I in IM refers to the fact that the message is SENT instantly.

I think at its very base it's a privacy issue, which is not surprising that FB has chosen to ignore it. Just because you sent me a message, why should you then be privy to see if I've read it or not without me explicitly doing something, or at least explicitly having the option to enable or disable this.

"Hey do you want to go on a date?"

read two days ago


Reading a message and not replying is considered rude but I don't want to reply to everything right away especially if it takes a lot of time to reply. I would rather there just be a thumbs up button on the screen so I can say I have read the message and do not intend to reply.

Because I like spending a long time coming up with the optimal message to send, and it's awkward if the other person knows how long it took me

Back in the day when SMS text message was the way to send messages to each other over the mobile phone network (in the UK), people would jump the gun by sending 'Happy NY!' messages 5 minutes before midnight, because the moment 12am hit, any messages sent then could be queued for hours as the mobile networks struggled to cope with the massive uptick in messages being sent at the same time.

I used to use bulk sending tools (yeah, they existed as J2ME apps) to send 10-50 SMS (it's been a while, not sure how many it was) and some would only arrive on January 1st quite a while into the day. At some point, it changed and everything would arrive just a few minutes after sending. I think that was when we had iPhone and Android already and I'm not sure if it was because of messengers (were there any back then?) or because the German telcos finally upgraded their infrastructure enough.

As someone who typically works on front-end projects, this was a very interesting read. I particularly loved the discussion of “graceful degradation.” That’s the kind of collaboration across the stack that makes a service like Messenger very pleasant to use.

Interesting that the messenger team is ~40 people, as compared to WhatsApp having 32 engineers at the time of their sale to FB.

The photo caption says that's just the infrastructure team. I'd imagine the product team is much larger given how many features are crammed into Messenger.

I wish it was the other way around. We'd have fewer, but faster features.

Seems like Messenger is getting WeChat-ized - in China, WeChat [0] is your one-stop shop to getting anything done - chatting, following celebrities, booking housekeeping, payments, even "mini-programs". And now many Western messaging apps are starting to take inspiration from this [1].

[0]: https://en.wikipedia.org/wiki/WeChat

[1]: https://blog.ycombinator.com/lessons-from-wechat/

Is it working though? Just because Facebook keeps adding features doesn’t necassarily mean that people are using those features at the same scale as WeChat. I avoid Facebook messenger entirely, but are others using it for all these ancillary things like sending friends money?

My (non-technical) friends & family, as anecdotal evidence, seem almost universally happy about switching to Facebooks Lite Messenger. (An app that only provides messaging functionality.)

They perceive it as less confusing.

Afaik the real team name is Messenger Foundation. FB has the concept of foundation teams, with the only goal of keeping a part of the service working no mater what.

/edit: been informed by some of the people in the picture that technically there are 2 teams there: Messenger Infra and Messenger Foundation

I can see some MySQL Infrastructure people in it :-)

I didn't realize message queues were used for this type of task. I'm assuming you would then also use autoscaling pods that respond to the number of messages in the queue. How do you scale pods fast enough for a messaging application or anything else trying for 100ms or less per operation?

I think over-provisioning is way more common and sane approach that can address the bulk of spikes versus auto-scaling. Especially if you have these big known events (new years day, black friday, ...) where you can over-provision (or controlled auto-scale if you will) for a short window.

My guess is they're doing both.

Anecdote time. I worked at a company where one project was over-provisioned on dedicated hardware and another auto-scaled in the cloud. The over-provisioned project was much cheaper, had significantly better response times and was easier to manage. It was load tested to handle over an order of magnitude more traffic than the all-time-peak and even though fully over-provisioned, it was cheaper than the baseline usage (and slower, and harder to manage) cloud solution.

The back of the napkin numbers I've seen say metal (onprem or colo) has a 100x price/performance advantage: 10x faster at 1/10th the cost.

Do you have any interesting reads about this? Most of the material I find is biased one way or the other by cloud hype or data centers trying to reel back customers.

AWS has gotten much better and closed the gap some. But, your average system will probably see a 4x difference (2x price and 2x performance).

There was a very long period of time where SSDs were commonly available from everyone but cloud vendors. For some workloads (like databases), that resulted in a massive difference.

This is still true for bandwidth. You can look it up yourself. Off the top of my head, in the US, you can find a server with 100mbps dedicated port for < $300. But that AWS bandwidth is over 3K. So that's less than 1/10th AND you get a relatively [relative to what you can get on AWS] powerful server vs just the bandwidth.

Back when the C4 instances were announced, I ran unix bench on them as well as a dedicated i7-4770. You can see the i7 was quite a bit faster, and, if I recall, was less than half the price

https://gist.github.com/karlseguin/5a6a45ace2048545b6c3 vs https://gist.github.com/karlseguin/a659ef87b3a4a5d590e9

I think database workload is still where the average app would see the biggest difference. A properly configured server with a battery backed raid adapter and proper dual network NICs will blow RDS out of the water for raw latency and cost and, most noticeable to me, consistent performance.

Unfortunately, fewer and fewer companies seem to be offering servers with BBU and dual NICs. And those that do are charging more...so that's also helped close the gap. IBM really screwed up. They wanted to compete with AWS so tried to turn Softlayer into an AWS clone, rather than focus on what Softlayer did better and fix the issues (automation and ddos mitigation come to mind).

For one, I thought Bitnami had done the comparison a few years back, but for the life of me I can't find it in the GOOG.

I don't remember that we did any benchmarks, but I don't doubt that if you look at the raw performance / price it can easily be an order of magnitude more expensive to go to the cloud. But once you take into account all the automation and services that the major cloud providers have, it can get much more cost effective as it saves significant time / expertise for most companies

They may have only mentioned Bitnami images, it's been awhile.

However, I think your rationale begs a question: how much of "the automation and services that major cloud providers have" (scare quotes only for readability) is needed by runners of metal? For instance, autoscaling can be mooted by overprovisioning, which would still incur only a minor cost increase. Multi-region is similarly cheap. A lot of the remainder seem like productized functionality that is fairly implementable locally if desired.

I should corroborate your anecdote with my own.

We did two games, one over-provisioned on metal, the other on auto-scaled cloud based infra.

The cost is significantly higher. We ended up with a hybrid to control for cost. But our needs are long-lived sessions which does not fit the elasticity of the cloud model well.

Messaging queues are a core part of a lot of high scale distributed systems (source - Twitter) You want enough queue space to handle the expected volume and then some. Assuming you have that, you don't need to instantly scale instances out to match the amount of messages, you just need to catch up before the queue space runs out.

Message queues (or similar things like Kafka, which isn't quite a proper "message queue") are used for basically everything at this scale. Messages are being passed indirectly. An event happens, it gets popped on a queue, and then the recipients do something with it.

Maybe you'll consider this pedantic, or maybe even wrong, but I think you meant to say "asynchronous" as opposed to "indirectly". I think anyone googling to learn more will get better results.

True, but there is also the store-and-forward subscription model where the message is placed on the queue without an intended recipient. Message passing can also be asynchronous but direct, e.g., Mach messages.

Facebook runs its own hardware. How would "autoscaling" help them?

Dynamically allocating hardware to different services makes sense for self managed DCs too. Breaking down cost per service of team is useful, as is seeing how much hardware is necessary for peak and trough I imagine.

I wouldn’t be surprised if on NYE services like chat take priority and get extra resources from background jobs that can wait a few hours.

One way is to over-allocate in the first place. When your spare pool is draining below a watermark, you scale in. Hopefuly there is enough time for that scale event to complete before the pool drains completely.

One thing I find very manipulative about Facebook is how, when someone sends you a message, the email notification has a link to open messenger and it says that messenger is the only way you can read that message, even if you don’t have messenger installed. They are trying everything they can do to have everyone install that app. Yet, of course you can just read and respond directly on their website without any app, but they don’t link to that or mention it.

mbasic.facebook.com for anyone reading this and looking for a way to work around Facebook's dark UI pattern on mobile, where a click on the "messages" button wants you to install the app.

Or use "Request desktop site" in your browser, it gives you the same thing.

Or you could just use some other service than facebook. Like signal or xmpp.

Yes, please absolutely do that, too.

I read the title and instantly thought "Erlang"

Fun fact: FB Chat was originally implemented in Erlang


Essentially what everyone else does - distributed systems with load balancing, load balancing and more load balancing. And if that goes awry, triage - where they prioritize messages and simply timeout and drop the lower priority messages. Of course the Messenger team is lucky in that they can drop messages since your family and friends missing a "Happy New Years" message isn't the end of the world. Other systems ( such as finance ), aren't so lucky. Drop a few transactions or apply them out of order and it is the end of the world. Was an interesting read, though it would have been nice if there were more specifics but I guess Facebook wouldn't approve that.

I don’t know, a missing “Happy New Years” isn’t missing dollars, but it’s definitely not cool to drop such a greeting —or any message—in my opinion. It should definitely be possible to at least store and then deliver these messages late. The baseline should be 100% deliverability and anything less than that should be subject to intense scrutiny. I mean, how big of a Kafka cluster do you need to make this happen?

Actual messages with content are never dropped. Only ‘meta-messages’ like read receipts - it’s not critical if on a group chat with many participants the state of ‘who’s seen the last message” is not 100% correct on New Year’s Eve.

> Of course the Messenger team is lucky in that they can drop messages since your family and friends missing a "Happy New Years" message isn't the end of the world. Other systems ( such as finance ),

Messenger is (also) a finance system. You can send money via Messenger. You can purchase products directly in Messenger. It's had all that for more than two years now.

That’s not the proper comparison. The difference is that payments sent through Messenger aren’t critical (think paying your friend for lunch), whereas for most financial transactions (eg trades of stocks, derivatives, swaps etc) it’s absolutely critical that they execute immediately.

Then don't drop the transactions pertaining sending money. They are not necessarily treated the same way as messages in the backend.

The OP mentioned a prioritization system for a reason...

From one of my projects (a MMORPG) I've learned that the required accuracy in non-financial transactions is often underestimated, while, on the other hand, financial transactions are often less critical that initially assumed. After all, compensation in financial transactions is often straightforward to calculate and apply. But the damage done through dropped/failed non-financial transactions is often hard to assess, and it's also more involved to find appropriate compensation.

Do you have any references/resources for plussing up on scaling transactions w.r.t. finance? It sounds like an intriguing set of challenges...

Mine crashes stably on launch right now on two different machines. Therefore not so good :-D

New Years Day hasn't happened yet. Your problems indicate that you should turn it off and on again.

Well, the problem exists for quite a while now. But what's the point of the backend working, if the app does not launch. Says something about quality, when a 300MB IM app won't start.

Turn two different machines off and on again? Doesn't sound like a coincidence to me, unless some other shared hardware (network, etc) is causing the problem.

How do you know if your problem is a client app problem vs a backend issue? :P

on mobile? are you running the most up to date version?

are you on a web browser? are any of your extensions mucking things up?

Both machines are Windows, the app installed from the store. When it does not crash on launch, it takes forever to start.

this certainly feel like paid article to bring some good image to Facebook


Facebook get its share amount of bad PR (some are well deserved), but we shouldn’t dismiss amazing engineering work because of those. This is a technical piece that highlights solutions to problems not many out there get to solve.

Uber's technical blog posts some amazing stuff - their work on using satellite rssi data and GPS almanac info to detect multipathing in gps signals reflected from highrise buildings to more accurately determine which side of the road a phone is on is _magnificent_.

But they, like Facebook, deserve every bit of bad PR they get for being truly awful corporations.

Incase if anyone is looking for the uber blog post. https://news.ycombinator.com/item?id=16887276

The thing is, this is clearly for marketing purposes in the end. The approach is complex and sophisticated, but certainly not novel. A good architect in any decent interview should get to a similar solution. This must be read through the lines: this is an attempt to try and change their bad PR cycle.

I'm not saying the parent comment is absolutely essential, but I'd rather see that instead of people using this to change the conversation from central flaws in Facebook's data practices, business model, and ethics in general. This may sound a bit harsh, but at a certain point you have to really consider what giving publicity to a group gives them versus the inherent value of the paper.

Change your public image by publishing to IEEE?

To developers, yes.


Fulfilling Godwin's law in under an hour, nice.

GPs post was begging for it.

I thought they'd just let the NetFlix and Spotify servers handle any overflow.

Cant wait for akward new yr wishes from ppl I haven't heard from since last new year.

Deleted my FB 2 years ago... no regrets. You should try

This is a 2018 internet-connected app, not a 1985 GSM network.

1 billion 100 Byte messages sums up to an almost trivial 100GB. This might be a technical challenge for the neighborhood's web admin but not for any real company.

What kind of tyro thinks GB/s instead of ops/s is the right measure here?

This is a text message. Unless you're doing it horribly wrong, ops/s should be very close to the number of messages.

That wasn't the point at all. You were sneering at others because you had converted the numbers into GB/s and those numbers looked small, but the only foolishness was doing the conversion in the first place and the only person doing it was you. A billion messages of any size is nothing to sneeze at. Have you ever worked on a system that handled that many messages in its entire lifetime, let alone a single day?

P.S. There's no such thing as a 1985 GSM network. That phrase just makes your comment look even sillier.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact