

Facebook Chat Architecture (Erlang) - alifaziz
http://docs.google.com/viewer?a=v&q=cache:o0g6ONeGemcJ:www.erlang-factory.com/upload/presentations/31/EugeneLetuchy-ErlangatFacebook.pdf+architecture+of+facebook&hl=en&gl=sg&pid=bl&srcid=ADGEESiw63ZXSvf3kF4c-tTuKgPrHYucnpVY2FNqzV3vwY12fTxebQwEQMhf9PLmeEVVWioJiFDnY2aiaAnQaqL4E65nkHUleUIuZj2Ythq34IDRx5gVhajUSKjOUmLUVxkV_3GFC7M2&sig=AHIEtbThewthIAMefblxc9U9NWkD6yYY6g&pli=1

======
cageface
The nature of contemporary web applications seems to be getting closer and
closer to Erlang's original problem domain: lots of clients, continuously
pulling down small updates, with a need for consistent and low latency. Maybe
it's finally Erlang's turn to shine?

~~~
timdorr
This was on here a week or two ago:
[http://blog.mysyncpad.com/post/2073441622/node-js-vs-
erlang-...](http://blog.mysyncpad.com/post/2073441622/node-js-vs-erlang-
syncpads-experience)

Erlang's definitely got a speed and concurrency advantage. More people should
take the time to learn about it, as it fits a lot of use cases.

------
admp
For those who are perfectly comfortable with viewing PDFs directly, here's a
link: www.erlang-factory.com/upload/presentations/31/EugeneLetuchy-
ErlangatFacebook.pdf

~~~
mikedmiked
Always appreciated. Thanks.

------
metabrew
Interesting to see that they mention loving the hot code reloading/upgrades,
but also that they don't use the OTP release/upgrade functionality.

I wonder what their process is.. Just coping over new .beams and loading them
is fine for functional changes only, but you need a system to run code-upgrade
hooks if you want to change the state being passed around.

~~~
mononcqc
As far as I know, you don't need releases and upgrade functionality to make
use of the code-upgrade hooks. It simply works without them.

Releases and upgrades do make it simpler to synchronize what needs to be taken
down in what order, restarted, left alone, etc. But simple updates are still
doable without that.

~~~
metabrew
Yes you can trigger updates just by doing this in most cases:

sys:suspend(Pid), {module, my_module} = code:load_file(my_module),
sys:change_code(Pid, my_module, undefined, [], 10000), sys:resume(Pid);

But you still need a way to do that for running processes, something that
release_handler does for you with real OTP releases by finding which pids are
running for each module, by looking at supervisors and so on.

------
JonnieCache
On page 13 they quote peak inbound traffic as 1gb/second.

Would this be a gigabit or gigabyte? Usually I would just assume gigabit but
in the case of facebook i dont feel so hasty...

~~~
nivertech
I guess it's Gb (Gigabit). Since the only traffic Facebook Chat has is
presence, which is handled using separate C++ servers. Out of 500M users (350M
active users) not many of them use chat, maybe because by design it feels
disconnected from the rest of the site.

~~~
ichverstehe
Do you have any sources to back up that claim? Among my friends Facebook Chat
has entirely replaced MSN Messenger.

~~~
nivertech
anything can replace MSN Messenger ;)

~~~
JonnieCache
Exactly. And everyone in europe used to use MSN messenger. Now I expect they
use FB chat. My non tech friends either use FB chat, or they dont use IM.

------
DrJosiah
I wonder if FB would be willing to post some of their load numbers...

Like... how many sessions are created per day, persisted from day to day, how
many messages sent per chat session, how long windows stay open, how many
restores, etc.

------
koski
I wonder how much it has changed since the April 2009.

------
Aloisius
I'm confused as to why one would build a service in a language that is so
difficult to hire engineers for.

~~~
SkyMarshal
You don't have to hire engineers with N years of experience in Erlang, rather
hire great engineers who are programming language polyglots, and even if they
don't already know Erlang they'll get up to speed on it fast, as well as any
other new and useful language that comes along.

------
iregdtoreply
Interesting. I didn't know they used Erlang. Still, the quality of Facebook
chat is pretty poor.

~~~
JonnieCache
The XMPP gateway works just fine for me, as reliable as any other centralized
IM service I've used heavily.

Info on how to connect to it here: <http://www.facebook.com/sitetour/chat.php>

~~~
criticurious
Nice feature.. but when i clicked at the link and then clicked on 'Other'
(bottom center), it says 'Use SSL/TLS: no'. I wonder if the chat messages are
sent unencrypted ?

~~~
JonnieCache
Personally I trust the endpoint less than the route in this case.

------
mcs
I wonder if they have been experimenting with Node.JS.

~~~
axod
You could write facebook chat in pretty much any half decent language. The
question is how competent the developers are rather than what tool they decide
to use.

It's not rocket science, it's mainly plumbing - moving data around.

~~~
nivertech
if you have unlimited resources and huge datacenters - maybe you can. Try to
calculate have many nodes will you need to implement COMET-like chat system
for 500M users, especially when you using Python or Ruby with some async IO
evented framework.

~~~
axod
I don't think that's the case. Facebook are the ones with 10,000 servers
(probably a lot more now idk).

Some people are often too quick to try and emulate people/products by picking
out some magical quality, using it, and expecting identical results.

It's just like women who latch onto fashion tips/diet tips from celebs - eg
"Oooo Angelina Jolie used the X diet. If I use that, I can be like her".

Similarly, some techies think "Ooo facebook used Erlang. If I use Erlang, I
won't have any problems scaling".

It's certainly more in how you approach things, and how you architect things
than which particular tool you choose to use. There is no magic solution.

FWIW, I run Mibbit which handles a good number of users and messages per
second on a few servers. Not facebook numbers yet, but not small either. I
think we do a few billion messages a month.

~~~
nivertech
I don't give a shit about Facebook! I use Erlang since Ericsson using it since
80-ies. It's already 12 years since Ericsson open-sourced it.

All those evented frameworks are single-threaded. All my servers have at least
16 cores and in few years they'll have 256 cores and more.

Good luck handling 256 instances of node.js per server.

Good luck connecting all your nodes into distributed cluster.

The only real competition for Erlang in this case is Jetty/Netty with Java or
Scala.

Erlang/OTP is just a tool. You still need to work very hard to build a system
like Facebook Chat!

~~~
axod
I'm not sure what your point here is.

Facebook chat doesn't use CPU. Why would it need 16 cores? It's more than
likely IO bound. It's just moving boring data around.

If you're using serious CPU power to write a chat backend, you're doing
something really badly or using a crappy tool.

Your "But I need more CPU cores!!!" argument is moot.

~~~
cgbystrom
An ordinary Comet server with long-poll can actually end up eating quite some
CPU.

I run beaconpush.com which is a cloud-based service for browser push messages.
We're based on Netty and Java, which works out well for us. We've looked at
Erlang and wrote some initial prototypes in it but Java became our language of
choice. Mainly because we knew it and had some experiences with Netty, a NIO
library for Java. Netty did also outperform Erlang and mochiweb (both vanilla
configured).

Anyway, the problem tend to be with long-poll that users re-establish their
connection back to the servers each time they receive a message or navigate to
another page (or reload the page). This end up taking a big toll on the
system. We're to some extent limited to how fast we can perform accept() on
our sockets.

If you use long-polling with multi-part you can get away with sending more
messages on a single established connection (it becomes
request/response/response/response or similar). That can reduce the system
load and the use of WebSockets can eliminate the use of reconnections
altogether (disregarding any page reloads).

Facebook's use of AJAX navigation (i.e not reloading the entire page when user
clicks on links etc) also reduces this load. This due to not having to re-
establish a connection each time a user reloads a page.

So yes, we're actually CPU bound by the accept() behavior (at least to how the
JVM does accept()). But would our connection be more permanent in nature then
no, we would of course be more I/O bound if not completely.

~~~
axod
I'd do some more in depth checks. I find it _very_ hard to believe you're cpu
bound within NIO itself.

I do around 1k req/s per cpu and never see load avg above 0.5

>> "re-establish their connection back to the servers each time they receive a
message or navigate to another page (or reload the page). This end up taking a
big toll on the system. We're to some extent limited to how fast we can
perform accept() on our sockets."

I'd expect what you're seeing is the overhead of your system creating new
objects initializing requests etc rather than anything low level.

(Mibbit uses keep alive long polling XHR / websocket, using custom app server
in Java +NIO)

~~~
cgbystrom
It turns out that Java synchronizes on accept() capping the number of socket
accepts that can made each second on a single port.

But even if that would be solved or if you used something more low-level like,
say nginx, you'd still be limited of how fast you can accept().

Try using ApacheBench and hit one server with keep-alive on and off
respectively. Even nginx, which I would say is a performant web server, show
quite some discrepancies between having to reconnect each time and having a
permanent connection.

I file this under the penalty of having to set up and tear down a socket each
time. Something that only can be avoided by doing less reconnecting.

I could of course be wrong but these are the findings from doing performance
testing and discussing it on the Netty mailing list. But I'd gladly accept
suggestions on how to improve this situation for our service.

~~~
axod
I'd personally start by not using Netty if you have the time to start from
scratch :/

~~~
cgbystrom
Netty is not the problem. Java NIO has that limitation. And performance isn't
in any way bad, just that there's more to extract (I believe).

~~~
axod
> "Netty is not the problem. Java NIO has that limitation."

As I say, I'd be interested in numbers to back that up. It's certainly not a
limitation I've noticed.

