
An analysis of Facebook photo caching - nbm
https://code.facebook.com/posts/220956754772273/an-analysis-of-facebook-photo-caching
======
Jemaclus
This is pretty interesting. I mean, it seems pretty intuitive and obvious once
it's put out there like that. I currently deal with a few hundred million
items that need to be delivered in real time, and we use a somewhat similar
structure, although our algorithms for cache invalidation are much less
sophisticated. I wonder how much effort it would take (and I can put forth) to
improve my own system to be more efficient.

In other words, I wonder how much of this efficiency boost is due to FB's
abilities (both in people and technology) to scale. The paper seems to imply
that it's relatively simple, now that the data has been gathered, but for a
one-person team like mine, I wonder what benefits I can take away from this.

~~~
nbm
I work with the Facebook CDN team and (amongst other things) maintain the data
pipelines that log requests, the tailers that fetch/annotate the requests, and
populate the Hive tables that we used for this research and other improvements
to serving content.

There is certainly a lot of infrastructure that this is built on that many
small teams don't immediately have access to in other companies. Whether it's
self-service hardware provisioning, Scribe logging infrastructure, tailer
frameworks with checkpointing and retries (and job systems to schedule them),
and large amounts of available space on Hive for experimentation. But most of
the software parts are available as open source, so it doesn't need remain
unavailable.

This is my first team at Facebook that I've been heavily involved in this
scale of data capture and analysis, but it only took a few days to get up to
speed through a combination of great tools and good documentation. Being able
to drop a Python file in a code repo to ensure that some complex data
warehousing task takes place every day after that is pretty powerful.

------
leapinglemur55
Why are they still using the FIFO policy when their own data shows that S4LRU
beats it in every area? Is it just easier to implement?

~~~
alexgartrell
The short, uninteresting answer is that it's a work in progress. Initially we
built FIFO into our caches because it was easy to build and didn't interact
badly with flash disk craziness (write amplification specifically). You can
read more about our old caching system mcdipper at
[https://www.facebook.com/notes/facebook-
engineering/mcdipper...](https://www.facebook.com/notes/facebook-
engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920)

------
chime
If cache size is infinite, why isn't hit ratio 100%-new photos? Due to auto
expiration / TTL?

~~~
nbm
Yes, mostly due to new content. You have to miss on the first request for the
new content, and each of the 350 million or so photos a day will contribute a
miss. Also, this only ran over a particular time period, so one-off requests
for older content (someone revisiting their old photo albums) would be misses
too.

I'm not sure whether a "refresh" (ie, 304 not modified due to expiration) is
counted as a miss in this data.

~~~
js2
Did they do any experimentation with pre-caching new content? What percentage
of those new photos each day end up being requested I wonder.

~~~
nbm
I'm not sure if Wyatt, Qi, et al. looked at pre-caching at all. I've not
looked at the numbers yet, but it's on my list of things to look into once the
next few months of more obvious wins are worked through.

------
felixthehat
I'td be interesting to see the daily 1,000 most popular photos, _some_ of them
must be really great

~~~
samplonius
Inside Facebook, the top 1000 daily photos is known as the "Buddha List",
because viewing them is a transcendent experience, after which your waking
life is little more than crude construct of colorless shapes.

And then you get fired for looking at private user photos.

------
umsm
Here's the video presentation of this page:

[https://www.youtube.com/watch?v=ENaQScyvOzY&list=PLn0nrSd4xj...](https://www.youtube.com/watch?v=ENaQScyvOzY&list=PLn0nrSd4xjjZsNjpfWvNEtuBMtKkTd3gW&index=12)

