
Building Facebook's Service Encryption Infastructure - sudoyear123
https://code.fb.com/security/service-encryption/
======
Shish2k
> After several days, we finally narrowed down the issue to a bad Advanced
> Vector Extensions (AVX) instruction on a single CPU in our fleet

This isn't even the first time I've heard of an issue at FB being caused by a
single bad CPU instruction. Working at a scale where "Problem X is a one-in-a-
million edge case" and "Problem X happens several times per day" are
synonymous is weird...

~~~
Diederich
At a previous organization I worked at that operated at this kind of scale, I
came up with a couple of maxims:

1\. If you're only 99% automated, you're dead.

2\. Anything you can easily imagine going wrong is probably going wrong right
now.

3\. Everything else that can possibly go wrong will go wrong at some point in
the not distant future.

Some of the problems we ran into were pretty fun and challenging. The longest
substantial one I'm aware of took a couple of years to figure out.

~~~
seanyesmunt
Can you share the problem that took a couple of years to figure out?

~~~
Diederich
Sure. I didn't figure this out, but was relatively close to the investigation.
My team provided a lot of data that ended up being used to come to root cause.

My team and extended teams managed almost 200,000 network devices, spread all
over the world, most of which were Cisco, and most of which were installed in
stores. And most of the switch ports were connected to customer facing Point
Of Sale devices. Among these are employee facing registers and customer facing
card scanners. That is, the devices you interact with whenever scan your card
to pay for something.

With that many devices in that many locations in a largely unmanaged
environment (the switches would be installed all over the store, often in the
ceiling, and many of them experienced extreme temperatures), there were a
constant stream of failures. The process to manage these failures was
optimized, streamlined and largely automated.

However, it was discovered that switches were failing far more frequently in
the northern Midwestern US than elsewhere, and then only in the winter.

So this wasn't a really big operational issue, but it had a substantial cost
impact, and the rate was high enough that a lot of the affected stores did
notice and were complaining.

Right. Very strange, very mysterious.

So, briefly, the root cause:

Apparently, people in the upper Midwest wear wool to stay warm far more
frequently than other cold places, specifically the US northeast. And much of
the time, the humidity is quite low. So, you have a lot of people wearing a
lot of wool in low humidity air. These people generated a lot of static, which
they would all too often discharge while interacting with the customer facing
point of sale device. And, all too frequently, that pulse of static would end
up flowing all the way back to the switch, often killing it.

I didn't follow the subsequent remediation efforts, so I don't know what if
anything was done about that.

~~~
laingc
That’s a fantastic anecdote. I would have loved to have been a fly on the wall
when the results were reported to management.

~~~
lioeters
Indeed, great story. I'd love to have seen the face of the person who finally
figured it out.

------
ihm
Does anyone know if they use encryption and access control to granularly
regulate access to data? For example, the part of the system that feeds data
to advertisements shouldn’t have access to my private messages (in my view
that would be a huge breach of trust with users.)

~~~
sudoyear123
There are several access control mechanisms. One such ACL as mentioned in the
post is identity certificates which are used to perform access control. Other
mechanisms for identity are CATs which have been talked about in the past
[https://rwc.iacr.org/2018/Slides/Lewi.pdf](https://rwc.iacr.org/2018/Slides/Lewi.pdf)
and [https://www.youtube.com/watch?v=kY-
Bkv3qxMc](https://www.youtube.com/watch?v=kY-Bkv3qxMc)

~~~
Boulth
CATs at first sight look like Macaroons or JWTs. Thanks for the links!

~~~
sudoyear123
Ya they're similar in that they are all signed blobs of data, but different in
the sense that they are specifically designed to send authentication
information via several layers of proxies

~~~
Boulth
I'm actually interested in this subject so I'll check out your links when I'll
be able to. At first sight this sounds like wrapping tokens or third party
caveats in Macaroons.

~~~
bdd
We presented about CATs again in Def Con 26. It's a 21 minute talk but if
you're interested in how CAT differ from Macaroons, you can skip to 16:15 mark
where Yueting explains [https://cryptovillage.org/cats-a-tale-of-scalable-
authentica...](https://cryptovillage.org/cats-a-tale-of-scalable-
authentication/)

~~~
Boulth
I've seen both videos, nice explanation.

If you don't mind I wouldn't necessarily agree with the comment about JWT by
Yueting. JWT is just a format, querying backend to get a new token is not
necessary (this is only how people often use them). I actually built a small
PoC that mints new JWTs on client side (in the browser) signing them with a
non-exportable key (through Webcrypto).

As for Macaroons I believe they could also be adjusted to resemble CATs as I
understood them (with layers for different services). I do have other issues
with Macaroons though
([https://news.ycombinator.com/item?id=17878845](https://news.ycombinator.com/item?id=17878845))...

------
dirkg
I wonder what the performance differences are with their approach vs using K8s
+ service mesh of your choice which includes all this.

I imagine Google runs a bigger scale operation on top of Borg and their
internal service mesh.

~~~
Diederich
Borg != Kubernetes k8s is based on Borg, but they are quite different.

Companies at this scale have integrations between all different levels and
layers of the 'stack' that make the use of off the shelf software difficult or
impossible.

~~~
dirkg
yes I know they are different, but the lessons learnt/dev in both products
probably end up influencing each other.

My point was encryption of services is built into K8s/service mesh and
wondering how it fares compared to FB's approach.

~~~
sudoyear123
This would make a great comparison. I'm not certain whether or not K8's mutual
auth supports session ticket resumptions and distribution of short lived
ticket keys. The ticket rotation design would probably make a great addition
to K8. There are a lot of intricate details in design which can make a major
difference in not only performance but also whether or not the system wakes
you up at night.

~~~
brown9-2
Kubernetes does not have a built-in mutual auth solution

------
ropman76
"After several years of trying to manage these issues with Kerberos, we
decided to redesign the system from the ground up.." I am curious, is there
anyone who went the opposite direction direction and implemented a new
Kerberos library or setup?

~~~
cryptonector
So, several places, such as Stanford or Morgan Stanley, have sophisticated
Kerberos setups using something like Russ Allbery's Wallet[0] (Stanford) or
Roland Dowdeswell's OSKT[1] (Morgan Stanley, Two Sigma) stack.

For example, OSKT is a self-service toolkit that lets users build up access
controls for clusters, "role accounts" (user accounts for running application
automation), and what not. Users use "krb5_prestash" to indicate what hosts
should have what role accounts' credentials (the user must own the hosts and
role accounts) and krb5_keytab to get keys for services on hosts they are
allowed to run. A nifty trick is to have wildcard DNS A RRs for hosts so that
one can have HTTP/${USER}.$(uname -n) principals (and keys for them) on any
host the user can login to.

All of this is high-performance and self-service. Users don't need to file
JIRA tickets or whatever to get their keys for their services.

Self-service credential provisioning is absolutely _essential_ to successful
deployment at scale of any authentication system one uses, whether that be
Kerberos or PKIX or DANE or anything else one might find or invent.

    
    
      [0] https://www.eyrie.org/~eagle/software/wallet/readme.html
      [1] https://oskt.secure-endpoints.com/
          https://github.com/elric1/

------
plainOldText
> We run one of the largest microservices deployments in the world, with
> thousands of services that perform billions of requests per second.

Thousands? Does one truly need that many microservices, even if you are the
size of Facebook? That is a lot.

~~~
throwawaymath
That actually sounds about right from my perspective (and experience). Out of
curiosity, why do you think that is a lot?

I would go a bit further and say most companies should only proceed with a
microservice architecture if they have sufficient scale and automation such
that decomposing their architecture will result in at least a high double
digit number of discrete services.

~~~
plainOldText
I think I was reading _microservices_ but I was imagining _services_ and
that's why it seemed like a lot. I guess it depends on the granularity of the
decomposition.

I agree, scale and automation are two important factors when decomposing
architectures. I think it would be really valuable if systems could decompose
themselves to some degree, based on scale and other factors, without much of
an operator's intervention.

------
Marako
It seems that they recreated a Service Mesh. Running Istio or Consul Connect
takes care or the vast majority of issues listed in this post: Encryption,
Identity, access control. And even trasparently for the developpers (no
modification of the code...)

~~~
StreamBright
Only the scale that these solutions can support is different.

~~~
bogomipz
Isn't Istio implemented mostly as a sidecar container(Envoy Proxy) though? The
article mentions they are running containers via their Tupperware
orchestrator. If they are largely running containerized where is the scaling
issue with adding sidecars to implement the service mesh? I don't have any
experience with Istio but I'm genuinely curious along which axis it(or
Connect, Linkerd etc) doesn't scale.

~~~
shereadsthenews
Envoy is so slow that deployment at this scale would be too costly, or if you
could afford it would immediately present itself as a huge opportunity for
cost reduction. People who are measuring their tail latency in microseconds
aren't going to tolerate Envoy's marginal latency, which will be milliseconds
even at the median.

~~~
bogomipz
Interesting, I wasn't aware that Istio has such performance issues. Isn't
Google using this though as well or at least an internal version of it? Surely
they are on the same scale as FB.

I'm curious at to what the cause of the latency is. TLS handshakes?

~~~
Terretta
That report resulted in an update to Istio docs: _“Warning not to use demo
profile for performance evaluation”_

[https://github.com/istio/istio.io/pull/4220](https://github.com/istio/istio.io/pull/4220)

More here, which basically suggests, don’t stop Istio from scaling out before
500 rps, it doesn’t like that at all:

[https://kinvolk.io/blog/2019/05/performance-benchmark-
analys...](https://kinvolk.io/blog/2019/05/performance-benchmark-analysis-of-
istio-and-linkerd/)

------
dpflan
Is this an engineering article to help prime the pump for discussing FB's
approach to taking privacy more seriously (e.g. Zuckerberg's the "future is
private")? The article does not explicitly state any connection to such larger
FB company and product developments, but it made me think it's connected in
some way.

~~~
jedberg
Looks like a pretty typical "we solve hard problems and you should come join
us" recruiting blog post.

