
Announcing Envoy: C++ L7 proxy and communication bus - ryan_lane
https://eng.lyft.com/announcing-envoy-c-l7-proxy-and-communication-bus-92520b6c8191#.fk8c6rbku
======
seasonedschemer
The info in
[https://lyft.github.io/envoy/docs/intro/comparison.html#prox...](https://lyft.github.io/envoy/docs/intro/comparison.html#proxygen-
and-wangle) about Proxygen not supporting HTTP/2 is not correct. Proxygen has
had HTTP/2 support for a while
([https://github.com/facebook/proxygen/blob/master/proxygen/li...](https://github.com/facebook/proxygen/blob/master/proxygen/lib/http/codec/HTTP2Codec.h)).

Disclaimer: I work on Proxygen at FB.

~~~
mattklein123
We will fix the docs, apologies. FYI, your README still says that "HTTP/2
support is in progress".

~~~
seasonedschemer
Thanks for pointing out, will fix that on our side!

------
doublerebel
Wow, with L7 routing on path (not just host) this does almost everything I'm
using bud+fabio+consul to do. It's like Hystrix+sidecar-HAProxy in one.

The one thing I must have is SNI. The docs only have a short blurb [1], does
someone know the full status of SNI support?

[1]:
[https://lyft.github.io/envoy/docs/intro/arch_overview/ssl.ht...](https://lyft.github.io/envoy/docs/intro/arch_overview/ssl.html?highlight=sni)

EDIT: Also is there any kind of visualization for the resulting network
topology? It looks like Envoy should know everything about who is talking to
who.

~~~
mattklein123
Visualization is going to be a big area of future investment for us. We
already have some pretty cool tools internally and obviously lots of
dashboards, etc. but we would love to have a dedicated UI for Envoy. If you
know any good UI devs who would want to work on this please send them our way.
:)

~~~
doublerebel
Thanks for all the info. Any insight into the service discovery issues
described in the docs [1]?

    
    
      Many existing RPC systems treat service discovery as a
      fully consistent process. To this end, they use fully
      consistent leader election backing stores such as
      Zookeeper, etcd, Consul, etc. Our experience has been
      that operating these backing stores at scale is painful.
    

[1]:
[https://lyft.github.io/envoy/docs/intro/arch_overview/servic...](https://lyft.github.io/envoy/docs/intro/arch_overview/service_discovery.html?highlight=consul)

~~~
mattklein123
Mainly just years of experience at different companies watching ZK, etcd, etc.
fall over at scale and require teams of people to maintain them.

We have had zero outages caused by our eventually consistent discovery system
with active health checking (knock on wood), and haven't really touched the
discovery service code in months. It just runs.

I'm not saying that a system using ZK, etc. can't be made to work. It
certainly can since many companies do it. It's mostly that I think those
solutions are actually making the overall problem a lot more complicated and
prone to failure than it has to be.

------
Usu
This looks really neat! We are deploying a few new microservices every month
now and I'm afraid that things will get out of hand networking wise (we are
too using ELBs for both load balancing and service discovery), so I'm looking
forward to try Envoy, thank you for open sourcing it guys.

------
manglav
Pardon my ignorance, but would someone mind explaining, in a little more
detail, when this software would be necessary, and perhaps other tools that do
the same thing? Envoy seems like it does a lot, I'm just trying to wrap my
head around it.

~~~
wccrawford
Their docs actually compare it to a lot of other stuff like haproxy, nginx,
Amazon ELB, and more.

[https://lyft.github.io/envoy/docs/intro/comparison.html](https://lyft.github.io/envoy/docs/intro/comparison.html)

~~~
manglav
I didn't see these, thank you!

~~~
wccrawford
Yeah, I read a fair bit of their docs before I happened across it. I feel like
they should make that comparison more prominent. Even knowing it was in there,
I had a little trouble finding it again.

------
epberry
This does seem incredibly useful for service oriented architectures. As I
understand, it's basically a per application, per machine monitoring library
for quickly detecting problems up and down the network stack.

However it also does load balancing. But doesn't that defeat the purpose a
little bit? If your monitoring tool is the same as your load balancing tool,
then who's monitoring the load balancer? :) I might be misunderstanding the
architecture here.

~~~
chairmanwow
So correct me if I'm wrong, but it seems to maintain a web socket 'mesh' that
it proxies all inter-service communication through. So whenever you need to
speak to another service, you don't need to worry about the extra cost of
creation/teardown of a new web socket. It also says that it handles automatic
retries, global rate limiting
([https://lyft.github.io/envoy/docs/intro/what_is_envoy.html](https://lyft.github.io/envoy/docs/intro/what_is_envoy.html)).

Because all inter-service requests are going through Envoy, it is really easy
to keep incredibly detailed stats about network health, request success rate &
more.

Envoy performing the task of load balancing does not defeat the purpose,
because it provides extremely detailed stats for ALL THE THINGS, they reported
it helped them find problems much quicker, instead of checking service code,
EC2 networking, or the ELB. Essentially by creating a supersolution with
better stats reporting for all, troubleshooting seems like it would be easier.

------
honkhonkpants
If there's anyone from Lyft here answering questions, I have a few.

#1: cost. Doesn't it basically cost double to move the request from the client
application to the proxy, and then from the proxy to the backend?

#2: upgrades. What happens to the clients when the proxy is being rolled out?

#3: head-of-line blocking. If an application has two streams to the same
backend, one stream which is low priority and one which is higher, how does
the proxy handle that?

~~~
mattklein123
Hi,

I work at Lyft. To answer your questions:

1) There is added cost, though it varies depending on how many things Envoy is
configured to do (e.g., logging, tracing, stats, rate limiting, health
checking, etc.). Even in complex scenarios (Envoy being used to proxy both
inbound connections into a service, as well as proxy outbound connections to
Mongo or Dynamo), we measure Envoy overhead to be < 1ms, which for almost all
applications is negligible. There are definitely certain cases where this
might be prohibitive, but in general we find the common functionality we get
(again stats, tracing, etc.) to be invaluable in a production setting.

2) Envoy supports hot restart
([https://lyft.github.io/envoy/docs/intro/arch_overview/hot_re...](https://lyft.github.io/envoy/docs/intro/arch_overview/hot_restart.html)),
as well as graceful drain of existing connections, so there is limited/no
impact to existing clients. There is one enhancement that we would like to
make to our HTTP/2 graceful draining to make it even more seamless, but that
is more complicated than I can type here. :)

3) Right now Envoy does not support HTTP/2 priority so all streams are treated
equally. We are currently working on priority support at the routing layer,
with different connection pools available for high and low priority traffic,
as well as circuit breaking settings. In the future we will likely merge this
back into a single HTTP/2 connection with proper priority support. In practice
though, within the DC, head of line blocking at the TCP layer isn't too much
of an issue.

------
xissy
Nicely done! Quite impressed since my company also has had similar problems
with a fine-grained service oriented architecture. Envoy seems to cover almost
of them, kudos.

However, I couldn't find a performance benchmark test or something compared to
alternatives such as haproxy, nginx, etc. So I'm going to make my hands dirty
now. ;)

------
_RPM
> Envoy works with any application language. A single Envoy > deployment can
> form a mesh between Java, C++, Go, PHP, > Python, etc.

I find it odd that they did not include Rust in their list of preferred
languages. Rust is safer than C++ and Go.

~~~
blub
Rust barely registers outside HN and a few other web gathering places.

All those other languages have their niches and ecosystems and are safe
enough.

------
jguegant
Any explanation on not using Asio instead of libevent (or libuv)?
Performances?

------
halayli
Wouldn't this require private keys to be sprinkled on all machines running it
to inspect the traffic?

~~~
ryan_lane
I work for Lyft.

For this we have a secret management system, called confidant
([https://lyft.github.io/confidant/](https://lyft.github.io/confidant/)), that
we use to distribute any necessary secrets. So, yes, you may need to have keys
on every node (depending on your monitoring system), but assuming you securely
distribute them, it's not a big deal.

This is, of course, a general problem that's not necessarily related to envoy.

~~~
halayli
This increases your attack surface area. Any breach to one of those machines
and the attacker can start doing mitm attacks. It also limits auto scalability
assuming newly provisioned machines require manual approval of priv key
distribution (that stays in memory) via hsm, and the same goes if the process
dies. One way to limit the key distribution is to embed the routing
information you require in the SNI at a second lb layer that's shielded from
public traffic. This way your public machines don't hold any keys and if they
get compromised, limiting the damage.

I agree it's a general problem. But sometimes certain architectures would
require more vulnerable approaches vs others.

------
mgrennan
Just seems like another piece of code in search of a problem and ways for
things to go wrong, because someone didn't take to time to research how things
work now.

~~~
Retra
Do you have any relevant criticisms, or are we all supposed to just pretend we
already know what your problem is?

