
Nebula, Slack's Open Source Global Overlay Network - tptacek
https://slack.engineering/introducing-nebula-the-open-source-global-overlay-network-from-slack-884110a5579
======
viraptor
> We tried a number of approaches to this problem, but each came with trade-
> offs in performance, security, features, or ease of use.

I wonder if they tried ZeroTier. It sounds really like what they wanted.

~~~
zrail
They may have decided that ZT's encryption isn't proven well enough for their
needs. It's also posssible they rejected ZT because they didn't want to use
ZT's centralized infrastructure. Two years ago was long before ZT started
working on making that optional.

~~~
tiernano
ZT allows you to run your own "Moons", meaning you dont need their
infrastructure... bit more config required on the client end, but less
reliance on Zerotier....

~~~
zrail
Moons are going to be deprecated soon and as far as I understand never
actually worked how people wanted. I.e. they still needed ZTs root servers,
even if you were running your own controller.

To contrast, with Nebula you run your own root(s) (lighthouses) and you don't
need a controller because important config (ip, group, hostname) is signed by
the same CA.

~~~
api
"Moons" are being deprecated in favor of true federation:

[https://www.zerotier.com/zerotier-2-0-status/](https://www.zerotier.com/zerotier-2-0-status/)

[https://www.zerotier.com/lf-announcement/](https://www.zerotier.com/lf-
announcement/)

The moon terminology will also go away since there will no longer be a
difference between these and our core roots. They'll all just be roots and
will be interchangeable. The use of a common underlying key/value store will
allow ZeroTier to keep its unified namespace and easy ability to join anyone's
network or communicate with anyone regardless of what roots they're using (as
long as their roots are on the same global network as you... obviously you
can't hop air gaps).

------
harikb
This is really interesting news!! Ryan was at the Gophercon 2018 and was
talking casually about his pet '20%' project with some of us. Happy to see if
finally released in the opensource. Great work Ryan!. His off the record
remarks really made me change my mind about Slack engineering team in general.
Otherwise, I was always cursing them about their electron client.

------
geofft
I feel like I don't totally follow how you would set this up for, say, a
company that has infra in two cloud providers (but no office network or
datacenter or anything)... I think the answer is you set up one or more
lighthouses with stable IPs on the public internet, and you make sure all your
ephemeral cloud machines have IPs on the public internet? And all your
ephemeral cloud machines get RFC-1918 addresses that are effectively in a
giant flat subnet with no broadcast / no L2 domain and no implied structure?

It feels a little different from Wireguard, in that with Wireguard your
engineers would be able to connect from behind a NAT, but my reading of how it
works is that machines route directly at each other. Which is good for a
production network where you care deeply about routing (bandwidth, latency,
costs, debugging, etc.), but it seems that here your engineers would still
need to connect to a bastion host or something, i.e., it isn't a VPN in the
sense of being able to join the corporate network directly.

I guess if you've also got the lighthouse node _internally_ routable by all
your machines (e.g. you have an internal datacenter network and something like
AWS Direct Connect) it would work too?

It'd be nice to see a sample network design.

~~~
zrail
(from a single read of the README and the OP)

I think the answer is that your lighthouse(s) are the only machines that need
publicly routable IPs. Your ephemeral cloud machines get any RFC-1918 address
you want, with any subnet you want.

Engineers would have Nebula set up on their laptop with a configuration that
knows about your lighthouse(s) static IP(s). They use the lighthouses for
meeting other nodes, UDP hole punching, etc, but otherwise every connection is
peer to peer.

~~~
geofft
NAT traversal sounds like a thing I very much don't want to deal with for a
production network, instinctively. It's fine for video games with friends but
I've seen enough stuff go wrong with even normal networks that I wouldn't want
to trust it. If this is what Slack is actually doing, I'd be very curious to
hear how it's working out for them and how they debug network outages.

(Which is why I suspect it's not and the readme isn't clear)

~~~
viraptor
Yeah, after doing a few years of VoIP I pretty much learned the same. Yes,
there are multiple methods of traversal, yes they are sound in theory. Yet
stuff breaks all the time on consumer routers.

------
e12e
This does look really great! Is there ipv6 support?

Both for overlay network[1], and/or for nodes?

[1]
[https://github.com/slackhq/nebula/issues/6](https://github.com/slackhq/nebula/issues/6)

~~~
sanxiyn
The answer is no.

------
AcerbicZero
Network virtualization seems to have been extremely slow to be adopted. Even
companies pushing "cloud first" seem to be running their physical networks
like its 2003.

~~~
geofft
When your Kubernetes falls over, you can ssh to it and run commands like it's
2003 (or 1993). When your network falls over, not so much. We deprecated our
network virtualization at $work and have been way happier for it.

Also, the end-to-end principle argues for putting complicated logic in the
endpoints and making the network boring. See also, TCP is implemented at the
endpoints and just requires network infrastructure to drop packets sometimes.
You could imagine a congestion control protocol implemented on each router on
the Internet, but it would be much more fragile and also much harder to deploy
changes to.

~~~
tptacek
Part of the reason for the end-to-end argument is to enable more clever (or,
at least, more purpose-designed) functionality to ride on top of the dumb
network. So e2e would suggest (to my reading at least) that you keep the
"real" IP layer dumb and flexible, and do the fun stuff in overlays, which is
what this is.

~~~
pvg
It's also mostly about two rather than N endpoints. The later, follow-on
Blumenthal & Clark paper is a kind of long list of end-to-end-principle
analysis 'it's complicated's.

------
sandstrom
If anyone has detailed knowledge, it would be interesting to learn how Nebula
is similar, and different, from e.g. Consul Connect and ZeroTier.

~~~
shantly
The BSD license is a pretty big difference.

[EDIT] MIT, that is, of course.

~~~
jen20
Consul Connect is under MPLv2, which is a perfectly reasonable license unless
you want to do shady things. There may be other differentiators, but this is
not one.

~~~
shantly
Is that one Apple App Store (iOS) compatible?

------
sansnomme
This looks like an open source Zerotier, my dream has come true.

~~~
e12e
Zerotier is open source?

~~~
tptacek
It is, but under a noncommercial license.

~~~
api
We have a blog post about our transition to the BSL from the GPL:

[https://www.zerotier.com/on-the-gpl-to-bsl-
transition/](https://www.zerotier.com/on-the-gpl-to-bsl-transition/)

I don't think the BSL is perfect. We're thinking and discussing with a number
of people about potentially better licenses that would be closer to
traditional FOSS while preventing "SaaSification" and similar. I think we're
in the early stages of a renegotiation of the open source social contract and
I don't think we've figured out the best model yet.

The AGPL is close but suffers from two problems: (1) it isn't perfect either
and has numerous loopholes, and (2) there are a ton of companies out there
with an irrational but nevertheless very entrenched phobia of anything
associated with the GPL (as we have discovered). Maybe something a bit like
the AGPL but not GPL branded would work.

~~~
tptacek
I'm not judging, I'm simply relating a fact: Nebula is MIT licensed, and
ZeroTier is BSL'd; a paid license is required to use ZeroTier in a closed-
source application.

~~~
api
Yes, that's intentional. It used to be GPL which imposed the same requirement,
but we shifted to BSL because it's a bit more explicit and because of (again,
irrational) GPL-phobia on the part of some non-trivial subset of corporate
users.

BTW the closed source restriction in the BSL is effectively the same as the
GPL and the only other meaningful restriction is on SaaS direct monetization.
Companies can still run ZT for free and run it behind the scenes for free.
It's a lot like the AGPL.

~~~
sanxiyn
This is not true. AGPL allows SaaS monetization (you just need to publish the
source). BSL does not allow it.

~~~
api
That's why the BSL exists: to stop SaaS companies from monetizing the software
without giving anything back to its developers.

A SaaS company can get a commercial license.

~~~
tptacek
Or just use Nebula, which is superior in some ways (though presumably not
every way) to ZeroTier, and MIT-licensed.

------
veeralpatel979
It would be nice if the team could publish a high-level architecture of the
system!

I'd like to understand how Nebula works -- any other suggestions besides just
diving into the code?

------
lorenzo95
The guys on linux unplugged interviewed the developer in their last podcast
here [https://linuxunplugged.com/329](https://linuxunplugged.com/329) Starts
at about 28:20. He explains more of the why and how.

------
gfodor
I wonder if this could be used in a trustless context to create a mesh of
contributed internet nodes.

------
macawfish
Other similar projects that deserve a mention: tinc and yggdrasil.

I used to use tinc and have recently switched to yggdrasil, which was much
easier to setup. So far it works great!

~~~
bjeanes
Tinc is mentioned in the post.

------
fulafel
IPsec does not require tunnel hosts.

Nor per connection configuration, you can eg let any nodes that hold certs
from your CA communucate.

------
virtuallynathan
How does this differ from WireGuard?

~~~
tptacek
WireGuard is a VPN, and Nebula is an overlay network (also known as a service
mesh). They are closely related concepts.

VPNs are primarily used for remote access, to get random machines access to
closed IP networks. Service meshes synthesize a new network (sometimes IP,
sometimes something else) to connect a bunch of related machines, almost
always with policy controls for who can talk to what, usually cryptographic.

It would be weird (but not "wrong") to use a service mesh to get developer
laptops access to staging Postgres.

It would be weird (but not "wrong") to use WireGuard to connect an application
server to its Postgres instance.

WireGuard is a much tighter and more limited design, intended for integration
directly into operating system kernels, with a strong emphasis on performance.
Nebula is a much more ambitious design; it includes direct DNS support,
certificates, and server infrastructure. WireGuard is a few thousand lines of
very carefully written C code; Nebula is a typical Go project.

They're both very cool.

~~~
rgun
Why do you think it is "weird" to use WireGuard for connecting application
server with a DB instance?

(Backdrop: I have recently moved our various prod servers into a WireGuard
based VPN to encrypt the traffic between them. I found it was easier/pragmatic
to do this than:

* to setup SSL for my DB

* to figure out how to encrypt traffic between my application server and Redis or my application server and Nginx )

~~~
tptacek
I like WireGuard and wouldn't blink at a client proposing to use it to create
a secure network fabric for their deployment environment, but it is not the
norm for people to do stuff like this; in K8s land, this is what service
meshes like Istio do, and more generally this is what people use overlay
networks for. WireGuard could form the basis of an overlay network, if you
added the same bells and whistles Nebula has. But I don't think Jason has in
his plans to add those bells and whistles himself, because that's not really
WireGuard's charter.

------
dastx
It feels like their issues would have been solved by a service mesh using e.g.
consul or istio. If so, I'd wonder writing a tool from scratch was the right
use of engineering time. Anyway, as an engineer, I'd certainly have found this
a fun project. Kudos to slack for trying something new and open sourcing it.

~~~
KaiserPro
Not entirely as it only really allows stuff thats running in that service
mesh's world to connect to the network.

But they want a global VPN for _everything_ including laptops. This means some
level of access control.

What I like here is the use of lighhouses, to allow external nodes to punch in
and discover the rest of the network. Something which is very difficult to do
if you are relying on a service mesh in an unknown and unconnectable network.

~~~
thu2111
How does this differ from cjdns?

~~~
KaiserPro
I suspect that the goals are slightly different.

The thing that immediately stands out is the routing. It looks like cjdns is a
traditional-ish multi-hop network. The DHT routing table allows you to map out
a route to peer A via peer B, R, & D.

What wireguard and nebula allow is for the underlying network to figure out
most of the routing, and effectively create a massive point to-point network.
whilst you can have concentrators/gateways, the idea is that most of the
traffic goes direct from peer to peer. This can reduce load considerably.

~~~
thu2111
I think cjdns allows arbitrary peering, so you can certainly set up a full
mesh if you want point-to-point traffic, with multiple hops only for cases
where the underlying network topology requires it.

~~~
neilalexander
Right, cjdns and Yggdrasil will both forward on behalf of other nodes where no
direct paths are available.

------
rahimnathwani
Can nodes communicate with each other directly even if they're behind NAT,
without port mappings or UPnP?

I know there are ways to make this happen (e.g. using the techniques from Samy
Kamkar's pwnat/chownat), but am not sure whether Nebula is designed to work
within this constraint.

~~~
TheDong
Having two nodes communicate to each other when you have a cooperating third-
party server (a lighthouse or discovery node) that isn't behind a nat isn't
hard. That's what STUN servers and other forms of UDP hole punching
accomplish.

pwnat is notable because it doesn't require having a public stun-like server,
but nebula already assumes there's public servers, so traversing nat is a non-
issue.

The readme says "Discovery nodes allow individual peers to find each other and
optionally use UDP hole punching to establish connections from behind most
firewalls or NATs".

In practice, I didn't see any code that implements it, but I didn't look too
hard.

~~~
sanxiyn
It's there in lighthouse.go.

------
gyrgtyn
Can this make taps, or just tuns?

------
roberson87
Sounds like a service mesh. How is this any difference to Istio/linkerd? This
library may be useful, but the stated problem it seeks to solve is hardly a
unique one.

~~~
vsupalov
To me it reads more like Nebula is a VPN solution, with end-to-end encryption
and security groups baked in.

To my understanding, a service mesh does not establish a common VPN-like
network, but assumes it's there already. Nebula and service meshes both
provide authentication, end-to-end encryption and role-based access control. A
service mesh can do more than Nebula: it makes it possible to shift traffic
between services for example apart from a "security group"-like filtering.

However, I might be mistaken. Any corrections are more than welcome.

------
dopylitty
This is a really cool project

That being said the code is full of TODO and other comments indicating that
shortcuts were taken which should be fixed later. I would be worried about
running such a thing in prod given the criticality of its function. At best
you could risk performance issues under load and at worst you could have
significant security issues allowing unintended traffic in/out.

~~~
inetknght
Do you think it's unusual for businesses to deploy to production code which
has TODO statements?

~~~
pferde
Depends on the TODO statements themselves. "TODO: document this section
better" is a huge difference from "TODO: add error handling to this section"

