Hacker News new | past | comments | ask | show | jobs | submit login
SONiC – An Open Source Network Operating System (azure.github.io)
116 points by jamieweb 43 days ago | hide | past | favorite | 43 comments



FWIW, I had an absolutely terrible time setting up sonic on an edgecore switch which was "supported".

Everything was a battle. The configuration is very finicky and not well documented, and every time I went to edgecore or sonic for support, they would blame the other party.

We did eventually get everything working, but it took hours of tinkering on the configuration. There's all sorts of undocumented magic about how ports, interfaces, lanes, etc all have to line up with each other. If it doesn't like how you've configured BGP, it will silently fail and refuse to link. And God help you if you have breakout cables. And for some reason, all the services are run in docker containers, which makes it even more painful to debug. Reapplying configurations is slow and buggy in itself, so you have to reboot the switch constantly.

Ultimately I find sonic hard to recommend. Yes it's free, but I'm really not sure that's a benefit. Any home enthusiast will buy a consumer grade switch with an OS preinstalled, or use OpenWRT. Any business using this for production would be wise to go with a stable product with good support, like Cumulus. As much as I hate how Cisco does business, at least their products will mostly work how you need out of the box.


So my problem with SONiC, Cumulus, and all these other "network operating system" Linux platforms is that they all seem to be designed to be fiefdoms and tend to be really stale. In my view, they bring almost nothing of value to the table.

Let's talk a bit more about SONiC...

For one, SONiC literally lists that pull requests that aren't already planned and approved will not be accepted[1]. This defeats a good chunk of the value of having a community project. People will want to contribute and extend your platform in ways you never thought of, and they'll do it in a completely decentralized fashion.

Another issue I see is that SONiC literally holds back everything to an old Linux kernel and ships random BSP blobs that are unvetted. This is a nasty combination for anyone who wants to consider their NOS trusted or secure. They're on a 4.9.x kernel, and while that is still maintained, it is far from the best option if you want to take advantage of innovation in Linux networking.

I'm also generally confused on why this whole project isn't just "let's get the networking tools and hardware support stuff into standard Linux distributions and leverage their tooling and communities". This was also a problem I had with Cumulus. When I tore apart Cumulus, I figured out that it was less than a dozen unique tools and a distribution rebuilt for 32-bit MIPS and PowerPC. It was pretty trivial to rebase to standard Fedora or Debian and get a better platform out of it.

And finally, I don't really think this provides any real innovation. It's not really different from Cumulus, Open Network Linux, and others. And ONL actually is using more up to date kernels (5.4.x as of right now!) and offers better networking tools!

What I would love to see is all these people who keep doing this crap working in the actual Linux distribution communities to build and integrate with upstream projects so that everyone downstream gets all kinds of flexibility.

Imagine if you had a flavor of Fedora CoreOS for your network gear! The immutable OS, updated with RPM-OSTree, fresh software stack, and broad hardware support, all in one neat package.

If we treated the network gear like weaker servers, instead of specialty equipment, there's so many more interesting things you can do!

[1]: https://github.com/Azure/SONiC/wiki/Sonic-Roadmap-Planning


>This was also a problem I had with Cumulus. When I tore apart Cumulus, I figured out that it was less than a dozen unique tools and a distribution rebuilt for 32-bit MIPS and PowerPC.

Almost all of which are open source (or at least "source available"), with the exception of switchd, which cannot be open sourced because it links with proprietary asic sdk's. I don't see how having very few custom tools over a vanilla Linux distribution is a bad thing.

>It was pretty trivial to rebase to standard Fedora or Debian and get a better platform out of it.

If you enable upstream Debian apt sources in your sources.list then it effectively is standard Debian - plus switchd.

Of course it is entirely possible to take all of the components of Cumulus Linux and use them on a separate operating system - enter sonic, vyos, etc - so if you build out such a system which can also drive ASICs and that you prefer over Cumulus, you can take full advantage of all of Cumulus's open source contributions.

>What I would love to see is all these people who keep doing this crap working in the actual Linux distribution communities to build and integrate with upstream projects so that everyone downstream gets all kinds of flexibility

If I read you correctly, Cumulus works upstream as much as it can:

   ~/linux$ git log --author "cumulusnetworks.com" --oneline | wc -l
   773

   ~/ifupdown2$ git log --author "cumulusnetworks.com" --oneline | wc -l
   1265

   ~/frr$ git log --author "cumulusnetworks.com" --oneline | wc -l
   8107
I like to believe Cumulus is quite active in the communities of projects it uses. I feel I may have misunderstood your point, though.

>If we treated the network gear like weaker servers, instead of specialty equipment, there's so many more interesting things you can do!

I completely agree, that's the dream!

Disclaimer: I work at Cumulus


> Almost all of which are open source (or at least "source available"), with the exception of switchd, which cannot be open sourced because it links with proprietary asic sdk's. I don't see how having very few custom tools over a vanilla Linux distribution is a bad thing.

The problem historically with Cumulus on this was that it was heavily obfuscated. In the past, when I talked to Cumulus sales folks, it was not quite as honest as what you've said.

I don't have a problem with the "shipping a Linux distribution you can support" thing. I have a problem with "not making it so the stuff you have is available everywhere (i.e. push into Fedora _and_ Debian to feed into all distros and ecosystems)".

> If I read you correctly, Cumulus works upstream as much as it can. I like to believe Cumulus is quite active in the communities of projects it uses. I feel I may have misunderstood your point, though.

Cumulus is actually a nice exception to this rule. Most Linux-based network operating systems do not bother (including SONiC, VyOS, EOS, etc), but Cumulus does good work here. My only complaint is the focus on ifupdown2 instead of helping make cross-distro tools like NetworkManager support these things. It's been a long time since NetworkManager was only for desktop-only use-cases and only did Wi-Fi. It's the standard tool on a wide range of distributions and supports server use-cases very well. I personally use it over ifupdown and netconfig on my systems.


>I have a problem with "not making it so the stuff you have is available everywhere (i.e. push into Fedora _and_ Debian to feed into all distros and ecosystems)".

Almost all of our kernel patches are in mainline Linux, and ifupdown2 and FRR are packaged on Fedora and others.

>Cumulus is actually a nice exception to this rule. Most Linux-based network operating systems do not bother (including SONiC, VyOS, EOS, etc)

In defense of VyOS, they contribute to FRR and generously offer free licenses for people who work on the projects they use (https://www.vyos.io/open-source-contributors/). I think in general there's a lot of goodwill between the people working in the open NOS space.

> My only complaint is the focus on ifupdown2 instead of helping make cross-distro tools like NetworkManager support these things

Gotcha, I understand now. I can't provide any direct insight into why ifupdown2 was chosen instead of nm. I also use nm on my personal devices - though I can't say I've ever missed the ability to e.g. configure vxlan tunnels on my personal infra ;). I guess if we'd chosen nm 10 years ago then there would be similar feelings from people who prefer /etc/network/interfaces. Of course, at the end of the day Cumulus engineering time is spent primarily on things that ship in Cumulus Linux.

Btw, appreciate the feedback :)


> Almost all of our kernel patches are in mainline Linux, and ifupdown2 and FRR are packaged on Fedora and others.

I did see FRR recently make its way into Fedora, but I haven't seen anyone package up ifupdown2 there. Is someone working on that at Cumulus? I'd be happy to do the package review if someone hasn't already grabbed it before me once it's submitted. :)

> In defense of VyOS, they contribute to FRR and generously offer free licenses for people who work on the projects they use (https://www.vyos.io/open-source-contributors/). I think in general there's a lot of goodwill between the people working in the open NOS space.

Oh, I don't doubt it. But it's weird how many of them built on Linux are still not FOSS or collaborating with their upstreams...

> Gotcha, I understand now. I can't provide any direct insight into why ifupdown2 was chosen instead of nm. I also use nm on my personal devices - though I can't say I've ever missed the ability to e.g. configure vxlan tunnels on my personal infra ;). I guess if we'd chosen nm 10 years ago then there would be similar feelings from people who prefer /etc/network/interfaces. Of course, at the end of the day Cumulus engineering time is spent primarily on things that ship in Cumulus Linux.

Well, for what it's worth, /etc/network/interfaces is supported by NetworkManager. :)

As for VXLAN configuration in my personal network, I do it for homelab stuff. Setting up layered networking is kind of necessary if I am going to be messing around with things like OpenStack and Kubernetes.

I totally get that the engineering is primarily spent on things that ship in Cumulus Linux. I just want to see more work from Cumulus that benefits everyone, especially given that networking is so hard to get right! :)

> Btw, appreciate the feedback :)

You're welcome. I'm happy to see such an engaged person from Cumulus like yourself responding well to feedback! :)


What are you trying to achieve here? What tools are you going to get and how do you want to use them? You do understand that the switching ASIC is just a PCI device right and that you cannot just pump all its bandwidth into the CPU for review? The path between the data plane (ASIC) and control plane (CPU) is limited, generally only a few gigs in today’s high end switches. Anything you want to do in the data plane has to be programmed on the ASIC. The only packets that are punted to the CPU are control plane are ones that need software processing and are low bandwidth, such as LLDP, STP, BGP control, etc. This is done by programming a switch ASIC table call “my station” or “l2 user”. On some kit you can tcpdump a front panel port to the CPU but it is rate limited as you can kill the CPU or stop processing of vital control plane packets (let’s DDOS the STP process, fun). Looking at traffic flow on the CPU on a 32x100g is not gong to happen. You need to sample, so sFlow, Netflow, etc. So given the limited bandwidth and any tools need to know how to translate your Linux configuration into Ethernet ASIC pipeline programming what is it you want to do that you cannot do today?

Random note. I worked at a switching startup (a few). At one we always ran own latest code. After an update to a core switch everything looked good, but then people started to complain things where very slow. Went looking. Switch looked fine but dropping traffic towards CPU which should not happen. In checking the cacti graphs for that switch (10 second polling) all the graphs that showed the ports between the different networks were exactly the same flat line at a max of 134MB/s on 10G pots. Hum, strange. Hold on, that sounds like the max BW between the ASIC and the CPU port! Let check some bits in the ASIC configuration. Yup. New build forgot to set HW routing on in the pipeline so every packet was punted to the CPU for route processing. Lucky control plane policy had the STP etc, packets at a higher queue. Tweak the bit, blam, graphs go to 11 :) File bug.


I think it is a shame and a mistake that PC industry has chosen PCI-Express over the battle tested Infiniband technology as the upgrade for the PCI [1]. Infiniband offers native channel based peer-to-peer connection fabric for disparate nodes and most of of the important CPU bottleneck tasks (e.g. memory protection & address translation) can be outsourced to the Infiniband controller instead of the proprietary ASIC networking controller.

The bottleneck is not only affecting networking but GPU industry as well. That's probably the main reason why Nvidia bites the bullet and bought major Infiniband player Mellanox for close to USD7 Billions deal. The bottleneck is only just bearable for video and games but not when you have to scale the processing of big data AI and machine learning applications.

[1] https://www.mellanox.com/pdf/whitepapers/PCI_3GIO_IB_WP_120....


This is already done today: https://github.com/Mellanox/mlxsw/wiki


Shameless plug: the Cisco 8000 series [1] recently got support for SONiC [2].

Right now, I believe it’s limited to fixed systems, but work is ongoing to get SONiC running on the distributed chassis. Exciting stuff!

[1]: https://www.cisco.com/c/en/us/products/routers/8000-series-r...

[2]: https://blogs.cisco.com/sp/cisco-goes-sonic-on-cisco-8000


If you want an overview of of the state-of-the-art open source networking stack, you can check this last year FOSDEM 2019 presentation by one of the developers of BISDN [1]. Somehow I don't think they mentioned SONiC in their presentation though.

I teach computer networking lab but for now we need to resort using dual boot Linux Switch Appliance (LISA) that uses custom kernel and another vanilla kernel running Quagga on multi-port Ethernet embedded PC. The good thing is that both LISA and Quagga use CLI environment similar to familiar Cisco switch/router IOS. I really wish there is an open source alternative offering that can seamlessly support layer 2 and layer 3 without dual boot, perhaps using Software Defined Networking (SDN) concept with an intuitive CLI.

Recently the new Shortest Path Bridging (SPB) has been integrated with 802.1Q since 2018 (bye-bye TRILL). I reckon any reasonably good Linux based open source layer 2 and 3 network OS player will be extremely popular overnight for enterprise, consumer and education. Together with eBPF this thing should be flying on the new off-the-shelf whitebox supporting 40Gbps multi-port Ethernet. Imagine a LAN or Metro LAN party with this beast :-)

[1]https://archive.fosdem.org/2019/schedule/event/from_closed_t...


Not really a "network" operating system since it doesn't seem to manage a network consisting of multiple devices the way something like ONOS, OpenDaylight or FAUCET would.

More like a "switch" operating system - it looks like SONiC is Microsoft's answer to ONL, Stratum, OpenSwitch, Cumulus et al.. Basically open source software to run on your cheap whitebox (or expensive greybox) switch.


Obligatory link to tutorial on how to run Sonic on an Arista switch:

https://www.servethehome.com/get-started-with-40gbe-sdn-with...


Linux-based. How is SONiC positioned vs. Cumulus ?


Cumulus requires a license to run on whitebox hardware. Sonic doesn't.


Does the Nvidia acquisition of Cumulus make that even worse going forward?


Cumulus was a pure software company; now that they are part of Mellanox there are more business models available such as subsidizing software with hardware.


It's also interesting that Nvidia (via Mellanox) announced on their blog their support for SONiC [0] a week after they acquired Cumulus.

[0] https://blogs.nvidia.com/blog/2020/05/12/mellanox-integrated...


SONIC has a very similar vibe to OpenStack and k8s. It's too big to fail, it has what plants crave, and you're dead if you're not on the bandwagon. Whether the code works doesn't matter.


Besides being cheaper and supporting more hardware, SONIC is more like a traditional NOS that runs on Linux but doesn't integrate with it. I would expect SONIC to be more familiar to Cisco CLI jockeys and Cumulus to be more familiar to Debian/Ubuntu sysadmins.


Incorrect. In Cumulus you have NCLU which is the Cumulus cli that just takes the commands from the bash prompt. If you know traditional Cisco you can sort it pretty quickly. They start with “net” so to show something “net show interfaces”.

Here is a cheatsheet:

https://cumulusnetworks.com/learn/resources/cheatsheets/nclu

Also Cumulus runs FRR (as does sonic) but on Cumulus you can do sudo vtysh and pretty much be at a CLI for routing like you are at an IOS prompt.

Sonic sticks all configuration in different docker containers in Json files and can be a real pain. Also not all commands are hitless, ie some will restart forwarding. That is being worked on. SONiC is pretty much “what MS wanted for large scale ops” and still is rough around the edges for enterprise IT.

There are a number of companies that support Sonic in production enterprise such as Dell and Apstra.


You are correct that sonic does a poor job of integrating switch management features, but it's nothing like other CLI platforms.

All configurations go in a single JSON file, which is used to configure the docker containers that manage the switching hardware.


How does this compare with existing switch OSs from Cisco and Juniper? Linux seems like it would be a bit heavy for a switch.




If you want to enjoy even more pain, ah trivia I mean, NXOS and NXOS are not the same thing. Read it again. There is an NXOS train for the original 2/5/7K and current kit in that line, and there is another version for the 3/9K new kit from the ACI spin in. They are just different enough that it makes things difficult when you are trying to develop tools for them. Cisco has promised to merge the two for a long time now.


IOS-XE is essentially IOS Classic running as an application on top of Linux, communicating with the actual forwarding hardware. NX-OS is similar but without legacy of having been a standalone OS. Linux generally stays away from all networking work and just provides a nicer deployment target with nicer APIs than bare bones MIPS/PPC/x86


IOS-XR is also Linux-based.


IOS-XR at least used to be QNX based, with somewhat distributed architecture (line cards running separate OS instances etc.)


Yes, the QNX version (cXR) is approaching end-of-life afaik. The distributed architecture remains with the Linux version (eXR).


TIL. Thanks a lot :) is eXR used for the XRv?


A switch operating system doesn't do the actual networking stuff, all that happens in the switching ASIC. All the OS does is apply configurations and basic management tasks.

If it runs on a $10 raspberry pi it will run fine on a $20,000 switch.


Well, that's not really the case with higher-scale router operating systems like IOS-XR. You have a ton of protocols running in software, so performance and memory requirements increase quite a bit.

And then there are the high-availability (HA) requirements which typically lead to redundancy in software and hardware.


Quite right. Control plane traffic is punted to the CPU, and a Raspi CPU cannot really handle that volume of traffic at enterprise scale.


Arista is Fedora. Routing stack started on NextHop (gated) which Arista bought in 2008 IIRC. Has been re-written majorly since.

Cisco NXOS is Yocto. They planned to move to Fedora at some point. Might have by now.

Cumulus is Debian based. switchd is closed sourced ASIC driver.

Junos is now a FreeBSD VM running on a Linux boot not sure what version.

Sonic is Debian based IIRC.

OS10 (Dell / Force10 was NextBSD) but I think with OPX (open source OS10 that SONiC will replace - personal option) moved to Linux.

FoundryOS was custom (VXworks?) Current version is Broadcom Strata.

Extreme original was VXworks. Current is Linux based.

Cisco IOS XE/XR is Linux (Debian IIRC).

SwtichLight is Linux as well as the BSNOS with their open flow stuff on top.

Ubiquiti is Vyatta running on Linux.

That is a quick dump from meat cache.


> Linux seems like it would be a bit heavy for a switch

Linux (and other Unix or Unix-like) kernels (and indeed full OS distributions) run fine on many low-end and embedded CPUs and hardware, and network switches are no exception.

OpenWRT is Linux-based and runs on extremely low-end switches such as home routers and access points.

Arista EOS is based on Fedora. (Of course Arista switches have real server CPUs and lots of memory. People do crazy things like running KVM on them.)

Juniper's Junos is based on BSD.

Remember that on a high-speed switch packets usually pass through the switching hardware without touching the switch CPU. Programmable switching chips like Tofino typically run pre-compiled pipelines that execute on-chip at line rate. The switch OS is primarily used for running management software that programs the hardware, runs the CLI, and/or provides other services. The OS can also run software to provide higher-level protocols and services such as BGP, DNS, or DHCP.


Eh, it's probably not that bad. On the low end, Linux isn't a slouch with XDP, etc. On the high end, it's game over if your kernel is in the data plane at all anyway, and Linux is a great option for the control plane of a switch.


Adding to what wmf said, JunOS is based on BSD and JunoOS evolved will be based on Linux.

Most network equipment run their control plane using standard Intel CPUs, so running Linux isn't much of a strech.


SONIC is less mature but it's free and open source with the benefits and drawbacks that entails. For example, SONIC runs on a variety of different hardware so you could take advantage of that diversity without having to learn different NOSes.

Ten years ago switches were using 800 MHz single-core PowerPCs which was adequate to run Linux (although many were using VxWorks or whatever). Now the $400 switches are still wimpy but more expensive "disaggregated" switches are using Atoms or low-end Xeons.


Arista's EOS is open source and up to the task, it would be awesome if there was a fork of it supporting whitebox gear.


No, EOS is not even close to open source. You might be able to run cEOS on a whitebox.


Maybe something changed but I rember downloading the source and also running it in a vm.


EOS core code never been open source. It it is built on Fedora Linux.

They do have vEOS and cEOS offerings but there is closed source Arista code in all of them.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: