Hacker News new | past | comments | ask | show | jobs | submit login

Has Linux reached parity with BSD in terms of the TCP stack? My understanding was that it still wasn't as efficient but that info is outdated.

Linux has been beating BSD for at least 8-10 years when it comes to TCP. When it comes to new features in TCP-land, Linux easily beats it. Google added Receive Side Scaling / Receive Flow Steering to Linux years ago, and it is still a WIP in FreeBSD as an example. Also take a look at much of the bufferbloat research recently that has been merged into Linux, etc.

The RSS is Linux is not particularly useful (for netflix scale workloads) because it does not integrate the RSS hashing across the entire stack. So all you get is connection sharding. With real RSS, as done in Windows and FreeBSD where the kernel has intimate knowledge of the hash key and algorithm, you can use RSS to split the TCP hash table up and make it per-CPU. By using multiple accept sockets for per-CPU workers, you can effectively keep everything for a single connection on a CPU, and run almost everything with no cross-CPU contention. You can't move the connection around at will between CPUs, but you don't care to, because no connection is special (in a netflix workload), it just one of tens of thousands.

Adrian Chadd did most of the FreeBSD RSS work, and gave a good talk about it at BAFUG: https://www.youtube.com/watch?v=7CvIztTz-RQ

The RSS in Linux was just used for load spreading (the last I checked, I haven't used Linux much since I left Google 1.5 years ago). If this has improved, I'd love to hear about it.

Linux RFS depends on the packets being dispatched to the correct CPU for the connection by the interrupt handler running wherever the packet happened to land. This has cache & memory locality implications, especially on NUMA.

Linux aRFS lets the NIC do the steering. Unfortunately, each connection requires an interaction with the NIC to poke it into the steering table, and most NICs can't steer 100,000 connections.

So, to sum up, Linux has a lot of cool tech for steering individual connections and support for that varies greatly by NIC. Windows and FreeBSD use standard RSS to predictably steer an unlimited number of connections. For a large CDN server, the latter is more useful. However, for low-latency / high bandwidth applications, I can see the advantage to aRFS.

I wouldn't say 8 to 10 years. A major bug in Linux's default congestion control algorithm was only fixed just last year:


Linux is the platform of choice for bufferbloat research, although FreeBSD isn't far behind in adopting the results of it:


I guess my information was not just outdated, but clearly wrong.

Don't get me wrong, BSD is still absolutely solid, but for anything cutting edge, Linux is spanking the pants off of it.

As is often the case it depends on the specifics of the application and on those building a solution. As far as raw performance is concerned FreeBSD performs very well.

Netflix gets nearly 100Gbps from storage out the network on their FreeBSD+NGINX OCA appliances. Some details in the "Mellanox CDN Reference Architecture" whitepaper at http://www.mellanox.com/related-docs/solutions/cdn_ref_arch..... The closest equivalent I've found on Linux was a blog post on BBC streaming getting about 1/4 of the performance.

Chelsio has a demo video (with terrible music) using TCP zero copy of 100Gbps on a single TCP session, with <1% CPU usage https://www.youtube.com/watch?v=NKTApBf8Oko.

At SC16 NASA had a "Building Cost-Effective 100-Gbps Firewalls for HPC" demo, using FreeBSD and netmap: https://www.nas.nasa.gov/SC16/demos/demo9.html

Thanks for the reference to the whitepaper. FWIW, I'm the kernel guy at Netflix who has been doing the performance work to get us to 100Gb/s from a single socket.

Another interesting optimization we've done (and which needs to be upstreamed) is TLS sendfile. There is a tech blog about this at http://techblog.netflix.com/2016/08/protecting-netflix-viewi.... We don't have a paper yet about the latest work, but we're doing more than 80Gb/s of 100% TLS encrypted traffic from a single socket Xeon with no hardware encryption offloads.

I just wanted to thank you publicly for all your hard work on this. The community will benefit greatly from this. If I recall, correct me if I am wrong, didn't you also port FreeBSD to the Alpha many moons ago? I loved the Alpha and it broke my heart when it died. Sad panda :(

Doug Rabson did most of the early work on alpha. I sent him enough patches that he sponsored me for a commit bit. My primary desktop for several years was running FreeBSD/alpha. First was an AlphaStation 600, then an API UP1000.

I was very sad when alpha got axed, but I agreed with killing it. FreeBSD is about current hardware.

You're spot on regarding the app and FreeBSD performing very well. Don't disagree with you one bit. Also, great link on the Netflix CDN work, they're doing some really fascinating stuff. It is nice to see the openness.

I work directly with both of the gents who gave this talk about 100G networking[1] (on Linux) and still find that much of the actual cutting edge research is done on Linux. Perhaps I'm biased! I've also been to one of Mellanox's engineering offices (Tel Aviv) to speak with their engineers at my previous employer 7-8 years ago. They told me they do most all of their prototyping and initial development on Linux, and RHEL to be specific. Then then port to other platforms.

Maybe I was wrong on some of this, but my use case (due to my employer's industry being finance) is lower latency, where Linux absolutely and positively crushes anything else.

    [1] http://events.linuxfoundation.org/sites/events/files/slides/100G%20Networking%20Toronto_0.pdf

Mellanox is now one of the role model vendors in the FreeBSD ecosystem. They have a handful of BSD developers as well as sales and support staff that are in tune with the needs of high scalability FreeBSD users.

Maybe I was wrong on some of this, but my use case (due to my employer's industry being finance) is lower latency, where Linux absolutely and positively crushes anything else.

Actually, while we're on the subject, SmartOS with CPU bursting from illumos is the leader in low latency trading:


That is a slick platform they've built, but I still don't see how it is competitive with Linux for very low latencies. He mentions trading at microseconds, but we're building microwave radio networks to trade at nanoseconds. Unless this has changed extremely recently, Solaris/Illumos and hence SmartOS still don't have tickless kernels. I recall Solaris having a 100hz tick by default which you could change to 1000hz with a special boot flag. Linux has had dynticks since fairly early 2.6 kernels and with the modern 3.x kernels (RHEL7+), you've got full on tickless via the nohz_full options. Without this, the kernel interrupts the application to use cpu time.

Additionally, I don't believe (Experts please correct me if this is wrong) SmartOS has an equivalent to Linux's isolcpus boot command line flag (or cpu_exclusive=1 if you're in a cpuset) to remove a cpu core entirely from the global scheduler domain. This prevents any tasks from running on that CPU, including kernel threads. Kernel threads will still occasionally interrupt applications if you simply set the affinity on pid 1 so that does't count.

These two features, along with hardware that is configured to not throw SMIs, allow Linux to get out of the way of applications for truly low latency. As far as I'm aware, this is impossible to do in Solaris/SmartOS. I'm not even getting into the SLUB memory allocator being better or the lazy TLB in Linux massively lowering TLB shootdowns, etc, etc. There is a reason why virtually every single major financial exchange in the world runs Linux (CME in Chicago, NYSE/NYMEX in New York, LSE in London, and Xetra in Frankfurt), it is better for the low latency use case.

You asked for an expert to correct you if you're wrong, so here it is: this is just completely wrong and entirely ignorant of both the capacity of the system and its history.

On timers: we (I) added arbitrary resolution interval timers to the operating system in 1999[1] -- predating Linux by years. (We have had CPU binding and processor sets for even longer.) The operating system was and is being used in many real-time capacities (in both the financial and defense sectors in particular) -- and before "every single major financial exchange" was running Linux, many of them were running Solaris.

[1] https://github.com/joyent/illumos-joyent/blob/master/usr/src...

Thank you Bryan for the correction, I did after all ask for it :)

One final question while I've got you that your response didn't seemingly address. Does the cyclic subsystem allow turning off the cpu timer entirely ala Linux's nohz_full? If so, I stand corrected.

Yes, it does -- the cyclic subsystem will only fire on CPUs that have a cyclic scheduled, which won't be any CPU that is engaged in interrupt sheltering via psradm.[1] This is how it is able to achieve hard real-time latency (and indeed, was used for some hardware-in-the-loop flight simulator systems in the defense sector that had very tight latency tolerence).

[1] https://illumos.org/man/1m/psradm

You should really chat to some HFT folk in NYC before making that conclusion.

This is an adolescent evaluation. FreeBSD will have a new TCP stack with BBR made public in a couple months. It will be easier to correctly deploy and more cohesive than Linux. The entire packet path is more cohesive and easier to debug and tune using DTrace although Linux might have caught up here recently. By volume, FreeBSD is doing at least 30% of Internet facing traffic between a well known company and some quieter giants. BTW it only took a half dozen people collaborating between 3 companies about 2 years catch up to state of the art.

I've done a great deal of reading and research on OS ethos, IMO a thriving and production worthy operating system can be maintained with as few as 40 people in total. The superiority of Linux feels exaggerated, and systems innovation has chilled because of it.

What's WIP about FreeBSD's RSS?

I'd have to recheck. Linux has done some interesting new systems engineering work with BPF kernel bypass to improve network performance (the eXtreme Data Path project, used by Facebook).

"""Has Linux reached parity with BSD in terms of the TCP stack?"""

Im not sure what you mean. Linux has led TCP implementations for a decade now.

Two years ago, Facebook was trying to hire someone to make Linux's network stack as good as FreeBSD's:


I work for Google and previously worked at LBL on a team that developed optimized network stacks. I have a fair amount of experience running large scale data transfers as well as low-latency network in LANs.

The Linux network stack is great. It's the preferred system of choice for nearly every researcher in the networking field. I don't know what Facebook meant in their case.

You have more experience than I do in this area, but I am going to reply anyway in the belief that I can contribute some useful information to the discussion. First, there is a discussion of this here:


The main remark seems to be:

> The predominant difference is that the FreeBSD network stack was much more carefully designed. The Linux stack was less careful and thus is much more haphazard. Also, more work has been put into optimizing the FreeBSD stack.

It is not my area of expertise, but the Linux skbuf seems to fit the description of haphazard while the FreeBSD mbuf seems to fit the description of more carefully designed. The same could be said about epoll versus kqueue.

The remark about more work in optimizing the FreeBSD stack also seems to be true. While I cannot speak for everything in FreeBSD's network stack, I do know that FreeBSD's netmap far exceeded anything Linux could do at the time and while it is available on Linux, I never hear of it being used anywhere but on FreeBSD:


Development of FreeBSD's network stack had plenty of innovative things in development at the time Facebook's post was made:


That included additional contributions from a major network equipment vendor that had made many contributions throughout the years. If I checked the commit history, I imagine I would find performance work done by said vendor. From what I can tell, FreeBSD's network stack is improving regardless of whether the rest of us hear about it.

Lastly, there have been multiple things discovered to be wrong in the Linux network stack since that facebook job listing. Two prominent ones that I recall offhand are:



They both could fall into the category of stability problems to which facebook had alluded. The second one more so though:

> The end result is that applications that oscillate between transmitting lots of data and then laying quiescent for a bit before returning to high rates of sending will transmit way too fast when returning to the sending state. This consequence of this is self induced packet loss along with retransmissions, wasted bandwidth, out of order packet delivery, and application level stalls.

For the cloudfare example they didn't actually identify any problems with Linux. What they learned- correctly- is that TCP autotuning needs to be enabled if you want high performance out of your network stack.

This is covered by my previous team's page: https://fasterdata.es.net/host-tuning/linux/ Note: "On newer Linux OSes this is no longer needed." (IE, it's already set properly).

For the second one, they fixed a bug in Linux TCP cubic implementation. FreeBSD didn't get cubic until 8.2, which was around 2009. So, you're criticizing Linux for having a in a bug in a feature that FreeBSD didn't even add until 7 years ago.

Again, I will repeat: I worked on a team that did multi-OS TCP/IP optimization. What you're describing in terms of oscillation is a well-known problem in many implementations. All of the people doing research on this are now using linux as their platform for research and development.

If you "worked on a team that did multi-OS TCP/IP optimization", then you should know that there is no one size fits all solution that is quantitatively better for everyone.

Not implementing cubic in FreeBSD when there was a bug in the only implementation of it in the world could have been an advantage in certain situations, including Facebook's.

There seems to be a hubris by many Linux users that Linux is the best solution in the world for everything and it is not. There is always someone who does something better. Maybe not in everything, but the same applies to Linux. No matter how good it becomes, it is not the best in everything. Networking is a broad topic. I don't think Linux is the best in every area of networking. I am not even sure if it is the best in many of them, given that many platforms do things very well and at some point, it is hard to be better.

My opinion on this as it always has been is Linux gets some new features first before freebsd but they are always done sloppily, half assed, and start getting refined and fixed about the time freebsd implements the same feature after thinking about it, designing it well, then implementing it.

With regard to https://wiki.freebsd.org/201405DevSummit/NetworkStack saying they had plenty of innovative things is really misleading. Most of the comments are "linux has this nice thing that works well, let's copy it".

In the BSD community this awareness of other operating systems is seen as a strength and pride. It's telling that Linux fans see it as weakness.

Subsystems are now done with up front design and some degree of consensus in the BSDs, closer to the cathedral and commercial development than the bazaar of Linux. This necessarily means we are not usually at the forefront of cutting edge features. It doesn't necessarily mean we don't have features before Linux; if the idea exists in academia or other OSes enough to reason about it's reasonable to propose, design, and build. Netmap is a good example. The new FreeBSD selectable TCP stacks are another, where we avoid incremental growing pains and baggage. When these designed features hit, they tend to be coherent, usable, obvious, and lasting.

My opinion of Linux features is that little due diligence was done, especially public acknowledgement of inspiration and why one route was taken over another. For instance, the Linux KPIs are littered with questionable decisions made in isolation. epoll and the various file notification calls are examples. That attitude manifested strangely up to userland through IPC/DBus with the continued systemd drama.

A little bit of logical inference.. there are financial drivers vendors are fleeing the Linux kernel in preference of userspace (i.e. Intel's DPDK and SPDK). One is licensing, which is not an issue with BSD nor userland. The other is the rate and quality of KPI churn. Linux KPIs break all the time, switch licenses all the time, and it is a general nuisance to maintain a vendor tree whether it is open or closed source. The good side is that hopefully drivers and products end up open source. The bad side is, in many modern usages, that does not happen because GPL is not relevant to hosted services, as well as low motivation/quality/incentive/license violation for IoT type things. The BSDs start with no pretense of GPL nor flippant APIs, so it is a lot more comfortable to consume and build great products.

That is only in 4 items and Linux only appears in a few comments on each of them.

This remark seems more to me like a statement of belief that no one else can do good things other than Linux. That is far from true.

"In linux these are managed with device-independent tools, which is much better than the custom methods we have now, and avoids polluting the ifnet with extra information."

"In linux, buffers in the tx queue hold a reference to the socket so completions can be used to notify sockets. Implementing the same mechanism in FreeBSD should be relatively straightforward. "

"We don’t have software TCP segmentation, we have to carry information in the mbufs. Performance was doubled, without hardware support, by doing segmentation very low in the stack, right before input into driver. (Student project.) Linux calls this approach GSO, pushing large segments through the stack; the hardware can do segmentation if supported, otherwise we do it at the bottom layer. Simplifies TCP code since you can send arbitrarily large segments. "

"Linux has their standard ifnet interface, with a single pointer to the extensions; if the interface does not support them, the system still runs. If it does, have interfaces to configure numbers of queues, numbers of buffers, etc. All of this is slow-path (configuration) code. Think we should go for a similar route — ease configuration of 10gig interfaces"

the rest of the stuff in there is just low level optimizations to update the design that was written out in the original FreeBSD book.

I never said that people can't do good things in OSes other than linux. I said that Linux's networking stack has been better than BSD's for ten years, I can cite numerous factual arguments and research papers to support this, along with my extensive experience with linux (my experience with BSD is less, but enough to know it's stack isn't magically better.

If you want to say one is better, then you ought to at least define what being better means. "better" clearly does not mean the same thing to both of us.

Linux does have plenty of nice things and plenty of nice work, but I am not going to dismiss everything being done elsewhere by declaring Linux to be "better". At best, I would say that it is ahead in some areas, behind in other areas and the same in many areas. As for what some of those "other areas" are, I recall Adrian Chadd implementing time division multiplexed atheros wifi support in FreeBSD that Linux does not have. Netflix also contributed a rather nice thing to FreeBSD that Linux did not have:


There are plenty of nice things in both platforms. Labelling one as "better" just doesn't do justice to either of them. It ignores opportunities for the "better" one to improve by denying that opportunities for improvement have been demonstrated to exist. It also denies the "lesser" one the acknowledgement of having done something worth while.

the BSD feature you mention was added to linux 6 months later.

When I say something is "better", I mean "I've looked at the data, and integrated over a wide range of parameters".

I'm still waiting to hear about a magical BSD feature that is better. That hasn't happened in about 10 years, hence my statement.

I just linked one feature that by your own definition was better and you replied that Linux got it in 6 months, rather than "I did not realize FreeBSD does some great things in networking first".

If you are as experienced in networking as you claim, you should stop waiting to hear about magical features that are better. Nothing will ever impress you as being magical. That is a downside of having experience.

Maybe you would find talking to an actual expert on FreeBSD's network stack more interesting. I am not one and while I could list several other things I know, I am clearly is not doing it justice.

I definitely wish dtrace and zfs had come to linux earlier, and without the stupid license restrictions of ZFS which Canonical currently flouts. Both of those were far better than the alternatives that Linux came up with.

That's a great find. And look at what Facebook have done: invested in BPF and eXtreme Data Path.

that job posting provides no specific details about how BSD is "better" than Linux execpt making unspecific allegations about stability.

I imagine that it worked better in their testing and rather than jump ship, they decided to invest in making Linux better. If they knew how FreeBSD was better at that time, they likely would have patched Linux rather than look for talent that could figure it out.

Maybe ten years ago when the first generation of PCI-X Intel chipset 10GbE NICs were available for servers, the driver support and tcp offload engine support were much better on FreeBSD. But the situation is reversed now with all of the drivers and updates to the v4.x series Linux kernel.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact