Turtles on the Wire: Understanding How the OS Uses the Modern NIC (2016)

drewg123 · on April 2, 2017

The biggest innovation that I've seen in quite some time in terms of making creative use of hardware offloads is Hans Petter Selasky's RSS assisted LRO in FreeBSD.

On our workloads (~100K connections, 16 core / 32 HTT FreeBSD based 100GbE CDN server) LRO was rather ineffective because there were roughly 3K connections / rx queue. Even with large interrupt coalescing parameters and large ring sizes, the odds of encountering 2 packets from the same connection within a few packets of each other, even in a group of 1000 or more, are rather small.

The first idea we had was to use a hash table to aggregate flows. This helped, but had the draw back of a much higher cache footprint.

Hps had the idea that we could sort packets by RSS hash ID before passing them to LRO. This would put packets from the same connection adjacent to each other, thereby allowing the LRO without a hash table to work. Our LRO aggregation rate went from ~1.1:1 to well over 2:1, and we reduce CPU use by roughly 10%.

This code is in FreeBSD-current right now (see tcp_lro_queue_mbuf())

pebblexe · on April 3, 2017

FreeBSD also has netmap https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4

policedemil · on April 2, 2017

Great article! A lot of this is way beyond me, but I'm generally interested in the process of how a NIC filters based on MAC addresses.

I'm in the humanities and certain scholars working with culture and technology love to make a huge deal about data leakage and how intertwined we all are precisely because you can put a NIC in promiscuous mode and cap packets that weren't meant for you. The whole point is that because your NIC is constantly receiving data meant for others (i.e. because it's filtering the MAC addresses), something like privacy on networks is always problematic. I've always found the whole point somewhat overstated.

So, could anyone explain real quick the process of how a NIC decides whether a packet/frame is actually bound for it or link some good resources? For example, does the NIC automatically store the frame/packet in a buffer, then read the header, and then decide to discard? Or can it read the header before storing the rest of the frame? How much has been read at the point the NIC decides to drop it or move it up the stack? Reading all of every packet seems improbable to me because if it were the case, laptop 1 (awake but not downloading anything) would experience significant battery drain due to constantly filtering network traffic that was meant for laptop 2. I'm not sure that really maps to my experience. Also, I assume there are also differences for LAN vs WiFi?

Any help on the matter would be greatly appreciated! I've tried google diving on this question many times before and it's really hard to find much on it.

skissane · on April 2, 2017

> I'm in the humanities and certain scholars working with culture and technology love to make a huge deal about data leakage and how intertwined we all are precisely because you can put a NIC in promiscuous mode and cap packets that weren't meant for you.

For wired networks, a lot of those concerns about machines receiving other machines traffic are somewhat outdated–they were very valid in the 1980s and 1990s, but now in the 2010s they are far less pressing (although not completely gone). Back when we used coax Ethernet or Ethernet hubs, the norm was every machine got every other machine's traffic, and the machine's NIC was responsible for filtering out the traffic destined for other machines, so spying on other people's traffic was easy, and could be done without being detected. Now, with Ethernet switches, the norm is that each machine only gets its own traffic (plus broadcast traffic destined for all machines.) It is possible to overload a switch into a hub by MAC flooding, but in a well-maintained corporate network you can't get away with that for long without being caught. (A home network or small business network you probably can do it for a long time without being detected, since those networks are usually poorly monitored.)

So, in a contemporary well-maintained Ethernet network, it is unlikely your traffic is being sent to other people's machines. Of course, you shouldn't rely on that if you care about your security and privacy. But, encryption is far more common (and far stronger) nowadays, so even if you see someone else's traffic, you are much less likely to understand it. That is the best answer to the concern – who cares if someone else gets your traffic if they can't read it? (Well, if they save it for a few decades, computers might become fast enough to be able to break it – but, it is very unlikely anyone could be bothered.)

For wireless networks, these concerns are still very valid. The best advice with wireless networks, even secure ones, is always use a VPN.

javajosh · on April 2, 2017

The serial nature of the physical connection cannot be overstated. Bits flow to your NIC one at a time.

A 1Gb/s NIC is detecting a billion wiggles in voltage per second. Structure is imposed on these wiggles in stages: first, the A/D conversion happens, making the voltage wiggles 1's and 0's, then ethernet framing, then IP packet parsing, then TCP packet/ordering, then the application handles IO (and can, and often does, define even more structure, such as HTTP). You might look up the OSI network layering model (or the OSI 7-layer burrito, as I call it).

My understanding is that MAC filtering happens after ethernet framing, and before putting into the ring via DMA, and packets failing that test do not generate interrupts. Your NIC hardware is choosing to ignore packets not addressed to it because, generally, it's pretty useless to listen in on other people's packets. Especially these days when your most likely to capture HTTPS encrypted data.

dom0 · on April 2, 2017

> A 1Gb/s NIC is detecting a billion wiggles in voltage per second.

Well technically :) ... 1000BASE-T - regular gigabit ethernet - uses a five-level modulation and four pairs in parallel (which reduces high frequency components, making gigabit over old cables possible).

javajosh · on April 3, 2017

No, thanks for the correction! Cool! It's actually something I'd like to learn more about. I get the 5-level modulation, but how are the 4-pairs demultiplexed? The timing constraints must be exquisite.

dom0 · on April 3, 2017

A DSP in the network card does quite a bit of work to make it work: Since the pairs are used at the same time, near-end and far-end crosstalk need to be cancelled out of the signals, it also removes the currently sent signal and it's reflection (echo cancellation), since each pair can be used in both directions at the same time (!). It also adjusts delay differences between the pairs.

Gigabit Ethernet is really cool and involved tech :)

eriknstr · on April 3, 2017

I see you in comments here on HN every so often and I enjoy reading what you have to say. I've been wondering -- and sorry for going totally off-topic here -- whether or not you picked your username inspired by the Xen hypervisor initial domain dom0.

https://wiki.xen.org/wiki/Dom0

dom0 · on April 3, 2017

Thanks and yes, I did.

javajosh · on April 3, 2017

Excellent. You wouldn't happen to have links to someone with an oscilloscope showing what that cross-talk looks like, would you? Would be useful to visualize even at sub-GHz speeds. And also, where is that echo coming from, and how do you avoid generating microwaves if you're sending a GHz signal down a wire? (Or, perhaps you don't avoid it and that's the nature of the cross-talk?)

dom0 · on April 3, 2017

Since the same pair is used for transmission and reception at the same time, the echo is just removing what you are sending right now from what you are receiving. Similar to how you can hear / listen to someone while talking at the same time (air = ethernet pair).

> how do you avoid generating microwaves if you're sending a GHz signal down a wire?

The main spectral energy of GbE is around 125 MHz & it's harmonics (250 MHz, ...), since that's the symbol rate on the wire, but a cable is still an excellent antenna at these frequencies.

Emission is mainly avoided by using differential transmission over a twisted pair; the small loop area between the conductors minimizes emissions, and also improves rejection of outside electromagnetic noise (EMI) — an antenna always works both ways; a well-shielded mechanism will emit less and will also be less susceptible. Cables are supposed to have an outer shield (for Gigabit anyway), though it works without.

Meanwhile Ethernet avoids creating ground loops by isolating the cable on both ends with small pulse transformers (=high pass filter for the signal). The shields of the cables are also only grounded through small capacitors (=high pass filter for shield currents).

> (Or, perhaps you don't avoid it and that's the nature of the cross-talk?)

Almost! Crosstalk is mainly generate by more "intimate" coupling. Emissions means electromagnetic waves (=long range), while crosstalk in cabling and connectors comes from inductive (magnetic) and capacitive (electric field) coupling. This happens because the conductors and contacts are all very close to each other.

---

I found this video (10BASE-T): https://www.youtube.com/watch?v=i8CmibhvZ0c

policedemil · on April 2, 2017

Thank you! This is starting to make a lot more sense! I guess what's still confusing to me has to do with the seriality (continuity?) of the process vs the OSI model, which seems like there are more discrete stages.

Does each stage of the model need to complete for the whole packet before moving on to the next stage? For example, does A/D conversion take place until all of the packet information is converted, then the whole binary blob is enframed as a header and a packet... then we filter for MAC address and move on up the stack in discrete and consecutive stages? Or are the voltages read off the line and once there is enough information to construct the header, compare it, then choose to continue reading the rest from the line or stop the A/D conversion because it is just a waste of energy? The latter makes a lot more sense to me.

EDIT: words

mschuster91 · on April 2, 2017

> Does each stage of the model need to complete for the whole packet before moving on to the next stage?

Usually, not. Massive core switches could not work if they had to wait for every frame being fully in the buffer before beginning to transmit it to the correct out port. All a core switch needs to look at is the destination MAC address.

Simple math explains why: MTU (max packet size) is usually 1500 bytes (due to most packets originating in Ethernet systems). The dstmac in the Ethernet frame is bytes 9 through 16, which means it would be an absolute waste of time to wait with forward transmission until the remaining 1484 bytes are in the buffer.

Let's calculate this with your ordinary 100 MBit/s home connection (to keep the numbers in reasonable magnitudes). 100 MBit/s means: 0.00000001 s/bit (or 0.01 us/bit, or 0.08 us/byte). With retransmit start after the first 16 bytes, the delay introduced by the equipment is 1.28 us (and needs, basically, 9 bytes of buffering capacity from start of packet to end of packet). Waiting for the full 1500 bytes would introduce 120 us (or 0.12ms) of latency, as well as require 1500 bytes of buffer during the transmission time.

policedemil · on April 2, 2017

Excellently put! That's exactly what I was looking for! Thank you!

mschuster91 · on April 2, 2017

If you want to read further, this is a part of network delay (https://en.wikipedia.org/wiki/Network_delay). A highly interesting field.

By the way, one thing I have forgotten: an instant-forward has much less latency (obviously), but it cannot retract packets that were corrupted during receiving at the ingress port - simply because the checksum can only be calculated when the whole packet is in the buffer.

So basically you choose between safety (corrupted packets do not travel as long, because they don't even reach the final station) and latency (e.g. 10 hops a 0.12ms = 1.20ms delay on 100MBit/s), and also for the cost in buffer memory.

dom0 · on April 2, 2017

> For example, does A/D conversion take place until all of the packet information is converted, then the whole binary blob is enframed as a header and a packet

The "A/D stage" only gives off a stream of bytes, so something similar to a state machine will be used to separate it into packets that are then further processed.

That doesn't mean you have to wait for a whole packet to come through, though. E.g. a switch (or router) could start to forward the packet as soon as it sees the destination address.

toast0 · on April 3, 2017

For wired Ethernet, a (decoded) frame starts with a 7-octet preamble of alternating 1 and 0 (0x55), then a "start of frame delimiter" which continues the alternation, except the last bit is 1 instead of zero (0xD5 -- least significant bit is sent first). Then the next 6 octets are the destination mac address. If the NIC is in normal mode, it will wake up when it sees a preamble, use the alternating pattern to synchronize its clock, wait for the start of frame delimiter (which is really the two 1 bits together -- in case the wakeup was a little slow); and then compare the destination mac address to its own mac address one bit at a time. The least significant bit of the first octet of the mac address indicates if it's a broadcast/multicast or intended for only one station; this is the first bit sent on the wire, so that the adapter knows to not do the normal comparison right away. Once the adapter determines the destination address doesn't match, it can stop processing the line. Ethernet requires 96 bits of idle between frames (interframe gap), which it can detect to rearm the preamble trigger.

Different versions of ethernet encode the logical signals differently, so the adapter may decode multiple bits at the same time (100BaseTX transmits in 4-bit groups, 1000BaseT in 8-bit groups), so preamble detection and address comparison would likely be a little different, but the same idea applies.

The Ethernet frame page [1] might be a helpful place to look for more info.

Also, as mschuster91 pointed out, an ethernet switch that can do cut-through switching [2], will be able to avoid buffering the whole packet to decrease latency; although it will need to have some capacity for buffering whole packets anyway -- if it's already sending a packet on the destination port, cut through switching isn't appropriate. An IP router can also do cut-through switching, once the IP header with destination comes in (or as much addressing as required for hashing among multiple links, if the destination is link aggregated)

[1] https://en.wikipedia.org/wiki/Ethernet_frame [2] https://en.wikipedia.org/wiki/Cut-through_switching

eru · on April 2, 2017

To answer the first point you brought up: privacy does not depend on who reads your network packages---it depend on encryption.

walrus01 · on April 2, 2017

> scholars working with culture and technology love to make a huge deal about data leakage

one of the fundamental considerations is that if things are very sensitive, they need to be on their own air gapped network. Or at least not on the same layer 2 fabric as a ton of other things that it can arp. Network engineers who understand all of the myriad possible ways that topology can be set up (both at OSI layer 1 and logically) are key.

Properly set up with a secure gateway/VLAN delivery for a critical workstation that has a special route outbound to the internet through a firewall, there will be only two MACs showing up on the fabric: The workstation itself and the device that is serving as its default route/gateway.

dom0 · on April 2, 2017

> one of the fundamental considerations is that if things are very sensitive, they need to be on their own air gapped network.

Or maybe consider not using an information-dispersal machine for "very sensitive things".

walrus01 · on April 2, 2017

tell that to the people who run siprnet/jwics... they're not going to go back to pencil and paper.

nickpsecurity · on April 2, 2017

The rules back to the Conputer Security Initiative & Orange Book said that high-assurance, security systems should be used there or at least at interface points. Currently called a Controlled Interface IIRC. Numerous products hit market under Orange Book and later Common Criteria that passed 2-5 years of pentesting each. Most of that was killed off by NSA and DOD acquisition policies about getting more shiny COTS in full of dangerous features and lockin. All kinds of problems resulted. Certain orgsnizations still use the high-assurance stuff, though, at least for cross-domain.

So, it's provably not going from what we have to paper. They could reduce a lot of risk using high-assurance products (esp compartmentalizing ones) that are on market right now. Plus port them to those secure CPU architectures NSF and DARPA funded. Hell, given CHERIBSD, NSA would get really far just paying for it to be put on an ASIC as is with ATI doing custom, MLS firmware. Boom. Immune to most attacks plus POLA for security-critical components. They just dont care enough to do it across DOD.

ams6110 · on April 2, 2017

MAC is a hexadecimal number so testing a packet header is literally a bitwise AND and then drop the packet if the result is not equal to the card's MAC address, or an XOR and drop if the result is not zero. I am not speaking from knowledge but I'd be astonished if this doesn't happen on the card and never involves the CPU.

0x0 · on April 2, 2017

It can't possibly an AND operation, can it? If your MAC address has a lot of zero bits you'd get a ton of false positives?

adrianratnapala · on April 2, 2017

I think ams... is talking about masking about whatever parts of the header are not the MAC address.

Although whether even that requires AND gates depends on the details of the hardware. Maybe the appropriate bits in the buffer are just directly wired to a comparison unit.

JdeBP · on April 3, 2017

Such an equality comparison unit would generally be an XOR-NAND combination.

en4bz · on April 2, 2017

This is why I'm really hoping RDMA [1] will catch on soon. It would be great if there was a cloud provider that would enable this feature on some of their offerings. Amazon has done something similar by allowing kernel bypass via DPDK [2] with their ENA offering but Kernel bypass is inferior to RDMA in so many ways IMO.

At this point we have 200Gbit/s NICs being provided by Mellanox [3]. CPUs aren't getting any faster and the scale out approach is extremely difficult to get right without going across NUMA domains [4]. Based on the progression of CPUs lately there just isn't going to be enough time to process all these packets AND have time left over to actually run your application. There's a lot of work focusing on data locality at the moment but at this point it's still not fool proof and the work that has been done is woefully under documented.

As the article mentioned we've already added a bunch of hardware offloads. RDMA is just a continuation of these offloads but unfortunately it requires some minor changes on the application side to take advantage of which is why it's probably been slow to be adopted.

RDMA has so many great applications for data-transfer for backend services. Whether it's queries between a web-server and a DB, replication/clustering of DBs, or micro-service fabric with micro seconds latency. Overall there's a lot of low hanging fruit that could be optimized with RDMA.

[1] https://en.wikipedia.org/wiki/Remote_direct_memory_access

[2] http://dpdk.org/doc/nics

[3] http://www.mellanox.com/page/products_dyn?product_family=266...

[4] http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-...

kev009 · on April 3, 2017

One other point, these cloud companies like Amazon and Google do a great deal of self congratulations for some pretty embarrassing architectures. Necessitating million server datacenters is as WTF to me as being proud of millions of lines of code when someone else is doing it with magnitudes less. 40G networking has been commercially viable since the top of the decade and 100G (or partial variants) for a couple years now while ENA tops out at 20G. They are basically asleep at the wheel because their architectures are predicated on poor assumptions, shirking understanding of computer architecture for 10s of thousands of SREs and "full stack developers".

Dovetailing a bit, back in the commercial UNIX and mainframe markets it is pretty common to have 80%-90% percent system utilization. In the Linux world, it's usually single digits. For some reasons (I guess it's more inviting to think holistically due to the base OS model), we are getting those 90% figures in the BSD community. See drewg's comment, WhatsApp, Isilon, LLNW:

I led the creation of an OS team, to look from first principles, where we were spending CPU/bus/IO bandwidth, and focusing on north-south delivery instead of horizontal scale. A team of 5, we are able to deliver significant shareholder value http://investors.limelight.com/file/Index?KeyFile=38751761.

en4bz · on April 3, 2017

Yeah I suppose what I really want is physical procurement of hardware with the ease of cloud.

kev009 · on April 3, 2017

I haven't used them, but it looks compelling: https://www.packet.net/

bogomipz · on April 3, 2017

>"One other point, these cloud companies like Amazon and Google do a great deal of self congratulations for some pretty embarrassing architectures. Necessitating million server datacenters is as WTF to me as being proud of millions of lines of code when someone else is doing it with magnitudes less"

Can you explain what these "embarrassing architectures"?

Also two of the largest tech companies on the planet using millions of servers is a "WTF" to you? Care to explain? Also I don't think either Amazon or Google has ever released actual numbers for server counts. Please provide a citation for this.

>"as being proud of millions of lines of code when someone else is doing it with magnitudes less"

This is also a pretty vague statement. Lines of code for what? And who is doing that same "what" with less code?

>"They are basically asleep at the wheel because their architectures are predicated on poor assumptions, shirking understanding of computer architecture for 10s of thousands of SREs and "full stack developers".

Do you really believe that Google and Amazon don't understand computer architecture?

Please provide citations for how you arrived at the at the number of SREs and "Full Stack Developers" at Google and Amazon. I don't think Google/Amazon even have "Full Stack Developer" as a job title.

>"I led the creation of an OS team, to look from first principles, where we were spending CPU/bus/IO bandwidth, and focusing on north-south delivery instead of horizontal scale"

You did this at Limelight Networks then I presume according to the link you provided? You realize that a CDN is predominantly North-South traffic right? CDNs are a very different business than being general purpose cloud providers which in addition to north-south traffic have substantial east-west traffic.

You fault Google and Amazon for "congratulating" themselves yet the link you provide presumably to substantiate your own grand achievements is fluffy press release. A press release with zero detail around the claims being made and no link to a white paper or any technical specifics are given.

Limelight is not exactly what I would call a market leader or a name associated with being a great technical innovator. Limelight lost 73 million dollars last years and has a 270 million market cap. They were also sued by CDN provider Akami and agreed to pay Akami licensing fees as part of the settlement for that infringement.[1]

>"A team of 5, we are able to deliver significant shareholder value"

Your company's stock trades at $2.56 a share while Amazon and Alphabets trade at well over $800 a share. I will gladly take the share holder value that Google's and Amazon's "10s of thousands of SREs and "full stack developers" provide over your team of 5.

[1] http://www.bizjournals.com/phoenix/news/2017/02/08/limelight...

kev009 · on April 3, 2017

Think about what this _means_, and let's be really conservative and assume a huge splay of 10 year old to current hw: 8 million cores, 10 million storage devices (100-1000x IOPS, 1-10x TB capacity), 8x recent global IP transit rate with just 2Gbit laggs. By the way, hyperscalers have many of these. That's enough compute to boil an ocean, at one datacenter. Now back at ya, go figure out the useful work per machine.

Google does pretty good in ecosystem investment. They do have people that know how computers work, and more importantly decision makers that let them pay huge dividends (i.e. containers vs VMs, CPU cache partitioning, TCP performance). I've seen little evidence of the same at Amazon.

Sorry, you can provide statistics that the majority of the 25k committers at one of these companies (ACM Queue monorepo article) are not toiling in a morass of overgrowth. People have told me things, YMMV, draw your own research and conclusions as I have mine.

That lawsuit was during a critical time. I can't comment any more there.

On GOOG, if you have enough money to make it worthwhile and want to ride a fractional increase above market growth rate, it's a safe stock, but it's not likely to to multiply over the next few years.

bogomipz · on April 4, 2017

Again you have provided no citations for any of your figures or any of the hardware configurations you are stating, your just making things up it seems. This is gibberish.

kev009 · on April 4, 2017

Heh, anyone in the game knows the orders off hand. Try 10 seconds of google https://www.nextplatform.com/2015/04/09/how-in-hell-will-any...

kev009 · on April 3, 2017

This has been ongoing debate since the beginning. RDMA means running a large amount of protocol in silicon and firmware. Can you imagine what might go wrong?

In the FreeBSD community it is somewhat common knowledge that Chelsio is the best Ethernet vendor, mainly because their driver isn't a trainwreck like the dominant Intel (which share BSD HAL code on BSD/Linux/Illumos and are quite similar)

If you are alright with the trade offs of a firmware stack, Chelsio do have some pretty slick TCP full offloads that require no substantial application changes: http://www.chelsio.com/wp-content/uploads/resources/T6-100G-...

en4bz · on April 3, 2017

At the end of the day I think we will end up with RDMA being the norm. Additional offloads can only go so far and are going to require firmware anyway so why not just cut to the chase.

Last time I checked Chelsio backed iWarp which has pretty much been defeated in the RDMA wars by RoCE since iWarp had terrible performance because it has to run over TCP. Intel has even removed from it their new cards and have pushed DPDK as a replacement. I'm also not surprised Chelsio has added additional offloads to compensate for their lose. That being said RoCE does have that Converged Ethernet part (CE) which is still not mainstream in many datacenters.

I get people will be disappointed that their previously Open Source drivers will be replaced with proprietary firmware but as we see FPGAs take off I think that you may see VHDL code going open source on new fully programmable cards.

I for one welcome our new hardware based overlords but perhaps I'm being far to optimistic.

kev009 · on April 3, 2017

But why?

We are fairly confident we can make BSD pump several hundred Gbps doing real world long haul TCP for content serving in the next couple years on something like Naples or POWER9.

At the other end, Isilon converted from Infiniband to OS TCP for the latest product: https://www.nextplatform.com/2016/05/06/emc-shoots-explosive.... That is pretty amazing because of low latency timing and incast.

To your point, Intel's Altera acquisition may eventually bear fruit but I'm not holding my breath and don't really know how to reason about it until an offload/accelerator ecosystem is built up.

en4bz · on April 3, 2017

Now that I've reviewed the Chelsio doc more carefully I've noticed that DDP is actually part of iWarp so the TCP offload you mentioned is just one part of iWarp that happens to be transparent to user space which is quite interesting. That being said iWarp has lost out to RoCE mainly because of latency and not throughput. I guess that I'm just more biased towards lower latency (at the same bandwidth) because it is empirically better, even though for most applications 10us vs 1us is negligible. That and because I work in low latency trading.

kev009 · on April 3, 2017

It's kind of the other way around, TCP is done in silicon and firmware and passed up to the OS via a scatter-gather engine which is the core architectural feature. So iWARP lends itself to layering on the TOE, as do other higher level protocols like iSCSI and FCoE. DDP requires intertwining with the OS networking stack and VM but the card is able to DMA the data from the offloaded TCP stream right into the application's address space sockbuf.

I'm out of my league on extreme low latency stuff but take a look at http://www.chelsio.com/wp-content/uploads/resources/T6-WDLat.... For comparison sake, do you know the end to end latency of Mellanox RoCE?

en4bz · on April 3, 2017

Yeah my rather old ConnectX-3 Pro 40Gbe is crushing all these numbers for the given sizes of 16 to 1024.

ib_send_lat is 0.77, 0.77, 0.80, 0.86, 1.17, 1.31, 1.58

ib_write_lat is 0.73, 0.75, 0.76, 0.85, 1.12, 1.29, 1.57

ib_read_lat is 1.38, 1.38, 1.39, 1.42, 1.50, 1.65, 1.91

However it's really difficult to judge these things without a detailed description of the test setup. This is over physical loopback on a single machine and even things like cable length can skew things at this level.

shaklee3 · on April 3, 2017

Can you elaborate on that? I've tried to find research or demos where people are doing 100+Gbps of useful work in software and haven't seen it. Power 9 will be here soon hopefully, and there's promise, but that doesn't seem like a given.

kev009 · on April 3, 2017

See drewg's comment. He is being modest in that they are also doing CPU encryption at 100G, so every single stream is different which doubles the usage of memory bandwidth.

I will get to 100G with Skylake, mainly because we have to rework our storage BoM rather than CPU improvements. Intel's focus has been off, they've somewhat misread where the market was going, but even today you have 40 PCIe 3.0 lanes (39GB/s), 67GB/s DDR4 memory bandwidth, and typically more than enough cores and threads to do whatever you want in a single E5 Xeon socket. Computers are _really_ fast, software is slow :)

So right out the gate, you have plenty of speeds and feeds to get stuff off disk and out the wire. That's exactly my workload, which entails pulling data off storage, into the VFS while kqueue is managing a pool of connected sockets, and when they are ready for more data, it goes out right from the page cache with sendfile. Netflix contributed some amazing work that makes FreeBSD particularly optimized for this workload.

In general the trick is to do less, batch more, and try not to copy things around. For example DPDK or Netmap packet forwarding, clear an entire soft ring at a time instead of one packet at a time. Using netmap, you can change ownership of the data by pointer swapping to move it from the rx ring to tx. The ACM Queue article on netmap is particularly good reading. Basically, pass by reference, but by understanding the memory layout of the system.

aio and sendfile kind of suck on Linux. epoll kind of sucks. Linux hugepages really suck, and the Linux VM seems to be biased toward massive concurrency or something that I don't really understand. None of these are monumental technical problems, but there is malignant culture at these hyperscalers because just being a bandwagon fan doesn't make you a winning team. Linux users tend to trust manufacturers to do all the device driver development correctly, and vendors like RedHat to drive general forward progress. How many patches does Amazon have in Xen? Linux?

At the BSD companies I've mentioned, a few dozen people across all of them have pulled this stuff off, and it's all there in the base system. We rip apart vendor drivers or entire subsystems when that's the prescription. We're pretty happy to share details and help others succeed at conferences or even by partnering up team to team across companies.

shaklee3 · on April 3, 2017

Thanks for your response. I'm very familiar with DPDK and am looking forward to the changes Skylake brings as well.

drewg123 · on April 3, 2017

RDMA can be rather challenging for a cloud provider, as it typically requires a lossless fabric. Cloud providers don't want to enable flow control on switches (to prevent packet drops) as that has the potential to deadlock the entire network if a buggy or malicious VM stops consuming traffic.

Heck, I took our corp. net down a decade back by causing a test machine to drop into a kernel debugger with flow control enabled on the switches and the test machine. All of a sudden, my remote access to the serial console kernel debugger froze, and I was cursing the flaky VPN. Then a second later, I had the 'Oh Crap, that was me!' realization.

ra1n85 · on April 3, 2017

Flow-control as in ECN or Ethernet/pause?

Loss is inevitable in most cloud network fabrics due to the use of shallow buffer commodity ASICs, the amount of fan-in, and the reliance on tenants to use the network in a typical manner (don't fragment, don't encap, preserve entropy).

RDMA is going to be tough - microseconds count, you can't wait for TCP to retrans, and UDP doesn't provide the reliability needed. A new transport protocol is likely the answer.

bogomipz · on April 3, 2017

RDMA has been around for years though. It's big in the HPC world Oracle Rac had used it for some time. In Solaris 11, the default transport for NFS uses RDMA:

https://docs.oracle.com/cd/E23824_01/html/821-1454/rfsrefer-...

en4bz · on April 3, 2017

Infiniband RDMA is quite old yes but the Ethernet protocols are still rather new and they (iWarp & RoCE) are still competing for supremacy which doesn't really help adoption. At this point it looks like RoCE will win because Intel dropped iWarp but at this point there is basically only one major manufacturer backing each protocol.

bogomipz · on April 3, 2017

Oh I misunderstood the context, yes my examples were in the context of Infiniband cards, thanks. I admit I haven't kept up with RoCE/iWarp. If you have an links to share I would appreciate it. Cheers.

jsolson · on April 3, 2017

Note that DPDK works just fine with virtio-net on GCE as well, although we have not declared official support for it.

In terms of RDMA and bypass messaging, I agree there can be substantial advantages to offloading transport layer message processing from VMs. Regrettably, I can't say anything more on the subject today.

(I lead the team that owns the virtual NICs exposed to guests in GCE)

i336_ · on April 3, 2017

> I can't say anything more on the subject today.

Cool. When should people start poking around GCE's network blog for Interesting Updates™ then?

(Keeping in mind that I acknowledge/note that the "6 months from now" or "next year" or whatever that you say is the date you think we should start keeping an eye out, not necessarily the point in time anything will happen.)

en4bz · on April 3, 2017

Do you not use SR-IOV for GCE?

jsolson · on April 3, 2017

We do not, although in practice as a guest you wouldn't necessarily know one way or another.

Speaking from (decently informed) personal opinion here--

I see SR-IOV as a means to achieving certain networking characteristics, but one that historically didn't play particularly nicely with live migration (among other things), and one that locks you into particular silicon vendors, etc. It's a bit unfortunate that SR-IOV is often assumed as a prerequisite for high throughput, low latency, low jitter networking. On its own it doesn't guarantee any of those things (you're still beholden to the VCPU scheduler, if nothing else, and there's a lot of 'else'), and it's also not the only way to achieve those things. In particular there are a lot of ways to dedicate hardware to I/O service they aren't SR-IOV

shaklee3 · on April 3, 2017

One way you could tell is lower performance, right? Intel publishes do dpdk benchmarks, and virtio is consistently worse.

Here's what I mean:

http://fast.dpdk.org/doc/perf/DPDK_17_02_Intel_virtio_perfor... http://fast.dpdk.org/doc/perf/DPDK_17_02_Intel_NIC_performan...

Sriov can do almost equivalent to the pmd,I believe.

jsolson · on April 3, 2017

Unfortunately it's a bit more complicated than that, which is why I hate that SR-IOV gets so tied up in this :/

When they say "virtio-net" there they mean virtio-net inside qemu with vhost servicing the queues on the host side (note, we don't use vhost in GCE -- our device models live in a proprietary hypervisor that's designed to play nicely with Google's production infrastructure). One could just as easily expose what looked like an Intel VF to the guest and service it in the same manner (although there are good reasons not to).

One could also build a physical NIC that exposed virtual functions offering register layouts equivalent to VIRTIOs PCI BARs and used the VIRTIO queue format. If you assigned those into a guest, you'd be doing SR-IOV, but with a virtio-net NIC as seen by the guest. It also likely wouldn't perform as well as a software implementation (in its current form VIRTIO has a lot of serializing dependent loads which make it inefficient to implement over PCIe). There's some ongoing work upstream aimed at a more efficient queue format.

So, yeah, "it depends" is about the best you can do. SR-IOV really just says you're taking advantage of some features of PCI that allow differentiated address spaces in the IOMMU for a single device and (on modern CPUs) interrupt routing to actively running VCPUs without requiring a hop through the host kernel. The former is handy if you want the NIC to be able to cheaply use guest physical addresses (although the IOMMU isn't free either); the latter doesn't matter if the guest is running a poll-mode driver that masks interrupts, nor does it matter if the target VCPU isn't actively running.

shaklee3 · on April 3, 2017

Thanks. I look forward to trying it on gce, but as others said, it would be nice if it was officially supported.

jsolson · on April 3, 2017

Agreed. As a policy, we don't sign off on supporting things until we feel we have sufficient regression testing for them in our automated qual matrix. DPDK is not there yet.

i336_ · on April 3, 2017

What alternatives would you recommend to people leaning toward SR-IOV?

I say this abstractly as I'm curious what your answer is and I think it might be very helpful to others considering your experience. (I just had to google "sr-iov" as I was typing this :) )

jsolson · on April 3, 2017

In open source vhost is a pretty good option that abstracts hardware away from VMs and generally performs well.

That said, SR-IOV isn't a BAD option if you don't care about live migration of VMs over a lifespan potentially longer than the underlying hardware platform. It's a good option if you can live with host-hardware-specific drivers in your guests, you're willing to standardize on a specific NIC as host hardware for the fleet where you'll be running VMs, and you're willing to deal with the resource commitment it requires to deliver on its performance promises.

shaklee3 · on April 3, 2017

Are you saying that all vendors provide sriov sightly differently in an incompatible way? Because as far as I know, all the vendors have supported it for quite a while (Intel, mellanox, chelsio).

jsolson · on April 3, 2017

It's not that their SR-IOV implementations are incompatible, it's more that SR-IOV doesn't imply anything about driver compatibility. It's is a mechanism for exposing underlying hardware (or slices of hardware) to guests and nothing more. It does not encompass register sets, queue formats, etc.

So you can assign a virtual function from any SR-IOV-capable NIC into a VM using common code, but what the guest sees inside the VM will still be an Intel, Mellanox, Chelsio, etc. device. Naturally they'll need a driver for that device.

If your fleet includes NICs from a variety of vendors, you'd never be able to live migrate VMs between hosts with different NIC models. Until recently, no vendor I knew of included the relevant serialization and deserialization functionality to allow for live migration at all.

shaklee3 · on April 3, 2017

Good point. I didn't think of the driver incompatibility between vendors.

kev009 · on April 3, 2017

If you are trying to solve multitenancy, SmartOS makes this a non-issue by securely virtualizing Linux syscalls instead of virtualizing an entire computer like VMs.

shaklee3 · on April 3, 2017

All good points, but keep in mind the 200gbps is somewhat of a gimmick since pcie3 will top out at 128gbps. It's useful for a redundant port, but you won't be pushing 200gbps for a while.

en4bz · on April 3, 2017

Apparently they have PCIe4 x16 and PCIe3 x32 models. Not sure if either of these exist at the moment but they so satisfy the PCIe bandwidth constraints.

kev009 · on April 3, 2017

They have one that is two PCIe3 x16 connections from one ASIC right now, it uses a ribbon cable :). That's janky, but it does work on currently available platforms. I've heard Mellanox will do a CAPI NIC, which is quite tantalizing.

shaklee3 · on April 3, 2017

Thanks, I didn't see that. But yes, that's weird. In regards to pcie4,I believe the power9 will be the first to have it, but I haven't seen a solid release date for that. Naples didn't look like it had it -- just more pcie3 lanes.

tedunangst · on April 2, 2017

Buggy firmware with edge cases is putting it mildly. I suppose a checksum of 0000 or ffff is technically an edge case, but not all that uncommon, and a pretty popular thing to get wrong.

gonzo · on April 2, 2017

Even the standards got it wrong. https://tools.ietf.org/html/rfc1624

bluetech · on April 2, 2017

Interesting article.

Another related article I found interesting: https://www.coverfire.com/articles/queueing-in-the-linux-net... Discusses some of the queues in the Linux network stack.

ams6110 · on April 2, 2017

As an aside, is anyone using the Joyent cloud stuff in production? Any good comparisons to Openstack? Looking for something easier to manage.

pthreads · on April 2, 2017

This is a very useful write-up. I thoroughly enjoyed it i.e. found it informative. Thank you.