On our workloads (~100K connections, 16 core / 32 HTT FreeBSD based 100GbE CDN server) LRO was rather ineffective because there were roughly 3K connections / rx queue. Even with large interrupt coalescing parameters and large ring sizes, the odds of encountering 2 packets from the same connection within a few packets of each other, even in a group of 1000 or more, are rather small.
The first idea we had was to use a hash table to aggregate flows. This helped, but had the draw back of a much higher cache footprint.
Hps had the idea that we could sort packets by RSS hash ID before passing them to LRO. This would put packets from the same connection adjacent to each other, thereby allowing the LRO without a hash table to work. Our LRO aggregation rate went from ~1.1:1 to well over 2:1, and we reduce CPU use by roughly 10%.
This code is in FreeBSD-current right now (see tcp_lro_queue_mbuf())
I'm in the humanities and certain scholars working with culture and technology love to make a huge deal about data leakage and how intertwined we all are precisely because you can put a NIC in promiscuous mode and cap packets that weren't meant for you. The whole point is that because your NIC is constantly receiving data meant for others (i.e. because it's filtering the MAC addresses), something like privacy on networks is always problematic. I've always found the whole point somewhat overstated.
So, could anyone explain real quick the process of how a NIC decides whether a packet/frame is actually bound for it or link some good resources? For example, does the NIC automatically store the frame/packet in a buffer, then read the header, and then decide to discard? Or can it read the header before storing the rest of the frame? How much has been read at the point the NIC decides to drop it or move it up the stack? Reading all of every packet seems improbable to me because if it were the case, laptop 1 (awake but not downloading anything) would experience significant battery drain due to constantly filtering network traffic that was meant for laptop 2. I'm not sure that really maps to my experience. Also, I assume there are also differences for LAN vs WiFi?
Any help on the matter would be greatly appreciated! I've tried google diving on this question many times before and it's really hard to find much on it.
For wired networks, a lot of those concerns about machines receiving other machines traffic are somewhat outdated–they were very valid in the 1980s and 1990s, but now in the 2010s they are far less pressing (although not completely gone). Back when we used coax Ethernet or Ethernet hubs, the norm was every machine got every other machine's traffic, and the machine's NIC was responsible for filtering out the traffic destined for other machines, so spying on other people's traffic was easy, and could be done without being detected. Now, with Ethernet switches, the norm is that each machine only gets its own traffic (plus broadcast traffic destined for all machines.) It is possible to overload a switch into a hub by MAC flooding, but in a well-maintained corporate network you can't get away with that for long without being caught. (A home network or small business network you probably can do it for a long time without being detected, since those networks are usually poorly monitored.)
So, in a contemporary well-maintained Ethernet network, it is unlikely your traffic is being sent to other people's machines. Of course, you shouldn't rely on that if you care about your security and privacy. But, encryption is far more common (and far stronger) nowadays, so even if you see someone else's traffic, you are much less likely to understand it. That is the best answer to the concern – who cares if someone else gets your traffic if they can't read it? (Well, if they save it for a few decades, computers might become fast enough to be able to break it – but, it is very unlikely anyone could be bothered.)
For wireless networks, these concerns are still very valid. The best advice with wireless networks, even secure ones, is always use a VPN.
A 1Gb/s NIC is detecting a billion wiggles in voltage per second. Structure is imposed on these wiggles in stages: first, the A/D conversion happens, making the voltage wiggles 1's and 0's, then ethernet framing, then IP packet parsing, then TCP packet/ordering, then the application handles IO (and can, and often does, define even more structure, such as HTTP). You might look up the OSI network layering model (or the OSI 7-layer burrito, as I call it).
My understanding is that MAC filtering happens after ethernet framing, and before putting into the ring via DMA, and packets failing that test do not generate interrupts. Your NIC hardware is choosing to ignore packets not addressed to it because, generally, it's pretty useless to listen in on other people's packets. Especially these days when your most likely to capture HTTPS encrypted data.
Well technically :) ... 1000BASE-T - regular gigabit ethernet - uses a five-level modulation and four pairs in parallel (which reduces high frequency components, making gigabit over old cables possible).
Gigabit Ethernet is really cool and involved tech :)
> how do you avoid generating microwaves if you're sending a GHz signal down a wire?
The main spectral energy of GbE is around 125 MHz & it's harmonics (250 MHz, ...), since that's the symbol rate on the wire, but a cable is still an excellent antenna at these frequencies.
Emission is mainly avoided by using differential transmission over a twisted pair; the small loop area between the conductors minimizes emissions, and also improves rejection of outside electromagnetic noise (EMI) — an antenna always works both ways; a well-shielded mechanism will emit less and will also be less susceptible. Cables are supposed to have an outer shield (for Gigabit anyway), though it works without.
Meanwhile Ethernet avoids creating ground loops by isolating the cable on both ends with small pulse transformers (=high pass filter for the signal). The shields of the cables are also only grounded through small capacitors (=high pass filter for shield currents).
> (Or, perhaps you don't avoid it and that's the nature of the cross-talk?)
Almost! Crosstalk is mainly generate by more "intimate" coupling. Emissions means electromagnetic waves (=long range), while crosstalk in cabling and connectors comes from inductive (magnetic) and capacitive (electric field) coupling. This happens because the conductors and contacts are all very close to each other.
I found this video (10BASE-T): https://www.youtube.com/watch?v=i8CmibhvZ0c
Does each stage of the model need to complete for the whole packet before moving on to the next stage? For example, does A/D conversion take place until all of the packet information is converted, then the whole binary blob is enframed as a header and a packet... then we filter for MAC address and move on up the stack in discrete and consecutive stages? Or are the voltages read off the line and once there is enough information to construct the header, compare it, then choose to continue reading the rest from the line or stop the A/D conversion because it is just a waste of energy? The latter makes a lot more sense to me.
Usually, not. Massive core switches could not work if they had to wait for every frame being fully in the buffer before beginning to transmit it to the correct out port. All a core switch needs to look at is the destination MAC address.
Simple math explains why: MTU (max packet size) is usually 1500 bytes (due to most packets originating in Ethernet systems). The dstmac in the Ethernet frame is bytes 9 through 16, which means it would be an absolute waste of time to wait with forward transmission until the remaining 1484 bytes are in the buffer.
Let's calculate this with your ordinary 100 MBit/s home connection (to keep the numbers in reasonable magnitudes). 100 MBit/s means: 0.00000001 s/bit (or 0.01 us/bit, or 0.08 us/byte). With retransmit start after the first 16 bytes, the delay introduced by the equipment is 1.28 us (and needs, basically, 9 bytes of buffering capacity from start of packet to end of packet). Waiting for the full 1500 bytes would introduce 120 us (or 0.12ms) of latency, as well as require 1500 bytes of buffer during the transmission time.
By the way, one thing I have forgotten: an instant-forward has much less latency (obviously), but it cannot retract packets that were corrupted during receiving at the ingress port - simply because the checksum can only be calculated when the whole packet is in the buffer.
So basically you choose between safety (corrupted packets do not travel as long, because they don't even reach the final station) and latency (e.g. 10 hops a 0.12ms = 1.20ms delay on 100MBit/s), and also for the cost in buffer memory.
The "A/D stage" only gives off a stream of bytes, so something similar to a state machine will be used to separate it into packets that are then further processed.
That doesn't mean you have to wait for a whole packet to come through, though. E.g. a switch (or router) could start to forward the packet as soon as it sees the destination address.
Different versions of ethernet encode the logical signals differently, so the adapter may decode multiple bits at the same time (100BaseTX transmits in 4-bit groups, 1000BaseT in 8-bit groups), so preamble detection and address comparison would likely be a little different, but the same idea applies.
The Ethernet frame page  might be a helpful place to look for more info.
Also, as mschuster91 pointed out, an ethernet switch that can do cut-through switching , will be able to avoid buffering the whole packet to decrease latency; although it will need to have some capacity for buffering whole packets anyway -- if it's already sending a packet on the destination port, cut through switching isn't appropriate. An IP router can also do cut-through switching, once the IP header with destination comes in (or as much addressing as required for hashing among multiple links, if the destination is link aggregated)
one of the fundamental considerations is that if things are very sensitive, they need to be on their own air gapped network. Or at least not on the same layer 2 fabric as a ton of other things that it can arp. Network engineers who understand all of the myriad possible ways that topology can be set up (both at OSI layer 1 and logically) are key.
Properly set up with a secure gateway/VLAN delivery for a critical workstation that has a special route outbound to the internet through a firewall, there will be only two MACs showing up on the fabric: The workstation itself and the device that is serving as its default route/gateway.
Or maybe consider not using an information-dispersal machine for "very sensitive things".
So, it's provably not going from what we have to paper. They could reduce a lot of risk using high-assurance products (esp compartmentalizing ones) that are on market right now. Plus port them to those secure CPU architectures NSF and DARPA funded. Hell, given CHERIBSD, NSA would get really far just paying for it to be put on an ASIC as is with ATI doing custom, MLS firmware. Boom. Immune to most attacks plus POLA for security-critical components. They just dont care enough to do it across DOD.
Although whether even that requires AND gates depends on the details of the hardware. Maybe the appropriate bits in the buffer are just directly wired to a comparison unit.
At this point we have 200Gbit/s NICs being provided by Mellanox . CPUs aren't getting any faster and the scale out approach is extremely difficult to get right without going across NUMA domains . Based on the progression of CPUs lately there just isn't going to be enough time to process all these packets AND have time left over to actually run your application. There's a lot of work focusing on data locality at the moment but at this point it's still not fool proof and the work that has been done is woefully under documented.
As the article mentioned we've already added a bunch of hardware offloads. RDMA is just a continuation of these offloads but unfortunately it requires some minor changes on the application side to take advantage of which is why it's probably been slow to be adopted.
RDMA has so many great applications for data-transfer for backend services. Whether it's queries between a web-server and a DB, replication/clustering of DBs, or micro-service fabric with micro seconds latency. Overall there's a lot of low hanging fruit that could be optimized with RDMA.
Dovetailing a bit, back in the commercial UNIX and mainframe markets it is pretty common to have 80%-90% percent system utilization. In the Linux world, it's usually single digits. For some reasons (I guess it's more inviting to think holistically due to the base OS model), we are getting those 90% figures in the BSD community. See drewg's comment, WhatsApp, Isilon, LLNW:
I led the creation of an OS team, to look from first principles, where we were spending CPU/bus/IO bandwidth, and focusing on north-south delivery instead of horizontal scale. A team of 5, we are able to deliver significant shareholder value http://investors.limelight.com/file/Index?KeyFile=38751761.
Can you explain what these "embarrassing architectures"?
Also two of the largest tech companies on the planet using millions of servers is a "WTF" to you? Care to explain? Also I don't think either Amazon or Google has ever released actual numbers for server counts. Please provide a citation for this.
>"as being proud of millions of lines of code when someone else is doing it with magnitudes less"
This is also a pretty vague statement. Lines of code for what? And who is doing that same "what" with less code?
>"They are basically asleep at the wheel because their architectures are predicated on poor assumptions, shirking understanding of computer architecture for 10s of thousands of SREs and "full stack developers".
Do you really believe that Google and Amazon don't understand computer architecture?
Please provide citations for how you arrived at the at the number of SREs and "Full Stack Developers" at Google and Amazon. I don't think Google/Amazon even have "Full Stack Developer" as a job title.
>"I led the creation of an OS team, to look from first principles, where we were spending CPU/bus/IO bandwidth, and focusing on north-south delivery instead of horizontal scale"
You did this at Limelight Networks then I presume according to the link you provided? You realize that a CDN is predominantly North-South traffic right? CDNs are a very different business than being general purpose cloud providers which in addition to north-south traffic have substantial east-west traffic.
You fault Google and Amazon for "congratulating" themselves yet the link you provide presumably to substantiate your own grand achievements is fluffy press release. A press release with zero detail around the claims being made and no link to a white paper or any technical specifics are given.
Limelight is not exactly what I would call a market leader or a name associated with being a great technical innovator. Limelight lost 73 million dollars last years and has a 270 million market cap. They were also sued by CDN provider Akami and agreed to pay Akami licensing fees as part of the settlement for that infringement.
>"A team of 5, we are able to deliver significant shareholder value"
Your company's stock trades at $2.56 a share while Amazon and Alphabets trade at well over $800 a share. I will gladly take the share holder value that Google's and Amazon's "10s of thousands of SREs and "full stack developers" provide over your team of 5.
Google does pretty good in ecosystem investment. They do have people that know how computers work, and more importantly decision makers that let them pay huge dividends (i.e. containers vs VMs, CPU cache partitioning, TCP performance). I've seen little evidence of the same at Amazon.
Sorry, you can provide statistics that the majority of the 25k committers at one of these companies (ACM Queue monorepo article) are not toiling in a morass of overgrowth. People have told me things, YMMV, draw your own research and conclusions as I have mine.
That lawsuit was during a critical time. I can't comment any more there.
On GOOG, if you have enough money to make it worthwhile and want to ride a fractional increase above market growth rate, it's a safe stock, but it's not likely to to multiply over the next few years.
In the FreeBSD community it is somewhat common knowledge that Chelsio is the best Ethernet vendor, mainly because their driver isn't a trainwreck like the dominant Intel (which share BSD HAL code on BSD/Linux/Illumos and are quite similar)
If you are alright with the trade offs of a firmware stack, Chelsio do have some pretty slick TCP full offloads that require no substantial application changes:
Last time I checked Chelsio backed iWarp which has pretty much been defeated in the RDMA wars by RoCE since iWarp had terrible performance because it has to run over TCP. Intel has even removed from it their new cards and have pushed DPDK as a replacement. I'm also not surprised Chelsio has added additional offloads to compensate for their lose. That being said RoCE does have that Converged Ethernet part (CE) which is still not mainstream in many datacenters.
I get people will be disappointed that their previously Open Source drivers will be replaced with proprietary firmware but as we see FPGAs take off I think that you may see VHDL code going open source on new fully programmable cards.
I for one welcome our new hardware based overlords but perhaps I'm being far to optimistic.
We are fairly confident we can make BSD pump several hundred Gbps doing real world long haul TCP for content serving in the next couple years on something like Naples or POWER9.
At the other end, Isilon converted from Infiniband to OS TCP for the latest product: https://www.nextplatform.com/2016/05/06/emc-shoots-explosive.... That is pretty amazing because of low latency timing and incast.
To your point, Intel's Altera acquisition may eventually bear fruit but I'm not holding my breath and don't really know how to reason about it until an offload/accelerator ecosystem is built up.
I'm out of my league on extreme low latency stuff but take a look at http://www.chelsio.com/wp-content/uploads/resources/T6-WDLat.... For comparison sake, do you know the end to end latency of Mellanox RoCE?
ib_send_lat is 0.77, 0.77, 0.80, 0.86, 1.17, 1.31, 1.58
ib_write_lat is 0.73, 0.75, 0.76, 0.85, 1.12, 1.29, 1.57
ib_read_lat is 1.38, 1.38, 1.39, 1.42, 1.50, 1.65, 1.91
However it's really difficult to judge these things without a detailed description of the test setup. This is over physical loopback on a single machine and even things like cable length can skew things at this level.
I will get to 100G with Skylake, mainly because we have to rework our storage BoM rather than CPU improvements. Intel's focus has been off, they've somewhat misread where the market was going, but even today you have 40 PCIe 3.0 lanes (39GB/s), 67GB/s DDR4 memory bandwidth, and typically more than enough cores and threads to do whatever you want in a single E5 Xeon socket. Computers are _really_ fast, software is slow :)
So right out the gate, you have plenty of speeds and feeds to get stuff off disk and out the wire. That's exactly my workload, which entails pulling data off storage, into the VFS while kqueue is managing a pool of connected sockets, and when they are ready for more data, it goes out right from the page cache with sendfile. Netflix contributed some amazing work that makes FreeBSD particularly optimized for this workload.
In general the trick is to do less, batch more, and try not to copy things around. For example DPDK or Netmap packet forwarding, clear an entire soft ring at a time instead of one packet at a time. Using netmap, you can change ownership of the data by pointer swapping to move it from the rx ring to tx. The ACM Queue article on netmap is particularly good reading. Basically, pass by reference, but by understanding the memory layout of the system.
aio and sendfile kind of suck on Linux. epoll kind of sucks. Linux hugepages really suck, and the Linux VM seems to be biased toward massive concurrency or something that I don't really understand. None of these are monumental technical problems, but there is malignant culture at these hyperscalers because just being a bandwagon fan doesn't make you a winning team. Linux users tend to trust manufacturers to do all the device driver development correctly, and vendors like RedHat to drive general forward progress. How many patches does Amazon have in Xen? Linux?
At the BSD companies I've mentioned, a few dozen people across all of them have pulled this stuff off, and it's all there in the base system. We rip apart vendor drivers or entire subsystems when that's the prescription. We're pretty happy to share details and help others succeed at conferences or even by partnering up team to team across companies.
Heck, I took our corp. net down a decade back by causing a test machine to drop into a kernel debugger with flow control enabled on the switches and the test machine. All of a sudden, my remote access to the serial console kernel debugger froze, and I was cursing the flaky VPN. Then a second later, I had the 'Oh Crap, that was me!' realization.
Loss is inevitable in most cloud network fabrics due to the use of shallow buffer commodity ASICs, the amount of fan-in, and the reliance on tenants to use the network in a typical manner (don't fragment, don't encap, preserve entropy).
RDMA is going to be tough - microseconds count, you can't wait for TCP to retrans, and UDP doesn't provide the reliability needed. A new transport protocol is likely the answer.
In terms of RDMA and bypass messaging, I agree there can be substantial advantages to offloading transport layer message processing from VMs. Regrettably, I can't say anything more on the subject today.
(I lead the team that owns the virtual NICs exposed to guests in GCE)
Cool. When should people start poking around GCE's network blog for Interesting Updates™ then?
(Keeping in mind that I acknowledge/note that the "6 months from now" or "next year" or whatever that you say is the date you think we should start keeping an eye out, not necessarily the point in time anything will happen.)
Speaking from (decently informed) personal opinion here--
I see SR-IOV as a means to achieving certain networking characteristics, but one that historically didn't play particularly nicely with live migration (among other things), and one that locks you into particular silicon vendors, etc. It's a bit unfortunate that SR-IOV is often assumed as a prerequisite for high throughput, low latency, low jitter networking. On its own it doesn't guarantee any of those things (you're still beholden to the VCPU scheduler, if nothing else, and there's a lot of 'else'), and it's also not the only way to achieve those things. In particular there are a lot of ways to dedicate hardware to I/O service they aren't SR-IOV
Here's what I mean:
Sriov can do almost equivalent to the pmd,I believe.
When they say "virtio-net" there they mean virtio-net inside qemu with vhost servicing the queues on the host side (note, we don't use vhost in GCE -- our device models live in a proprietary hypervisor that's designed to play nicely with Google's production infrastructure). One could just as easily expose what looked like an Intel VF to the guest and service it in the same manner (although there are good reasons not to).
One could also build a physical NIC that exposed virtual functions offering register layouts equivalent to VIRTIOs PCI BARs and used the VIRTIO queue format. If you assigned those into a guest, you'd be doing SR-IOV, but with a virtio-net NIC as seen by the guest. It also likely wouldn't perform as well as a software implementation (in its current form VIRTIO has a lot of serializing dependent loads which make it inefficient to implement over PCIe). There's some ongoing work upstream aimed at a more efficient queue format.
So, yeah, "it depends" is about the best you can do. SR-IOV really just says you're taking advantage of some features of PCI that allow differentiated address spaces in the IOMMU for a single device and (on modern CPUs) interrupt routing to actively running VCPUs without requiring a hop through the host kernel. The former is handy if you want the NIC to be able to cheaply use guest physical addresses (although the IOMMU isn't free either); the latter doesn't matter if the guest is running a poll-mode driver that masks interrupts, nor does it matter if the target VCPU isn't actively running.
I say this abstractly as I'm curious what your answer is and I think it might be very helpful to others considering your experience. (I just had to google "sr-iov" as I was typing this :) )
That said, SR-IOV isn't a BAD option if you don't care about live migration of VMs over a lifespan potentially longer than the underlying hardware platform. It's a good option if you can live with host-hardware-specific drivers in your guests, you're willing to standardize on a specific NIC as host hardware for the fleet where you'll be running VMs, and you're willing to deal with the resource commitment it requires to deliver on its performance promises.
So you can assign a virtual function from any SR-IOV-capable NIC into a VM using common code, but what the guest sees inside the VM will still be an Intel, Mellanox, Chelsio, etc. device. Naturally they'll need a driver for that device.
If your fleet includes NICs from a variety of vendors, you'd never be able to live migrate VMs between hosts with different NIC models. Until recently, no vendor I knew of included the relevant serialization and deserialization functionality to allow for live migration at all.
Another related article I found interesting: https://www.coverfire.com/articles/queueing-in-the-linux-net... Discusses some of the queues in the Linux network stack.