Actually, in its root it is based on simd and prefetching. In short, each part o...

dragontamer · on Nov 21, 2023

I can certainly imagine some SIMD concepts in that. Particularly stream-compaction (or in AVX512 case: VPCOMPRESSD and VPEXPANDD instructions)

EDIT: I guess from a SIMD-perspective, I'd have expected an interleaved set of packets, a-la struct-of-arrays rather than array-of-structs. But maybe that doesn't make sense for packet formats.

wmf · on Nov 22, 2023

The NIC gives you an array (ring buffer) of pointers to structs (packets). Interleaving them into SOA format would probably cost more than any speedup from SIMD.

dragontamer · on Nov 22, 2023

Yeah, but its difficult to write a SIMD / AVX512 routine if things aren't in SOA format.

I can see how this approach described is "vector-like", even if the vector is this... imaginary unit that's parallelizing over the branch predictor instead of an explicit SIMD-code.

This "vector" organization probably has 99.999%+ branch prediction or something, effectively parallelizing the concept. But not in the SIMD-way. So still useful, but not what I thought originally based on the title.

xoranth · on Nov 22, 2023

A ring buffer of pointers to structs is friendly to gather instructions. That said, the documentation shows a graph of operations applied to each packet. I'd expect that to lead to a lot of "divergence", and therefore being non-SIMD friendly.

(also, x86-64 CPUs with good gather instructions are rare, and sibling comments show that this is aimed at lower end CPUs. That makes SIMD even less relevant.)

benou · on Nov 23, 2023

Most packets follows the same nodes in the graph. You have some divergence (eg. ARP packets vs IP packets to forward), but the bulk of the traffic does not. So typically the initial batch of packets might be split in 2 with a small "control plane traffic" batch (eg. ARP) and a big "dataplane traffic" batch (IP packets to forward). You'll not do much SIMD on the small controlplane batch which is branchy anyway, but you do on the big dataplane batch, which is the bulk of the traffic.

And VPP is targeting high-end system and uses plenty of AVX512 (we demonstrated 1TBps of IPsec traffic on Intel Icelake for example). It's just very scalable to both small and big systems.