Actually, in its root it is based on simd and prefetching.
In short, each part of the packet processing graph is a node. It receives a vector of packets (represented as a vector of packet indexes), then the output is one or more vectors, each goes as an input to the next step in the processing graph.
This architecture maximizes cache hits and heats the branch predictor (since we run the same small code for many packets instead of the whole graph for each packet).
I can certainly imagine some SIMD concepts in that. Particularly stream-compaction (or in AVX512 case: VPCOMPRESSD and VPEXPANDD instructions)
EDIT: I guess from a SIMD-perspective, I'd have expected an interleaved set of packets, a-la struct-of-arrays rather than array-of-structs. But maybe that doesn't make sense for packet formats.
The NIC gives you an array (ring buffer) of pointers to structs (packets). Interleaving them into SOA format would probably cost more than any speedup from SIMD.
Yeah, but its difficult to write a SIMD / AVX512 routine if things aren't in SOA format.
I can see how this approach described is "vector-like", even if the vector is this... imaginary unit that's parallelizing over the branch predictor instead of an explicit SIMD-code.
This "vector" organization probably has 99.999%+ branch prediction or something, effectively parallelizing the concept. But not in the SIMD-way. So still useful, but not what I thought originally based on the title.
A ring buffer of pointers to structs is friendly to gather instructions.
That said, the documentation shows a graph of operations applied to each packet. I'd expect that to lead to a lot of "divergence", and therefore being non-SIMD friendly.
(also, x86-64 CPUs with good gather instructions are rare, and sibling comments show that this is aimed at lower end CPUs. That makes SIMD even less relevant.)
Most packets follows the same nodes in the graph. You have some divergence (eg. ARP packets vs IP packets to forward), but the bulk of the traffic does not. So typically the initial batch of packets might be split in 2 with a small "control plane traffic" batch (eg. ARP) and a big "dataplane traffic" batch (IP packets to forward). You'll not do much SIMD on the small controlplane batch which is branchy anyway, but you do on the big dataplane batch, which is the bulk of the traffic.
And VPP is targeting high-end system and uses plenty of AVX512 (we demonstrated 1TBps of IPsec traffic on Intel Icelake for example). It's just very scalable to both small and big systems.
You can read more about it here: https://s3-docs.fd.io/vpp/24.02/aboutvpp/scalar-vs-vector-pa...