> And oops, now your vectors are 1 million entries.
As soon as you traverse that linked list of 1 million entries, oops.
In fact, here, I put together a quick benchmark. It inserts in the middle of the list and then traverses it (since, after all, ordering is irrelevant if there's no traversal).
Test system is a Ryzen 3700x. I tried to avoid all amounts of compiler optimizations and believe I succeeded, but by all means happy to take a review of the above. Would love to see you do a similar test in whatever runtime you want, especially one with that juicy super-amazing compacting GC that makes linked-lists so fast.
At 100 elements vector wins (40ns vs 160ns). At 1000 elements vector wins (~300ns vs. ~1000ns). At 10,000 elements vector wins (~2700ns vs. 12,000ns). At 100,000 elements guess what? Vector still wins (29us vs. 130us).
You are either vastly underestimating how slow linked-list traversal is or vastly overestimating how expensive memmove is.
But you did say 1 million entries so let's give that a shot:
Vector size 1000000 took 286,123ns
(Vector at 1 million is only twice as slow as a linked list at 100,000, linked lists aren't off to a strong start here...)
Linked-list size 1000000 took 2,670,542ns
Oh my god it's a bloodbath. Vector wins by 10x. Still. At 1 million entries the O(N) insert structure is still faster than the O(1) insert one. And the gap is getting bigger!
And this is the basically best case for linked lists of inserting in the middle happens equally as often as traversal, I'll point out. If traversal is rare obviously a lazy sort would crush these vector results, and if traversal is common the linked lists results get that much worse.
Now really these results shouldn't actually be that surprising if you really dig into it. After all, the vector version of this is storing (and therefore traversing) 4MB of data. That's all 1 million ints really is, it's not very big. Now sure you can store bigger things, but much bigger and you're probably storing a vector of pointers instead which only bumps that to 8MB of data. The linked list version, on the other hand, is storing 2 pointers + an int for each node. That's 20MB of data total. So resizing the vector involves a read of 2MB of data, a write of 2MB of data, and a traversal of 4MB of data - 8MB total. Simply traversing the linked list is hitting 20MB of data - over double the amount of memory bandwidth utilized. That's a big difference. In fact on my system at 1 million entries that happened to be the difference between the vector version comfortably fitting in L3 with loads of room to spare (especially since the resize made part of it nice & hot) vs. the linked-list version not coming close to fitting (specs on this CPU claim 32MB of L3, but it's really 16MBx2 - hitting the far L3 isn't much slower than hitting main memory). So more data to hit and it's pointer chasing? Modern CPUs just really hate that. Like a lot.
Since I don't intend to cheat at happening to have sufficient L3 for the use case you laid out (combined with the benchmark naturally keeping this hot), I also tried adjusting it. I changed the containers to hold intptr's instead, and hit it with 10 million entries. Neither one comes close to fitting in L3 now. Still 80MB of data for the vector vs. 240MB for the linked list but maybe not having that copy and with both of them thrashing L3 the linked list might finally show something to redeem it.
Vector size 10,000,000 took 3,680,111ns
Linked-list size 10,000,000 took 26,460,333ns
That's a big fucking oof right there.
> I appreciate the discussion. It's a bit of a rathole for something that you shouldn't be optimizing if you can completely avoid it by using the right data structure for your needs.
And a linked list is rarely the right data structure, so yes you should avoid trying to optimize for that without measuring.
Heh. I guess I have an interview question to update.
One of my quick warm-up questions is asking about what data structure to choose if inserting an element after an element you already have a reference to is a common operation that you want to optimize for.
Of course, I don't consider there to be a right or wrong answer, but just make sure that people can intelligently discuss it. Some people start off with vec or the like, and then we discuss how fast that would be compared to a linked list. I generally stop after we get to the O(n) vs. O(1) comparison.
But maybe if we have some extra time I should also talk about the tradeoff with actually traversing the list.
After all, I suppose this is why things like gap buffers are popular for text editors, rather than linked lists.
I think one important variable to tune would be the element size (since supposedly linked list shines more with bigger data). Curious to see what the numbers are at 100b/1kb
My guess is that the linked list impl remains almost the same while vec sees a linear ish slowdown.
Oh, absolutely! Linked lists do have a quite useful strength in that they are agnostic to data size and, potentially more importantly, that pointers to the data are not invalidated when a linked list changes size as they are in vector. Both those properties combined make linked lists good data structures for holding large data as more of an allocator than a list type usage. Where it's not the O(1) inserts-anywhere that matters so much as the O(data-never-moves) that does.
But since you asked I adjusted the benchmark to instead contain a 100byte & 1kb payload (just an array of ints). And in the loop I just use the first number of the array in the summation. I assumed an iteration isn't accessing the entire payload but say instead an ID or whatever.
I also added a vector that is pointers to the data instead of the data itself (always a new allocation, none of the pointers are the same)
For 100 bytes:
Vector; payload sizeof=100 of count 100 took 142ns
Vector-of-pointers; payload sizeof=8 of count 100 took 150ns
Linked-list; payload sizeof=100 of count 100 took 180ns
Vector; payload sizeof=100 of count 10,000 took 18,796ns
Vector-of-pointers; payload sizeof=8 of count 10,000 took 8,670ns
Linked-list; payload sizeof=100 of count 10,000 took 32,853ns
Gap shrunk, but possibly not by as much as you would have expected.
For 1kb Results:
Vector; payload sizeof=1000 of count 100 took 1137ns
Vector-of-pointers; payload sizeof=8 of count 100 took 549ns
Linked-list; payload sizeof=1000 of count 100 took 387ns
Vector; payload sizeof=1000 of count 10,000 took 120,709ns
Vector-of-pointers; payload sizeof=8 of count 10,000 took 10,681ns
Linked-list; payload sizeof=1000 of count 10,000 took 105,927ns
Linked-list finally manages to pass up vector, but then vector of pointers runs away from linked-lists in the 1kb test.
Think about it like this: Traversal speeds of linked lists are bottlenecked by memory latency. Traversal speeds of vectors are bottlenecked by memory bandwidth. Only one of those metrics has been improving over the years.
That's true. And there is seriously optimized software out there using linked lists (e.g. in the Linux kernel).
However, with bigger data in my experience a very good solution is often to have a vector of pointers to the actual data. This loses some locality of course, but during traversal the fact that you don't have dependencies between iterations can still be a win.
That's interesting data. I guess I nerd-sniped you.
One use case where I think linked lists would be far superior would be merging two sorted linked lists. That can be done completely in-place with a linked list, consuming no intermediate memory. With vectors, you end up with an intermediate copy, either of one of the lists, or both. If you take the result of merging two lists, then merge it with a third list, then take that and merge it with a fourth list, etc, you can't really do this with a vector-of-lists representation that might come to mind.
> One use case where I think linked lists would be far superior would be merging two sorted linked lists.
Benchmark it & prove it. I expect you're still vastly underestimating how slow it is to traverse a linked list.
To visit a node in a linked list you must first load the node before it. Every iteration is a dependent load. You get no pipelining here, so every node lookup is ~100ns. Meanwhile for vectors each index is independent, so they are nice & pipelined. Yeah you're copying, but you've got vastly more memory bandwidth than memory latency.
Also to zipper merge 2 linked lists means updating 2 * N pointers. That's essentially a copy of the entire list right there, depending on the size of the contained object.
There are situations where you can have a linked list perform faster than a vector. It's good to consider linked lists if you have the time to think about your solution and profile multiple options... but the point is that memory latency is a huge hurdle to overcome.
For context: A cold memory fetch is ~200 cycles. That's around 10 incorrectly-predicted branches. That's around 100 correctly-predicted branches. You need to be skipping a lot of work to justify injecting 100s of cycles of lag inside a loop of some sort.
(And, of course, those situations would need to be the common, relevant case. Don't follow a 10x performance speed-up in a one-off situation at a time no-one cares about to justify a 5x slowdown in a critical, hot loop.)
If you have big objects, then this might also start to become a concern. However, in those cases, a vector of pointers (or structs of pointers + metadata) would appear as a new challenger. In that case, you would only take the latency hit if you actually need to go to the object (and those objects could be held in contiguous storage, etc. etc. etc.).
Chances are that, unless you have data for your specific case showing otherwise, vector will probably win. It should be the default unless you have a reason not to, and evidence to back it up.
As soon as you traverse that linked list of 1 million entries, oops.
In fact, here, I put together a quick benchmark. It inserts in the middle of the list and then traverses it (since, after all, ordering is irrelevant if there's no traversal).
https://paste.ofcode.org/7Buz2Mua9xna55TC3uFaQp
Test system is a Ryzen 3700x. I tried to avoid all amounts of compiler optimizations and believe I succeeded, but by all means happy to take a review of the above. Would love to see you do a similar test in whatever runtime you want, especially one with that juicy super-amazing compacting GC that makes linked-lists so fast.
At 100 elements vector wins (40ns vs 160ns). At 1000 elements vector wins (~300ns vs. ~1000ns). At 10,000 elements vector wins (~2700ns vs. 12,000ns). At 100,000 elements guess what? Vector still wins (29us vs. 130us).
You are either vastly underestimating how slow linked-list traversal is or vastly overestimating how expensive memmove is.
But you did say 1 million entries so let's give that a shot:
(Vector at 1 million is only twice as slow as a linked list at 100,000, linked lists aren't off to a strong start here...) Oh my god it's a bloodbath. Vector wins by 10x. Still. At 1 million entries the O(N) insert structure is still faster than the O(1) insert one. And the gap is getting bigger!And this is the basically best case for linked lists of inserting in the middle happens equally as often as traversal, I'll point out. If traversal is rare obviously a lazy sort would crush these vector results, and if traversal is common the linked lists results get that much worse.
Now really these results shouldn't actually be that surprising if you really dig into it. After all, the vector version of this is storing (and therefore traversing) 4MB of data. That's all 1 million ints really is, it's not very big. Now sure you can store bigger things, but much bigger and you're probably storing a vector of pointers instead which only bumps that to 8MB of data. The linked list version, on the other hand, is storing 2 pointers + an int for each node. That's 20MB of data total. So resizing the vector involves a read of 2MB of data, a write of 2MB of data, and a traversal of 4MB of data - 8MB total. Simply traversing the linked list is hitting 20MB of data - over double the amount of memory bandwidth utilized. That's a big difference. In fact on my system at 1 million entries that happened to be the difference between the vector version comfortably fitting in L3 with loads of room to spare (especially since the resize made part of it nice & hot) vs. the linked-list version not coming close to fitting (specs on this CPU claim 32MB of L3, but it's really 16MBx2 - hitting the far L3 isn't much slower than hitting main memory). So more data to hit and it's pointer chasing? Modern CPUs just really hate that. Like a lot.
Since I don't intend to cheat at happening to have sufficient L3 for the use case you laid out (combined with the benchmark naturally keeping this hot), I also tried adjusting it. I changed the containers to hold intptr's instead, and hit it with 10 million entries. Neither one comes close to fitting in L3 now. Still 80MB of data for the vector vs. 240MB for the linked list but maybe not having that copy and with both of them thrashing L3 the linked list might finally show something to redeem it.
That's a big fucking oof right there.> I appreciate the discussion. It's a bit of a rathole for something that you shouldn't be optimizing if you can completely avoid it by using the right data structure for your needs.
And a linked list is rarely the right data structure, so yes you should avoid trying to optimize for that without measuring.