I almost needed XOR linked lists somewhere once, but I ended up doing the same thing with tagged pointers. I wish I remembered the occasion.
But imagine this situation: you're doing some high-perf code on x86's and doing a data structure (with next/prev links somewhere) and your node size is 136 bytes because of arbitrary reason X. You're running a few percent behind your worst competitor in performance figures. Now if you shave 8 bytes you'll fit one node per cache line and probably squeeze out a little pref just by having a cache-friendly alignment. Should you do it? Hell yeah.
Okay, I know many people don't write code with tight performance budgets, but some of us do.
Aside from that though, I think if you're clever enough to come to this and implement it well, I hope you're clever enough to understand whether or not it's the RightWay(TM).
edit: In fact, when do you ever really need to store particularly large objects in something like a linked-list? It seems like you can always get some relatively small reference value to the data (eg, a pointer) and store that in your list instead of storing your large objects themselves.
Also, I think on a 64-bit machine with common DDR SDRAM, a memory location is 64 bits wide, no matter the size of the integer contained there. Not sure about this, however.
Right, of course...
> I really don't know all that much about architecture (yet), so that's totally possible, and I might be wrong about the 66% thing I guess.
edit: Actually, about the whole more objects thing, there might be some strange use case where you only have a few objects but are storing multiple references to them in a list. But yeah, in general, the benefit goes down with larger objects.
One fix for this issue would be to extract the node where you split and "freeze" it. i.e. If you split at element 10, that element gets shoved into a new node which holds only one element and pointers are adjusted as necessary. The node it's extracted from may also be split into two pieces if necessary. From this point on, the single-element node is never merged into another node. It always exists independently.
Merging in this world works basically like it does in a normal linked list. You rearrange pointers but don't combine nodes. Insertion works like it does for a typical unrolled linked list, but with the frozen nodes special cased.
The tricky bit is if you're saving a lot of pointers that aren't part of a split operation. Then you might need to explicitly expose the freeze operation. And if you use it on every node, you're back to a standard linked list (with a bigger constant factor). You could also expose unfreeze, but that could get buggy really quickly. You could do a ref-counted freeze/unfreeze, but you should probably quit before you get to that point.
Essentially, it says that it isn't as good an idea as you might imagine, because lists aren't often created all at once and they get chopped up and inserted into often enough the advantages of the scheme disappear.
Unrolled linked list don't exhibit this behavior (though they have other drawbacks). Their behavior is closer to that of B-trees, where size-bounded multi-element nodes can be split and joined cheaply, which allows for cheap element insertion.
How about you?
If I get something that might possibly be useful for someone else, sure, I'll put it on github or something. (If I get a good design, the real network will be opensource too.)
edit: reference: http://c2.com/cgi/wiki?SuperCompiler
Of course there are some cases in which the binary layout matters (for example, serialization, or buffer handling in network code), but those can be regarded as glorified (un)parsers instead of fixed data layout. Would also make the langsec people happy.
That's some pretty advanced magic.
Even the compiler has very limited insight into what your code is actually doing without simulating it, the prefetcher might be able to look a bit ahead in the execution stream and do branch prediction but absolutely no way does that extend to knowing stuff about your data structures.
Unless I have just been transported by a time warp I really think this is fiction.
If you're thinking of cache pre-fetching that actually has a really hard time dealing with stuff like linked lists because it has absolutely no idea at all about the data structure it is looking at. The 'next' and 'previous' pointers in the linked list might actually simply be values without any significance at all. And if they are dereferenced as pointers then that memory could be just about anywhere within the valid address range.
For arrays on the other hand such pre-fetching can be useful.
Nope, the Intel guys and gals do this kinda magic day in, day out. Or at least the chips they manufacture do. I'm not familiar with the internals of the prefetcher of any CPU at this level, but let me wave hands here. This is what the prefetcher could do:
All it takes is for the prefetcher to get a cache line when requested, and then observe what is inside and look at pointer-sized and aligned values that look like pointers and are pretty close to the original cache line's (virtual) address. Now if these values happen to be sane virtual addresses in the current process, the prefetcher might as well fetch them one cache closer to the CPU. If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity.
All modern CPUs do dirty little tricks like this if it helps them outshine their competitors.
Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.
Caches have limited size. If it misses, it also evicts something else from the cache. If that is what is actually needed, this costs performance.
"and are pretty close to the original cache line's (virtual) address"
Why does it have to be 'pretty close'?
"Now if these values happen to be sane virtual addresses in the current process"
That sanity check would involve visiting the paging tables, so it would require at least two indirections (http://lwn.net/Articles/253361/). If a cache line is 16 bytes, you would have at least 4 positions where a 32-bit pointer could be present. So, at least four times two memory lookups would be needed. I think all of them would go through the same cache, but even assuming that the CPU has ways of signalling that it should not recurse, I do not think it is practical to do what you describe (disclaimer: I am not an expert on CPU design)
What is possible is to guess at where data is to be found. That allows CPUs to read and speculatively execute instructions from the physical memory that they think backs the virtual address of the PC while they, in parallel, do the lookup to verify that. See http://dl.acm.org/citation.cfm?id=2000101&dl=ACM&col.... I do not know whether this has made it into actual CPUs, though.
I know a bit about prefetch, and it's variants, but haven't directly used in 10 years (I was on a software project that had one of the first Katmai's, back in 1999 to do some bilinear/bicubic texture filtering for a media/drawing application, and had to code extensively in mmx assembly back then).
edit: try to look for performance tuning guides for your CPU architecture, there might be some details about systems like this.
Nope. Some might exist but this stuff is generally considered trade secrets. A patent search might yield something. You can find some material about speculative execution, branch prediction, prefetching, etc but the real beef is hidden somewhere in Intel's (and their competitors') vaults.
It's all supposed to be transparent to the programmer so there's no need to write and release detailed public documentation about it.
As I said earlier, I have no idea how they work (or do they even exist), but I'll give a handwaving example of how they _could_ work.
So better not speculate, what you think could be done when you are not familiar with the details typically tends to be a lot harder when you are familiar with the details and are tasked with the implementation. The devil is in the details and there is a very appreciable gap between theory and practice.
Further reading: http://www.futurechips.org/chip-design-for-all/prefetching.h...
Note the caution against using lists and trees and an advice to stick to arrays, which are typically accessed sequentially.
> Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.
And that is exactly the point, if you don't supply the knowledge neither the processor nor the compiler can do this for you. If the processor or the compiler could do this then those __builtin_prefetch calls would not be required.
That seems unlikely to me. Which processors do this?
There has been some research on this subject with respect to compiler optimizations but as far as I know this is not at all done at the hardware level.
The prefetcher detects patterns. If you start accessing memory at N byte intervals, it will start prefetching N, 2N, 3N etc. ahead. So if you allocate your linked list elements from contiguous (virtual) memory, it will be able to prefetch elements. But once you start jumbling the list up, that will fall apart.
Also, if you make an incorrect prefetch decision, it's not a big deal. Just one wasted cache line.
In no circumstance will the CPU magically understand the semantics of some data structure.
Hence, they have no capability to understand a structure like a linked list and will not cache it intelligently. All you can rely on is that they will cache arrays and traverse them in a way that efficiently uses the cache.
Suppose you need to multiplex between inputs A,B,C to make output D (e.g. you have a 3d block, a display engine, and a CPU all trying to access DRAM). The idea is to xor all valid inputs together.
When there is only one input this results in output=input.
If all 3 are active then the output is A^B^C.
The trick is that the next cycle only the granted input (say A) is removed so the output is B^C
Receivers are required to XOR two sequential values to extract the actual data. In this case (A^B^C)^(B^C)=A.
See the "NoX Router" for details and explanation of how this compares to alternative arbitration approaches:
http://www.cl.cam.ac.uk/teaching/exams/pastpapers/y2010p3q6.... With classic mildly amusing exam humour introduction!
class BasicLinkedList implements LinkedList
class XORLinkedList implements LinkedList
typedef BasicLinkedList LinkedList
typedef XorLinkedList LinkedList
LinkedList<Element> list = new XORLinkedList(Collections.fromArray(arrayList));
Element ele = list.get(4); //Ok, when we construct the list we can store the first and second pointers and traverse to the fourth
Element next = ele.next(); //???
You need an Iterator type that has the address of the previous and current nodes, in addition to your existing node / Element type.
LinkedList<Element> list = new XORLinkedList(Collections.fromArray(arrayList));
Iterator<Element> it = list.get(4); //now you can access it.element for data from index 4
it.next(); // now access it.element for data from index 5
But isn't it fascinating? Some information moved from the link node into the state of the program processing it. This halves the link redundancy of a doubly linked list. Suddenly a doubly linked list has the footprint of a singly linked list. How did that happen? :-)
It taught me. Tricks like this make me a better programmer. Even if I won't use the principle involved in this exact way.
This XOR linked list hack makes sense for smallest chunk size. On 32-bit system, two pointers will take 8 bytes, so minimum size of allocated memory block is 8 bytes without this hack and 4 bytes with this hack. On 64-bit system, this hack makes possible allocation of 8-byte block from heap without rounding up to 16 bytes.
For other practical purposes, I think that trading speed and code maintainability for memory size isn't good idea.
Seems to me like (as other commenters have pointed out) this wouldn't be a great idea for something that is stored as a straightforward linked list.
Where it seems really useful to me is a cheap way to store an alternate ordering on top of an existing data structure, without having to keep a separate list around. I've done this before, using next pointers to preserve an alternate ordering, but this is a nice way to allow it to be bidirectional. neat.
Point is, it's strictly non-portable according to the standard, but when done carefully is unlikely to break on a general-purpose architecture.
This is from C99, I don't have a copy of C11:
5) An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.)
6) Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.
And footnote : The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.
A xor A = 0.
A xor 0 = A.
That should work like a regular circularly linked list or am I missing something?
Doh, now I understand. If this was allowed it wouldn't be possible to have a one-node non-circular list.
IMHO one of the (few) primary benefits of linked lists is random access to insert/delete/iterate an element without having to traverse the entire list. Something that breaks with xor lists.
Secondly if you have so many nodes that the memory saving of one pointer is significant. Its likely traversing that list will take forever and force you to use a different algo anyway.
... an improvement on basic linked lists would be 16B/32B/128B/512B etc alignment of the nodes memory address and bit packing the Left/Right offset in units of the alignment unit. Hell you could probably beat xor lists for space for a range of workloads.
point being you dont have to iterate from the start of the list.
Unless you're storing external pointers to every node (in which case you'll actually break even with the traditional implementation in terms of size), you'll still have significant savings. Passing an extra pointer to a function is pretty trivial in terms of size.