Never say never. There's a time and place for dirty pointer tricks like the XOR linked list or tagged pointers (low 2-3 bits used for "tags"). Emphasis on words "dirty tricks" so you know that it's the rabbit you pull out of a hat, not something you do every day.
I almost needed XOR linked lists somewhere once, but I ended up doing the same thing with tagged pointers. I wish I remembered the occasion.
But imagine this situation: you're doing some high-perf code on x86's and doing a data structure (with next/prev links somewhere) and your node size is 136 bytes because of arbitrary reason X. You're running a few percent behind your worst competitor in performance figures. Now if you shave 8 bytes you'll fit one node per cache line and probably squeeze out a little pref just by having a cache-friendly alignment. Should you do it? Hell yeah.
Okay, I know many people don't write code with tight performance budgets, but some of us do.
What I've got isn't general-purpose at all. I'm working on distributed learning algorithms, and right now I'm more concerned with having lots of nodes in the learning algorithms than I am in simulating real-world p2p...so I'm not modeling things like network delays, nodes dropping out and returning with different IPs, etc.
How about you?
If I get something that might possibly be useful for someone else, sure, I'll put it on github or something. (If I get a good design, the real network will be opensource too.)
If you're on a 64bit machine storing 32bit ints, this will decrease the node size from 32 + 64 + 64 = 160 to 32 + 64 = 96, so you should be able to store 66% more data.
edit: In fact, when do you ever really need to store particularly large objects in something like a linked-list? It seems like you can always get some relatively small reference value to the data (eg, a pointer) and store that in your list instead of storing your large objects themselves.
If you're storing references to larger objects, your larger objects are what's taking up the space, which I think was the point (66% more pointers are useless if you can't also store 66% more of the objects they point to).
Also, I think on a 64-bit machine with common DDR SDRAM, a memory location is 64 bits wide, no matter the size of the integer contained there. Not sure about this, however.
> If you're storing references to larger objects, your larger objects are what's taking up the space, which I think was the point (66% more pointers are useless if you can't also store 66% more of the objects they point to).
Right, of course...
> I really don't know all that much about architecture (yet), so that's totally possible, and I might be wrong about the 66% thing I guess.
edit: Actually, about the whole more objects thing, there might be some strange use case where you only have a few objects but are storing multiple references to them in a list. But yeah, in general, the benefit goes down with larger objects.
You could cut your memory usage by arbitrarily close to 1/2 by using an unrolled linked list instead. Plus then you get the benefits of having pointers look like pointers, and getting your nodes to fit neatly in cache lines.
I find unrolled linked lists less useful than I hoped in the past. I tend to use linked lists when I want to do a lot of splicing, and be able to keep pointers to particular nodes. Requiring being able to keep a pointer to nodes stops unrolled linked lists being useful, as you can't merge/split unrolled nodes as required (or at least, I couldn't come up with a clever way to do it).
This is similar to the problem with deletion in a standard linked list. Destructive actions can invalidate existing pointers.
One fix for this issue would be to extract the node where you split and "freeze" it. i.e. If you split at element 10, that element gets shoved into a new node which holds only one element and pointers are adjusted as necessary. The node it's extracted from may also be split into two pieces if necessary. From this point on, the single-element node is never merged into another node. It always exists independently.
Merging in this world works basically like it does in a normal linked list. You rearrange pointers but don't combine nodes. Insertion works like it does for a typical unrolled linked list, but with the frozen nodes special cased.
The tricky bit is if you're saving a lot of pointers that aren't part of a split operation. Then you might need to explicitly expose the freeze operation. And if you use it on every node, you're back to a standard linked list (with a bigger constant factor). You could also expose unfreeze, but that could get buggy really quickly. You could do a ref-counted freeze/unfreeze, but you should probably quit before you get to that point.
Essentially, it says that it isn't as good an idea as you might imagine, because lists aren't often created all at once and they get chopped up and inserted into often enough the advantages of the scheme disappear.
These seem like like related but distinct ideas. From what I understand, CDR coding is basically an optimization that replaces the entire list with an array. This has the effect that inserting elements involves an ugly and expensive indirection hack.
Unrolled linked list don't exhibit this behavior (though they have other drawbacks). Their behavior is closer to that of B-trees, where size-bounded multi-element nodes can be split and joined cheaply, which allows for cheap element insertion.
Isn't that what we're all waiting for, the mythical "smart enough" compiler. Imagine constructing programs from abstract and composable high-level concepts, and the compiler does all the dirty hacks behind the screen. Producing fully parallelized, secure, efficient machine code on-the-fly. Could it? One can dream :)
There's been some invisible barriers for compilers
to start reorganising data and not just code.
It would seem to me that side has even more potential
than semantics preserving code optimization that
takes data layout as a given.
Agreed. It would already make a lot of difference if the compiler could make decisions between (for example) structure of arrays or array of structures representations. The programmer could always organize the data in a human-sensible way, and the compiler would transparently convert it to an efficient internal representation, with only the fields that are necessary for certain computations.
Of course there are some cases in which the binary layout matters (for example, serialization, or buffer handling in network code), but those can be regarded as glorified (un)parsers instead of fixed data layout. Would also make the langsec people happy.
Please profile tricks like this, as they may actually be significantly slower than their naive counterparts on modern hardware. The prefetcher knows what a linked list looks like, and it knows how to get it somewhere closer than main memory before the nodes are needed.
> The prefetcher knows what a linked list looks like
That's some pretty advanced magic.
Even the compiler has very limited insight into what your code is actually doing without simulating it, the prefetcher might be able to look a bit ahead in the execution stream and do branch prediction but absolutely no way does that extend to knowing stuff about your data structures.
Unless I have just been transported by a time warp I really think this is fiction.
If you're thinking of cache pre-fetching that actually has a really hard time dealing with stuff like linked lists because it has absolutely no idea at all about the data structure it is looking at. The 'next' and 'previous' pointers in the linked list might actually simply be values without any significance at all. And if they are dereferenced as pointers then that memory could be just about anywhere within the valid address range.
For arrays on the other hand such pre-fetching can be useful.
> Unless I have just been transported by a time warp I really think this is fiction.
Nope, the Intel guys and gals do this kinda magic day in, day out. Or at least the chips they manufacture do. I'm not familiar with the internals of the prefetcher of any CPU at this level, but let me wave hands here. This is what the prefetcher could do:
All it takes is for the prefetcher to get a cache line when requested, and then observe what is inside and look at pointer-sized and aligned values that look like pointers and are pretty close to the original cache line's (virtual) address. Now if these values happen to be sane virtual addresses in the current process, the prefetcher might as well fetch them one cache closer to the CPU. If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity.
All modern CPUs do dirty little tricks like this if it helps them outshine their competitors.
Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.
"If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity."
Caches have limited size. If it misses, it also evicts something else from the cache. If that is what is actually needed, this costs performance.
"and are pretty close to the original cache line's (virtual) address"
Why does it have to be 'pretty close'?
"Now if these values happen to be sane virtual addresses in the current process"
That sanity check would involve visiting the paging tables, so it would require at least two indirections (http://lwn.net/Articles/253361/). If a cache line is 16 bytes, you would have at least 4 positions where a 32-bit pointer could be present. So, at least four times two memory lookups would be needed. I think all of them would go through the same cache, but even assuming that the CPU has ways of signalling that it should not recurse, I do not think it is practical to do what you describe (disclaimer: I am not an expert on CPU design)
What is possible is to guess at where data is to be found. That allows CPUs to read and speculatively execute instructions from the physical memory that they think backs the virtual address of the PC while they, in parallel, do the lookup to verify that. See http://dl.acm.org/citation.cfm?id=2000101&dl=ACM&col.... I do not know whether this has made it into actual CPUs, though.
Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?
I know a bit about prefetch, and it's variants, but haven't directly used in 10 years (I was on a software project that had one of the first Katmai's, back in 1999 to do some bilinear/bicubic texture filtering for a media/drawing application, and had to code extensively in mmx assembly back then).
> Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?
edit: try to look for performance tuning guides for your CPU architecture, there might be some details about systems like this.
Nope. Some might exist but this stuff is generally considered trade secrets. A patent search might yield something. You can find some material about speculative execution, branch prediction, prefetching, etc but the real beef is hidden somewhere in Intel's (and their competitors') vaults.
It's all supposed to be transparent to the programmer so there's no need to write and release detailed public documentation about it.
As I said earlier, I have no idea how they work (or do they even exist), but I'll give a handwaving example of how they _could_ work.
> I'm not familiar with the internals of the prefetcher of any CPU at this level
So better not speculate, what you think could be done when you are not familiar with the details typically tends to be a lot harder when you are familiar with the details and are tasked with the implementation. The devil is in the details and there is a very appreciable gap between theory and practice.
Note the caution against using lists and trees and an advice to stick to arrays, which are typically accessed sequentially.
> Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.
And that is exactly the point, if you don't supply the knowledge neither the processor nor the compiler can do this for you. If the processor or the compiler could do this then those __builtin_prefetch calls would not be required.
hardware prefetching mechanism is in core i7 that i know for sure, it automatically identify memory referencing pattern and attempts to fetch those blocks into cache before they are accessed, the detailed algorithm of prefetching are not documented...
You can't tell whether a register contains a memory address or not after it has just been fetched right up until the moment that it is used. If the instruction stream contains instructions that reference that register that are within the look-ahead window of the instruction pipeline then maybe this might work, but if there is any conditional code in there or something else that causes the value to be used in a way that is not 'as a pointer' and to access the memory (and not for for instance pointer arithmetic prior to utilization) then such a pre-fetch would likely cause more harm than good by blowing good data out of the cache and replacing it with useless data (at a significant penalty).
There has been some research on this subject with respect to compiler optimizations but as far as I know this is not at all done at the hardware level.
Pretty much everything looks like a memory address. There is currently a memory address "hole" on x86_64 which doesn't, but any number between -140737488355328 and 140737488355327 can represent a virtual memory address. Prefetching any occurrance of any of those would be insane.
The prefetcher detects patterns. If you start accessing memory at N byte intervals, it will start prefetching N, 2N, 3N etc. ahead. So if you allocate your linked list elements from contiguous (virtual) memory, it will be able to prefetch elements. But once you start jumbling the list up, that will fall apart.
Yes, but imagine the bus traffic if you tried that: every commited instruction would require a TLB read to see if it "looked like" an address. The poor TLB is overcommited as it is (Intel CPUs can do three memory operations in a clock). All prefetch implementation I'm aware of simply do sequential access (positive or negative) prediction, sometimes with stride detection, and that's it.
Note: the CPU knows your virtual address spaces and a whole lot more. So "looking like a pointer" is not just a 64 bit hexadecimal value, but the CPU has the possibility of checking whether it's a sane virtual address and other heuristics.
Also, if you make an incorrect prefetch decision, it's not a big deal. Just one wasted cache line.
As best I understand it, while instruction prefetch is a complex affair, data prefetch is limited to detecting incrementing sequential address access, unless directed through a PREFETCH op. I can't a find a good reference, but I believe at best the CPU can only optimize by detecting stride.
In no circumstance will the CPU magically understand the semantics of some data structure.
Pre-fetchers rely on two metrics to determine what to cache - temporal locality, meaning it will cache things that it judges to be frequently accessed, and spatial locality, meaning that when you access something at memory address X, the cache will fill one cache line with the contents of consecutive memory locations i.e X,X+1,X+2,..,X+s.
Hence, they have no capability to understand a structure like a linked list and will not cache it intelligently. All you can rely on is that they will cache arrays and traverse them in a way that efficiently uses the cache.
LinkedList<Element> list = new XORLinkedList(Collections.fromArray(arrayList));
Element ele = list.get(4); //Ok, when we construct the list we can store the first and second pointers and traverse to the fourth
Element next = ele.next(); //???
I don't understand your comment - are you saying "ele" doesn't have enough info to find the next node?
You need an Iterator type that has the address of the previous and current nodes, in addition to your existing node / Element type.
LinkedList<Element> list = new XORLinkedList(Collections.fromArray(arrayList));
Iterator<Element> it = list.get(4); //now you can access it.element for data from index 4
it.next(); // now access it.element for data from index 5
XORLinkedList is most assuredly harder to debug then BasicLinkedList. Don't do it unless you hafta. In fact don't implement your own linked list at all, unless you have a specialized scenario that needs it (as proven by profiling).
I haven't used an XOR linked list since my Amstrad 6128. I think probably independently discovered but not sure. It is a pretty obvious thing to do on a memory limited machine. Nice to know people still know about this stuff post the 8 bit micro era but outside of embedded systems why would anyone use this today?
Most people don't need more than 4G per process so I wish stuff like the linux x32 abi became standard. All the extra registers without the huge pointers. Would be handy for squeezing more into a cheap VPS. Not to mention the cpu cache.
But isn't it fascinating? Some information moved from the link node into the state of the program processing it. This halves the link redundancy of a doubly linked list. Suddenly a doubly linked list has the footprint of a singly linked list. How did that happen? :-)
It taught me. Tricks like this make me a better programmer. Even if I won't use the principle involved in this exact way.
Some implementations of heap memory allocators manage free memory chunks as doubly-linked lists. There are multiple lists for chunks of equal size, i.e. 4, 8, 16, 32 ... 2^n.
This XOR linked list hack makes sense for smallest chunk size. On 32-bit system, two pointers will take 8 bytes, so minimum size of allocated memory block is 8 bytes without this hack and 4 bytes with this hack. On 64-bit system, this hack makes possible allocation of 8-byte block from heap without rounding up to 16 bytes.
For other practical purposes, I think that trading speed and code maintainability for memory size isn't good idea.
Seems to me like (as other commenters have pointed out) this wouldn't be a great idea for something that is stored as a straightforward linked list.
Where it seems really useful to me is a cheap way to store an alternate ordering on top of an existing data structure, without having to keep a separate list around. I've done this before, using next pointers to preserve an alternate ordering, but this is a nice way to allow it to be bidirectional. neat.
It's technically undefined behavior, but in practice it will work perfectly well on almost all C implementations, and in the infinitesimal number of cases where this structure is actually useful, portability isn't a major concern.
Actually, the behavior is a combination of undefined and implementation defined. To the extent that it's implementation defined, it may still result in nasal demons, but they have to be documented nasal demons.
Point is, it's strictly non-portable according to the standard, but when done carefully is unlikely to break on a general-purpose architecture.
This is from C99, I don't have a copy of C11:
5) An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.)
6) Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.
And footnote : The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.
Generally, it doesn't allow you to walk the list given the pointer to an entry. You always need to know the addresses of two adjacent list entries. And yes, this limits its usefulness in practice. I tend to only need doubly linked lists where I want to be able to remove or insert an arbitrary element without walking the list, and you can't do that with an xor list.
it only works if you are linearly iterating from the start/end of a list.
IMHO one of the (few) primary benefits of linked lists is random access to insert/delete/iterate an element without having to traverse the entire list. Something that breaks with xor lists.
Secondly if you have so many nodes that the memory saving of one pointer is significant. Its likely traversing that list will take forever and force you to use a different algo anyway.
... an improvement on basic linked lists would be 16B/32B/128B/512B etc alignment of the nodes memory address and bit packing the Left/Right offset in units of the alignment unit. Hell you could probably beat xor lists for space for a range of workloads.
No you're not. Your function needs to accept two pointers. Your data structure doesn't need to store two pointers.
Unless you're storing external pointers to every node (in which case you'll actually break even with the traditional implementation in terms of size), you'll still have significant savings. Passing an extra pointer to a function is pretty trivial in terms of size.
its how I use them. theres the list + some other structure/object that points/references the node, usually multiple time and the node is contained in multiple lists - big data + performance optimization