Hacker News new | comments | show | ask | jobs | submit login
XOR Linked List (wikipedia.org)
153 points by no_more_death 1784 days ago | hide | past | web | 84 comments | favorite

It's a cute trick, but please don't use it. It will confound debuggers, garbage collectors, memory leak detectors, and future readers. :)

Never say never. There's a time and place for dirty pointer tricks like the XOR linked list or tagged pointers (low 2-3 bits used for "tags"). Emphasis on words "dirty tricks" so you know that it's the rabbit you pull out of a hat, not something you do every day.

I almost needed XOR linked lists somewhere once, but I ended up doing the same thing with tagged pointers. I wish I remembered the occasion.

But imagine this situation: you're doing some high-perf code on x86's and doing a data structure (with next/prev links somewhere) and your node size is 136 bytes because of arbitrary reason X. You're running a few percent behind your worst competitor in performance figures. Now if you shave 8 bytes you'll fit one node per cache line and probably squeeze out a little pref just by having a cache-friendly alignment. Should you do it? Hell yeah.

Okay, I know many people don't write code with tight performance budgets, but some of us do.

Yeah, there's going to be these trade-offs (complexity/efficiency), but I've seen this end up in fellow student's personal projects after learning it in class.

Aside from that though, I think if you're clever enough to come to this and implement it well, I hope you're clever enough to understand whether or not it's the RightWay(TM).

Oh, I dunno. Every now and then I write a massive simulation of some p2p idea or other and max out my RAM. This might just come in handy.

So that you can get 30 more nodes? Surely the L/R pointers of the linked lists aren't taking up all the memory in your program

If you're on a 64bit machine storing 32bit ints, this will decrease the node size from 32 + 64 + 64 = 160 to 32 + 64 = 96, so you should be able to store 66% more data.

edit: In fact, when do you ever really need to store particularly large objects in something like a linked-list? It seems like you can always get some relatively small reference value to the data (eg, a pointer) and store that in your list instead of storing your large objects themselves.

If you're storing references to larger objects, your larger objects are what's taking up the space, which I think was the point (66% more pointers are useless if you can't also store 66% more of the objects they point to).

Also, I think on a 64-bit machine with common DDR SDRAM, a memory location is 64 bits wide, no matter the size of the integer contained there. Not sure about this, however.

Most modern architectures like AMD64 use 40 bit memory references in the CPU, but write them as 64 bit to memory. That is why they have much lower addressable memory than the logical limit.

I think Bulldozer and Ivy Bridge are both 48, but I'm not going to dive into 500 page PDFs to find out.

Maybe, I just remember that number from powerPC / ARM back in my assembly class in 2010.

> If you're storing references to larger objects, your larger objects are what's taking up the space, which I think was the point (66% more pointers are useless if you can't also store 66% more of the objects they point to).

Right, of course...

> I really don't know all that much about architecture (yet), so that's totally possible, and I might be wrong about the 66% thing I guess.

edit: Actually, about the whole more objects thing, there might be some strange use case where you only have a few objects but are storing multiple references to them in a list. But yeah, in general, the benefit goes down with larger objects.

Not if you're storing lots of references to the same object, which is what I'm often doing. And my objects are really small anyway.

Some compilers/runtimes support pointer compression (eg. https://wikis.oracle.com/display/HotSpotInternals/Compressed...)

It would depend on the data you're storing. If you have a really long list of a really small data type (one machine word? I guess it's possible), this could cut your memory usage by up to a third.

You could cut your memory usage by arbitrarily close to 1/2 by using an unrolled linked list instead. Plus then you get the benefits of having pointers look like pointers, and getting your nodes to fit neatly in cache lines.


I find unrolled linked lists less useful than I hoped in the past. I tend to use linked lists when I want to do a lot of splicing, and be able to keep pointers to particular nodes. Requiring being able to keep a pointer to nodes stops unrolled linked lists being useful, as you can't merge/split unrolled nodes as required (or at least, I couldn't come up with a clever way to do it).

This is similar to the problem with deletion in a standard linked list. Destructive actions can invalidate existing pointers.

One fix for this issue would be to extract the node where you split and "freeze" it. i.e. If you split at element 10, that element gets shoved into a new node which holds only one element and pointers are adjusted as necessary. The node it's extracted from may also be split into two pieces if necessary. From this point on, the single-element node is never merged into another node. It always exists independently.

Merging in this world works basically like it does in a normal linked list. You rearrange pointers but don't combine nodes. Insertion works like it does for a typical unrolled linked list, but with the frozen nodes special cased.

The tricky bit is if you're saving a lot of pointers that aren't part of a split operation. Then you might need to explicitly expose the freeze operation. And if you use it on every node, you're back to a standard linked list (with a bigger constant factor). You could also expose unfreeze, but that could get buggy really quickly. You could do a ref-counted freeze/unfreeze, but you should probably quit before you get to that point.

This is esssentially CDR coding. Here's what a Lisp FAQ has to say about that:


Essentially, it says that it isn't as good an idea as you might imagine, because lists aren't often created all at once and they get chopped up and inserted into often enough the advantages of the scheme disappear.

These seem like like related but distinct ideas. From what I understand, CDR coding is basically an optimization that replaces the entire list with an array. This has the effect that inserting elements involves an ugly and expensive indirection hack.

Unrolled linked list don't exhibit this behavior (though they have other drawbacks). Their behavior is closer to that of B-trees, where size-bounded multi-element nodes can be split and joined cheaply, which allows for cheap element insertion.

P2P simulation, just what I am getting into right now. Interested to know if you... have made/intend to make, any code available.

What I've got isn't general-purpose at all. I'm working on distributed learning algorithms, and right now I'm more concerned with having lots of nodes in the learning algorithms than I am in simulating real-world p2p...so I'm not modeling things like network delays, nodes dropping out and returning with different IPs, etc.

How about you?

If I get something that might possibly be useful for someone else, sure, I'll put it on github or something. (If I get a good design, the real network will be opensource too.)

Would that be for a university assignment, to design your own protocol or for something else?

A supercompiler could do this transparently.

edit: reference: http://c2.com/cgi/wiki?SuperCompiler

Isn't that what we're all waiting for, the mythical "smart enough" compiler. Imagine constructing programs from abstract and composable high-level concepts, and the compiler does all the dirty hacks behind the screen. Producing fully parallelized, secure, efficient machine code on-the-fly. Could it? One can dream :)

There's been some invisible barriers for compilers to start reorganising data and not just code. It would seem to me that side has even more potential than semantics preserving code optimization that takes data layout as a given.

Agreed. It would already make a lot of difference if the compiler could make decisions between (for example) structure of arrays or array of structures representations. The programmer could always organize the data in a human-sensible way, and the compiler would transparently convert it to an efficient internal representation, with only the fields that are necessary for certain computations.

Of course there are some cases in which the binary layout matters (for example, serialization, or buffer handling in network code), but those can be regarded as glorified (un)parsers instead of fixed data layout. Would also make the langsec people happy.

Please profile tricks like this, as they may actually be significantly slower than their naive counterparts on modern hardware. The prefetcher knows what a linked list looks like, and it knows how to get it somewhere closer than main memory before the nodes are needed.

> The prefetcher knows what a linked list looks like

That's some pretty advanced magic.

Even the compiler has very limited insight into what your code is actually doing without simulating it, the prefetcher might be able to look a bit ahead in the execution stream and do branch prediction but absolutely no way does that extend to knowing stuff about your data structures.

Unless I have just been transported by a time warp I really think this is fiction.

If you're thinking of cache pre-fetching that actually has a really hard time dealing with stuff like linked lists because it has absolutely no idea at all about the data structure it is looking at. The 'next' and 'previous' pointers in the linked list might actually simply be values without any significance at all. And if they are dereferenced as pointers then that memory could be just about anywhere within the valid address range.

For arrays on the other hand such pre-fetching can be useful.

> Unless I have just been transported by a time warp I really think this is fiction.

Nope, the Intel guys and gals do this kinda magic day in, day out. Or at least the chips they manufacture do. I'm not familiar with the internals of the prefetcher of any CPU at this level, but let me wave hands here. This is what the prefetcher could do:

All it takes is for the prefetcher to get a cache line when requested, and then observe what is inside and look at pointer-sized and aligned values that look like pointers and are pretty close to the original cache line's (virtual) address. Now if these values happen to be sane virtual addresses in the current process, the prefetcher might as well fetch them one cache closer to the CPU. If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity.

All modern CPUs do dirty little tricks like this if it helps them outshine their competitors.

Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.

"If it hits, it might yield big performance boost in real world apps. If it misses, it's just a little wasted electricity."

Caches have limited size. If it misses, it also evicts something else from the cache. If that is what is actually needed, this costs performance.

"and are pretty close to the original cache line's (virtual) address"

Why does it have to be 'pretty close'?

"Now if these values happen to be sane virtual addresses in the current process"

That sanity check would involve visiting the paging tables, so it would require at least two indirections (http://lwn.net/Articles/253361/). If a cache line is 16 bytes, you would have at least 4 positions where a 32-bit pointer could be present. So, at least four times two memory lookups would be needed. I think all of them would go through the same cache, but even assuming that the CPU has ways of signalling that it should not recurse, I do not think it is practical to do what you describe (disclaimer: I am not an expert on CPU design)

What is possible is to guess at where data is to be found. That allows CPUs to read and speculatively execute instructions from the physical memory that they think backs the virtual address of the PC while they, in parallel, do the lookup to verify that. See http://dl.acm.org/citation.cfm?id=2000101&dl=ACM&col.... I do not know whether this has made it into actual CPUs, though.

Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?

I know a bit about prefetch, and it's variants, but haven't directly used in 10 years (I was on a software project that had one of the first Katmai's, back in 1999 to do some bilinear/bicubic texture filtering for a media/drawing application, and had to code extensively in mmx assembly back then).

> Any refs, docs about this behavior? First time hearing it. How does it work with MMU, protected memory, or external memory mapped as normal one?

edit: try to look for performance tuning guides for your CPU architecture, there might be some details about systems like this.

Nope. Some might exist but this stuff is generally considered trade secrets. A patent search might yield something. You can find some material about speculative execution, branch prediction, prefetching, etc but the real beef is hidden somewhere in Intel's (and their competitors') vaults.

It's all supposed to be transparent to the programmer so there's no need to write and release detailed public documentation about it.

As I said earlier, I have no idea how they work (or do they even exist), but I'll give a handwaving example of how they _could_ work.

> I'm not familiar with the internals of the prefetcher of any CPU at this level

So better not speculate, what you think could be done when you are not familiar with the details typically tends to be a lot harder when you are familiar with the details and are tasked with the implementation. The devil is in the details and there is a very appreciable gap between theory and practice.

Further reading: http://www.futurechips.org/chip-design-for-all/prefetching.h...

Note the caution against using lists and trees and an advice to stick to arrays, which are typically accessed sequentially.

> Btw. you can add prefetch instructions in your code manually if you do linked list traversals or similar. In GCC you can use __builtin_prefetch() compiler intrinsic.

And that is exactly the point, if you don't supply the knowledge neither the processor nor the compiler can do this for you. If the processor or the compiler could do this then those __builtin_prefetch calls would not be required.

hardware prefetching mechanism is in core i7 that i know for sure, it automatically identify memory referencing pattern and attempts to fetch those blocks into cache before they are accessed, the detailed algorithm of prefetching are not documented...

The prefetcher knows what a linked list looks like

That seems unlikely to me. Which processors do this?

Prefetching when a register contains what looks like a memory address at least seem possible

You can't tell whether a register contains a memory address or not after it has just been fetched right up until the moment that it is used. If the instruction stream contains instructions that reference that register that are within the look-ahead window of the instruction pipeline then maybe this might work, but if there is any conditional code in there or something else that causes the value to be used in a way that is not 'as a pointer' and to access the memory (and not for for instance pointer arithmetic prior to utilization) then such a pre-fetch would likely cause more harm than good by blowing good data out of the cache and replacing it with useless data (at a significant penalty).

There has been some research on this subject with respect to compiler optimizations but as far as I know this is not at all done at the hardware level.

Pretty much everything looks like a memory address. There is currently a memory address "hole" on x86_64 which doesn't, but any number between -140737488355328 and 140737488355327 can represent a virtual memory address. Prefetching any occurrance of any of those would be insane.

The prefetcher detects patterns. If you start accessing memory at N byte intervals, it will start prefetching N, 2N, 3N etc. ahead. So if you allocate your linked list elements from contiguous (virtual) memory, it will be able to prefetch elements. But once you start jumbling the list up, that will fall apart.

Yes, but imagine the bus traffic if you tried that: every commited instruction would require a TLB read to see if it "looked like" an address. The poor TLB is overcommited as it is (Intel CPUs can do three memory operations in a clock). All prefetch implementation I'm aware of simply do sequential access (positive or negative) prediction, sometimes with stride detection, and that's it.

Note: the CPU knows your virtual address spaces and a whole lot more. So "looking like a pointer" is not just a 64 bit hexadecimal value, but the CPU has the possibility of checking whether it's a sane virtual address and other heuristics.

Also, if you make an incorrect prefetch decision, it's not a big deal. Just one wasted cache line.

As best I understand it, while instruction prefetch is a complex affair, data prefetch is limited to detecting incrementing sequential address access, unless directed through a PREFETCH op. I can't a find a good reference, but I believe at best the CPU can only optimize by detecting stride.

In no circumstance will the CPU magically understand the semantics of some data structure.

Pre-fetchers rely on two metrics to determine what to cache - temporal locality, meaning it will cache things that it judges to be frequently accessed, and spatial locality, meaning that when you access something at memory address X, the cache will fill one cache line with the contents of consecutive memory locations i.e X,X+1,X+2,..,X+s.

Hence, they have no capability to understand a structure like a linked list and will not cache it intelligently. All you can rely on is that they will cache arrays and traverse them in a way that efficiently uses the cache.

Note though that speed isn't all that matter, memory usage does too, especially in constrained environments.

The nicest use of a similar XOR trick I have seen is to eliminate dead cycles in hardware bus arbiters.

Suppose you need to multiplex between inputs A,B,C to make output D (e.g. you have a 3d block, a display engine, and a CPU all trying to access DRAM). The idea is to xor all valid inputs together.

When there is only one input this results in output=input. If all 3 are active then the output is A^B^C. The trick is that the next cycle only the granted input (say A) is removed so the output is B^C

Receivers are required to XOR two sequential values to extract the actual data. In this case (A^B^C)^(B^C)=A.

See the "NoX Router" for details and explanation of how this compares to alternative arbitration approaches: pharm.ece.wisc.edu/papers/micro2011_nox.pdf

oooh very cool.

Believe it or not, we actually had a C exam question where we were asked to sketch out an implementation of this.

http://www.cl.cam.ac.uk/teaching/exams/pastpapers/y2010p3q6.... With classic mildly amusing exam humour introduction!

This question still haunts me.

Anyone actually using this is almost certainly committing a heinous premature optimization.

Why would you not just do the following?

interface LinkedList

class BasicLinkedList implements LinkedList

class XORLinkedList implements LinkedList

Hm, I'm super concerned about performance! I know, I'll turn every function call into a virtual function call!

Don't be silly, the principle is sound.

    #ifdef _DEBUG
    typedef BasicLinkedList LinkedList
    typedef XorLinkedList LinkedList

You'd implement it either with a policy-based class[0] or a CRTP compile-time virtual dispatch[1]. There's zero need to involve vtables in this.

  [0] http://en.wikipedia.org/wiki/Policy-based_design
  [1] http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern#Static_polymorphism

Except this is not about performance, but memory.

    LinkedList<Element> list = new XORLinkedList(Collections.fromArray(arrayList));
    Element ele = list.get(4); //Ok, when we construct the list we can store the first and second pointers and traverse to the fourth
    Element next = ele.next(); //???

I don't understand your comment - are you saying "ele" doesn't have enough info to find the next node?

You need an Iterator type that has the address of the previous and current nodes, in addition to your existing node / Element type.

    LinkedList<Element> list = new XORLinkedList(Collections.fromArray(arrayList));
    Iterator<Element> it = list.get(4); //now you can access it.element for data from index 4
    it.next(); // now access it.element for data from index 5

XORLinkedList is most assuredly harder to debug then BasicLinkedList. Don't do it unless you hafta. In fact don't implement your own linked list at all, unless you have a specialized scenario that needs it (as proven by profiling).

I haven't used an XOR linked list since my Amstrad 6128. I think probably independently discovered but not sure. It is a pretty obvious thing to do on a memory limited machine. Nice to know people still know about this stuff post the 8 bit micro era but outside of embedded systems why would anyone use this today?

Saving memory still matters if you are trying to fit something into cache - although the items you are storing in the list would have to be pretty small if the extra pointer mattered

Most people don't need more than 4G per process so I wish stuff like the linux x32 abi became standard. All the extra registers without the huge pointers. Would be handy for squeezing more into a cheap VPS. Not to mention the cpu cache.

I know, maybe it's not very useful.

But isn't it fascinating? Some information moved from the link node into the state of the program processing it. This halves the link redundancy of a doubly linked list. Suddenly a doubly linked list has the footprint of a singly linked list. How did that happen? :-)

It taught me. Tricks like this make me a better programmer. Even if I won't use the principle involved in this exact way.

As I remember this was used the operating system in the XDS Sigma 7. Decoding core dumps when things went awry was a challenge, but it did save memory.

This is an exercise in Knuth, TAOCP v.1.

Some implementations of heap memory allocators manage free memory chunks as doubly-linked lists. There are multiple lists for chunks of equal size, i.e. 4, 8, 16, 32 ... 2^n.

This XOR linked list hack makes sense for smallest chunk size. On 32-bit system, two pointers will take 8 bytes, so minimum size of allocated memory block is 8 bytes without this hack and 4 bytes with this hack. On 64-bit system, this hack makes possible allocation of 8-byte block from heap without rounding up to 16 bytes.

For other practical purposes, I think that trading speed and code maintainability for memory size isn't good idea.

Cool! Hadn't seen that before, neat trick.

Seems to me like (as other commenters have pointed out) this wouldn't be a great idea for something that is stored as a straightforward linked list.

Where it seems really useful to me is a cheap way to store an alternate ordering on top of an existing data structure, without having to keep a separate list around. I've done this before, using next pointers to preserve an alternate ordering, but this is a nice way to allow it to be bidirectional. neat.

It looks fun and all, but I really wonder about the implementation and practicality of it in C. But once again, it does look really cool.

It's technically undefined behavior, but in practice it will work perfectly well on almost all C implementations, and in the infinitesimal number of cases where this structure is actually useful, portability isn't a major concern.

Just curious, what part of an XOR-linked list is undefined? AFAIK, casting to and from sufficiently wide integers is ok.

I remember hearing claims that a C++ implementation would be allowed to use conservative GC (and nop out free/delete). I wonder what the language lawyers make of this.

What about destructors? I guess only using the GC on destructorless objects could work.

C++ is a lot stricter about a lot of things, and C++11 added some explicit GC-related language, but I have little desire to parse that particular 10MB document. Bjarne covers it in his FAQ, at least: http://www2.research.att.com/~bs/C++0xFAQ.html#gc-abi

Actually, the behavior is a combination of undefined and implementation defined. To the extent that it's implementation defined, it may still result in nasal demons, but they have to be documented nasal demons.

Point is, it's strictly non-portable according to the standard, but when done carefully is unlikely to break on a general-purpose architecture.

This is from C99, I don't have a copy of C11: Pointers


5) An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.[56])

6) Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type.


And footnote [56]: The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

Just curious, this wouldn't work for a circularly linked list in the case where there is only one element, right?

Generally, it doesn't allow you to walk the list given the pointer to an entry. You always need to know the addresses of two adjacent list entries. And yes, this limits its usefulness in practice. I tend to only need doubly linked lists where I want to be able to remove or insert an arbitrary element without walking the list, and you can't do that with an xor list.

Why wouldn't it?

A xor A = 0.

A xor 0 = A.

That should work like a regular circularly linked list or am I missing something?


Doh, now I understand. If this was allowed it wouldn't be possible to have a one-node non-circular list.

It would be the reverse case if it were non-circular, returning 0.

You can encode "end of list" with another value like -1 which is usually safe since you can assume that addresses are at word aligned (so odd numbers cannot be addresses)

it only works if you are linearly iterating from the start/end of a list.

IMHO one of the (few) primary benefits of linked lists is random access to insert/delete/iterate an element without having to traverse the entire list. Something that breaks with xor lists.

Secondly if you have so many nodes that the memory saving of one pointer is significant. Its likely traversing that list will take forever and force you to use a different algo anyway.

... an improvement on basic linked lists would be 16B/32B/128B/512B etc alignment of the nodes memory address and bit packing the Left/Right offset in units of the alignment unit. Hell you could probably beat xor lists for space for a range of workloads.

A regular linked list doesn't have random access to elements, typically a head node is inserted into a function that iterates the list.

.. or you pass any node to a function that operates on the list. iterate up/down insert/delete etc.

point being you dont have to iterate from the start of the list.

So your function needs to be passed current and next (or prev, depending on which direction you want to iterate). XOR-lists have lots of issues. This isn't one of them.

and your back to storing 2 pointers again.

No you're not. Your function needs to accept two pointers. Your data structure doesn't need to store two pointers.

Unless you're storing external pointers to every node (in which case you'll actually break even with the traditional implementation in terms of size), you'll still have significant savings. Passing an extra pointer to a function is pretty trivial in terms of size.

In theory yes, but how many references do you need to claim random access.

its how I use them. theres the list + some other structure/object that points/references the node, usually multiple time and the node is contained in multiple lists - big data + performance optimization

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact