
Zip Trees - federicoponzi
https://arxiv.org/abs/1806.06726
======
bfirsh
If you’re on a phone, here’s an HTML version of the paper [https://www.arxiv-
vanity.com/papers/1806.06726/](https://www.arxiv-
vanity.com/papers/1806.06726/)

~~~
0-_-0
Thanks for letting me know about Arxiv Vanity!

------
kilotaras
A Cartesian tree (treap) with zip/merge change routine is a well known
datastructure in competitive programming. The only difference from the paper
would be rank selection algorithm - uniform instead of geometric.

Unclear if authors were aware of it.

~~~
blacksmythe

      >> One can view a zip tree as a treap (Seidel and Aragon 1996) in which priority ties are allowed and 
      >> in which insertions and deletions are done by unmerging and merging paths ("unzipping" and "zipping")

~~~
kilotaras
> ...rather than by doing rotations.

DS i've mentioned is using unzip/zip (called split/merge), see [1]

[1] [https://cp-algorithms.com/data_structures/treap.html](https://cp-
algorithms.com/data_structures/treap.html)

~~~
blacksmythe
Thanks for clarifying :)

------
rafael859
Correct me if I'm wrong, but isn't this the "standard" [1] treap with
split/merge (which is somewhat acknowledged in the footnote of page 8), simply
with a different distribution of priorities? The result is certainly
impressive, but it seems like a small modification to an otherwise known
structure.

[1]: See here [http://e-maxx.ru/algo/treap](http://e-maxx.ru/algo/treap) for
an old reference of this structure. Google translate does a decent job with
it, and the code is readable without translation.

~~~
teraflop
I had a similar confusion when skimming the paper. The algorithm only depends
on the relative ordering of node "ranks", rather than their absolute values.
If you were to treat the distributions as continuous (ignoring ties) then it
would make absolutely no difference what distribution you used. (Any two
continuous distributions are related by a monotonic transformation, which
preserves the ordering relationships between different points of the
distribution.)

(A more concrete way of looking at it: a simple way of sampling from a
geometric distribution is to choose a uniform random integer and count the
number of leading 1 bits. The authors suggest concatenating a few random bits
to the end for tie-breaking, so that the rank is a pair of a geometric random
variable and a uniform random variable, sorted lexicographically. But such a
pair is equivalent to just choosing a uniform random integer, and sorting by
that!)

Upon a closer reading, it makes a bit more sense. In section 4, the authors
point out that by choosing ranks from a geometric distribution, they only need
to store O(log log n) bits per node, instead of O(log n). The geometric
distribution of ranks, and the handling of ties, allow them to prove the same
time complexity bounds as a normal treap while using slightly less space.

However, I don't see this as a major advantage because either way, it still
takes O(log n) bits to store each node's child pointers.

------
mabbo
> insertion and deletion can be done purely top-down, with O(1) expected
> restructuring time and exponentially infrequent occurrences of expensive
> restructuring. Certain kinds of deterministic balanced search trees, in
> particular weak AVL trees and red-black trees achieve these bounds in the
> amortized sense [3], but at the cost of somewhat complicated update
> algorithms.

It's a neat, novel data structure with O(1) typical insert and delete times.
Very cool!

~~~
budabudimir
I don't think a ordered data structure with O(1) insert time is even possible,
as that would mean that you can do sort in O(n).

~~~
mabbo
O(1) expected is a different thing than standard O(1).

~~~
budabudimir
Indeed, could you expand on that a bit?

Even if what you meant is expected time complexity for the single insert
operation is O(1), that surely cannot be the case. What would be expected time
complexity of N inserts then?

~~~
mabbo
Consider the humble array-list- a simpler example to be sure, but it's great
for explaining.

Build an array of an initial basic size (16 let's say), and keep a counter of
the current size. When you insert items 0 through 15, the time to insert is
O(1)- find the current size, insert there. When you insert item 16 however,
you need to make a new array of size (currentSize _2), move the existing items
over to it, then add your new item- which is O(n).

Let's say we insert n items (be in 64, 1024 or 2^32, it doesn't matter). What
was the mean, median and mode for an insert?

Well, for any non-doubling case (1-k/k) it took O(1). For any doubling case
(1/k of the time) it took O(k). This all adds up to a mode of O(1), a median
of O(1) and a mean of (O(1)_(k-1) + O(k)*1)/k, or roughly O(2), which is O(1).

Your worst case scenario of an insert is indeed O(n). But that's not the most
common outcome.

Similar, from how I'm reading this paper, very rarely you'll have an O(ugly)
restructuring during inserts or deletes, but the rest of the time you can
expect O(1) while maintaining sorted order. I'm going to have to take a try at
implementing it to see how it goes.

~~~
kbenson
So, I think that's what is meant when they say:

 _with O(1) expected restructuring time and exponentially infrequent
occurrences of expensive restructuring._

Occasionally there's O(n) expensive restructuring, but given the nature of how
it's increasing space, when and why, and when it will need to happen again,
it's twice (or whatever) the prior amount of time before it needs to happen
again. I'm not up on my asymptotic notations[1] besides big-O, but perhaps
this is more succinctly (if more confusingly, to a general audience) by one of
the additional notations?

1:
[https://en.wikipedia.org/wiki/Big_O_notation#Related_asympto...](https://en.wikipedia.org/wiki/Big_O_notation#Related_asymptotic_notations)

~~~
jibal
> given the nature of how it's increasing space, when and why, and when it
> will need to happen again

This is really rather straightforward. The whole point of the algorithm is to
randomly distribute the height of the inserted node such that half the nodes
are leaves (height 0), 1/4 have height 1, 1/8 have height 2, etc.
Restructuring is more expensive the larger the height, but the frequency of
restructuring decreases exponentially with respect to the cost of doing it.
Net result: O(1) for restructuring, + O(lg n) to find the insertion point (as
is necessarily true for any binary search tree, else we would have a way to do
an O(n) sort).

------
noahdesu
I'm curious about how expensive the restructuring is, and importantly, can
that be "scheduled": can i violate some conditions and then do the expensive
restructuring at a time that is convenient?

------
gexaha
Data structure from competitive programming is also known as "treap with
implicit keys", related question on stackoverflow:
[https://stackoverflow.com/questions/3497875/treap-with-
impli...](https://stackoverflow.com/questions/3497875/treap-with-implicit-
keys)

------
benbenolson
Any links to the implementation of this? The paper says that there's a
concurrent implementation in the author's thesis, but Princeton charges $5 per
copy of theses, so I can't view the actual implementation:

[https://dataspace.princeton.edu/jspui/handle/88435/dsp01gh93...](https://dataspace.princeton.edu/jspui/handle/88435/dsp01gh93h214f)

~~~
jibal
An implementation of zip tree insertion and deletion is in the paper.
_concurrent_ zip trees is another matter. I don't know why you can't view
something that Princeton charges $5 for, but in any case I think you would
find the thesis disappointing ... the arxiv paper says "The third author
developed a preliminary, lock-based implementation of concurrent zip trees in
his senior thesis [12]. We are currently developing a non-blocking
implementation." \-- I would wait for the latter.

~~~
jpap
I'd also love to read the thesis, especially to see how they structured the
lock.

I recently wrote a readers-writers (subtree) lock for a K-ary tree and it was
quite a challenge with a nontrivial implementation. It was much easier to
start with a single mutex-like (subtree) lock (one reader or writer), and then
extend it to the general multi-reader/writer case.

I can only imagine that a lock-free version (of even a single reader/writer)
must be even more of a challenge. I hope it's not a terribly long wait, but
there could be much to learn from the simpler lock-based scheme in the
meantime. :)

On the topic of access to the research, I found a talk by the first author [1]
a helpful companion to the paper, with slides for what appears to be the same
talk elsewhere [2]. I haven't seen any implementations outside of pseudocode
in the paper -- it would be nice to see if there are any tricks to generating
the random ranks cheaply, or the alternative that uses a pseudo-random
function of the key.

Thanks to the OP for sharing this. I'm keen to try it on another project where
I was hesitant to use red-black trees due to their complexity.

[1]
[https://www.youtube.com/watch?v=NxRXhBur6Xs](https://www.youtube.com/watch?v=NxRXhBur6Xs)

[2] [http://knuth80.elfbrink.se/wp-
content/uploads/2018/01/Tarjan...](http://knuth80.elfbrink.se/wp-
content/uploads/2018/01/Tarjan_Zip_Trees_Knuth80.pdf)

------
voidmain
I wonder if this would be a good candidate for the basis of a persistent
("functional") data structure?

O(1) pointer changes for insert/remove is a good sign, since pointer changes
tend to become _space_ overhead in a persistent data structure.

------
everybodyknows
Readers already familiar with similar data structures may save time by
proceeding directly to section 4, "Previous Related Work".

------
m3kw9
Is cool to see new innovation in this area

