
Understanding Clojure's Persistent Vectors, Part 1 - llambda
http://hypirion.com/musings/understanding-persistent-vector-pt-1
======
StefanKarpinski
Does anyone have benchmarks comparing the performance of Clojure's
PersistentVectors to standard arrays in Java and/or C? O(1) is pretty useless
if the constant factor is huge. Anecdotally, I've heard that they are "fast"
but it would be interesting to know what that really means.

~~~
PeterisP
W/o benchmarks, but from theory - it's a rather different data structure.

1) for pure lookup-by-random-int-index, the standard array is much faster - in
C you just calculate a memory location and there your data is; this structure
requires, say, 4-6 such lookups depending on the array size, and is that many
times slower.

2) On the other hand, iteration through large consecutive parts of the array
is almost the same as pure arrays; there is some overhead but that's tiny.

3) Insertion is much faster than standard C/Java arrays - you can't really
insert in the middle of an array w/o copying almost all of it, but you can do
it here.

4) If you need "modification while keeping the old version as well" \- then
again, arrays need to make a full copy, but this beast can do it cheaply,
faster than a raw C array.

As for any data structures, there are no "faster data structures" since
preferences greatly depend on what you want to do with them, some data
structures are faster for X and others are faster for Y. The efficiency of
this structure greatly depends if your array/vector is mostly used as random-
access-lookup or as a list where you need to process all/many sequential
items.

~~~
deliminator
Clojure's standard persistent vectors don't allow insertion in the middle
without copying the entire vector, they only allow fast insertion at the end.
There is an alternative implementation here [1] which does allow insertion
(anywhere) without copying the entire vector.

[1] [https://github.com/clojure/core.rrb-
vector](https://github.com/clojure/core.rrb-vector)

------
postfuturist
> They are a data structure invented by Rich Hickey for Clojure

It is an implementation of a data structure invented by Phil Bagwell:
[http://lampwww.epfl.ch/papers/idealhashtrees.pdf](http://lampwww.epfl.ch/papers/idealhashtrees.pdf)

~~~
JeanPierre
Author here: I think you talk about the persistent hashmaps, not the
persistent vectors. I've been looking for papers explaining Clojure's
persistent collections, and Bagwell seems to cover the hash maps and hash sets
quite well. However, I've not seen a paper on the persistent vectors, which
was quite a bummer, and that was the reason I started explaining them in the
first place.

If you have a reference to a paper explaining something similar (or the actual
implementation), I'd love to put it in the post for others.

~~~
jasonwatkinspdx
[https://github.com/clojure/clojure/blob/c6756a8bab137128c811...](https://github.com/clojure/clojure/blob/c6756a8bab137128c8119add29a25b0a88509900/src/jvm/clojure/lang/PersistentVector.java)

Looking at the source the persistent vectors are virtually identical to
Bagwell's paper. Rich did add a couple tweaks, namely moving the bitvector
that indicates what slots of a node are occupied from being a word in the node
object to being embedded in the 64bit integers stored in each node slot. When
a node is filled enough to span 2 cache lines, around 9 slots on typical
hardware with 64 byte lines, and the next desired index fragment is the 9th
slot or higher, this avoids touching the first cache line, potentially saving
a cache miss. This is why the nodes are 32 way: 32bits for the bitvector and
32bits for the offset in the underlying storage array fit in one 64bit word
which can be written atomically (inside a transient obviously). Rich goes
through this in one of his talks but I don't recall which.

The modification to go from mutable to immutable isn't an invention either.
Anyone who's read any of the functional data structure literature will be
familiar with path copying being one of the two general ways of making any
data structure persistent.

From the perspective of these data structures there's little difference
between a vector with integer indexes and a hashmap. The hashmap just requires
a preliminary step of hashing the key to an integer.

~~~
swannodette
From the paper I cited below:

 _The immutable vector data structure as pioneered by the programming language
Clojure [4] strikes a good balance between read and write performance and
supports many commonly used programming patterns in an effi- cient manner. In
Clojure, immutable vectors are an essential part of the language
implementation design. Ideal Hash Tries (HAMTs) [1] were used as a basis for
immutable hash maps and the same structure, 32-way branching trees, was used
for immutable vectors._

I'm pretty sure they picked the word _pioneered_ for a reason. If Rich Hickey
didn't invent them, then Tiark & Bagwell didn't invent RRB-Trees.

~~~
jasonwatkinspdx
Well, it's arguable either way IMHO. I'd give priority to Bagwell because he
first published his work academically in 2000. At the time he worked for
Odersky, the author of the Scala language. So these structures were in Scala's
implementation first, then adapted and improved for Clojure.

~~~
modersky
Phil Bagwell was loosely associated with my group in 2000 but did not work for
me then. His work at the time was theoretical; the first practical
implementation is Clojure's. Scala's implementations only appeared in version
2.8, in 2010.

~~~
jasonwatkinspdx
Awesome, thanks for the correction.

------
fyolnish
"practically O(1)" is an interesting statement

~~~
sethev
I like to point out that hash tables are O(N) when people bring this up.
Constant factors, average cases, and practical considerations matter a lot -
it's not just nitpicking or semantics.

------
film42
Awesome post! The diagrams are really great additions. I look forward to the
rest of the series. :)

------
oskarkv
Branching factor 32 is great for lookups, but isn't it slower for
modification? At least, one has to create more array cells in total (31 copies
in each node), no?

