
Columnarization in Rust - steveklabnik
http://www.frankmcsherry.org/columnarization/serialization/rust/2014/12/15/Columnarization-in-Rust.html
======
kbenson
_So like everyone without a job, I’ve started to learn Rust. And like everyone
who has started to learn Rust, I now feel it is very important to tell you
about my experience with it._

Well, that sums that up pretty succinctly. :)

~~~
rubiquity
That's the best intro to a blog post that I've ever read.

~~~
frou_dh
It gets two points deducted for the plague of the unnecessary introductory So.

~~~
evincarofautumn
Discourse markers may be unnecessary, but they do carry some meaning and serve
a purpose.

------
killercup
There is a part 2:
[http://www.frankmcsherry.org/columnarization/serialization/r...](http://www.frankmcsherry.org/columnarization/serialization/rust/2014/12/16/Columnarization-2.html)

------
mkaufmann
While I like the construction of the column store and the corresponding API.
The claims of the author don't really make sense:

> "... columnarization, a technique from the database community for laying out
> structured records in a format that is more convenient for serialization
> than the records themselves."

Column stores in comparison to row stores don't offer any serialization
benefit per se. The main benefits are the following, I will be using a record
(A,B,C,D,E) as example with all types u32 (4 bytes):

* If you only use some fields you have to load less data from memory/disk into the CPU cache and your working set is more probable to fit into cache. For example when filtering only the records where A=22 and B=45 you only have to actually load x _(sizeof(A)+sizeof(B)) = x_ 8 bytes instead of x _record_size=x_ 20\. This can make a very significant difference. * When using compression to reduce the size of data, columns can often be compressed better because they only contain data of the same type and nature and thus probably share similarities. When using such a small record consisting only of integers it probably won't make a difference. But if e.g. some fields are country abbreviations, textual description or others are ids, one could easily imagine that there are gains.

Coming back to the point about serialization, using the same technique as
described in the blog post, there won't[1] be a performance difference between
column storage and row storage (e.g. using a struct). The method described in
the blog post just lets the data array of the original vector be wrapped by a
Vec<u8> without even moving the memory, so the method is independent of the
data type that is stored in the vectors. Of course it will only work for data
types that do not contains references, otherwise we could get illegal memory
access after deserialization (which should be guaranteed by the rust type
system because only Copy types are allowed).

The only thing this benchmark is testing is how fast a vector can be
initialized.

[1] There can be an space improvement of keeping the data in a column layout
compared to row layout when using normal structs. Normal structs normally
align the total size to the size of the largest field in the struct. A struct
containing i64 and i8 would contain 7 bytes of padding. In a column layout
this overhead would be avoided. Still there would not be an improvement in
this serialization scheme as it does not actually copy any data.

~~~
frankmcsherry
Hi, author here.

Unless I misunderstand your post, I think you may have missed important parts!

More is going on than "just wrapping the data of the original vector as a
Vec<u8>". That is what happens when we are handed a Vec<uint>, but for other
inner types (pairs, vectors, etc) there is more to do (and, actual code
presented showing what that work is).

1\. A Vec<(u8, u64)> definitely ends up as a (Vec<u8>, Vec<u64>), thereby
avoiding padding you'd have if you just wrote out the elements as structs. You
absolutely end up saving space, and part of doing this is definitely copying
data. (responding to "as it does not actually copy any data").

2\. The types themselves can be vectors, corresponding to a struct with owned
pointers (whose elements can also own pointers, etc). This is not something
that just casting the source array will deal with. It's important here that
Rust hands you ownership, as otherwise it would be totally inappropriate for
us to claim the underlying memory (which the code does). That part, recycling
owned memory, is one of the big performance wins (about 2.5x faster for me
than invoking the allocator each time I need to mint a new array).

3\. "Copy" doesn't appear in the first post. The types don't have to be Copy,
and indeed Vec<T> is not Copy, even if T is. Not sure where you got that from.
But, no such requirement; yay!

Read part 2 for more about Copy, and how when your type is Copy you get a free
implementation that does what you suggest (keeping each struct intact), except
you can mix and match with Vecs and Options and stuff like that.

Hope this clears up some of the not-sense-making. Feel free to holler with
other questions.

Cheers, Frank

~~~
mkaufmann
Addressing your points:

1\. In my comment I acknowledge that there are space savings with types that
include padding. I don't understand why you imply in your answer that I didn't
understand this point. Regarding my comment about not copying data: it was
based on the benchmark[1] as linked in the blog post. The parts you measure do
not copy any of the user data! It only converts your internal vectors of user
types to a list of vectors of u8 type. So what you are doing is essentially
moving the data. But when using move semantics the size of the user data does
not matter any more, so there won't be a difference between your column layout
and using the more complex types directly inside the vectors if measured the
same way as in this benchmark. In my opinion your benchmark is flawed and does
not support the argument you make in your blog post.

2\. Ok yes that was bad wording on my side. If you provide adaptors for
complex types with pointers like Vec than you can of course also serialize
those.

3\. I guess this is just about the same argument as point 2) and thus
redundant.

I made the effort for you to write a serialization framework which does not do
"columnarization" but which simulates row layout and also added an adaptor for
vec and ran it with the same benchmark, and also reran the original columnar
benchmark on my machine. Both benchmarks were compiled exactly the same way.
You can find my code here[2]. Here are the results:

==============================

columnarization <uint> 12.1 GB/s, 742k values/s

columnarization <(uint, (uint, uint))> 5.3 GB/s, 107k values/s

columnarization <vec<uint>> 2.1 GB/s, 54k values/s

columnarization <Option<uint>> 1.58 GB/s, 163k values/s

==============================

row (simple_serialize) <uint> 12.1 GB/s, 730k values/s

row (simple_serialize) <(uint, (uint, uint))> 8.83 GB/s, 164 values/s

row (simple_serialize) <vec<uint>> 2.13 GB/s, 56 values/s

row (simple_serialize) <Option<uint>> 8.82 GB/s, 263k values/s

==============================

You see that columnization does not have a performance benefit in your
benchmark and it even is significantly slower for Option<uint> type and pairs.

In your blog post you never mentioned that columnization will only has the
potential to bring performance benefits to de-/serialization when using types
with large padding overheads. I think this discussion would probably helped
the blog post. It would be much better if it would either omit the currently
wrong performance argument and just focus on the nicely typed API or if it
would use a proper benchmark which would support your argument, which from my
understanding would be only possible in a very limited set of use case
scenarios.

Here is a list of other valid arguments you could have made instead:

* Format saves space at the cost of performance.

* Better than repr(packed) as it will also work on platforms that don't support unaligned access

[1]
[https://github.com/frankmcsherry/columnar/blob/master/exampl...](https://github.com/frankmcsherry/columnar/blob/master/examples/example.rs)
[2]
[https://github.com/mkaufmann/simple_serialize](https://github.com/mkaufmann/simple_serialize)

~~~
frankmcsherry
With all due respect, your code seems to be a copy/paste of my code, where
you've removed the Pair and Option implementation (allowing the alternate
default I discussed in my second blog post) and replaced my string
`ColumnarVec` with your string `SimpleSerialize`. So, thank you for measuring
my code for me? :D

The reason the new numbers look better for Option with the Copy implementation
is that you are now writing 16 bytes for each Option. They are a 50-50 mix of
Some(0u) and None, which I wrote out in 5 bytes on average (always a 1 byte
"present/not", and then 4 bytes of data on average). The Copy implementation
is just writing 3x as much data and padding the throughput numbers, rather
than reporting something more like goodput.

I sense that this isn't the right place to shake this out, so I'll stop
posting. If there is a better way to follow up, let me know.

[edit: I've got you through github, thanks! I'll buy beers when I swing by TU
next]

~~~
mkaufmann
The changes make all the difference. Your argument was that columnization and
thus storing each native type in a data record in seperate columns brings a
performance benefit. By removing the specialisation of Pair and Option which
save the data in seperate columns I switched back to a classical storage which
basically stores the data of one record at one place like a row store. So your
code simulates a column store and my a row store.

Using your original benchmark I than show that the row layout brings a large
performance win in the benchmark. My numbers show this performance win not
just in throughput (bytes per sec.) but als in "goodput" (values per sec.).
Check my previous comment. I just noticed though that I sometimes forgot the k
in the reported numbers, where it is missing you have ti multiply the number
times 1000, I can't edit it anymore.

I guess we can agree to disagree and should continue the discussion in another
form ;)

PS: I just updated my github information, I am now at the TU Munich

------
skybrian
I'm not familiar with Rust, but I wonder how much copying this does for deeply
nested types?

~~~
steveklabnik
Rust has move semantics by default. A move does a bytewise copy, and then
disallows access from the old location.

This code largely just copies pointers around.

------
thomasahle
Seeing how the Rust community seems to have converged on using `uint` as the
default type for array indexing and so on; wouldn't it make sense to reflect
that in the complexity of the name of the types?

Instead of (uint, int) we could have (int, sint) or (uint, sint).

~~~
dilap
You are going to confuse a hell of a lot of people if you make int mean
"unsigned integer"!

~~~
long
So maybe "nat" for "uint", then?

~~~
ufo
nat makes me think of arbritrary-precision numbers. I think uint is fine.

~~~
Arnavion
Also not all of the world agrees upon whether 0 is a natural number or not.

------
ww520
Pretty cool. Columnarization is used in other contexts like high performance
computing, or Entity Component System in gaming. It's data centric and very
cache friendly.

------
arthursilva
Make sure you compile your release/bench builds with --release (if using
cargo) or (-O if using rustc).

~~~
masklinn
IIRC release is the default when testing or benching.

~~~
dbaupp
Only when using `cargo bench` to compile/run them.

