

Sparkey – Key/value storage by Spotify - patricjansson
https://github.com/spotify/sparkey
Sparkey is a simple constant key&#x2F;value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts.
======
levosmetalo
I remember my interview at Spotify where we discussed how to implement
thumbnail display service in the most effective way. What we actually came to
is something along the lines of this library.

I always like it when a company focus on their real problems in job interviews
and manage to avoid the brain teaser trap. That way you can have a feeling
about the job that you are going to work on there, and see if you really like
both work and people.

~~~
rohansingh
Thanks :-) That is a go-to interview question for us and we actually all do it
slightly differently and take it in different directions depending on your
expertise or specialization.

It is really closely aligned to what our core service is (distributing and
streaming files) and is a great chance to talk with the interviewee and figure
out where their strengths are.

Totally encourage other people to interview this way. It's what I've done at
the past few companies I've been at and really worked excellently — just think
of a problem you're working or have worked on, and distill it into an
interview problem.

~~~
staunch
I also like this style, but you have to be very mindful of the fact that
you've been thinking about the problem 1000x more than the candidate. The
Curse of Knowledge[1] haunts the interview process. I haven't tried it much,
but maybe it would be better to always use a fresh problem in each interview
that not even you had seen. Maybe selected from StackOverflow.

1\.
[http://en.m.wikipedia.org/wiki/Curse_of_knowledge](http://en.m.wikipedia.org/wiki/Curse_of_knowledge)

------
jdp
Another cool project is bam[1], a constant key/value server built on similar
principles. A single input file (in this case, a TSV file instead of the SPL
log file) and an index file. The cool thing about bam is that it uses the
CMPH[2] library to generate a minimal perfect hash function over the keys in
the input file before putting them in the index file.

[1]:
[https://github.com/StefanKarpinski/bam](https://github.com/StefanKarpinski/bam)
[2]: [http://cmph.sourceforge.net/](http://cmph.sourceforge.net/)

~~~
StefanKarpinski
Wow, didn't expect this to make a mention on the front page of HN today. I
never did convince Etsy to let me deploy bam in production, but it's _so_
simple that it should be doable without much fuss. I mainly built it as a
proof-of-concept to show that serving static data does not have to be
difficult – and that loading large static data sets into a relational database
is a truly wasteful, terrible approach. Are you actually using bam "in anger"?

~~~
krka
Bam looks really interesting, definitely a lot simpler than Sparkey, and the
basic principle is the same. I have been hesitant to use perfect hashing for
Sparkey since I wasn't sure how well it holds up for really large data sets
(close to a billion keys). Impressive to write it in less than 300 lines of
clean code!

------
krka
There are a bunch of comments related to performance data - the code is
available so nothing is stopping anyone from making an unbiased comparison. :)

That said, I intend to publish some sort of performance comparison code /
results. The downside with me doing it is that: 1) I know the sparkey code
much better than I know level-db or any other solution, so the tuning
parameters will probably be suboptimal for the other solutions. 2) I will only
focus on our specific usecase (write large bulks, do lots of random reads),
which may seem a bit unfair to the most general solutions.

------
krka
Here are some preliminary performance benchmarks from my regular workstation
(Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz, 8 GB RAM):
[http://pastebin.com/7buZVgdu](http://pastebin.com/7buZVgdu)

The sparkey usage is fairly optimized, but I just randomly put something
together for the level-db, so consider the results extremely biased.

~~~
hyc_symas
Where's the source code for your bench? How large are the records you're
loading? How large are the keys? What's the insert order?

How does your test compare to
[http://symas.com/mdb/microbench/](http://symas.com/mdb/microbench/) ? If
you're going to try to talk about numbers, talk about them in a meaningful
context. Right now you're just handwaving.

~~~
krka
Yes, this post was handwaving, and I tried to make that clear ("preliminary",
"extremely biased").

On monday I added some slightly more proper benchmark code, you can find it on
[https://github.com/spotify/sparkey/blob/master/src/bench.c](https://github.com/spotify/sparkey/blob/master/src/bench.c)

I didn't add the level-db code to this benchmark however, since I 1) didn't
want to manage that dependency 2) didn't know how to write optimized code for
usage of it.

I'm using very small records, a couple of bytes of key and value. The insert
order is strictly increasing (key_0, key_1, ...), though that doesn't really
matter for sparkey since it uses a hash for lookup instead of ordered lists or
trees.

As for the symas mdb microbench, I only looked at it briefly but it seems like
it's not actually reading the value it's fetching, only doing the lookup of
where it actually is, is that correct?

"MDB's zero-memcpy reads mean its read rate is essentially independent of the
size of the data items being fetched; it is only affected by the total number
of keys in the database."

Doing a lookup and not using the values seems like a very unrealistic usecase.

Here's the part of the benchmark I'm referring to: for (int i = 0; i < reads_;
i++) { const int k = rand_.Next() % reads_; key.mv_size = snprintf(ckey,
sizeof(ckey), "%016d", k); mdb_cursor_get(cursor, &key, &data, MDB_SET);
FinishedSingleOp(); }

~~~
hyc_symas
It's a valid measurement of how long the DB takes to access a record. You can
assume that the time required for an app to use the values will be the same
across all DBs therefore it's omitted. All of the microbench code behaves the
same on this point.

[http://symas.com/mdb/memcache/](http://symas.com/mdb/memcache/) and
[http://symas.com/mdb/hyperdex/](http://symas.com/mdb/hyperdex/) give results
where the records are transmitted across the network.

~~~
krka
Well, some DB's "load" the value in some way before giving it to the user, so
that time is implicitly measured for those types of DB's but not for others,
so I don't think it's a particularly fair comparison. I think Tokyo Cabinet
gives you a pointer to newly allocated memory, at least for compressed data
(but I am not completely sure about this). Like LMDB, Sparkey also does no
processing of the value for uncompressed data, but for compressed data some
decompression needs to take place in the iterator buffer (I guess that's
equivalent to your cursor object). Even worse, if this is done lazily upon
value retrieval, the cost is completely hidden in the benchmark.

In any case, I think the easiest way to get a fair benchmark is to at least
iterate over the value, possibly also compare it. If that time turns out to be
significant (perhaps even dominant) compared to the actual lookup time, then
further optimization of the actual storage layer is pretty meaningless.

~~~
hyc_symas
Have a look
[https://github.com/hyc/sparkey/tree/master/src](https://github.com/hyc/sparkey/tree/master/src)
bench.c, bench_mdb.c bench.out, bench_mdb.out

This was run on my Dell M4400 laptop, Intel Q9300 2.53GHz quadcore CPU, 8GB
RAM. The maximum DB size is around 4GB so this is a purely in-memory test.
Your hash lookup is faster than the B+tree, but with compression you lose the
advantage.

~~~
krka
Thanks, that's interesting data!

I am not sure why you changed the key format to "key_%09d" \- is that an
optimization for lmdb, to make sure the insertion order is the same as the
internal tree ordering? If so, why is that needed for the benchmark?

I noticed that the wall time and cpu time for the sparkey 100M benchmarks were
a bit disjoint, it would seem that your OS was evicting many pages or was
stalling on disk writes. The Sparkey files were slightly larger than 4 GB
while lmdb was slightly smaller, but I am not sure that really explains it on
an 8 GB machine.

I am not sure I agree about the non-linear creation time difference, the
benchmarks indicate that both sparkey and lmdb are non-linear. The sparkey
creation throughput went from 1206357.25 to 1109604.25 (-8.0%) while lmdb's
went from 2137678.50 to 2033329.88 (-4.8%)

Regarding the lookup performance "dropping off a cliff", I think that is
related to the large difference in wall time vs cpu time, which indicates a
lot of page cache misses.

lmdb seems really interesting for large data sets, but I think it's optimized
for different use cases. I'd be curious to see how it behaves with more
randomized keys and insertion order. I didn't think of doing that in the
benchmark since sparkey isn't really affected by it, but it makes sense for
when benchmarking a b-tree implementation.

Sparkey is optimized for our use case where we mlock the entire index file to
guarantee cache hits, and possibly also mlock the log file, depending on how
large it is.

The way you append stuff to sparkey (first fill up a log, then build a hash
table as a finalization) is really useful when you need to use lots of memory
while building and can't affort random seek file operations, and in the end
when most of the work is done and your memory is free again, finalize the
database. Of course, you could do the same thing with lmdb, first writing a
log and then converting that into a lmdb file.

Thanks for taking the time to adapt the benchmark code to lmdb, it's been very
interesting.

~~~
hyc_symas
Yes, I changed the key format to allow using the MDB_APPEND option for bulk
loading. (That's only usable in LMDB for sequential inserts.) Otherwise, for
random inserts, things will be much slower. (Again, refer to the microbench to
see the huge difference this makes.) If you don't have your data ordered in
advance then this comparison is invalid, and we'd have to just refer to the
much slower random insert results.

Still don't understand what happened to sparkey at 100M. The same thing
happens using snappy, and the compressed filesize is much smaller than LMDB's,
so it can't be pagecache exhaustion.

Also suspicious of the actual time measurements. Both of these programs are
single-threaded so there's no way the CPU time measurement should be greater
than the wall-clock time. I may take a run at using getrusage and gettimeofday
instead, these clock_gettime results look flaky.

~~~
krka
Could be due to a bug related to reading uninitialized data on the stack. That
could lead to using the wrong number of bits for the hash, causing an
unnecessarily high number of hash collisions, which makes it more expensive
due to false positives that needs to be verified. I think it's fixed in the
latest master, and the benchmark code now prints the number of collisions per
test case, which could be useful debug data.

Also, I think it would be more interesting to see a comparison with lmdb using
random writes instead of sequential.

As for the cpu time measurement, the wallclock is very inprecise, so it could
be some small quantum larger than cpu time, but it should never be more than
the system specific wall clock quantum.

~~~
hyc_symas
re: random insert order - if we just revert to the original key format you'll
get this: [http://www.openldap.org/lists/openldap-
devel/200711/msg00002...](http://www.openldap.org/lists/openldap-
devel/200711/msg00002.html) It becomes a worst-case insert order. If you want
to do an actual random order, with a shuffled list so there are no repeats,
you'll get something like the September 2012 LMDB microbench results. If you
just use rand() and don't account for duplicates you'll get something like the
July 2012 LMDB microbench results.

------
atdt
I was surprised to see LevelDB
([https://code.google.com/p/leveldb/](https://code.google.com/p/leveldb/)) was
missing from the list of storage solutions you tried, because it seems optimal
for your use-case. Were you aware of it?

~~~
blippie
I'm not sure about the optimal use-case match. Sparkey is for "mostly static"
datasets where on disk structures are generated by a batch process and pushed
to servers providing read only access to the data to consumers.

leveldb on the other hand, supports concurrent writes and provides features to
handle data consistency and cheap gradual reindexing.

~~~
illumen
Also Sparkey seems to work well with bittorrent/rsync distribution. I recall
spotify use bittorrent to distribute files to their servers.

------
rschmitty
I wish more projects would follow this kind of readme format, at least
somewhat. There are so many new things that popup on HN but have very little
information about all the whats and whys I should care

What problem are you solving?

If existing solutions existed what hurdles did you face with them and how did
you overcome them with your custom solution?

How do you compare from a performance view? (granted they still need to do
this, but at least put in a section about it)

------
huhtenberg
Looks like a cdb variation that moves index to a separate file and therefore
allows changing the database (to a degree) without requiring a rebuild.

[0]
[http://en.wikipedia.org/wiki/Cdb_%28software%29](http://en.wikipedia.org/wiki/Cdb_%28software%29)

~~~
js2
Indeed, from the README: _We used to rely a lot on CDB (which is a really
great piece of software). It performed blazingly quick and produces compact
files. We only stopped using it when our data started growing close to the 4
GB limit_

~~~
ptramo
A 64-bit port of cdb isn't too hard:
[https://github.com/pcarrier/cdb64](https://github.com/pcarrier/cdb64)

------
jflatow
Similar also to DiscoDB, which does support compression, and uses perfect
hashing for constant time-lookup with minimal disk access.

Not only that but it provides lightning-fast conjunctive normal form queries,
a.k.a logical combinations of primitive keys. Plus it has Python / Erlang
bindings.

[http://discodb.rtfd.org](http://discodb.rtfd.org)

[https://github.com/jflatow/discodb](https://github.com/jflatow/discodb)

~~~
seiji
Yeah, my first two thoughts were discodb and bitcask [http://basho.com/hello-
bitcask/](http://basho.com/hello-bitcask/) too.

------
wyuenho
I'm baffled by the choice of using the GNU autofools chain just to include a
Doxygen target in the Makefile. The whole thing is essentially straight up C
with just 1 library dependency.

The command line argument processing is also quite haphazardly done, it's not
like it using getopt or whatever that poses compatibility issues. Is writing
and packaging with a Makefile that difficult?

~~~
tinco
They wrote a database to solve an operational need. From experience I can tell
you that's an endeavour you should strife to spend as little time on as
possible.

I think it's a miracle they produced something they feel comfortable sharing
with the world. If you write a database in house, and the tool chain and the
argument processing are the only things done haphazardly, then hats off to you
:)

~~~
wyuenho
Whoa I sense passive-aggressiveness :) I'm still waiting for those benchmarks.
The code is very clean and simple I was just picking bones. The use case seems
to be overly specific tho. Is there any other example use cases where this
library could be useful?

------
jetz
There is LMDB ([http://symas.com/mdb/](http://symas.com/mdb/)) storage
solution for read heavy workloads. Alternative to LevelDB or CDB.

------
slynux
This looks interesting. Can be used for deduplication by keeping this hash
table on disk for large amount of data

I wrote something like this to optimize disk seeks heavily by returning a
reference of 8 byte and keeping a hashtable in memory. A mostly-append only
records store that allowing mutations of same key and by rounding size of
blobs by power of 2. Written to optimize storage layer for Membase.

[https://github.com/t3rm1n4l/lightkv](https://github.com/t3rm1n4l/lightkv)

------
krka
I have now created a very simple benchmark suite to give you some rough
performance numbers, and updated the README to include some sample numbers for
one specific machine.

------
rythie
I'm struggling to find something that it does that a webserver pointed at the
filesystem doesn't do (with hash-ids for file names). I'm wondering if that's
all it is, with a bit of logic to write the files in the correct structure.

~~~
sp332
From the description: _Sparkey is an extremely simple persistent key-value
store. You could think of it as a read-only hashtable on disk and you wouldn
't be far off._

~~~
rythie
Good point, I missed that. Read the feature list which lists stuff the
filesystem does itself.

------
kirbyk
This goes to prove that simple is fast.

