
Gorilla: A Fast, Scalable, In-Memory Time Series Database [pdf] - orrsella
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
======
scurvy
A few things here:

1) What/where exactly are they using GlusterFS for? Has Gluster fixed their
scaling problems yet? Specifically the issue where new storage spaces/nodes
were only available to new directories and files, but not existing
directories? Granted, the last time I looked at this was 2009 or so, but it
was a flaw due to their "no master node" topology.

2) FB has an entire team to manage Hadoop/HBase. This shows just how much of a
beast that stack is. Anyone who has run Hadoop on "Internet time" knows what
I'm talking about. It's great at running time insensitive, deferred compute
jobs in an academic or scientific setting. It's really hard to keep it all
100% running in an on-demand setting. Aside, I couldn't imagine just working
on 1 product in an operations setting as my full-time job. Boredom/fatigue
must be a problem on that team.

3) I'd like to see more information on the networking side. What transport
protocol? How large are the average updates in frame size? Etc etc.

We've built something similar to Gorilla in-house, so I'm happy to see that
we've come to some of the same conclusions.

~~~
linuxhansl
For #2, show me any system that holds over 2PB of data on a large set of
machines that does not need a team to be managed.

~~~
scurvy
I meant dedicated team. Most places run shared ops teams. Every place I've
seen that runs a "big Hadoop" deployment also has a dedicated "Hadoop team" to
go with it.

It's pretty easy to build a 2PB storage system on Ceph that the average group
of sysadmins can run.

------
saosebastiao
I really wish this included a comparison with KDB. It's not cheap to get a
license, and they certainly wouldn't give a testing license in order to
publish benchmarks against it, but in finance it is the standard for TSDBs.
There hasn't ever been anything open source that has come close.

~~~
beagle3
They can download a 32-bit version of kdb+ for evaluation purposes, not sure
if they could publish any benchmarks, and it would obviously not properly
represent the speed (and capacity) of the 64-bit version.

But I suspect that even the 32-bit kdb+ is going to be significantly faster
than this gorilla.

~~~
iskander
I briefly worked with K and Q while doing research on high-level numerical
computing. I found their claimed efficiency to be severely exaggerated. K is a
very naively implemented array-oriented language and I found it to be slower
than Matlab or NumPy for many tasks.

~~~
tlack
You should consider writing up something with examples. Given how it's
implemented (unboxed directly typed mmap'd arrays) it's hard to see how much
slack there could be in it. I and many others are beginning to experiment with
Q and Kdb and your learnings might spark valuable debate - and perhaps saved
time. :)

~~~
iskander
I wouldn't voluntarily touch K/Q ever again, but if someone else is interested
in doing a blog post: try implementing any iterative machine learning
algorithms. The separate compilation of primitive operators requires creating
many array temporaries that on the one hand, Matlab's JIT can fuse and, on the
other hand, NumPy provides a richer set of compiled functions to work with.
K/Q lets you (tersely) express complex array computations using just its core
operators, but all those array constructions add up to comparatively bad
performance.

~~~
beagle3
Do you remember which versions and wordlength you use of each software, in
case I get a chance to do comparisons?

~~~
iskander
It was ~4 years ago on a quad core xeon with ~8GB of memory, using whatever
was the most recent version of Q, Python/Numpy and Matlab at the time.

------
nwmcsween
Why pointers, why not just do a mirror mmap if you have constant offsets and
if time points change and querying based on time points need be constant maybe
a table that holds an offset w/ the difference? Also why not atomics instead
of spinning?

~~~
scurvy
I also saw spinlock and immediately thought, why?

------
rodionos
> Further, many data sources only store integers into ODS

If the underlying data type is 64 bit double, aren't they losing precision for
integers greater than 2^53?

------
thrusong
So this isn't managing news feed data or anything like that, it's helping them
aggregate server performance and error data for quick look up?

------
rodionos
Has anyone attended a VLDB conference recently? How is it different from
Strata, for example?

P.S. Their choice of venues is nice.

------
pdarshan
Few folks from Fb started this company called Interana, and they seem to be
doing the same thing.

------
simpsond
The compression is very neat.

