

Ask HN: How can I use memcached as in-memory index - zaplata

I have 5TB of data and 20 machines with 16GB ram each. I want to create an in-memory index of this data (a financial time series which looks like &#60;timestamp: tick values&#62;) so that I could query the data by time range very quickly (in hundreds of msec).<p>I thought of using some distributed key-value store for the in-memory index, and I'm guessing it is impossible to construct something like the traditional B-tree index using hash table, so the indexing must be done differently from typical databases. What are some simple ways you can think of using memcached or other distributed in-memory key-value store for in-memory indexing of sequentially ordered data of this size.
======
willvarfar
Different approaches take different levels of time investment on your part;
here are some options

your focus is on the data-structure for the index, whereas I'm thinking that
unless the data itself that the index points to is also in RAM, its going to
be hard to get msecs results anyway.

1) if you are repeatedly accessing the same subset of your data, rather than
making continuous random reads across the whole data set, then caching that
set can help:

1a) if you have some file or DB on each of your machines, then the kernel will
automatically cache regularly read bits and it'll all end up quite fast. You
could use any database of your choice for this, with 'sharding'.

1b) you could run memcached and the DBs on each of your machines; to be
honest, letting the OS do the caching is likely a better idea.

2) if you want to continuously go all over the data set, then you are going to
be IO bound.

2a) therefore, if you can _compress_ your data set - find a space-efficient
storage for it so it fits into 250GB of RAM instead of 5TB of disk, then you
can avoid the disk and it'll fly. You could consider storing deltas and then
arithmetic or huffman coding behind some low order predictor; ask on this
compression forum <http://encode.ru/> and you'll have to tell them much more
about your data and perhaps describe a few lines of it. Compression can be on
small blocks of data, so you only have to decompress a small amount to get the
random value you want

2b) or you might find that you only need to generate some statistics from the
data once, and then work with the derived data - presumably small enough to
fit in your RAM - from then on

3) you could buy bigger machines, or bigger amounts of RAM

------
bl4k
I would use redis ordered sets and shard between servers in your application.

1\. Get the server ID based on what the timestamp is. You would store a
starting time and a server ID. You then lookup what server your data is on
based on the start of your range.

2\. Use sorted sets to store your data, using the timestamp as the score.
These are really fast in redis:
<http://code.google.com/p/redis/wiki/ZaddCommand>

Other option is to checkout Raik, which will do the distributed part for you
(a distributed hash-table with no master eg.) but it doesn't have the data
structures that redis has (it is just key=>valu)

~~~
willvarfar
does redis support more data than it has RAM?

~~~
bl4k
Yes - disk swapping was added in a recent release, see:

<http://antirez.com/post/redis-virtual-memory-story.html>

