
What database would you use to store (min) 10 billion objects? - imajes
I'm building something that's going to store 2 billion media objects in the short term, with each object having 10+ related objects on top. Think crawler with some interesting extrapolations/filters/etc applied.<p>Would want to be able to power multiple views on data, mostly in aggregate form.<p>This is all greenfield: the data exists but i'm starting from scratch to crawl and reformat it. I'm interested to know how you might consider storing the info?<p>Thanks :)
======
ars
Are there updates to the data, or is it append/delete only?

Does each object have a fixed size? (Or can be a fixed size with padding?)

What kind of index do you need? Do you need advanced SQL commands, like
aggregate data, and complex joins?

MySQL can handle billions
<http://dev.mysql.com/doc/refman/5.1/en/features.html> of rows - but it may be
overkill for you. Pretty much any database can.

But that's the wrong question. The right question is what do you need the
database to do. Mainly what kind of index, and what kind of joins.

PS. If all you need is the aggregate data, why store the details? Just store
the aggregate and update it on the fly. (Presumably store the details offline,
so you can regenerate the aggregate if needed.)

Read this: <http://news.ycombinator.com/item?id=594889> for updating the
agregates on the fly. And this: <http://news.ycombinator.com/item?id=587723>
for one type of system for storing large numbers of objects.

~~~
imajes
96% of change is append. 0.0001% would be delete.

Sorts of views might be: "show me objects that have lat xx and long yy but
tagged "buttons" and created on mm/dd/yyyy" in the basic instance, but also
"show me objects that my friends like" as a more complex example.

Makes me think that what I really need is a graph db- object relationship
seems the key factor.

~~~
ars
So you do have updates? Because updates makes a database more complicated.

If you need lat/lng then you need an R-tree and very few databases come with
that.

Sounds to me like your going to have to write your own database - or more
accurately your own index.

Basically store data as append only, and handle change as delete/append. Store
just the offset to the data as the key.

I really hope your index can fit on one machine.

Do it like the facebook link I gave you earlier.

Your going to have a hard time with index merges (lat/lng merge with tags).

Maybe a better idea is don't store the object in the database, but just the
offset to it (plus machine ID).

Then use any commercial database that has an R-Tree.

You have 200GB to 1PT of data to store. Which needs at least 25 machines.

But if you can store just the index, but not the data, maybe you can store the
index on just one machine - that will makes things easier. (Obviously you'll
have multiple machines with the same index for load balancing.)

This might help also: <http://www.ddj.com/184410998> (instead of an R-Tree,
you might be able to a regular index).

~~~
fhars
PostgreSQL has spatial indexing capability, too. So you can do something like
(the exact syntax for spatial ops escapes me right now):

    
    
      SELECT DISTINCT obj
    
      FROM media NATURAL JOIN mediatags
    
      WHERE (location INTERSECTS somebox)
    
         AND tag IN
    
         (SELECT tag FROM usertags WHERE user = 137)
    

You will probably have to spend some quality time with the query planner
figuring out how to set up your indexes so the common queries will be fast,
but it should be possible.

(EDIT: fighting with the formatting. Why is there no preview?)

------
iigs
This type of question is really too vague to answer without knowing what the
data is and exactly what you want to do with it.

How much of the 20k of data items and 100k related objects do you actually
have to scan/process over? If the 20k item is an opaque block like a
photograph, for example, you probably aren't going to do much analysis of the
block often -- you might pull the header or citation data and other related
information down into 1k or less of useful data per item. On the other hand,
if either the 20k or 100k are long form plain text, you might be able to
distill it down to keywords to focus the data for efficiency.

When you say you want to do "multiple views on data, mostly in aggregate
form", do you mean like OLAP cubes and other common reporting/analysis
metrics, or more like searchy stuff where you might be interested in a needle
in a haystack?

How do your users expect to interact with the data? If you were to sequential
scan the entire uncompressed data set using Storagereview's current fastest
disk, the Seagate 15k.5 300G, it would take over 21 disk DAYS ((min+max)/2
transfer rate), and this is contrived, and unlikely to be attainable in a real
world scenario. It would take about 700 disks to hold it all, which would
wall-time down to about 45 minutes. So naive walking the data for search
purposes is out. It would have to be indexed.

Indexing is a tricky problem. In small databases they're usually so small and
help so much that nobody pays any attention to the resources required to
maintain them. With this dataset that's unlikely to be the case: you may
effectively need several times the size of the data in indexing alone, which
is going to have drastic implications on the computing resources required and
the data storage format needed.

700 of those Seagate disks at roughly $200 street price is $140k. That's
before simple redundancy, systems, cases, racks, power, logical redundancy for
performance or extra space for derived data. Making a system to manipulate
this much data is easily $500k, and paying a commercial vendor to provide you
systems, hardware, and possibly software to do the job could be an order of
magnitude more. For that kind of money I'd spend a few thousand hiring some
people you trust to give you a reasonable assessment of the task at hand. It
sounds like fun, though. Good luck. :)

------
scumola
That's approximately 200TB of data. I suggest sharding the data across
multiple machines because a single SAN doing BDB lookups of a single file of
that size is still going to be a tad slow. Most large databases nowadays are
divided by sharding, but since it's key/value data, a BDB is probably one of
the top solutions. If you modify a lot of the same data frequently, memcache
can speed things up (also a key/value in-memory solution), but the size of the
problem in total bytes may be more of a problem than the actual looking up and
indexing of the data.

~~~
jaydub
_but the size of the problem in total bytes may be more of a problem than the
actual looking up and indexing of the data._

One cool property of such a large data-set (~200TB) is that you are pretty
sure to see a lot of the same data repeated.

It'd be neat to try reduce the overall data footprint by assigning some sort
of signature to repeats. I.e. if you are storing the sequences (let's imagine
for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF ,
replace ABC with say #1 etc. and perhaps save a whole lot of space.

~~~
planck
This is also known as the ZIP compression algorithm.

~~~
webignition
But then you might end up not being able to see the forest of data for all the
Huffman trees.

------
kvs
Forget the database. Figure out: \- What data you have? (I think you know
this). \- How it will be accessed?

Then, figure out what's the best way to store the data so that access
(includes, read/writes) is efficient. Essentially, at this point you're trying
to figure out what's the most efficient way to organize the data.

Then, think of where to store this organization. You may end up using more
than one type of database for that matter but at least at this point you can
ask more pointed questions like: "Hey, anyone knows the best way to store
long/lats?" Or "Hey, anyone knows the best way to store tags for 10billions
objects?" etc.

------
antirez
Any KV store can do the trick, just make sure to use multiple machines via
some for of partitioning (for example consistent hashing).

Redis is reported to hold 300 million keys on a single 64bit machine with 64
GB of RAM, so you need around 7-10 Redis servers in this configuration: an
object for every top-object with value including the serialized version of the
other 10 objects. All this assuming that values will not be too big.

------
nfwatson
I don't know what your definition of a "media object" is, but I'll assume:

* you have audio, video, and/or image files * each media file has a name/media_UID * your "related objects" are more or less fixed-format small records with elements of type 'string', 'uint', 'int', etc. * your "related objects" might be metadata attached to the media, info resulting from processing the media, related info from the context in which the media was found, or dynamic info about how the media was used or referenced

If money were available I would do the following:

    
    
      * buy a 1-big-table log aggregator like SenSage
        (http://www.sensage.com) (distributed Linux-based
        redundant large data storage/query engine)
      * define a single DB table with the media_UID as
        key and with all columns defined for all related
        objects (note:  I'm assuming fixed-column-set
        for each "related object") ... with the understanding
        that any given row may or may not contain a
        "related object" of a given type
      * I'd take the (relatively) static data for each
        media file (e.g., media_UID, file size, file name,
        media type, ..., # of unique faces recognized in
        the media, make-up-your-own-field-here) and
        insert it once, with NULLs for other dynamic
        "related object" fields
      * For dynamic info I'd insert media_UID and relevant
        fields and a timestamp for the dynamic event
      * ... and after this you'd have a queryable data set
        that's constantly evolving
      * you could dynamically update the schema as you need
        more "related objects" or more extend their fields
    

I would buy an EMC Centera array (integrated with SenSage for archival) and
use it also to store the actual media, keyed by media_UID.

After you've done this you can periodically run full-table SQL/Perl scans to
aggregate the info you need -- that's what the SenSage tool is built for and
does it blindingly fast. You could expose the aggregations as full data sets
in a Postgres DB if they're needed multiple times ... or as throwaway dynamic
results if required just once.

If there's less money I'd try to replicate the data store and aggregation in
Hadoop or something similar.

As for SenSage speed ... European/US/international telcos/ISPs use it to store
call-data-records and IP-records, scanning billions of records in minutes when
law enforcement demands the info.
[http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=1...](http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=104&STORY=/www/story/05-04-2009/0005018286&EDATE=)

~~~
nfwatson
I forgot to mention that SenSage usually is treated as a "write-only" data
store, you never delete anything except when it reaches its "age-out" date (on
order of 2 to 10 years usually).

This is very important for its native purpose, log data storage. For legal
reasons companies that store logs related to HIPAA (health care records), SOX
(public company financial records), CDR/IP-R (call-data/IP), PCI
(Payment/Credit/Debit card data) need to keep records about what went on with
their routers/switches, computers, servers, apps, so they can reconstruct a
cross-device slice of any activities of interest for legal purposes. (It's
also useful for maintaining operational integrity and proving SLAs).

In this context here of media-related storage, not throwing data away has some
nice advantages in that you can always track the evolution of all interaction
with the media over time. (e.g., capture media download trends ...) Then
engine can store it all. Many users load 200GB+ daily of log data into the
data engine.

------
joecode
Either you're going to need tons of cash to pay for tons of servers, or you're
going to need to rethink how the data gets used. Is it really completely
random access that you need? Or is there some way to pre-process or pre-order
it so you can at least get the indexes into memory on one machine?

------
mcav
How big is each media item? That might make a difference.

~~~
imajes
20KB - key/value data.

Associated objects might be in the 100s of kb

~~~
mahmud
Start with Tokyo Cabinet and only switch when you hit a ceiling.

My system can do 300 transactions per second over http and I use a lowly
Common Lisp library called "rucksack". I also keep a plain text "log" in a
memory queue that gets paged to disk routinely, from which I can recover the
database in case it gets corrupted.

You can't believe how much time you can save using the easiest solutions. You
will get plenty of mileage from BDB or TokyoCabinet, upgrade to something
beefier as the need arises, if and when you have that many users.

~~~
smokinn
Tokyo Cabinet (TC) is what I was going to recommend as well. It's crazy fast
and quite reliable. It also does zlib compression so your dataset will be
smaller than with most other key/value stores. Be careful picking an open
source key/value store because a lot of them are kinda half baked right now.
(But TC is very solid.)

My first attempt would be as suggested above, to use a single instance of TC.
If it ever started to get slow or you simply ran out of space you could even
put a consistent hash in front of a cluster of machines. Though I wouldn't be
surprised if a single machine would be enough for your current needs (20
billion).

The hash doesn't have to be very complicated, depending on your key. If your
key is just a numeric auto incremented id for example you could implement a
"hash" with modulo which just maps 0-x billion to machine 1, x-2x billion to
machine 2, etc.

If you do this though you'll lose the ability to generate keys to search
through your data. If you need to generate keys and your data is distributed I
highly recommend sphinx search ( <http://www.sphinxsearch.com/> ).
Technically, it's a full-text search engine but I use it for normal searching.
You can build an index distributed across multiple machines and even multiple
cores of multiple machines. For example, if you have 2 boxes with 4 cores
each, you can split your data into 8 chunks, 4 of the chunks running on one
machine, 4 on the other with a single "distributed index" that ties them all
together. When you do a search on your distributed index for say 100 items
sorted by date, 8 threads will search the 8 chunks, return the 100 best
results and the distributed index will filter out duplicates and return the
100 best results. And it'll do it incredibly quickly.

We're using sphinx with mysql. I'm not sure how you'd get the data from TC to
sphinx (there may or may not be adapters for that I'm not sure) but if you
have to pick between sphinx and TC you'd have to decide what's most important
for you, low storage space with very quick lookup or very low latency search.
If you need low latency search I suggest you simply use mysql for storage and
index your data with sphinx.

~~~
gfodor
Seconding this. TC is solid.

------
quellhorst
I would look into hosting the system at someplace like softlayer, pay special
attention to the cloud offering they have coming out next week.

For file store you maybe able to save some money using
<http://www.danga.com/mogilefs/> instead of S3.

For your datastore checkout [http://www.igvita.com/2009/02/13/tokyo-cabinet-
beyond-key-va...](http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-
value-store/)

Also worth taking a look at is <http://hadoop.apache.org/core/>

------
vicaya
If you need range scan queries (selecting objects and their associations that
match certain prefixes), the only DBs that can readily handle this are
Bigtable (which is originally designed to store crawldbs) clones like HBase or
Hypertable, the latter can do >1M inserts/scans/s sustained on cheap Dell
nodes with 16GB ram and 7.2KRPM SATAs. Most KVs based on DHT _cannot_ do
efficient scanning. They typically cannot do fast inserts when RAM to DB size
ratio is less than 1, either.

------
suhail
Hadoop + HBase + Pig might be what you want.

~~~
miracle
Hbase is not reliable in production (it wasn't when I tested it 3 months ago)

------
silas
Check out GlusterFS: <http://www.gluster.org/>

------
moonpolysoft
When you talk about that amount of data, you'll have to make some decisions
about handling disk failures. At that scale they will happen as a matter of
course. There are a number of different distributed databases out there and
some of them have very different ways of handling consistency and
availability.

I work on a database called dynomite which currently only supports a strict
key/value interface to the data. It follows a dynamo model of eventual
consistency. It trades the consistency the data stored in the case of failure
and race conditions for greater availability. What that means is you might get
more than one version of the data back for any particular query. Dynomite is
primarily designed around serving live data to power a site. You can see it in
use at powerset.com for the images in the republished articles.

From the sound of your requirements the strict key/value lookup might not work
so well. If you'll need to occasionally crunch over the numbers something like
HBase might be a better fit. HBase is optimized for scanning over the entire
table for things like hadoop jobs. It's built overtop of the Hadoop File
System (HDFS) which is a distributed file system built after Google's GFS
paper. One of the drawbacks of HBase, however, is that it is not built for
high availability.

Something to think about is perhaps a hybrid approach. It sounds like you need
acquisition of data, do some amount of calculation over it, and then serve it
to users. One way to do this, that search engines are very successful at, is
to acquire data into a map-reduce ready storage system like HBase. Crunch over
it using Hadoop. Then batch the data into a highly available datastore like
Dynomite for live serving.

If you try to satisfy all of your requirements with one system, it will likely
solve all of your problems poorly, if at all.

------
erlanger
CouchDB.

------
MoeDrippins
Isn't this sort of question how Nedry got "found out" as to what his super
secret assignment was on Jurassic Park?

