Hacker News new | past | comments | ask | show | jobs | submit login
What database would you use to store (min) 10 billion objects?
63 points by imajes on May 8, 2009 | hide | past | web | favorite | 46 comments
I'm building something that's going to store 2 billion media objects in the short term, with each object having 10+ related objects on top. Think crawler with some interesting extrapolations/filters/etc applied.

Would want to be able to power multiple views on data, mostly in aggregate form.

This is all greenfield: the data exists but i'm starting from scratch to crawl and reformat it. I'm interested to know how you might consider storing the info?

Thanks :)

Are there updates to the data, or is it append/delete only?

Does each object have a fixed size? (Or can be a fixed size with padding?)

What kind of index do you need? Do you need advanced SQL commands, like aggregate data, and complex joins?

MySQL can handle billions http://dev.mysql.com/doc/refman/5.1/en/features.html of rows - but it may be overkill for you. Pretty much any database can.

But that's the wrong question. The right question is what do you need the database to do. Mainly what kind of index, and what kind of joins.

PS. If all you need is the aggregate data, why store the details? Just store the aggregate and update it on the fly. (Presumably store the details offline, so you can regenerate the aggregate if needed.)

Read this: http://news.ycombinator.com/item?id=594889 for updating the agregates on the fly. And this: http://news.ycombinator.com/item?id=587723 for one type of system for storing large numbers of objects.

"but it may be overkill for you"

I am curious about this line. MySQL is not very complicated to set up, so how could it be overkill?

Edit: seriously, "overkill" to me would be to use a sledge hammer to drive in a nail, that is, something that is too fat/heavy/complicated for the problem at hand.

To store a billion records in MySQL if you don't needs it's abilities (indexes, joins and updates) is overkill.

If you never do updates, and you write your own index, then a much simpler storage method with far less overhead would be what you need.

Writing one's own index already seems to be a lot more work than simply setting up MySQL. I get your point, I just have a hard time imagining what could be simpler. Perhaps writing to a sequential file without an index. Although even then I would worry about the capabilities of the file system, which I suppose MySQL would abstract away from me.

96% of change is append. 0.0001% would be delete.

Sorts of views might be: "show me objects that have lat xx and long yy but tagged "buttons" and created on mm/dd/yyyy" in the basic instance, but also "show me objects that my friends like" as a more complex example.

Makes me think that what I really need is a graph db- object relationship seems the key factor.

So you do have updates? Because updates makes a database more complicated.

If you need lat/lng then you need an R-tree and very few databases come with that.

Sounds to me like your going to have to write your own database - or more accurately your own index.

Basically store data as append only, and handle change as delete/append. Store just the offset to the data as the key.

I really hope your index can fit on one machine.

Do it like the facebook link I gave you earlier.

Your going to have a hard time with index merges (lat/lng merge with tags).

Maybe a better idea is don't store the object in the database, but just the offset to it (plus machine ID).

Then use any commercial database that has an R-Tree.

You have 200GB to 1PT of data to store. Which needs at least 25 machines.

But if you can store just the index, but not the data, maybe you can store the index on just one machine - that will makes things easier. (Obviously you'll have multiple machines with the same index for load balancing.)

This might help also: http://www.ddj.com/184410998 (instead of an R-Tree, you might be able to a regular index).

PostgreSQL has spatial indexing capability, too. So you can do something like (the exact syntax for spatial ops escapes me right now):


  FROM media NATURAL JOIN mediatags

  WHERE (location INTERSECTS somebox)

     AND tag IN

     (SELECT tag FROM usertags WHERE user = 137)
You will probably have to spend some quality time with the query planner figuring out how to set up your indexes so the common queries will be fast, but it should be possible.

(EDIT: fighting with the formatting. Why is there no preview?)

Regarding space filling curves some implementations would be:

1) http://code.google.com/p/uzaygezen/ (Hilbert)

2) http://www.h2database.com/html/features.html#multi_dimension... (Z-order)

Other space transformations could be employed like the Pyramid technique or the iDistance (http://obsearch.net). Nevertheless, if the "intrinsic dimensionality" of your objects is too high then the curse of dimensionality comes into play and sequential scan might be cheaper. In such a case the Va-file, IQ-Tree or even "sketches" might be the only way to go.

Yeah, i think there's a lot of validity in this comment... I fully expect to do stuff like abstract storage of lat/long to separate stores- so most append is really just creating a new related object- but how to store it is the fun problem :)

good tips. thanks. :)

apparently, you can aggregate lat-long into a geohash http://en.wikipedia.org/wiki/Geohash. seems like a good idea, although i havent actually tried it.

This type of question is really too vague to answer without knowing what the data is and exactly what you want to do with it.

How much of the 20k of data items and 100k related objects do you actually have to scan/process over? If the 20k item is an opaque block like a photograph, for example, you probably aren't going to do much analysis of the block often -- you might pull the header or citation data and other related information down into 1k or less of useful data per item. On the other hand, if either the 20k or 100k are long form plain text, you might be able to distill it down to keywords to focus the data for efficiency.

When you say you want to do "multiple views on data, mostly in aggregate form", do you mean like OLAP cubes and other common reporting/analysis metrics, or more like searchy stuff where you might be interested in a needle in a haystack?

How do your users expect to interact with the data? If you were to sequential scan the entire uncompressed data set using Storagereview's current fastest disk, the Seagate 15k.5 300G, it would take over 21 disk DAYS ((min+max)/2 transfer rate), and this is contrived, and unlikely to be attainable in a real world scenario. It would take about 700 disks to hold it all, which would wall-time down to about 45 minutes. So naive walking the data for search purposes is out. It would have to be indexed.

Indexing is a tricky problem. In small databases they're usually so small and help so much that nobody pays any attention to the resources required to maintain them. With this dataset that's unlikely to be the case: you may effectively need several times the size of the data in indexing alone, which is going to have drastic implications on the computing resources required and the data storage format needed.

700 of those Seagate disks at roughly $200 street price is $140k. That's before simple redundancy, systems, cases, racks, power, logical redundancy for performance or extra space for derived data. Making a system to manipulate this much data is easily $500k, and paying a commercial vendor to provide you systems, hardware, and possibly software to do the job could be an order of magnitude more. For that kind of money I'd spend a few thousand hiring some people you trust to give you a reasonable assessment of the task at hand. It sounds like fun, though. Good luck. :)

That's approximately 200TB of data. I suggest sharding the data across multiple machines because a single SAN doing BDB lookups of a single file of that size is still going to be a tad slow. Most large databases nowadays are divided by sharding, but since it's key/value data, a BDB is probably one of the top solutions. If you modify a lot of the same data frequently, memcache can speed things up (also a key/value in-memory solution), but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.

but the size of the problem in total bytes may be more of a problem than the actual looking up and indexing of the data.

One cool property of such a large data-set (~200TB) is that you are pretty sure to see a lot of the same data repeated.

It'd be neat to try reduce the overall data footprint by assigning some sort of signature to repeats. I.e. if you are storing the sequences (let's imagine for a sec that these are non-trivial sizes) ABC, ABCD, ABCDE and ABCDEF , replace ABC with say #1 etc. and perhaps save a whole lot of space.

This is also known as the ZIP compression algorithm.

But then you might end up not being able to see the forest of data for all the Huffman trees.

It's called "deduplication" in databaseland, it's common strategy for tape backups.

Most large databases nowadays are divided by sharding

Err, this isn't even close to true. MySQL isn't the only database in the world, you know.

Sharding has nothing to do with MySQL, and everything to do with scaling.

I think the point was that the big DBs have better ways to deal with scaling than sharding.

Yeah. but it's text, so i'm thinking about S3, interesting storage optimizations (alot of it would be tag data, for example,so there are plenty of fkeys).

i'd shard it based on some kind of distributed scheme... though not sure what that might look like. Good tips though. :)

Here's a good comparison of key/value database alternatives: http://www.metabrew.com/article/anti-rdbms-a-list-of-distrib...

Forget the database. Figure out: - What data you have? (I think you know this). - How it will be accessed?

Then, figure out what's the best way to store the data so that access (includes, read/writes) is efficient. Essentially, at this point you're trying to figure out what's the most efficient way to organize the data.

Then, think of where to store this organization. You may end up using more than one type of database for that matter but at least at this point you can ask more pointed questions like: "Hey, anyone knows the best way to store long/lats?" Or "Hey, anyone knows the best way to store tags for 10billions objects?" etc.

Any KV store can do the trick, just make sure to use multiple machines via some for of partitioning (for example consistent hashing).

Redis is reported to hold 300 million keys on a single 64bit machine with 64 GB of RAM, so you need around 7-10 Redis servers in this configuration: an object for every top-object with value including the serialized version of the other 10 objects. All this assuming that values will not be too big.

I don't know what your definition of a "media object" is, but I'll assume:

* you have audio, video, and/or image files * each media file has a name/media_UID * your "related objects" are more or less fixed-format small records with elements of type 'string', 'uint', 'int', etc. * your "related objects" might be metadata attached to the media, info resulting from processing the media, related info from the context in which the media was found, or dynamic info about how the media was used or referenced

If money were available I would do the following:

  * buy a 1-big-table log aggregator like SenSage
    (http://www.sensage.com) (distributed Linux-based
    redundant large data storage/query engine)
  * define a single DB table with the media_UID as
    key and with all columns defined for all related
    objects (note:  I'm assuming fixed-column-set
    for each "related object") ... with the understanding
    that any given row may or may not contain a
    "related object" of a given type
  * I'd take the (relatively) static data for each
    media file (e.g., media_UID, file size, file name,
    media type, ..., # of unique faces recognized in
    the media, make-up-your-own-field-here) and
    insert it once, with NULLs for other dynamic
    "related object" fields
  * For dynamic info I'd insert media_UID and relevant
    fields and a timestamp for the dynamic event
  * ... and after this you'd have a queryable data set
    that's constantly evolving
  * you could dynamically update the schema as you need
    more "related objects" or more extend their fields
I would buy an EMC Centera array (integrated with SenSage for archival) and use it also to store the actual media, keyed by media_UID.

After you've done this you can periodically run full-table SQL/Perl scans to aggregate the info you need -- that's what the SenSage tool is built for and does it blindingly fast. You could expose the aggregations as full data sets in a Postgres DB if they're needed multiple times ... or as throwaway dynamic results if required just once.

If there's less money I'd try to replicate the data store and aggregation in Hadoop or something similar.

As for SenSage speed ... European/US/international telcos/ISPs use it to store call-data-records and IP-records, scanning billions of records in minutes when law enforcement demands the info. http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=1...

I forgot to mention that SenSage usually is treated as a "write-only" data store, you never delete anything except when it reaches its "age-out" date (on order of 2 to 10 years usually).

This is very important for its native purpose, log data storage. For legal reasons companies that store logs related to HIPAA (health care records), SOX (public company financial records), CDR/IP-R (call-data/IP), PCI (Payment/Credit/Debit card data) need to keep records about what went on with their routers/switches, computers, servers, apps, so they can reconstruct a cross-device slice of any activities of interest for legal purposes. (It's also useful for maintaining operational integrity and proving SLAs).

In this context here of media-related storage, not throwing data away has some nice advantages in that you can always track the evolution of all interaction with the media over time. (e.g., capture media download trends ...) Then engine can store it all. Many users load 200GB+ daily of log data into the data engine.

Either you're going to need tons of cash to pay for tons of servers, or you're going to need to rethink how the data gets used. Is it really completely random access that you need? Or is there some way to pre-process or pre-order it so you can at least get the indexes into memory on one machine?

How big is each media item? That might make a difference.

20KB - key/value data.

Associated objects might be in the 100s of kb

Start with Tokyo Cabinet and only switch when you hit a ceiling.

My system can do 300 transactions per second over http and I use a lowly Common Lisp library called "rucksack". I also keep a plain text "log" in a memory queue that gets paged to disk routinely, from which I can recover the database in case it gets corrupted.

You can't believe how much time you can save using the easiest solutions. You will get plenty of mileage from BDB or TokyoCabinet, upgrade to something beefier as the need arises, if and when you have that many users.

Tokyo Cabinet (TC) is what I was going to recommend as well. It's crazy fast and quite reliable. It also does zlib compression so your dataset will be smaller than with most other key/value stores. Be careful picking an open source key/value store because a lot of them are kinda half baked right now. (But TC is very solid.)

My first attempt would be as suggested above, to use a single instance of TC. If it ever started to get slow or you simply ran out of space you could even put a consistent hash in front of a cluster of machines. Though I wouldn't be surprised if a single machine would be enough for your current needs (20 billion).

The hash doesn't have to be very complicated, depending on your key. If your key is just a numeric auto incremented id for example you could implement a "hash" with modulo which just maps 0-x billion to machine 1, x-2x billion to machine 2, etc.

If you do this though you'll lose the ability to generate keys to search through your data. If you need to generate keys and your data is distributed I highly recommend sphinx search ( http://www.sphinxsearch.com/ ). Technically, it's a full-text search engine but I use it for normal searching. You can build an index distributed across multiple machines and even multiple cores of multiple machines. For example, if you have 2 boxes with 4 cores each, you can split your data into 8 chunks, 4 of the chunks running on one machine, 4 on the other with a single "distributed index" that ties them all together. When you do a search on your distributed index for say 100 items sorted by date, 8 threads will search the 8 chunks, return the 100 best results and the distributed index will filter out duplicates and return the 100 best results. And it'll do it incredibly quickly.

We're using sphinx with mysql. I'm not sure how you'd get the data from TC to sphinx (there may or may not be adapters for that I'm not sure) but if you have to pick between sphinx and TC you'd have to decide what's most important for you, low storage space with very quick lookup or very low latency search. If you need low latency search I suggest you simply use mysql for storage and index your data with sphinx.

Seconding this. TC is solid.

Sphinx has an XML input format, IIRC.


Why Tokyo Cabinet?

How quickly can you populate your database locally? Adding 10 billion objects at 300/seconds will take too long.

1 million records in 0.4 seconds. 2.5M queries per second.


page 4.

You realize these numbers are not for transactional (ACID) updates, right? They are with write cache enabled, meaning a power failure would cause data reported as written not to be available. I hope it doesn't mean data corruption, if someone has tested this I'd like to hear from them.

For ACID storage, one transaction per disk rotation is pretty much a theoretical limit for sustained performance.

> For ACID storage, one transaction per disk rotation is pretty much a theoretical limit for sustained performance.

Why? There's nothing about "all or nothing" that forbids performing independent application-level commits with a single implementation-level commit. And, it's even possible to do application-level-dependent commits at the same time. (Yes, it's tricky.)

And, you can have commits on different spindles. And, you can do multiple writes during a single rotation.

Those numbers are probably from non-real world benchmark tests of sequential read/write. You probably never reach to those levels and when put TT to the equation # will decrease even more.

If I read it correctly, 300 records per second was his production rate. I've benched TokyoCabinet out to be much, much faster.

I would look into hosting the system at someplace like softlayer, pay special attention to the cloud offering they have coming out next week.

For file store you maybe able to save some money using http://www.danga.com/mogilefs/ instead of S3.

For your datastore checkout http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-va...

Also worth taking a look at is http://hadoop.apache.org/core/

If you need range scan queries (selecting objects and their associations that match certain prefixes), the only DBs that can readily handle this are Bigtable (which is originally designed to store crawldbs) clones like HBase or Hypertable, the latter can do >1M inserts/scans/s sustained on cheap Dell nodes with 16GB ram and 7.2KRPM SATAs. Most KVs based on DHT _cannot_ do efficient scanning. They typically cannot do fast inserts when RAM to DB size ratio is less than 1, either.

Hadoop + HBase + Pig might be what you want.

Hbase is not reliable in production (it wasn't when I tested it 3 months ago)

Check out GlusterFS: http://www.gluster.org/

When you talk about that amount of data, you'll have to make some decisions about handling disk failures. At that scale they will happen as a matter of course. There are a number of different distributed databases out there and some of them have very different ways of handling consistency and availability.

I work on a database called dynomite which currently only supports a strict key/value interface to the data. It follows a dynamo model of eventual consistency. It trades the consistency the data stored in the case of failure and race conditions for greater availability. What that means is you might get more than one version of the data back for any particular query. Dynomite is primarily designed around serving live data to power a site. You can see it in use at powerset.com for the images in the republished articles.

From the sound of your requirements the strict key/value lookup might not work so well. If you'll need to occasionally crunch over the numbers something like HBase might be a better fit. HBase is optimized for scanning over the entire table for things like hadoop jobs. It's built overtop of the Hadoop File System (HDFS) which is a distributed file system built after Google's GFS paper. One of the drawbacks of HBase, however, is that it is not built for high availability.

Something to think about is perhaps a hybrid approach. It sounds like you need acquisition of data, do some amount of calculation over it, and then serve it to users. One way to do this, that search engines are very successful at, is to acquire data into a map-reduce ready storage system like HBase. Crunch over it using Hadoop. Then batch the data into a highly available datastore like Dynomite for live serving.

If you try to satisfy all of your requirements with one system, it will likely solve all of your problems poorly, if at all.


Isn't this sort of question how Nedry got "found out" as to what his super secret assignment was on Jurassic Park?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact