

Ask HN: A good open-source implementation of multi-key-value db? - liuliu

I used redis, memcachedb &#38; tokyo cabinet. They are all key-value database implementation. Is there any multi-key-value database c implementation with open-source? The question is to maintain correspondence of tag and per image. In the special case, I am doing a larger framework in c/c++, so it has to be low latency or just a naive algorithm implementation.
======
joshu
what does "multi-key-value" mean?

This problem description is nearly incoherent.

~~~
liuliu
a key has multi-value, multi-key point to the same value.

You can regard it as a efficient implementation of undirected graph.

Relational DB can do it very well with something like:

create table image2tag ( varchar image, varchar tag, primary key (image, tag),
index (image), index (tag) );

------
Maro
I'm not sure what your problem is. Do you mean multiple keys pointing to the
same value, like key1=>value key2=>value?

You could use indirection:

key1=>valueID key2=>valueID valueID=>value

------
JimmyL
SQL and a JOIN?

Not what you asked for (and certainly not what the cool kids use), but why
not?

~~~
joshu
tags + sql = bad

~~~
silentbicycle
Could you expand on that?

Searching for the intersection of a set of tags' references seems like the
epitome of a set/relational operation to me.

~~~
joshu
You want an inverted index.

Finding intersections with tags are multi-way joins, which are painful as all
hell. At least on MySQL. Perhaps other databases do better.

Note that I built probably the biggest platform of this type around.

~~~
gojomo
So the tags are terms, and each 'document' is...

\- a target (such as an URL in delicious)?

\- a [target, user] pairing?

\- a [target, user, tagging-event] tuple?

Would you recommend using any of the off-the-shelf open-source inverted-index
implementations? (Lucene, ferret, sphinx, [hyper]estraier, etc.?)

~~~
joshu
It depends on what you need, obviously.

I think Lucene or Sphinx will take you a lot of the way. I don't know about
the others.

------
dantheman
I don't know if this will fit your needs, but couchdb is pretty fast. It's
under active development and still alpha, so things do change. As for multi-
key, it's a yes and no sort of situation.

If you want to do an intersection you can, unfortunately it's in a sorted
order. for example: [key1, key2, key3] won't match [key2,key1, key3]

If you're trying just get all images matching a key so your structure looks
like:

    
    
      {img (url, binaray, etc), tags: [tag1, tag2, tag3, tag4]}
    

you can run a simple map function over the data

    
    
      function map (doc)
      {
        for each (var tag in tags)
           emit (tag, null);
      }
    

which will allow you to then get all images by tag:

    
    
      key=tag
    

It doesn't have a c++ api, but it uses http as its interface so you can use
curl..

~~~
mahmud
_It's under active development and still alpha, so things do change._

CouchDB is great, but that's just one description you don't want your
_database_ server to fit. Go for boring, stable and tried and true.

~~~
dantheman
I agree, but it depends on the nature of the project. That's why i included it
in the brief description.

------
silentbicycle
Framework _for what_? You'll probably need to give us more details. For
starters, will it be a mostly read-only or write-heavy database? What relative
priorities are you placing on speed, data consistency, ease of ad-hoc queries,
etc.?

Based on the few details you've given, I'd consider SQLite
(<http://sqlite.org/>). It's relational, but very lightweight and runs in-
process. It's also open-source, quite mature, simple to use, and interfaces
very easily with C.

~~~
russell
SQLite is suitable only for single user cases. "redis, memcachedb & tokyo
cabinet" imply his needs are way beyond the capabilities of SQLite.

~~~
silentbicycle
Not necessarily. If he's doing a lot of concurrent writing, that's true, but
that's unspecified. Either way, I would say "only up to a couple dozen
simultaneous users", which is _very_ different from "only a single user". He's
also asking specifically for something with low latency and (implicitly) a
simple implementation.

For all we know, he's using the above because he doesn't like (or understand?)
SQL/relational theory, or because of vague stuff he keeps hearing about MySQL
not scaling. I'm not assuming he's ignorant, but we don't know _why_ he
started with memcachedb, et al, and those reasons are worth ruling out first.

~~~
liuliu
the reason why not using sql is that the thing I am doing is more about using
a "library" (like what tokyo cabinet c api provides). SQLite may be good for
this purpose but still, I guess there will be a simpler implementation
specially for attacking this kind of problem. MySQL/PgSQL is not suitable
because of the authorization mechanism.

~~~
silentbicycle
SQLite is a C library, and is called in-process. It's probably the best
fitting relational database for your problem, as I understand it. You've
painted a pretty incomplete picture, though.

The constraints on the problem (how much simultaneous r/w access, size of the
data set, etc.) will have the most impact on what to choose, but it doesn't
seem like you've fleshed that out anywhere else in the thread.

~~~
liuliu
in my scenario, it is a read-heavy database with occasionally writes. As
average about 3 tags for 500,000 images, it should supports 1.5 million in-
memory records on one computer. for performance, I expect ~100ms to retrieve
100,000 records.

------
catch23
Couldn't you just emulate multi-key with tokyo cabinet? You'd just have to
write the code to union sets of data... it's what databases do anyway.

------
richcollins
<http://www.dekorte.com/projects/opensource/tagdb/>

------
vicaya
What you need is a full text search engine with faceting features. Solr or
Sphinx should fit the bill.

------
liuliu
I finally settle down to use tokyo dystopia as the tag db which is very
satisfying to my need. Thank you all for the help

------
stevedekorte
Make the key the tag name and the value the list of image ids.

