
Unicorn: a simple and flexible abstraction of BigTable-like databases - haifeng
https://github.com/haifengl/unicorn
======
jdf
Not to be a stickler about name collisions, but Facebook wrote a research
paper about a graph database called Unicorn back in 2013:

[https://people.csail.mit.edu/matei/courses/2015/6.S897/readi...](https://people.csail.mit.edu/matei/courses/2015/6.S897/readings/unicorn.pdf)

This appears to be unrelated, which is somewhat unfortunate.

~~~
droopyEyelids
I wish there was some sort of social agreement to name projects new words, or
combined words.

We're overloading the english language so much. In 100 years it's going to be
impossible to search for anything, as every word and phrase will have a
million products and projects attached.

Didn't the original MIT hackers take pride in coming up with clever and unique
names? what happened to that? I'd even settle for names like "elinks"

~~~
Johnny555
Search engines will just need to take context into account, I remember a lot
of confusion between searches for Cisco IOS versus Apple IOS back when IOS
first took on the name, but with a few keywords to provide context, now it's
pretty easy to get relevant results.

------
rspeer
How would I use this if I have graph data that's described in terms of its
edges, not its nodes?

The N-Triples and DOT formats would be examples of graph data that's
structured like this: you just list the edges as the pairs of nodes that they
connect. The nodes don't necessarily have any properties, they're just
implicitly created by edges. I could describe

    
    
        a -- b
        b -- c
        b -- d
    

and nodes "a", "b", "c", and "d" would implicitly exist.

I ask this because the documentation involves programmatically creating nodes,
storing them in local variables, and referring to them when building edges:

    
    
        gods.addEdge(jupiter, "father", saturn)
        gods.addEdge(jupiter, "lives", sky, json"""{"reason": "loves fresh breezes"}""")
    

If "jupiter", "saturn", and "sky" weren't previously declared and stored in
local variables, how would you do this?

The documentation on the GitHub page is reasonably extensive, but it doesn't
even say how to get an existing node without creating it, and certainly
doesn't say how to create an edge in an efficient way that is independent of
whether its nodes have already been created.

I've also run into a similar problem trying out the new version of OrientDB.
They have a fast importer called ETL, but all the documentation for it assumes
that you're mostly concerned with importing nodes and you're only using edges
to represent SQL-esque relational data. I'm not trying to shove relational
data into NoSQL for the sake of NoSQL, I actually have a large graph.
Importing serialized graphs into a graph database seems to be a pretty
neglected use case.

~~~
haifeng
In most graph database, you find a vertex by filtering its properties, e.g.
Gremlin graph query language. In Unicorn, you can do the similar with document
vertices (it is, a vertex corresponding to a document in another
table/collection). This is probably very nature in a business application.
However, it is not very useful in your case as your vertices are abstract
without any properties.

I guess what you want is some large scale graph analytics, which I suggest
Spark GrpahX or other distributed graph computing engine.

Unicorn is designed for property directed multi-graphs.

~~~
rspeer
I would say that what I _have_ is a property-directed multi-graph, as I
understand it. It's just that the properties are on the edges, and the nodes
have no properties except for their ID.

The graph in question is ConceptNet, which in the version I'm working on has
about 10 million edges and 3 million nodes. Let's be clear that, in computing,
"million" is not a large number. I only said "large graph" to clarify that
it's not a small toy graph. The data needs to be imported with some degree of
efficiency. But I have a 3TB hard drive and 16 GB of RAM, and both of them can
spare a few gigabytes for this task.

Before you throw me into the tarpit of distributed computing, like every other
graph-DB provider does as an excuse for their terrible inefficiency, I would
like to know if your graph database is appropriate to use with reasonable-
sized graphs that fit easily on a single computer.

~~~
haifeng
Check out this script
[https://github.com/haifengl/unicorn/blob/master/shell/src/un...](https://github.com/haifengl/unicorn/blob/master/shell/src/universal/examples/dbpedia.sh),
which loads dbpedia graph into unicorn. You should be able to load ConceptNet
without minor modifications. Later, you can refer a vertex by its string id.

------
janprill
While people seem to mostly tinker with the name in the comments I'd like to
say that this looks like a really interesting project!

Would you mind to give us a little more background with regards to how this
has been initiated, what your motivation was to write something new?

Given that you have an interesting vita
([https://www.linkedin.com/in/haifengli](https://www.linkedin.com/in/haifengli))
and a lot of people are interested in the graph database space I'd assume that
people what be interested in your take on: The graph landscape, why for
example haven't you joined the effort of Neo4j, ArangoDB, Titan and the likes.
Is Unicorn already older than these systems? Why have you decided to open
source now? Why is this linking to a fork originating at ADP while you are
obviously a member of ADP and what is ADP about? Questions over questions
which IMHO should be answered so that people like myself, who are impressed by
your work, get a better chance where this massive effort comes from to better
estimate how long this is going to stay around.

However: Thanks for open sourcing, posting and giving us a chance to play
around with this...

~~~
janprill
Ok, doing a little research about ADP I realize that this is quite a large
company. Sorry, I didn't knew it (am from Germany). But this would make it
even more interesting how unicorn is used at ADP and if this already is a
reference with regards to the scale of larger installations of Unicorn.

------
hans
given the context of today, find the name rather unfortunate

~~~
haifeng
It is true. Unfortunately, the project was started several years ago and had
nothing to do with the startup world. I would like to complain that VCs
destroy another nice name with their hypes :(

------
jc423
does it web-scale?

~~~
haifeng
It is on top of hbase, Cassandra, or accumulo. So yes, it is web scaled.

