
On Building A Stupidly Fast Graph Database - wheels
http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/
======
wheels
I wrote this article pretty much for Hacker News since when there have been
previous articles that have made it to the home page there have been questions
about what exactly our graph database was.

~~~
moe
My first question whenever I read about a new database system: Does it scale
horizontally, by throwing more machines at it?

That's the one basic requirement for use in a website backend these days.

~~~
patio11
_That's the one basic requirement for use in a website backend these days._

There are plenty of quite profitable websites which do not have this
requirement. It is almost peculiar to sites which are trying to show display
advertising to groups of users larger than many nation-states.

You can make an _awful_ lot of money with one commodity server if your
business model supports it. I used to have an effective CPM of $80 and I know
one which has in excess of $500. No, that is not a typo. (That is on six
digits of pageviews per month.)

You know how much scaling you need when essentially get 50 cents a pageview?
Not much at all.

FogCreek has, if I recall correctly, one database server. I haven't read how
many total machines they're using recently but its a "count on your fingers
and toes" number rather than a "zomg we need a colo facility to ourselves"
number.

~~~
moe
Well, a business model that pays 50 cent a page view sounds nice for sure - I
fear most of us don't share that luxury.

I figured that most sites who would be interested in such a database either
fall into the retail (recommendation) or the community (social graph)
category. Both operate mostly on volume and the last thing you want is a hard
bottleneck just when you're at the verge of becoming successful.

But well, if your business model doesn't require scalability on the web-tier
then yes, these concerns ofcourse don't apply.

~~~
aaronblohowiak
I don't believe you'd put all of your information in this kind of custom db.
The author remarks that they are running with memory-mapped disk back-end,
which means a single machine should be able to take you pretty darn far.

------
tptacek
This is a great article, and I don't doubt you have a stupidly fast graph
database, and I am jealous that you get to spend all day working on graph-
theoretic problems. That said:

I'm not so sure of your policework on mmap() vs. read():

* The "extra copies" you make with read happen in the L1/L2 cache, making them comparable to register spills. Buffer copying just isn't expensive.

* (and here I start paraphrasing Matt Dillon) On the other hand, you are absolutely going to take extra page faults and blow your TLB if you address a huge file using mmap, which is not only going to slow down I/O but also hurt performance elsewhere.

It seems to me like you did mmap() so you could easily externalize vectors and
maps. Which actually leads me to a second question:

Isn't the reason people use things like B-Trees that they are optimized to
touch the disk a minimal number of times? Isn't that kind of not the case with
a C++ container "ported" to the disk?

~~~
nkurz
> I'm not so sure of your policework on mmap() vs. read()

Scott's conclusions agree with my experiences very well: if you design around
mmap(), and let the system handle the caching, you can end up with something
several times faster than the traditional alternatives. This isn't to say that
your criticisms are completely wrong, just that they don't match up with the
actual testing.

* "extra copies" [are cheap]

True, but the real cost is the greater memory footprint. Less application
buffering means more room for cached pages. And this cache is transparent
across multiple processes.

* extra page faults

I think the opposite turns out to be true. Letting the system handle the
buffering results in more cache hits, since the memory is used more
efficiently.

* blow your TLB

Theoretically a problem, but in practice one doesn't linearly access the
entire file. The beauty of mmap() is that it allows for brilliantly efficient
non-sequential access.

* B-trees vs C++ containers

While it's true that you have to think closely about the memory layout of your
containers, if you do so the access patterns can be even better than a B-Tree.
If the container has been designed for efficient memory-access with regard to
cache-lines and cache-sizes, it tends to have great disk-access as well.

What's really beautiful about the mmap() approach is the simplicity it offers.
In this model, RAM can be viewed as a 16 Gig L4 cache, and disk as a multi-
Terabyte L5. Just as one currently writes code that doesn't distinguish
between a fetch from L1 and a fetch from main memory, mmap() allows extending
this syntax all the way to a fetch from disk.

Now, this doesn't mean that one can just substitute mmap() for fread() and get
any significant improvement. One needs to re-optimize the data structures as
well. But the nice part is that these techniques are the same techniques used
to optimize existing cache accesses, and certain 'cache-oblivious' algorithms
already work out of the box.

Anyway, thanks to Scott for the writeup!

~~~
tptacek
When you say "fread()", I wonder whether you're considering that fread() does
stdio buffering in userland above and beyond the small window of memory you
reuse on every read (and that is going to stay in-cache) when you use the
read(2) syscall directly.

~~~
tptacek
Two simple test cases:

<http://pastie.org/402608> (read)

<http://pastie.org/402607> (mmap)

Each opens a 10M file and accesses aligned pages. Depending on how many bytes
in the page you ask the mmap() case to touch, mmap ranges from 10x faster to
10x slower for me. Reading straight through without seeking, it's no contest
for me; read() wins. But you knew that.

~~~
nkurz
Thanks for encouraging me to look at this closer. I was testing with this:
<http://pastie.org/402890>

I was having trouble comparing results, so I combined your two into one, tried
to make the cases more parallel, took out the alarm() stuff, and just ran it
under oprofile.

My conclusions were that for cases like this, where the file is small enough
to remain in cache, there really isn't any difference between the performance
of read() and mmap(). I didn't find any of 10x differences you found, found
that the mmap() version ranged from twice as fast for small chunks to about
equal for full pages.

You might argue that I'm cheating a little bit, as I'm using memcpy() to
extract from the mmap(). When I don't do this, the read() version often comes
out up to 10% faster. But I'm doing it so that the code in the loop can be
more similar --- I presume that a buf[] can optimize better.

I'd be interested to know how you constructed the case where read() was 10x
faster than mmap(). This doesn't fit my mental model, and if it's straight up,
I'd be interested in understanding what causes this. For example, even when I
go to linear access, I only see read() being 5% faster.

------
kmavm
You don't really get into why mmap is an unpopular choice. It's not as if
other programmers just forgot to read the man page. Traditional RDBMSs dislike
the OS's buffer cache because the dbms has information that could better drive
those algorithms; e.g., streaming data should not be cached, and should not
compete with useful items in the cache. The page replacement algorithm is
similarly blind; yeah, madvise exists, but it rarely has teeth. mmap is
convenient, and performant enough. But if you found yourself driving hard to
get the last 1% of performance out of this system, I would argue that you'd
end up doing explicit file I/O and manual management of memory; e.g., the only
way to use large pages to reduce TLB misses on popular OS'es is to use funky
APIs like hugetlbfs on Linux.

Also, a pet peeve: mmap != "memory-mapped I/O." The latter refers to a style
of hardware/software interface where device registers are accessed via loads
and stores, rather than magical instructions. If you're not writing a device
driver, you don't know or care whether you're using "memory-mapped I/O". mmap
is ... just mmap.

~~~
nkurz
I'd be interested in knowing more about why it's unpopular. I'm a fan of
mmap() because I like the way it can simplify my code, and so far I've been
pleased with the speed as well. But if there are subtle downsides I'd love to
be aware of them. My instinct was that mmap() isn't used much because it's
relatively new, and because it's traditionally had poor support on Windows.

I'm primarily a Linux user, but the best discussion I was able to find with a
quick search was this exchange on freebsd-questions from several years ago:
[http://lists.freebsd.org/pipermail/freebsd-
questions/2004-Ju...](http://lists.freebsd.org/pipermail/freebsd-
questions/2004-June/050133.html)

Do you have know of any updated articles about it's performance tradeoffs?

------
jonmc12
Excellent article.. I was wondering if you could help me understand why
Franz's Allegrograph or Aduna's Sesame were not sufficient for you needs. Have
you had the opportunity to perform any benchmarks against these graph DB's?

~~~
wheels
Tried Sesame, it was one of the graph DBs that I mentioned not being up to
snuff. Also looked at Franz's DB, but based on the benchmarks they publish on
their site (they've also imported a smaller Wikipedia dumb) it looked like
it's about 5x slower than ours.

------
finnw
Did you investigate any other network DBMSes? If so why did you find them
inadequate?

<http://en.wikipedia.org/wiki/Network_model>

~~~
wheels
Nope, or at least none that called themselves such. We tried neo4j, which
exploded trying to import data on the order that we're working with and a
couple of RDF databases, which survived the import, but were a couple of
orders of magnitude off from the performance we were hoping for.

After writing some 8 different backends for our store class and none being
within an order of magnitude of our own prototype for the sorts of
applications we're doing, it seemed more fruitful to round out our own
application rather than continuing the seemingly endless recurse of possible
data backends which ranged from mildly to amazingly disappointing.

If you've got something specific that you've worked with in the past that you
think would be worth our while to evaluate, I'd consider investing the time to
try it out. But just that there exist more options that we could evaluate at
the moment doesn't necessarily imply that it's reasonable to keep writing new
backends, which sometimes take a non-trivial amount of effort.

~~~
emileifrem
I'm part of the Neo4j team and I'm puzzled about the import problem. I don't
know about the size requirements you have but you mention 2.5M nodes and 60M
edges and we run systems in production with a LOT more data (billions range).
So it definitely shouldn't blow up. Maybe you ran into some bug in an older
release or something else was wrong.

It's also important to note that Neo4j through the normal API is optimized for
the most common use cases: reading data and transactional updates. Those
operations are executed all the time during normal operation, whereas an
import is typically done once at system bootstrap and then never again.

To ease migration, as part of our 1.0 release (June time frame) we will expose
a new "batch injection" API that is faster for one-time imports of data sets.
This is currently being developed. If you have feedback on how an API like
that should behave, feel free to join the discussions on the list:

    
    
       http://neo4j.org
    

Cheers,

-EE

