
Achieving 100M database inserts per second using Apache Accumulo and D4M [pdf] - espeed
http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/31.pdf
======
dijit
I hate to be a hater.

But the big issue with databases I've worked with is not how many inserts you
do per second, even spinning rust, if properly reasoned can do -serious-
inserts per second in append only data structures like myisam, redis even
lucene. However the issue comes when you want to read that data or, more
horribly, __update __that data. Updates, by definition are a read and a write
to commuted data, this can cause fragmentation and other huge headaches.

I'd love to see someone do updates 1,000,000/s

~~~
jamesblonde
MySQL Cluster (NDB - not Innodb, not MyISM. NDB is the network database engine
- built by Ericsson, taken on by MySQL. It is a distributed, in-memory, no
shared state DB). >50% of mobile calls use NDB as a Home Location Registry,
and they can handle >1m transactional writes per second. We have verified the
benchmarks, that you can get millions of transactions/sec (updates are about
50% slower than reads) on commodity hardware with just gigabit ethernet.
Oracle claim up to 200m transactions/sec on Infiniband and good hardware:
[http://www.slideshare.net/frazerClement/200-million-qps-
on-c...](http://www.slideshare.net/frazerClement/200-million-qps-on-commodity-
hardware-getting-started-with-mysql-cluster-74)

Two amazing facts about MySQL Cluster: (1) It's open source (2) It's never
penetrated the Silicon Valley Echo Chamber, but is still the world's best DB
for write-intensive transactional workloads.

~~~
viraptor
Another reason may be that it seems you need to be an expert in internal
workings of the ndb to deal with it. I tried to use it. Over a month ended up
with a few cases of "can't start the database", "can't update", etc. with no
real solutions. There's not that much information about ndb on the internet,
so googling doesn't help. The best help I got was from mysql-ndb irc channel,
but in practice when things went bad, people said something like "if you send
me the database, I'll help you figure it out". This does not work in practice.

I feel like it would be more popular if people actually wrote about how
they're using it. But how do you start when even basic configuration bugs are
still open without progress:
[https://bugs.mysql.com/bug.php?id=28292](https://bugs.mysql.com/bug.php?id=28292)

(this was a few years ago or so, maybe things changed)

~~~
jamesblonde
Yes, that's a good few years ago. Stability issues were mostly fixed around 8
years ago. It's now very stable. There are products like www.severalnines.com
and MySQL Manager to setup and manage instances for you with a UI. If you want
to roll your own, there are chef cookbooks for installing ndb -
[https://github.com/hopshadoop/ndb-chef](https://github.com/hopshadoop/ndb-
chef) .

But if you really want to squeeze performance out of it, read Mikael's blog:
[http://mikaelronstrom.blogspot.se/](http://mikaelronstrom.blogspot.se/) and
Frazer's blog:
[http://messagepassing.blogspot.se/](http://messagepassing.blogspot.se/)

------
espeed
This is work is related to the MIT D4M Course, GraphBLAS, and Graphulo:

Standards for Graph Algorithm Primitives
[http://www.netlib.org/utk/people/JackDongarra/PAPERS/GraphPr...](http://www.netlib.org/utk/people/JackDongarra/PAPERS/GraphPrimitives-
HPEC.pdf)

GraphBLAS: A Programming Specification for Graph Analysis [video]
[https://www.youtube.com/watch?v=6tnzSiq8QBo](https://www.youtube.com/watch?v=6tnzSiq8QBo)

[http://graphblas.org](http://graphblas.org)

Graphulo: Graph Analytics in Apache Accumulo [video]
[https://www.youtube.com/watch?v=nsmFjZNl60s](https://www.youtube.com/watch?v=nsmFjZNl60s)

[https://github.com/Accla/graphulo](https://github.com/Accla/graphulo)

MIT D4M: Signal Processing on Databases
[http://www.mit.edu/~kepner/D4M/](http://www.mit.edu/~kepner/D4M/)

[https://ocw.mit.edu/resources/res-ll-005-d4m-signal-
processi...](https://ocw.mit.edu/resources/res-ll-005-d4m-signal-processing-
on-databases-fall-2012/index.htm)

Video Lectures:
[https://www.youtube.com/watch?v=zNGKX-4PRsk&list=PLUl4u3cNGP...](https://www.youtube.com/watch?v=zNGKX-4PRsk&list=PLUl4u3cNGP62DPmPLrVyYfk3-Try_ftJJ)

Book: Graph Algorithms in the Language of Linear Algebra
[http://epubs.siam.org/doi/book/10.1137/1.9780898719918](http://epubs.siam.org/doi/book/10.1137/1.9780898719918)

D4M: Bringing Associative Arrays to Database Engines (2015) [pdf]
[https://arxiv.org/pdf/1508.07371.pdf](https://arxiv.org/pdf/1508.07371.pdf)

------
privacyfornow
I think this is a piece of the pie. The thing to recognize that I think is
just as important is that it is possible to state several common use cases
(even synchronous microservices) as collections of append only immutable logs
for system of record and an in-memory read/view state for readers and mutating
functions.

I am using this pattern in risk, fraud and commerce and once new members in my
teams get over the mental barrier of decoupling the append only log from
state, it all just clicks for them.

~~~
freeman478
Looks interesing ! Isn't this some kind of event sourcing or am I missing
something ?

------
nrjdhsbsid
So... When do you reach the point Where it's better to just use a hashtable in
ram? Super high speed "in memory" databases are still beat by manipulating the
data structures yourself.

I feel like there's very limited applications where all out speed is important
and it's better to use a database than just do the operation yourself in ram
and save the network overhead.

You can insert billions of items per second into a hashtable, and when you're
working in your own app memory transactions aren't needed.

------
jandrewrogers
The key point is that they managed to do it with Accumulo; the insertion rate
through storage is otherwise unremarkable. For 10 GbE clusters, line-rate
insertion has been relatively easy to achieve for several years now.

An important point is that they disabled all of the durability, replication,
and safety features. Graph500 records are quite small so that insertion rate
given the size of their cluster implies average throughput that is
significantly less than line-rate.

------
hmottestad
I was wondering. Does anyone here use Accumulo internally or for a client?

I had not heard of it before and the line "widely used for government
applications" made me wonder why I hadn't. I'm a consultant working with
graphs in Norway and this database is completely new to me.

~~~
Scaevolus
It was developed by the NSA based on Google's Bigtable paper, so it has a lot
of US government users. I _think_ Apache HBase or Cassandra are far more
popular NoSQL solutions for most users.

~~~
mikecb
If you don't need cell level access control, you probably wouldn't be looking
for it. I guess hbase now offers this too.

------
lngnmn
The word database means transactions committed to a persistent, durable
storage (such that the data could survive a reboot).

~~~
castis
iirc "database" means an organized collection of data. persistence is just a
nice feature.

~~~
threeseed
It literally means: "a structured set of data held in a computer, especially
one that is accessible in various ways."

You are right that restart survivable persistence has absolutely nothing to do
with it.

~~~
sbov
This comment branch is a weird topic, but the implication of these definitions
is that modern languages come packaged with multiple databases, some of which
can scale to billions of writes per second (arrays, lists, sets, etc.)

