

Russ’ 10 Ingredient Recipe for Making 1 Million TPS on $5K Hardware - antman
http://highscalability.com/blog/2012/9/10/russ-10-ingredient-recipe-for-making-1-million-tps-on-5k-har.html

======
colanderman
> _context switches passing tcp-packets back and forth from the operating
> system were taking up 35%_

You could get rid of this (and in doing so, double your TPS) by switching to a
memory-polling-based network driver like PF_RING [1] (and obviously, keeping
the kernel on its own core like you are doing).

> _Lookup data in memory (this is fast enough to happen in-thread)_

> _I knew I had it right when I watched the output of the “top” command,_

Does all your data live in cache? If not, have you tried using perf [2] to
measure load stalls? They are typically the bottleneck once you get past
context switches. Hyperthreading should help at least somewhat here (do you
have it enabled?).

[1]
[http://www.ntop.org/products/pf_ring/](http://www.ntop.org/products/pf_ring/)

[2] [https://perf.wiki.kernel.org/](https://perf.wiki.kernel.org/)

~~~
wmf
Of course, if you use Netmap/DPDK/PF_RING you have to bring your own TCP
stack, which is more than many app developers are comfortable with.

~~~
colanderman
Excellent point which I missed :) Can you tell I live in an L3 bubble?

I am surprised (from a quick Google) there is no open-source user-space
PF_RING-aware TCP stack. Am I missing something?

~~~
wmf
When you're doing packet capture that normally costs one system call per
packet, Netmap/DPDK/PF_RING are a clear win. But with TCP a single system call
can send or receive many packets and TSO/LRO offloads help even more. I
haven't seen any numbers about the benefits of kernel bypass for TCP; maybe
SolarFlare has some.

~~~
itchyouch
I was about to point out solarflare's openonload. Its only compatible with an
sfc nic though. With onload, i've seen about a 20% throughput increase when
turning onload on.

------
wildmXranat
>grafting AlchemyDB’s flexible data-modeling capabilities onto Aerospike’s
high-velocity horizontally-scalable key-value data-fabric.

It reached an awesome Kps - keywords per sentence count as well.

------
dschiptsov
T in TPS means Transactions, right? What kind of a transaction is that one
which uses no persistent storage?)

1 million network-to-memory writes, well, that is quite possible, but, please,
do not call this a transaction in the way it meant in TPS.)

What was meant in the old days by transaction, was an atomic operation which
completes after storing the data in a persistent (usually direct-access, which
means no buffering by an OS kernel) storage, so it could be read without any
corruption if a power loss will occur the very next second.

~~~
leif
You can achieve durability with very high performance using write-ahead
logging, lots of concurrent writers doing group commit for the log fsyncs, and
a write-optimized data structure like an LSM tree or a fractal tree. Maybe not
1 million on a $5k server, but you can get a lot closer than I imagine you're
picturing right now.

In any case what they seem to be measuring is a read-only, in memory workload,
so this is not that impressive. IIRC, InnoDB has no trouble pulling off
something like this.

~~~
dschiptsov
So, disk writes are necessary, after all?

Yes, there are lots of tricks, like placing that append-only physical
transaction log on a different controller with a distinct storage device, etc.
Data partitioning is the another big idea. Having indexes in memory to avoid
unnecessary reads, using collected statistics in a query optimizer, etc. But
nothing could beat the partitioning based on actual workloads and separation
of tablespaces on distinct hardware, including decoupling indexes from the
tables - this is what DBAs were for.

I used to be Informix DBA in old good days, so I can't help but smile when I
look at MySQL (well, they added lots of partitioning options in recent InnoDB
- the things Informix could do out of box 12 years ago) leave alone modern
NoFsync "databases".)

Btw, not all NoSQL guys are insane.) Riak with LevelDB storage backend is very
sane approach, which cares about and counts writes.

~~~
tracker1
There's also RethinkDB which seems to be focused on the D in acid while being
a non-relational database. When you really need performance, in general
relationships/joins need to go out the window as much as possible, and often
one or more of the letters in ACID are compromised.

It should get very interesting in the next couple of years.. of course _MOST_
environments don't need the kind of performance or scale that these systems
are really offering.

IIRC StackOverflow ran for a very long time on a single server, under some
pretty serious demand. In some cases SQL with a caching system for mostly-read
data can be better... other scenarios tend to fit a document (non-relational)
data store better.. just depends.

~~~
leif
I see rethinkdb as being focused more on the data model and language, and the
cluster administration experience. The performance doesn't seem compelling
yet, though they are admirably durable by default.

You're right, with a reliably performant engine you can get a lot more out of
a single machine than a lot of people these days seem to think. That's part of
our vision for TokuMX, to bring back a little bit of "scale up" potential to
the NoSQL space.

------
jpallas
Ingredient 0: redefine “transaction” to mean nothing remotely like what it
means to everyone else.

~~~
aurelianito
Which part of the article makes you think that the operations being made are
not ACID?

~~~
diek
> 100% in-memory workload. Don’t even think about hitting disk for 0.0001% of
> these requests.

They also discuss how these are 1M read requests.

------
jahaja
Regarding point 9, "IRQ affinity from the NIC".

If one would like to delve a bit deeper into this one can find some info
regarding the techniques introduced for linux here:
[http://lxr.linux.no/linux/Documentation/networking/scaling.t...](http://lxr.linux.no/linux/Documentation/networking/scaling.txt)

------
free652
I'd have to guess that they're limiting themselves to a specific platform
(hardware & software). Probably selling an appliance would be best way for
them, instead of dealing with sysadmins that don't understand how to properly
configure the database on their machines.

------
bradhe
Somewhat off topic, but this article perfectly shows how difficult taking full
advantage of a multi-core environment truly is, and how lacking a lot of our
tools are. Granted, this is an extreme example.

~~~
jamesaguilar
Our tools for something like this aren't great because almost nobody needs to
do 1M reads per second, and of those who do, only a tiny fraction have a
working set that can fit economically in memory _and_ is read-only, like the
system in this article.

------
nasalgoat
I was one of the people who commented on the original post. Since then I've
had little success in getting any of these recommendations to work.

Does anyone have some actual configuration examples to provide for things like
setting IRQ affinity?

------
antman
The author has moved to Aerospike. Here are some benchmarks a few months back
[http://www.infoq.com/news/2013/04/NoSQL-
Benchmark](http://www.infoq.com/news/2013/04/NoSQL-Benchmark)

------
sillysaurus
Is there a domain where this is applicable outside of high performance
trading? I'm trying to think of another use case that would legitimately
generate 1M queries/sec.

~~~
derefr
Some of the larger combats (5000+ ships) in EVE Online might do that.

(I single EVE out because other MMOs generally shard by user-cohort, so having
that number of people on one shard is impossible. EVE, meanwhile, shards by
location within the virtual world (each star system is a shard), so the entire
player-base can "gather" on a single shard for a confrontation.)

~~~
recuter
> In February 2013, EVE Online reached over 500,000 subscribers.

So what, the entire game userbase needs to be awake and in the same spot? :0)

~~~
ricardobeat
Perfect timing: [http://www.theverge.com/2013/7/28/4565558/eve-online-
biggest...](http://www.theverge.com/2013/7/28/4565558/eve-online-biggest-
space-battle-in-history)

------
scotty79
TPS as in "Torsonic Polarity Syndrome"?

~~~
blhack
Transactions per second, I think.

------
rorrr2
How often OS/network/context overhead is the bottleneck? In my experience most
of the time it's the DB. Even if it fits in RAM, complex queries always take
most of the time. (Web dev here).

~~~
tracker1
That's kind of a key argument in NoSQL data stores... In many scenarios having
a key/value store is better, a single key to lookup a record and all the
necessary related data.

Some hybrid stores like MongoDB, RethinkDB and others offer more
characteristics similar a traditional SQL RDBMS, while offering horizontal
scaling. It's when you have several join operations that performance really
takes a hit under significant load. You can't really scale a relational
database in the same way.

That said, as much as I like non-relational databases, they aren't the best
fit for every use case. Beyond this, most situations don't _need_ that type of
performance scaling.

------
ignoranimus
Gee, that's spectacular. Now...

If only they would do the reader the simple, and most gracious service of
defining precisely what they mean by this obscure "TPS" acronym.

...and before you downvote this comment (because I can smell your itchy little
fingers all the way from the otherside of the internet), yes, I can assure you
that I did actually Google for the answer. And yes, I did discern what is
meant by TPS.

But that isn't the point. The point isn't that I'm a lazy slacker, and/or an
ignorant yokle because I didn't already know the meaning of the abreviation
innately, and feel inconvenienced by having to open another browser window,
and search for some clue.

The point is that the author is assuming everyone will immediately know and
understand that acronym, but meanwhile, when I conduct my search, I am forced
to assume that my chosen definition is correct, wihout actually knowing for
sure.

And for that reason, I'm going to leave out the meaning I've chosen as the
author's intended definition for TPS. I have no way of knowing whether my
assumption was accurate. So the mystery persists. What does TPS actually mean?
Go search for it, you lazy, ignorant slacker.

~~~
kintamanimatt
There's little excuse for laziness when pursuing new knowledge. A two second
Google search and a new tab is not a burden. If you didn't get the answer
first time, you should qualify the search with one of the other keywords from
the article.

A certain level of reader sophistication is assumed, and not everything is
explained with a high level of hand-holding. It's a technical tutorial, not a
beginner's tutorial.

TPS = transactions per second.

~~~
regularfry
In general you'd be right, but in this specific case there's a fair bit of
ambiguity as to what "transaction" is taken to mean. A definition or two
wouldn't have gone amiss here.

