
Principles of high-performance programs (2012) - arunc
http://blog.libtorrent.org/2012/12/principles-of-high-performance-programs/
======
arielweisberg
Reading stuff like: "is the eviction of your cache by the kernel while
executing the system call."

Makes me suspicious. See [http://mechanical-sympathy.blogspot.com/2013/02/cpu-
cache-fl...](http://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-
flushing-fallacy.html)

Some of this seems focused on working around broken primitives and frameworks.

Concurrency primitives in Java will not go to the kernel unless a thread
actually has to park, and if you use an executor service or similar threads
will not go to sleep if a task queue is not empty. I suspect pthread
primitives are also careful to avoid coordination when none is needed.

I don't actually see how a task queue would unschedule a consumer for each
element if the queue is not empty. What is going to wake up the thread that is
blocking on a non-empty queue? I guess I am spoiled by having queues solved
well in a standard library.

Context switching is also something that comes up a lot as the sort of thing
that is expensive. Context switching is only expensive if task size is small
relative to the cost of a context switch which isn't that large. My experience
is that for small non-blocking tasks you can run thread per core to expose the
available parallelism and everything else will tolerate context switching.

I am also perpetually hearing about the importance of hot caches and not
switching threads. Caches are just tracking state for tasks, and unless you
have actually done something to create locality between tasks there is nothing
to make them stay hotter anyways.

If the state the cache is tracking is multiple thread stacks, well... the CPU
doesn't know the difference between data on a stack and data that it is
chasing through some pointer.

The real problem is having a task migrate to a different CPU instead of
waiting its turn in the right spot and that can be solved other ways.

Access pattern matters as well. If you are going to sequentially process
buffers then prefetching will work and there is no benefit to a hot cache.
That is where the emphasis on zero copy tends to show holes. Think about the
difference in speed between RAM and a network or disk interface, and how much
processing you are going to do beyond just the copying.

My main beef is that pushing this kind of performance thinking without
providing measurements that show where it does and doesn't matter encourages
the kind of premature optimization that isn't productive.

~~~
dllthomas
_" Caches are just tracking state for tasks, and unless you have actually done
something to create locality between tasks there is nothing to make them stay
hotter anyways."_

This is only true if your data is small enough that it never gets evicted.
Otherwise, there are certainly things that can make them stay hotter without
involving multiple cores: You can shrink your data, reorder your data, or
reorder your traversal of your data.

------
hyc_symas
One problem with "read everything on the socket until it's empty" is an issue
of fairness. A particularly greedy client could keep its connection filled
with requests, and this approach would monopolize a server thread, potentially
starving other clients/sockets.

~~~
hyc_symas
Btw, that's not just theoretical. The connection manager in OpenLDAP slapd
used to do exactly that - and then we found exactly that problem while
benchmarking/soak testing. Now we always process only one request from a
connection before moving on to the next connection. Not as efficient, but more
fair.

There are always tradeoffs. Time sharing, multi-processing, and time-slicing
are all inherently inefficient. Batch mode processing, where 100% of system
resources are dedicated to a single job until it completes, is most efficient
and gets jobs done quicker. We accept this efficiency tradeoff because we
believe that interactive processing is more user-friendly.

So take the quest for efficiency with a grain of salt. Perfect efficiency may
actually give a worse user experience.

On the flip side, we have LMDB, a read-optimized database engine which does no
system calls or blocking library calls at all in the read codepath. It is as
nearly perfectly efficient as one can get without resorting to hand-optimized
machine language. Fairness inside LMDB isn't a concern, because calling apps
are always going to trigger context switches on their own anyway.

The quest for efficiency has different constraints depending on what level of
a system you're programming in.

~~~
hvidgaard
1 req pr connection at a time, while fair, seems a bit harsh. From a
performance perspective it's not a bad idea to process up to n reqs in at most
m time from a connection before moving on to the next, and thus amortize the
cost from switching connection.

~~~
robotresearcher
The issue of magic numbers like this n is discussed in the article. n=1 has
the benefit of perfect fairness. All other values have to be tweaked for the
platform.

~~~
hvidgaard
The article also explicitly say you should amortize the cost of things like
switching context, which switching connection is.

I never stated that n and m has to be a magic numbers. They can be can (and
should be) adaptive. While n=1 does provide perfect fairness, but what if
switching connection cost just as much as processing 1 request? Setting n = 2
increase throughput by 50%, at the cost of 33% longer wait until the last
connection is handled. However, because the throughput is increased, any
subsequent requests will be handled faster than with n = 1.

In reality you want a sane m value, the time allowed to cycle through all
connections. I'm not sure exactly how to make this adaptive, but it's likely
very dependent on the nature of the usage (http connections from a GUI, p2p
network, or something else). As long as you cycle through all connections
within m time and aren't saturating the connection, the algorithm can increase
n to increase throughput.

------
jcrites
> Imagine a world where all system calls are asynchronous, all events and
> system calls return values are posted onto a message queue and you could
> drain the message queue with a single interaction with the kernel.

Interesting idea. Have existing OSes explored this? What are the major
technical challenges and proposed solutions? Questions and ideas:

One option is to design every such API so that it can accept a batch of work.
An array of structures representing the parameters to individual logical API
calls. Invoke a single API call to hand off the array. Perhaps that will
provide most of the achievable benefit, since I imagine these performance-
intensive applications are bottlenecked on high volumes of calls to small
numbers of APIs.

It would be a pain to modify every API interface in this fashion, and I
imagine it will be difficult to implement those APIs.

Could the kernel provide a single batch facility accepting as input an array
of structures representing API calls? The goal would be to make it possible to
implement those APIs mostly like normal, such that the kernel takes
responsibility for fanning out the batch into individual API invocations, in
the simplest case by executing them sequentially; though the implementation of
the API could also take advantage of batching as well.

Success of the calls would need to be determined through an API like select,
or alternatively through IO events on the individual file descriptors being
worked-on. There will need to be a canonical format for specifying the API to
invoke (function pointer?) and its arguments (calling convention). I imagine
this could be quite challenging to implement in general; all state will need
to be managed explicitly. For example, you could not use a calling convention
where the caller is expected to use another API to get details about why the
last call failed (like Windows `GetLastError`), since the state will be
clobbered by subsequent calls in a batch.

In terms of a mechanism, could the OS and applications communicate through
concurrent data structures like the disruptor pattern rather than by direct
context switching? It seems like this would potentially require passing
input/output through main memory (if crossing CPUs). Unsure if this would
provide a net benefit.

From a naive perspective this all seems feasible, but also sounds like a huge
amount of work and the primary benefit would be for the highest of high-
performance applications. Though I wonder how much overall performance would
improve for typical applications if we can minimize context switching.

Potentially relevant:
[https://news.ycombinator.com/item?id=7679822](https://news.ycombinator.com/item?id=7679822)
\- Linus Torvalds on the high cost of page faults, which are more likely to
happen under frequent context switching.

~~~
nostrademons
Wasn't win32 done like that? I have vague memories of
PostMessage/GetMessage/TranslateMessage/DispatchMessage loops. I never did
enough Windows development to understand where the actual syscall boundaries
were, but since everything was done by message passing to the main loop of
your application, the API seems amenable to pumping out multiple messages in
one syscall.

~~~
yuhong
Currently PostMessage/GetMessage I think are syscalls,
TranslateMessage/DispatchMessage are not most of the time.

~~~
quotemstr
DefWindowProc is a system call. Really, switching to kernel mode is not as
expensive as people suppose.

~~~
yuhong
DefWindowProc is not always a system call. It depends on the message being
tried for example.

------
dschiptsov
Are they saying that threads and GC are crap, so to write a predictable code
one should think in terms of pre-allocared data [structures] and partitioned
[block] I/O requests?)

This is a big news..) As far as I remember Informix Dynamic Servet 7.30 has
been released 15 yeas ago..

