

Speculative Lock Elision in Varnish Cache - perbu
https://www.varnish-software.com/blog/speculative-lock-elision-varnish-cache

======
colanderman
As I understand it, HLE is the wrong solution for high-contention critical
sections. The locks themselves are added to the transaction's read set [1],
which means that if _any_ of the concurrently executing transactions of this
critical section are re-executed non-speculatively, then _all_ these
transactions are re-executed non-speculatively, and importantly: _any future
transactions which overlap these will be executed non-speculatively,
potentially indefinitely_.

You really want full RTM for such high-contention situations.

I would love to be wrong about this, but I haven't a Haswell to test on.

Also, classy shout-out (beg-out?) to Intel's free samples division at the end
there.

[1]
[http://software.intel.com/sites/products/documentation/docli...](http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-
win/GUID-A462FBC8-37F2-490F-A68B-2FFA8010DEBC.htm)

~~~
sp332
FTA: _one of the updates might be lost, something we see from time to time._

You're right, but it doesn't sound like these are high-contention operations
since they currently run without any locks and rarely have problems.

~~~
colanderman
Sorry, I meant high contention on the lock – if they plan on using a coarse-
grained lock, I presume it _will_ see high contention (assuming they have many
statistics).

Obviously if they plan on using one lock per statistic, this isn't a problem,
but then they should just be using atomic updates instead.

------
jedbrown
I highly recommend Paul McKenney's blog series on this topic:

[http://paulmck.livejournal.com/tag/transactional%20memory%20...](http://paulmck.livejournal.com/tag/transactional%20memory%20everywhere)

------
cespare
> To mitigate this we can use locks around the statements that update the
> counters. However, these locks are expensive and we’ve been hesitant to
> place locks around all the counters just to make sure they are accurate.
> Performance over accurate statistics, right?

Does this mean the Varnish authors have intentional data races all over the
codebase? Can't the compiler do whatever it wants in these cases (because it's
undefined behavior)? Can't this completely blow up?

[http://software.intel.com/en-us/blogs/2013/01/06/benign-
data...](http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-
what-could-possibly-go-wrong)

Edit: I'm not an expert on this stuff so I'm probably wrong -- maybe someone
can tell me what I'm missing.

~~~
damienkatz
I believe it's only for stats, which means the stats might not be accurate,
but shouldn't affect the proper functioning of code (assuming nothing relies
on stats being accurate).

------
shin_lao
Before using HTM, I encourage the OP to use lockfree structures and atomics.

Transactional memory is absolutely not free and certainly not a silver bullet.

Additionally, without proper compiler support it's going to be very hard to
use it properly.

~~~
colanderman
> _Transactional memory is absolutely not free_

Can you please compare in your experience, the performance of HTM with lock-
free structures? My experience is that lock-free structures are rather slow,
and my understanding of TSX is that it _is_ almost free.

> _without proper compiler support it 's going to be very hard to use it
> properly._

They're using Intel's HLE, which requires very simple library support, which
is already included in recent versions glibc.

~~~
shin_lao
Lock free structures are as slow as your memory allocation scheme (nota bene:
I've designed several lockfree structures).

If you're clever about how you use them, it can be a big win.

HLE isn't really HTM, but it's a first step.

~~~
colanderman
_Lock free structures are as slow as your memory allocation scheme_

I've found they're an order of magnitude slower than your architecture's
atomics, and mostly exacerbated by the memory fencing required. Though, the
structures I've used (queue & stack) didn't require any memory allocation on
the fly.

 _HLE isn 't really HTM, but it's a first step._

I know; my post should be read in the broader context of Intel's TSX (which
includes their RTM instructions).

~~~
shin_lao
Which lock-free structure have you been using and for what purpose?

~~~
colanderman
Michael & Scott, "Simple, Fast, and Practical Non-Blocking and Blocking
Concurrent Queue Algorithms", which I understand to be fairly standard, as
well as the "textbook" lock-free stack algorithm (dunno who invented it, I
re-"invented" it myself and then confirmed it was identical to that given in
The Art of Multiprocessor Programming).

Use case for the queue is message-passing. Use case for the stack is as the
allocator for the queue.

~~~
shin_lao
Which language? C? You did your own implementation? Sounds like you used the
wrong stuff.

~~~
colanderman
Yes, C, my own implementation. What is the "right" stuff for low-latency high-
contention message passing?

~~~
shin_lao
Well, if you were coding in C++ I would say "have a look at TBB".

The thing is, text book implementations generally lack all the platform
dependant details to make it efficient.

------
ars
For the confused, you can translate the title as:

"Removing Speculative Locks in Varnish Cache"

------
nkurz
_Branch prediction across memory barriers is also something I don’t think is
doable._

My understanding is weak, but this doesn't sound right. Presuming he means
that "speculative execution across a write memory barrier is prohibited", is
he correct?

~~~
nn3
He's not correct. In fact the post is full of goobledock he only partly
understands.

~~~
Whitespace
Care to elaborate?

~~~
nn3
Just two examples, there is more wrong, beginning with the title.

>It creates a “memory barrier” CPU are usually free to >shuffle around the
stream of instructions that flow through >them and it is done all the time as
the CPU constantly >optimizes the instruction stream to keep all its resources
>busy. A lock creates a barrier and the CPU is not allowed >to move stuff past
this barrier. Branch prediction across >memory barriers is also something I
don’t think is doable.

He's confusing instruction retiring with memory ordering. On modern fast CPUs
they are only vaguely related.

>Cache coherency. The address line that stores the lock gets >invalidated and
the content in that cache gets thrown out.

That's not how a modern CPU with MESI protocol operate. As long as the cache
line stays EXCLUSIVE to that core it does not get invalidated.

