Hacker News new | comments | show | ask | jobs | submit login
Speculative Lock Elision in Varnish Cache (varnish-software.com)
43 points by perbu 1491 days ago | hide | past | web | 30 comments | favorite

As I understand it, HLE is the wrong solution for high-contention critical sections. The locks themselves are added to the transaction's read set [1], which means that if any of the concurrently executing transactions of this critical section are re-executed non-speculatively, then all these transactions are re-executed non-speculatively, and importantly: any future transactions which overlap these will be executed non-speculatively, potentially indefinitely.

You really want full RTM for such high-contention situations.

I would love to be wrong about this, but I haven't a Haswell to test on.

Also, classy shout-out (beg-out?) to Intel's free samples division at the end there.

[1] http://software.intel.com/sites/products/documentation/docli...

FTA: one of the updates might be lost, something we see from time to time.

You're right, but it doesn't sound like these are high-contention operations since they currently run without any locks and rarely have problems.

Sorry, I meant high contention on the lock – if they plan on using a coarse-grained lock, I presume it will see high contention (assuming they have many statistics).

Obviously if they plan on using one lock per statistic, this isn't a problem, but then they should just be using atomic updates instead.

I highly recommend Paul McKenney's blog series on this topic:


> To mitigate this we can use locks around the statements that update the counters. However, these locks are expensive and we’ve been hesitant to place locks around all the counters just to make sure they are accurate. Performance over accurate statistics, right?

Does this mean the Varnish authors have intentional data races all over the codebase? Can't the compiler do whatever it wants in these cases (because it's undefined behavior)? Can't this completely blow up?


Edit: I'm not an expert on this stuff so I'm probably wrong -- maybe someone can tell me what I'm missing.

I believe it's only for stats, which means the stats might not be accurate, but shouldn't affect the proper functioning of code (assuming nothing relies on stats being accurate).

Not just the compiler, the processor can do whatever it wants.

Before using HTM, I encourage the OP to use lockfree structures and atomics.

Transactional memory is absolutely not free and certainly not a silver bullet.

Additionally, without proper compiler support it's going to be very hard to use it properly.

> Transactional memory is absolutely not free

Can you please compare in your experience, the performance of HTM with lock-free structures? My experience is that lock-free structures are rather slow, and my understanding of TSX is that it is almost free.

> without proper compiler support it's going to be very hard to use it properly.

They're using Intel's HLE, which requires very simple library support, which is already included in recent versions glibc.

Lock free structures are as slow as your memory allocation scheme (nota bene: I've designed several lockfree structures).

If you're clever about how you use them, it can be a big win.

HLE isn't really HTM, but it's a first step.

Lock free structures are as slow as your memory allocation scheme

I've found they're an order of magnitude slower than your architecture's atomics, and mostly exacerbated by the memory fencing required. Though, the structures I've used (queue & stack) didn't require any memory allocation on the fly.

HLE isn't really HTM, but it's a first step.

I know; my post should be read in the broader context of Intel's TSX (which includes their RTM instructions).

Which lock-free structure have you been using and for what purpose?

Michael & Scott, "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms", which I understand to be fairly standard, as well as the "textbook" lock-free stack algorithm (dunno who invented it, I re-"invented" it myself and then confirmed it was identical to that given in The Art of Multiprocessor Programming).

Use case for the queue is message-passing. Use case for the stack is as the allocator for the queue.

Which language? C? You did your own implementation? Sounds like you used the wrong stuff.

Yes, C, my own implementation. What is the "right" stuff for low-latency high-contention message passing?

Well, if you were coding in C++ I would say "have a look at TBB".

The thing is, text book implementations generally lack all the platform dependant details to make it efficient.

To be fair, contended atomics aren't free either, especially if you're on the wrong NUMA node. Sharing is rarely free, so try not to share.

Exactly. Sharing is free for reading only... The minute you write, the laws of physics are there to annoy you.

here is a relevant acm article from quite some time back on the topic: http://queue.acm.org/detail.cfm?id=1454466

Can you go into more detail on this?

Branch prediction across memory barriers is also something I don’t think is doable.

My understanding is weak, but this doesn't sound right. Presuming he means that "speculative execution across a write memory barrier is prohibited", is he correct?

He's not correct. In fact the post is full of goobledock he only partly understands.

Care to elaborate?

Just two examples, there is more wrong, beginning with the title.

>It creates a “memory barrier” CPU are usually free to >shuffle around the stream of instructions that flow through >them and it is done all the time as the CPU constantly >optimizes the instruction stream to keep all its resources >busy. A lock creates a barrier and the CPU is not allowed >to move stuff past this barrier. Branch prediction across >memory barriers is also something I don’t think is doable.

He's confusing instruction retiring with memory ordering. On modern fast CPUs they are only vaguely related.

>Cache coherency. The address line that stores the lock gets >invalidated and the content in that cache gets thrown out.

That's not how a modern CPU with MESI protocol operate. As long as the cache line stays EXCLUSIVE to that core it does not get invalidated.

Counters aren't a good use case for transactional memory because the contention is over very specific sets of memory.

Transactional memory is useful for using a coarse lock to access a wide set of data most of which is not accessed concurrently. You are having the hardware doing the work of lock elision/splitting for you.

This saves you from having the unnecessary mutual exclusion of a regular coarse lock. Or alternatively it saves you from the unnatural acts necessary to get lock splitting or eliminate shared state.

For write mostly counters where performance is an issue you want to have a thread local counters or striped counters.

See http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr1... for an example of how that is done. This counter demonstrates lock splitting as well as an attempt at mechanical sympathy by trying to pick the right stripe based on CAS failures.

Well, it doesn't quite appear like he's correct in how Linux is implemented:


So nothing that occurs inside a LOCK and UNLOCK operation will be re-ordered, but things that exist outside these LOCK and UNLOCK operations may be re-ordered in such a way that it appears that something before the LOCK statement actually occurred after it.

I think.

Let me be clear - obviously this is discussing a logical CPU not the actual ISA.

My point is just that the linux doc illustrates the minimum guarantees for how Linux approaches these issues, which given a ACQUIRE/RELEASE instruction in x86 and load/store ordering, is probably going to be pretty accurate for the state of successful lock execution behavior in the more general sense.

> obviously this is discussing a logical CPU not the actual ISA.

being a tad pedantic here, but shouldn't this be a part of the machine's micro-arch rather than the isa ?

Well, the LOCK discussion in the linux docs I linked could be a bit vague, because there is also the x86 "lock" instruction.

I was talking about how lock worked in linux, which could be implemented using lock+addl, or xchg, etc. The TSX HLE stuff (xacquire/xrelease) are precisely what that document doesn't cover.

So it's a bit convoluted. I was talking about micro-arch + linux implementation, and how it invalidates TFA's assumption of speculative execution. The documentation talks about micro-arch, but could be mistaken for discussing the ISA.

So you're right, the article is discussing the micro-arch(and the ISA inherently due to TSX), and the document I posted is only talking about micro-arch, but could be mistaken for ISA stuff (as per a private email I had with someone about the word LOCK)

For the confused, you can translate the title as:

"Removing Speculative Locks in Varnish Cache"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact