You really want full RTM for such high-contention situations.
I would love to be wrong about this, but I haven't a Haswell to test on.
Also, classy shout-out (beg-out?) to Intel's free samples division at the end there.
You're right, but it doesn't sound like these are high-contention operations since they currently run without any locks and rarely have problems.
Obviously if they plan on using one lock per statistic, this isn't a problem, but then they should just be using atomic updates instead.
Does this mean the Varnish authors have intentional data races all over the codebase? Can't the compiler do whatever it wants in these cases (because it's undefined behavior)? Can't this completely blow up?
Edit: I'm not an expert on this stuff so I'm probably wrong -- maybe someone can tell me what I'm missing.
Transactional memory is absolutely not free and certainly not a silver bullet.
Additionally, without proper compiler support it's going to be very hard to use it properly.
Can you please compare in your experience, the performance of HTM with lock-free structures? My experience is that lock-free structures are rather slow, and my understanding of TSX is that it is almost free.
> without proper compiler support it's going to be very hard to use it properly.
They're using Intel's HLE, which requires very simple library support, which is already included in recent versions glibc.
If you're clever about how you use them, it can be a big win.
HLE isn't really HTM, but it's a first step.
I've found they're an order of magnitude slower than your architecture's atomics, and mostly exacerbated by the memory fencing required. Though, the structures I've used (queue & stack) didn't require any memory allocation on the fly.
HLE isn't really HTM, but it's a first step.
I know; my post should be read in the broader context of Intel's TSX (which includes their RTM instructions).
Use case for the queue is message-passing. Use case for the stack is as the allocator for the queue.
The thing is, text book implementations generally lack all the platform dependant details to make it efficient.
My understanding is weak, but this doesn't sound right. Presuming he means that "speculative execution across a write memory barrier is prohibited", is he correct?
>It creates a “memory barrier” CPU are usually free to >shuffle around the stream of instructions that flow through >them and it is done all the time as the CPU constantly >optimizes the instruction stream to keep all its resources >busy. A lock creates a barrier and the CPU is not allowed >to move stuff past this barrier. Branch prediction across >memory barriers is also something I don’t think is doable.
He's confusing instruction retiring with memory ordering. On modern fast CPUs they are only vaguely related.
>Cache coherency. The address line that stores the lock gets >invalidated and the content in that cache gets thrown out.
That's not how a modern CPU with MESI protocol operate.
As long as the cache line stays EXCLUSIVE to that core it does not get invalidated.
Transactional memory is useful for using a coarse lock to access a wide set of data most of which is not accessed concurrently. You are having the hardware doing the work of lock elision/splitting for you.
This saves you from having the unnecessary mutual exclusion of a regular coarse lock. Or alternatively it saves you from the unnatural acts necessary to get lock splitting or eliminate shared state.
For write mostly counters where performance is an issue you want to have a thread local counters or striped counters.
See http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr1... for an example of how that is done. This counter demonstrates lock splitting as well as an attempt at mechanical sympathy by trying to pick the right stripe based on CAS failures.
So nothing that occurs inside a LOCK and UNLOCK operation will be re-ordered, but things that exist outside these LOCK and UNLOCK operations may be re-ordered in such a way that it appears that something before the LOCK statement actually occurred after it.
My point is just that the linux doc illustrates the minimum guarantees for how Linux approaches these issues, which given a ACQUIRE/RELEASE instruction in x86 and load/store ordering, is probably going to be pretty accurate for the state of successful lock execution behavior in the more general sense.
being a tad pedantic here, but shouldn't this be a part of the machine's micro-arch rather than the isa ?
I was talking about how lock worked in linux, which could be implemented using lock+addl, or xchg, etc. The TSX HLE stuff (xacquire/xrelease) are precisely what that document doesn't cover.
So it's a bit convoluted. I was talking about micro-arch + linux implementation, and how it invalidates TFA's assumption of speculative execution. The documentation talks about micro-arch, but could be mistaken for discussing the ISA.
So you're right, the article is discussing the micro-arch(and the ISA inherently due to TSX), and the document I posted is only talking about micro-arch, but could be mistaken for ISA stuff (as per a private email I had with someone about the word LOCK)
"Removing Speculative Locks in Varnish Cache"