
Lock-free programming for the masses - antouank
http://kcsrk.info/ocaml/multicore/2016/06/11/lock-free/
======
RossBencina
The library is called "reagent". It's for Ocaml. From what I gather skimming
the post, the operations are: "swap" (synchronous rendezvous exchange similar
to CSP channels maybe) and "upd" (similar to CAS). With composition operators
(sequential, conjunction, disjunction). The implementation aggregates composed
operations into a transaction. It looks as though the operators would let you
implement something along the lines of Go's select() for example.

From the post:

> The key idea behind the implementation is that the reagent transaction
> executes in two phases. The first phase involves collecting all the compare-
> and-swap (CAS) operations necessary for the transaction, and the second
> phase is invoking a k-CAS operation (emulated in software)

> Reagents are less expressive than STM, which provides serializability. But
> in return, Reagents provide stronger progress guarantee (lock-freedom) over
> STM (obstruction-freedom)

Interesting.

~~~
vardump
> second phase is invoking a k-CAS operation (emulated in software)

Sounds like a normal CAS performed on a pointer.

~~~
naasking
It synchronizes multiple CAS operations an ensures they all execute
"atomically" upon k-CAS completion.

------
novaleaf
as a 'expert' multithreading programmer, i can confidently say that most
developers should stay away from fine-graned multithreading (generally
speaking of course). You will have much greater productivity gains and less
debugging headaches (race conditions!) If you architct your program to be
coarse-grained (modules running asynchronously, but each module not
multithreaded) or even better, the entire app being single threaded and just
fork it to scale horizontally.

~~~
MaulingMonkey
Even if you have a legitimate need to get fine-grained, for me the most common
use of lock free programming is to implement the underlying infrastructure for
task based systems, where the individual tasks desperately try to stay away
from multi-threading hazards - by e.g. keeping shared data immutable (no need
for 'lock-free programming'), and keeping mutated data task-local (not shared
between threads, and thus again no need for 'lock-free programming'.)

Forking / multi-process architectures are of course the logical extension of
this - where you're making _the entire program state_ task local, just to be
sure. Rule 37: There is no 'overkill.' There is only 'open fire' and 'I need
to reload.'

------
programLyrique
This library is built upon multicore Ocaml. We've spoken about multicore OCaml
for a long time, and apparently, the last tentative of it is being developed.
When will multicore Ocaml land into the stable version of the compiler?

~~~
kcsrk
(I work on multicore OCaml) Multicore OCaml has been steadily under
development, with the latest updates here:
[https://ocaml.io/w/Multicore](https://ocaml.io/w/Multicore). While I can't
promise when multicore lands in trunk, a stable-alpha version installable
through opam should be ready in the next few weeks.

------
Animats
It's time for better CPU support for atomic operations. "Backoff, retry,
blocking and signalling" is a hack. What we really need is machine level
support for short memory transactions. This would be an extension of fence
hardware.

What's needed is a CPU instruction to fence a critical section and lock up two
cache lines while you manipulate them from one CPU, with other CPUs locked
out. Transactions should have a hardware instruction cycle limit, so you can't
lock up the other CPUs for very long. Then you could do swaps, sets, appends
to lists, and such with straightforward code.

This locking would be local to the two cache lines - unless two CPUs are
trying to hit the same two cache lines, there's no delay. Why two cache lines?
That lets you do moves and swaps, and maybe some list and queue manipulation.
Those can all be completed in a few instruction cycles.

~~~
pizlonator
How is this any better than using a high-speed adaptive lock?

I would claim that it's strictly worse, since it limits you to only two cache
lines, doesn't have adequate interaction with the OS scheduler, and doesn't
scale gracefully for longer critical sections.

~~~
davidtgoldblatt
I assume interrupts would be disabled or deferred during this section, so you
get stronger guarantees on worst-case behavior; hence the bound on the number
of cycles a thread is allowed to execute for.

On one of the less common unixes (Sun's maybe? I don't totally remember), you
could set a bit in some thread-specific part of memory to indicate that you
were uninterruptible. If the OS saw that bit was set, it would set a "you need
to yield" bit in the same part of memory and then allow the thread to continue
(and penalize threads that abused the feature).

As a practical matter, I suspect that the scheme proposed wouldn't have
advantages over a careful use of HTM.

I also suspect that the "lock two cachelines" thing is harder to implement
than it sounds. The way Intel implements atomic operations that span
cachelines now involves waiting for a interconnect quiescence period (taking
O(microseconds)). What's proposed here is strictly more general.

~~~
pizlonator
Disabling interrupts prevents you from using interrupt-unsafe logic inside the
critical section, which is impractical in most cases. It would mean, for
example, that you wouldn't be able to touch memory that had been swapped. It
also doesn't actually improve performance. When people do this, it's because
they want to edit process-local data structures. Unless you're in kernel, you
probably don't want this.

As a practical matter, HTM is dead. It's 2x slower than locking in the common
cases: uncontended critical section or contended critical section that has a
race. It also prevents you from doing effects outside of memory (it's just
guaranteed to revert to locks in that case, so you pay all of the overhead of
HTM _and_ all of the overhead of locks).

Interesting about the difficulty of locking two cache lines. To me the
important thing to keep in mind when talking about concurrency alternatives to
the status quo is: nobody has proved that contended-but- _not_ -racy critical
sections are common. According to my data, they are unicorns. This means that
from a performance standpoint, any proposed concurrency protocol that is
different from locking must justify how it plans to beat locks on the two
things that they are super good at, which are uncontended critical sections
(with locks you pay one or two CASes on cache lines you probably already own)
and contended-and-racy critical sections (with locks you pay for some CASes
but the lock will make threads contend as quietly as possible to allow the
lock owner to proceed quickly). If a proposal is worse than locks on the two
most common kinds of critical sections, then that's silly. AFAICT, all of
these trasactions-or-better-CAS approaches would only help in the case of
contended-but-not-racy critical sections, which are too rare to matter. The
two cache line thing definitely sounds like it will be more expensive than
locking in the uncontended case, and I don't see how it will help in the
contended-but-racy case.

EDIT: I said "contended-but-racy" when I meant "contended-but- _not_ -racy" in
one place, above. Fixed.

~~~
davidtgoldblatt
> Disabling interrupts prevents you from using interrupt-unsafe logic inside
> the critical section, which is impractical in most cases. It would mean, for
> example, that you wouldn't be able to touch memory that had been swapped.

Presumably the semantics of the "lock these two cachelines" instruction trap
when one of the cachelines you're trying to lock isn't present, rather than
when you perform an operation on that cacheline subsequent to the locking.

> It also doesn't actually improve performance. When people do this, it's
> because they want to edit process-local data structures. Unless you're in
> kernel, you probably don't want this.

I don't totally follow; off the top of my head, lock-free reference counting
and doubly-linked list removal both get significantly simpler. There's lots of
algorithms that get much less complicated and dodge more atomic ops if you
have access to CAS-2, and this is strictly more powerful.

To be clear, I don't think this instruction is a good idea; I just don't think
it's _obviously_ bad.

> As a practical matter, HTM is dead. It's 2x slower than locking in the
> common cases: uncontended critical section or contended critical section
> that has a race. It also prevents you from doing effects outside of memory
> (it's just guaranteed to revert to locks in that case, so you pay all of the
> overhead of HTM and all of the overhead of locks).

HTM is already showing wins in some domains, and it's only going to get faster
relative to other concurrency primitives.

I'm not totally sure what you mean by "contended and racy" and "contended but
not racy"; by "racy" do you mean cacheline ping-ponging? Certainly that will
never be cheap. I think most of the desire for lock-freedom comes less from
fast-path cycle reduction as much as protection from the scheduler or some
very slow process. There's also plenty of situations in which data structures
are touched mostly by one thread, but periodically need contention management.
The overhead of locking in the single-threaded case can be substantial even if
the lock acquisition always succeeds; on my machine a non-atomic CAS is about
6x faster than an atomic one even if the CAS always succeeds (and this was
with no stores in between the operations; a more realistic example would have
a deeper write-buffer).

~~~
pizlonator
> Presumably the semantics of the "lock these two cachelines" instruction trap
> when one of them is paged out, rather than deferring the trap until you
> write to one of the cachelines you've locked.

Right, I didn't think of that.

> I don't totally follow; off the top of my head, lock-free reference counting
> and doubly-linked list removal both get significantly simpler. There's lots
> of algorithms that get much less complicated and dodge more atomic ops if
> you have access to CAS-2, and this is strictly more powerful.

I was talking about disabling interrupts and not lock-freedom in general or
CAS-2. I agree that CAS-2 would make those things simpler _if it was also fast
enough_. Sorry about the confusing context switch.

> HTM is already showing wins in some domains, and it's only going to get
> faster relative to other concurrency primitives.

Can you define what "some domains" is? I think it's important to be specific.

> I'm not totally sure what you mean by "contended and racy" and "contended
> but not racy"; by "racy" do you mean cacheline ping-ponging?

"Contended-and-racy" means that if you took the lock away, you'd have a data
race. It means that if you used a transaction then the race detection that
aborts a transaction would conclude that there had been a race and abort.
Contended-and-racy is exactly the case where TM is not a speed-up, pretty much
by definition.

If contended-but-not-racy critical sections were common, then it would make
sense to pay some base cost for executing in a TM critical section since it
would increase concurrency. But most contended critical sections are also
racy. TM won't give you a speed-up in a racy section, so you end up paying all
of the cost without getting any of the benefit.

> I think most of the desire for lock-freedom comes less from fast-path cycle
> reduction as much as protection from the scheduler or some very slow
> process.

Yeah, and also deadlock avoidance. And that lock-free algorithms are often
faster in the uncontended and contended-and-racy cases. I'm a big fan of lock-
free algorithms based on conventional word CAS.

But that's different than TM or 2-cache-line-CAS. Lock-free algorithms based
on conventional word CAS tend to be faster than locks in the uncontended case.
TM, and probably 2-CAS, is slower than locks. It's easy to justify handling
uncommon scheduling pathologies, or using exotic approaches for avoiding
deadlock, if it also gives you a speed-up. Not so much if it's slow like TM.

> The overhead of locking in the single-threaded case can be substantial even
> if the lock acquisition always succeeds; on my machine a non-atomic CAS is
> about 6x faster than an atomic one even if the CAS always succeeds (and this
> was with no stores in between the operations; a more realistic example would
> have a deeper write-buffer).

It's true that the uncontended cost of locks is dominated by the uncontended
cost of CAS. But this cost is much smaller than the unconteded cost of HTM,
and I suspect that the unconteded cost of CAS is also lower than the
uncontended cost of 2-cache-line-CAS. If the uncontended cost of 2-cache-line-
CAS is more than 2x more than the uncontended cost of CAS, then probably using
a lock is better.

------
naasking
See the LtU post [1] for the paper backing this project.

[1] [http://lambda-the-ultimate.org/node/5237](http://lambda-the-
ultimate.org/node/5237)

------
dwarman
I always thought "lock free" meant no hidden pauses in execution - a write is
a write and a read is a read. The primitives described here will block. How is
that different in effect from a lock, albeit one implemented down in the
library rather than explicitly by the user? My lock free mechanisms never
blocked. And they're used in thousands of emedded devices without failures.

------
asb
There was a little bit of discussion when this was previously on HN:
[https://news.ycombinator.com/item?id=11893911](https://news.ycombinator.com/item?id=11893911)
(linking for completeness - I'm really glad this has seen more attention after
resubmission!)

------
deadgrey19
This sounded so promising until the OCaml part. The problem is, the masses
don't program in OCaml. They program in C# or Java, or C++. So this is really
lock-free programming for OCaml, which is a lot less relevant.

~~~
wolfgke
In the original paper (see
[https://news.ycombinator.com/item?id=11908821](https://news.ycombinator.com/item?id=11908821))
it is impplemented in Scala.

~~~
deadgrey19
How is that different? Neither Scala nor OCaml rank in the top 10 programming
languages according to several metrics/organisations [1-5]. Keeping in mind
that programming language use is a heavily skewed distribution, OCaml and
Scala have almost zero usage relative to the big ones. So, again I must ask,
how is this relevant to the masses?

[1] [http://www.tiobe.com/tiobe_index](http://www.tiobe.com/tiobe_index)

[2] [http://pypl.github.io/PYPL.html](http://pypl.github.io/PYPL.html)

[3] [http://spectrum.ieee.org/computing/software/the-2015-top-
ten...](http://spectrum.ieee.org/computing/software/the-2015-top-ten-
programming-languages)

[4] [http://redmonk.com/sogrady/2016/02/19/language-
rankings-1-16...](http://redmonk.com/sogrady/2016/02/19/language-
rankings-1-16/)

[5] [http://www.codingdojo.com/blog/9-most-in-demand-
programming-...](http://www.codingdojo.com/blog/9-most-in-demand-programming-
languages-of-2016/)

