
Lthread is a multicore/multithread coroutine library written in C - nkurz
https://github.com/halayli/lthread#readme
======
loeg
So, last time this was posted I commented that I was interested, but couldn't
use this at work because of the GPL license.

Since then I implemented my own M:N userspace:kernel threading library (which
is what lthread boils down to) based on Russ Cox's BSD-licensed libtask. You
know what I found?

1) Many libc routines require a surprisingly large amount of stack. 64kiB was
the smallest power of 2 I could find where I wouldn't overflow the stack
somewhere in a libc call. (This isn't a problem with stack-copying, but it is
a problem if you have dedicated per-lthread stacks).

2) It was slow! Just dropping the M:N library and using blocking network I/O
with 200+ kernel pthreads was vastly faster.

3) Many pthread-functions get very, very confused if they are used from
different pthreads (same logical "lthread"). (This happens when a userspace
scheduler swap happens between matching operations.) For example, pthread
mutexes don't like to be locked in one pthread and unlocked in another pthread
(same "lthread"). I ran into other things that match this category, but can't
remember them off of the top of my head.

I ended up scrapping it and just using a pool of kernel pthreads. It works,
the code is pretty clear (blocking IO! whatever, man), and it's fast enough.

Edit: By "fast enough", I mean "can fill 2x 1Gbit pipes from disk" (without
any perf-focused work thus far).

~~~
loeg
Since I can't edit this post anymore, here's the relevant code:

<https://github.com/cemeyer/taskmn>

(Buyer beware: this is basically a snapshot of it at the point we decided to
drop it entirely; I poked it enough to get it to compile under GCC / linux,
but haven't verified functionality at all.)

More edits:

4) Re: Stack-copying; I found that without any attempt at optimizing memory
use, my "light" threads were consuming on the order of ~600 bytes of stack
(when de-scheduled); with a stack copying approach and a run-time stack of
128kiB, calling large-stack-using libc functions was fine.

5) Re: per-task dedicated stack memory; the default amount of memory allocated
per-pthread on our platform is 128kiB, so 64kiB isn't great savings. I
detected stack corruption at 32kiB and below by setting a canary value (at the
top of the allocated stack) in the scheduler before running a ready "lthread."

~~~
beagle3
Thanks for sharing your code and experience.

600 bytes per de-scheduled thread sounds excellent; It makes a million mostly-
dormant threads feasible in less than a gig of memory -- I think that's the
use case lthread-style threads shine in. You wouldn't be able to do that with
blocking kernel threads (you'd need 64gig of ram just for stack, and that
assumes the kernel scales well enough to handle that).

I think overall lower efficiency per thread, even if it's 50% slower, is
acceptable in such a use case, given the memory requirements reduction.

~~~
loeg
No problem :-).

Re: "lower efficiency… is acceptable in such a use case": Maybe — it really
depends on your use case. In the case you describe, sure. In my particular
(specialized) case:

1) My clients are connected by 1Gb or 10Gb switched ethernet, and they are
typically on firewalled local networks. I don't worry about trickle-style DoS
attacks. (In other words: my threads are mostly-active, not mostly-dormant.)

2) My clients really only care about sustained streaming throughput. So, if
they can max out the underlying disk with 20-30 simultaneous connections, they
won't bother throwing 1000s of connections at my server. pthread-per-
connection works great for this circumstance (with a pre-allocated thread
pool).

Use the right tool for the job ;-).

------
pm215
The trouble with this sort of C coroutine library in my experience is
portability. Lthread seems to have taken the "write a bit of inline asm"
approach, which means it's x86/x86-64 only. If you avoid the assembly and try
to do things only with C library functions, you end up with something like
QEMU's coroutines, which have four different backends: makecontext/setcontext
based, win32 fibers, the nasty sigaltstack trick used by GNU Pth[1], and a
last-ditch fallback using a separate GThread per coroutine. Multiple backends
means more code, and the less-used backends are more liable to bitrot and
undetected bugs, which is the last thing you want in a key bit of
infrastructure.

Also some libc implementations don't take kindly to programs messing with the
stack pointer behind their backs. Early Linux NPTL implementations put thread-
local-storage just above the stack and used "round ESP up to 2MB boundary" to
access it, which meant you had to switch back to the libc-created thread stack
before calling just about any libc function. I hear at least one of the BSDs
still does something similar.

All things considered I'd really rather just use threads...

[1] <http://www.gnu.org/software/pth/rse-pmt.ps>

------
ComputerGuru
Previously posted on HN 50 days ago, there are some interesting critiques in
the comments, especially regarding the GPL license, the consensus being it is
ridiculous for a library like this and probably a good reason it's not as
popular as it could be:

<http://news.ycombinator.com/item?id=3661038>

~~~
nkurz
Thanks, I hadn't seen the earlier discussion. I'm more interested in the
approach than in the specific code, so GPL doesn't concern me terribly. It
seems to be a significant block for others, though.

    
    
       -----	

halayli 50 days ago | <http://news.ycombinator.com/item?id=3661584>

I am not married to the license. I noticed several complaints, so I am going
to reconsider it. :)

    
    
       -----
    

Halayli: Any further thoughts on licensing?

~~~
halayli
and done!

~~~
loeg
Cool! Thanks.

Edit: looks like you missed COPYING; it's still GPLv2 (whereas LICENSE has
BSD-looking text).

~~~
halayli
fixed. Thanks for catching it!

------
wsc981
I wonder how this library compares agains Apple's Grand Central Dispatch.

I guess they don't contain 1:1 functionality, but both libraries seem to be
meant to make threading easier. I guess GCD is a more complete solution due to
the addition of queues.

~~~
loeg
They're different beasts entirely. One key distinction is that GCD requires
kernel support, while lthread and friends simply sit on top of any pthread-
compatible system. As I understand it, GCD is based on an old systems idea
(1991):

<https://en.wikipedia.org/wiki/Scheduler_activations>

It also has some auxiliary stuff (compiler-level support for C closures (Clang
blocks), etc…) to "make threading easier," like you say.

------
Hoff
Any thoughts on this HP research paper

<http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html>

"Threads Cannot be Implemented as a Library"?

That paper states that library-based threading implementations that don't also
involve the compiler can't guarantee correctness of the resulting threading.

~~~
qznc
Not relevant here.

The paper talks about correctness for parallel kernel-level threads. In
contrast, lthreads are about concurrency. These user-level threads are about
software architecture, because they are more elegant than an event-loop.
Parallelism (exploiting multicore hardware for a speedup) is a secondary goal,
but it is much easier to parallelize a user-level thread program compared to
an event loop program.

------
mninja
When I saw this repost, I thought the license had changed. Too bad it hasn't
yet.

~~~
ewillbefull
What's wrong with the BSD license?

~~~
halayli
I changed the license to BSD one hour ago after I was reminded by this post
:). The comment was made before the change.

------
alpb
This is previously submitted to HN several times.
<http://news.ycombinator.com/item?id=3331474>

------
otterley
See also State Threads:

<http://state-threads.sourceforge.net/>

~~~
loeg
According to SF, it was last updated in 2009 (is it actively developed?). I
don't see much in the way of examples on the SF page, which is a bit off-
putting. I guess I'm curious about the model it uses and how it performs, but
my best guess is that it's not radically different from lthread or any other
M:N userspace threading library. Do you know more?

------
sown
Neat.

I wonder what happens if you use swap/get context?

