
Show HN: uThreads – Concurrent User Threads in C and C++ - saman_b
http://samanbarghi.com/uThreads/v0.3.0/
======
krat0sprakhar
This looks super interesting! I've been working on a Raytracer in C++ and I
was recently looking into a threading library which I can use to parallelize
the rendering. Surely going to try this out in the coming weekend.

Unsolicited suggestion - while benchmarks and the motivation are important for
a threading library, a code snippet of a simple parallel program on the home
page would be something that I'd love to see.

Great job, though!

~~~
zokier
Isn't raytracing almost completely CPU-bound? Seems like a odd use-case for
green threads, which afaik are more commonly used for IO-bound tasks.

~~~
krat0sprakhar
You're absolutely right - raytracing is CPU-bound. However, one can still
parallelize per-pixel rendering computations on separate cores. Here's an
example of a path-tracer written in Go, spawning a new go-routine for this
very use-case
[https://github.com/fogleman/pt/blob/master/pt/renderer.go#L6...](https://github.com/fogleman/pt/blob/master/pt/renderer.go#L64-L65)

~~~
comex
If you just want to start num-CPUs threads like that code does, there is no
advantage of green threads over regular ones, std::thread in C++. If you want
to start a thread per pixel or something, green threads could work, but it's
probably more efficient to keep track of state manually.

------
jcbeard
How would something like this differ from something like Sandia National Lab's
Qthreads
([http://www.cs.sandia.gov/qthreads/](http://www.cs.sandia.gov/qthreads/))?
Seems it's a tried and true solution in C that also works with C++11
(committed a test case for C++11 myself)...It is also an optional underpinning
for some relatively big-name frameworks like Kokkos, Chapel, RaftLib, etc.

~~~
saman_b
Thanks for mentioning this, it is indeed very related and very interesting. I
am not sure why not me or people around me were aware of Qthread, it has very
good support for various architectures and provides many interesting features.
It has many similarities with I have in mind for a concurrent library, and
even some research goals seems to be very close to mine. Specially the notion
of affinity and locality is what I am focusing on in uThreads. I am going
through the papers and the source code at the moment to see what are the
similarities and differences.

uThreads is still a work in progress and I have specific plans for it in the
future that might differ of what Qthreads is trying to accomplish. For now my
focus is more on providing auto tuning of Clusters based on the workload. I
also will try to explore the pros and cons of uThread migration based on
Cluster placement (NUMA and cache locality), and from the 2008 paper it seems
that it is what you are trying to study as well. I am open to collaboration if
there is an ongoing study around this topic.

~~~
jcbeard
ooh, ok. you might also check out openshmem
([http://openshmem.org/site/](http://openshmem.org/site/)) and openucx
([http://www.openucx.org](http://www.openucx.org)). I'd been planning on
integrating both in RaftLib just not enough cycles to get it done yet. These
combined would make it much easier to maintain a relatively portable yet
performant back end. My thesis research was all about locality, memory
placement, and throughput for big data systems. Current work is similar but
I'm not neck deep in the hardware dev world. There are current research
efforts on my part outside of work, most center around the raftlib.io
platform. Before I forget, you might also want to check out the graph
partitioning frameworks like metis and scotch...both are used in some MPI
frameworks for more optimal partitioning. To get topology data you might want
to look at the hwloc framework, it's cross platform and provides input for
things like NUMA/cache/PCIe topology for optimization. I haven't had a chance
to integrate this hook into RaftLib, however it's just a few lines away once I
find the time. If you're wondering...I started out writing my own fiber
library for RaftLib. Had ports for both IBM Power and Intel, but it gets a bit
tiring maintaining/optimizing for every new architecture. Qthreads and the
like have been used in HPC circles for quite awhile, so it made sense. There
was no way I was beating them for dev time, so might as well join them.

Based on your auto-tuning discussion...RaftLib aims to do something similar,
but for big-data stream processing workloads. Here's my 2016 IJHPCA paper:
[http://hpc.sagepub.com/content/early/2016/10/18/109434201667...](http://hpc.sagepub.com/content/early/2016/10/18/1094342016672542)

It looks like it's behind a paywall so if you don't have access I'll update my
website with the "author archive" copy sometime today...will be at (
[http://jonathanbeard.io/media](http://jonathanbeard.io/media) ) once I update
it. Bottom line if there's intersected interest, definitely open to
collaboration :).

~~~
saman_b
Thanks, so much information to absorb, let me go through all this and get back
to you. I shoot you an email when I processed all this :)

------
jtsylve
Is there any reason you chose GPL3 or would you consider a less restrictive
license like Apache or BSD?

~~~
jimktrains2
Is there a reason you'd choose Apache or BSD instead of a freedom preserving
license like GPL3?

~~~
aseipp
They probably want to use it, make money off it, and give nothing back to
anyone. The same as most people who complain about it, IMO. Or, just as
likely, they'll integrate it into their own BSD software, which will then be
used by someone else to make money, and give nothing back, which might as well
be the same thing. (Countdown until 1 person shows up and retorts "Well, _my_
company supports the OSS we use..." as if it's meaningful at all, vs the
massive trend of corporations ripping off FOSS and returning nothing)

I don't GPL license a lot of my software (I used BSD/MIT/Apache most of the
time, and my company revolves around and writes BSD-licensed code), but IMO,
having seen this tired argument over and over again, I'm pretty sure 90% of
all GPL complaints come down to this, even if the people don't come out and
say it: "I can't make money off of your free work as easily, and that isn't
fair to me. Please reconsider." Given this is Hacker News where half of
everyone is in a rat-race to make money, I speculate this is a strong part of
it.

People just like to wrap it in words like "restrictive" and "viral" to make
themselves sound more palatable and reasonable.

And of course, there are also many reasonable alternatives to this library,
many which are BSD or permissively licensed, which these people could also use
instead -- but that won't stop them from complaining that GPL is unfair or
"bad", of course, even though they could pick from a dozen alternatives...

~~~
striking
It's harder for me to agree with a library being licensed under GPL than if it
were an entire application being licensed under GPL.

If someone's licensing an entire application that's usable in its own right,
like an SQL server or emulator or game or kernel, then it totally makes sense
that anything you build around it should have its code be accessible.

But if it's some smaller part of your entire codebase, I feel it's a little
less reasonable for people who aren't working on something that's already *GPL
licensed. Especially when you have middlewares or other proprietary pieces of
code linked with yours. That, plainly and simply, will prevent you from using
the library. (Except if it's the LGPL, which I've never had problems with. I
personally like it the most, and I'd use it if I really cared about people
upstreaming their changes.)

I'm not going to say "this library is bad because it's GPL'd" but it does mean
I would avoid baking it into a programming language runtime, for example.

And it really bothers me that you have such a low opinion of people who
comment here, and of people who disagree with you. Maybe it's just an
exaggeration for argument's sake, but I feel like there are plenty of
reasonable arguments against use of the GPL.

------
michaelsbradley
For what kinds of applications would one favor this over, say, the coroutines
approach of libdill?

[http://libdill.org/tutorial.html](http://libdill.org/tutorial.html)

See, in particular, Step 6 of the tutorial.

~~~
saman_b
In libdill approach, you are probably limited to only multiplex connections
over multiple kernel threads. And when a connection is accepted over a kernel
thread it has to perform all further instructions over that kernel thread. So
it gives you a bit of control over which cores to be utilized but after that
you do not have control over what part of the code should be executed on each
core.

Using uThreads you can decide what part of the code should be executed over
each kernel thread and, if taskset is used, which core to execute your code.
You can do this by creating Clusters and using migration or uThread creating
in runtime. Thus you can decide which thread is used to multiplex connections
and which one is used to run CPU bound code in addition to having a thread
pool for example to do the disk IO asynchronously. Ultimately, one can create
a SEDA[1] like architecture using uTrheads. Also you can always use uThread as
a single threaded application.

\------------------------------------------

[1] [https://en.wikipedia.org/wiki/Staged_event-
driven_architectu...](https://en.wikipedia.org/wiki/Staged_event-
driven_architecture)

~~~
michaelsbradley
First time I've heard of SEDA, though I've been aware of the concerns/concepts
it addresses, in various forms, for some time.

Any thoughts on Welsh's _Retrospective on SEDA_?

[http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda...](http://matt-
welsh.blogspot.com/2010/07/retrospective-on-seda.html)

~~~
saman_b
I have seen this before. Those are very good points, and I am trying to move
this library towards supporting a SEDA type architecture with dynamic control
and auto tuning runtime parameters.

------
anonymousDan
Interesting, I've recently been looking for a user level threading package in
C/C++. I ended up settling on lthreads:
[http://lthread.readthedocs.io/en/latest/intro.html](http://lthread.readthedocs.io/en/latest/intro.html)

Does anyone know how it compares?

~~~
saman_b
Hi, I developed uThreads. I looked at lthreads quickly, and it seems lthreads
only maps multiple coroutines onto a single pthread (N:1). Although, it adds
the possibility of running multiple pthreads, but each pthread can only run
their local lthreads (using M threads that do N:1 mapping). However, in
uThreads, uThreads can be multiplexed over multiple pthreads (thus M:N
mapping). Also lthreads scheduler is based on epoll/kqueue per pthread, and
uThreads is using run Queues to manage uThreads which has less overhead. Per
pthread epoll/kqueue can mean better scalability for large number of threads
in comparison with uThreads that is relying on a single poller thread. But
since the poller thread and synchronization is very low overhead in uThreads,
the scalability is not an issue (Experiments to up to 16 threads show that
uThreads scale very well). Although lthreads provide compute boundaries and
async IO to move lthreads over other pthreads, but this process seems to be
very expensive. uThreads does not provide these features, but it provides more
flexibility and control to the developer by providing migrations. Developers
can use migration at any point to move the uThread to another set of kThreads
to execute tasks asynchronously (By defining Clusters of kThreads, e.g., IO
cluster or Compute Cluster).

~~~
scott_s
I recommend trying to get access to larger machines with more hardware
parallelism. I have seen techniques that scale just fine to using 16 threads,
but hit serious limitations when you get to over 100 threads.

~~~
saman_b
You are right, I have access to machines with higher number of cores, but they
have multiple sockets and at some point I need to address the cross NUMA cost
which adds a whole new level of complexity and design decisions.

For sure at some point the poller thread will be saturated and the program
will not scale past a certain number of threads. I used to have a poller
thread per cluster for better scalability, but that would add overhead for
migrations between clusters, thus I had to remove it for now until I can
somehow find a low overhead solution. uThreads is a work in progress and all
these need to be carefully considered in the future :) Thanks for your
feedback

~~~
scott_s
Sometimes, the techniques you use to scale to 100s of threads solve some NUMA
issues by virtue of the fact that in order to scale that high, you need to
avoid touching as much non-local data as possible. I think it's better to just
deal with the pain now and start running your experiments on as large of a
machine you can get access to. You can still put off explicitly designing for
NUMA, but you want to avoid spending too much time and effort designing for
the lower end of the scalability spectrum.

------
purple-dragon
I've had success with Intel's Threading Building Blocks. Is there a reason I
might prefer something like this instead?

~~~
saman_b
Seems like a very mature library, I can't answer your question before I go
through their documentation and code. Also, I find your question a bit
abstract, since I am not aware of the details of the problem you are trying to
solve, I cannot reason why any library or tool can be better over other
libraries. In your case, TBB might fit your requirements and it might be hard
to give a reason to switch to another library.

------
medi11
How does it compare to boost.fiber
([http://www.boost.org/doc/libs/1_62_0/libs/fiber/doc/html/ind...](http://www.boost.org/doc/libs/1_62_0/libs/fiber/doc/html/index.html))?

------
valarauca1
Impressive.

8KiB stacks are a bit on the small side though for production usage. Go gets
away with this because they're stacks act more like Vectors then flat arrays.

Why did you decided to roll your own stack swapping software instead of using
say using `boost::context`?

~~~
saman_b
Right, however segmented stacks are have high overhead and stack copying is
not very easy in C/C++. Thus, for now uThreads only support fixed size stacks,
I know it makes it harder to be used in production, and in the future I might
provide optional segmented stacks. As for why not using `boost::context`, I am
implementing uThreads as part of my research in uwaterloo, I wanted to have
full control over the code and be able to optimize for performance as much as
I can. Thus, tried to avoid relying on any third party code when I started :)

------
naasking
Why not just use StateThreads? A comparison against other options would be
illuminating.

~~~
saman_b
Good idea, I probably write a blog post on this later. If you take a look at
[1], I explain the difference between N:1 and M:N mappings. StateThreads uses
a N:1 mapping which means you can multiplex many fibers over a single thread
and to take advantage of multi-processors you can have M processes that do N:1
mapping, which is a common practice (libmill, libdill, ...). But with uThreads
you can multiplex many fibers over many kernel threads.

\---------------------------------

[1][https://github.com/samanbarghi/uThreads#what-are-
uthreads](https://github.com/samanbarghi/uThreads#what-are-uthreads)

~~~
naasking
Sure, I get the M:N argument, but that wasn't something you could extend
existing solutions to support?

~~~
saman_b
Do you have anything specific in mind? The only code I found similar to this
is uC++ [1], which has way more features and more sophisticated scheduler. I
am using this as part of my research and wanted to have sth very simple.

For all N:1 mappings, since there is only a single process, there is no need
for synchronization, also there is no need for any scheduler as a simple queue
suffice. But as soon as multiple threads are introduced, there is a need for
orchestration among threads and it also changes all other aspects of the code.
Of course, I could develop on top of an existing codebase, but I suspect I had
to change so much that it is better to start from scratch anyway.

\------------------------------------

[1][https://plg.uwaterloo.ca/usystem/uC++.html](https://plg.uwaterloo.ca/usystem/uC++.html)

~~~
naasking
> For all N:1 mappings, since there is only a single process, there is no need
> for synchronization, also there is no need for any scheduler as a simple
> queue suffice.

Wouldn't an M:N model simply amount to work stealing among N kernel-thread-
local queues? This seems like it should be a pretty straightforward extension
to one of the user-level C thread packages.

Or are you doing something more elaborate for your research?

~~~
saman_b
Well on the surface, yes! when I started I expected the same. For now there is
no work stealing among kThreads. But let me dig into the problem a bit deeper
so you get an idea why it might not be very straight forward (also it very
much depends on the use case).

Usually these libraries either provide their own queue or rely on underlying
event system e.g. epoll/kqueue to manage the state of the light weight
threads. Also, the other main part is IO multiplexing, which can be done using
select/poll/epoll/kqueue ...

Lets say if they are using some sort of queue, since there is no other thread
in the system, thus no synchronization required and its as simple as pushing
to the tail of queue and pulling from the head. Now, if I add more threads,
now I have to think how I want to synchronize among threads. The straight
forward way of doing this is using the same queue and multiplex among kernel
threads and use Mutex and CV. However, Mutex has high overhead and is not very
scalable (pthread_mutex does not scale well under contention in Linux). What
about N-kernel-thread-local queues along with Mutex? Well, if you see the
documentation this approach does better but has high overhead as well. What is
next? remove the queue and add some sort of lock-free queue, either MPMC,
MPSC, or SPSC. This requires some work to determine which one has lower
overhead, in my case I have not tested SPSC, and for now settled with MPSC. So
the queue part is totally gone, since they probably did not care about all
this and used a simple queue.

Next, comes the IO multiplexing. Relying on an epoll instance per kernel-
thread is absolutely fine specially for cases where uThreads stick to the same
kernel thread, but as soon as I introduce the notion of migration then moving
from on kernel-thread to another kenrel-thread means issuing at least two
system calls (in case of epoll, deregister from current epoll instance and
register to the new one). This has high overhead, so I need to provide some
sort of common poller that do multiplexing over multiple kernel threads, but
due to having more than one kernel thread, it means connections should be
properly synchronized as more than one thread might try to access a
connection. Also, it has to be done in a way with low overhead as migrations
for my research require to have very low overhead. Thus, the IO multiplexing
part should be replaced.

So the main parts are required to be changed, and I believe the effort
required to make fundamental changes to an existing system might be more than
the efforts required for writing it from scratch. Also for each part
implemented, I did performance optimisations and build on top of that which
helped to keep the performance at an acceptable level, it would be hard to do
the same with an existing system as it requires to isolate various parts and
optimise each part, which requires additional effort.

I hope it makes it more clear :)

~~~
naasking
I agree there are additional complexities with M:N. I was assuming a lock-free
queue, since you'll need this for scalable work-stealing.

I was also assuming a single kernel thread performs I/O via epoll/kqueue/etc.
and either has its own queue from which other threads steal, or simply pushes
results onto a random queue when requested I/O are complete. This accomplishes
the migration I believe you were describing.

When I/O needs to be done, you enqueue the uthread on the I/O queue and invoke
a reserved file descriptor to notify the I/O thread, which then reshuffles its
file descriptors and again calls epoll/kqueue.

I'm not sure whether this would scale as well as what you're doing since you
mentioned an epoll-per-kernel thread, but I wouldn't be surprised if it got
close since it's so I/O-bound.

~~~
saman_b
> I was also assuming a single kernel thread performs I/O via
> epoll/kqueue/etc. and either has its own queue from which other threads
> steal, or simply pushes results onto a random queue when requested I/O are
> complete. This accomplishes the migration I believe you were describing.

> When I/O needs to be done, you enqueue the uthread on the I/O queue and
> invoke a reserved file descriptor to notify the I/O thread, which then
> reshuffles its file descriptors and again calls epoll/kqueue.

If I remember correctly, what you described is similar to how golang perform
I/O polling since they are using work stealing, except there is no dedicated
poller thread, and threads take turn to do the polling whenever they become
idle.

Also, with epoll/kqueue there is no need to enqueue uThreads and you can
simply store a pointer(a struct that has reader/writer pointers to uThreads)
with the epoll event that you create and recover it later from the
notification, this way you can set the pointer when need to do I/O.

I agree a single poller thread does not scale as well as per kthread epoll
instance, and that requires some meditation to provide a low overhead solution
that does not sacrifice performance over scalability.

------
avdicius
I did something similar in plain C:

[https://github.com/ademakov/MainMemory](https://github.com/ademakov/MainMemory)

