If anyone is interested in a real-world but still meant for educational purposes example (a la minix), I wrote a generic tcp proxy in rust/tokio some years ago, specifically for use as a teaching example to demonstrate how to correctly handle what would, at first blush, seem to be an incredibly straightforward task: https://github.com/mqudsi/tcpproxy
I've updated it over the years (moving from chaining futures to using async, using the latest tokio, etc) to keep it relevant.
In particular, correctly splitting a duplex connection into send/receive streams and then aborting one side of the connection when the other closes without missing events is trickier than it should be.
I am relieved that the source is just shy of 200 lines, because my initial thought was "there's some bookkeeping, sure, but that doesn't sound too bad".
(I'm actually surprised it's that small, and it's clearly written for clarity not size. nicely done!)
For anyone not used to more than simple concurrency stuff though, yeah, loads of subtle footguns. And they're often the sort of thing you might not notice in normal manual tests. Thanks for the link and maintaining it!
Thanks. Perhaps I did go overboard with that disclaimer.. probably because I myself made the mistake of initially using [0] the oh-so-convenient tokio::io::copy() instead of writing my own copy method that would drop the other half of the connection when one side was closed.
The copy_with_abort() routine is still taking the easy way out in this not-optimized-for-heavy-production-use sample because it uses a broadcast channel per connection to reactively signal that the other half of the connection should be closed (rather than timing out every x ms to see if an abort flag has been set). In the real world, I'd probably replace the join! macro with a manual event loop to be able to do the same but without creating a broadcast channel per-connection.
(I maintain an extremely lightweight "awaitable bools" library for rust [1] that is perfect for this kind of thing (roughly equivalent to a "bounded broadcast_channel<()> of queue length 1, but each "channel" is only a single (optionally stack-allocated) byte) — but it's for event loops in synchronous code and not async executor compatible.)
Nah, I think it's a fair assessment - I do a fair bit of concurrency work, and do a bunch of education stuff at work. I've built and helped fix multiple things close to or (regrettably) more complicated. It's not surprising that I guessed the general shape correctly.
But there are quite a lot of people stuck somewhere between "can reliably use a single blocking queue correctly / coarsely protect things with one mutex" and "can handle multiple competing interactions with error handling". More material to help cross that threshold is definitely a good thing.
These runtimes seems heavily scewed towards being used by IO bound application.
While rayon can help with CPU bound tasks. I wonder if these mix at all, it may not be desired at all. But I think fairly interesting nonetheless.
I know that tokio doesn't like being blocked by anything. You will get a hard error, if you do a blocking operation when not being spawn_blocking via. tokio. Rayon, probably doesn't trigger this warning, as it is technically not doing any WouldBlock operations.
you would probably do a spawn_blocking for rayon as you would for IO bound operations.
Anyways, mixing CPU and IO bound operations on the same CPU/machine is probably the same as mixing oil and water, and would probably be a pain in the neck to maintain and get good utilization.
It is also a bit interesting how these paradigms compare with Golang green threads, which also seems to yield without a lot of syntactic sugar. The same for elixir. These languages also seem to solve the IO bound problem quite well using this paradigm (you obviously doesn't get as much control as you do for rust), but powerful nonetheless.
The section just above the part you have linked also explains that tokio lets you pick the runtime scheduler - https://docs.rs/tokio/latest/tokio/runtime/index.html#runtim... - so you can run on single or multiple threads (the default is the latter with core and blocking threads as mentioned in the section you linked).
I think you can mix them (or you can, as mentioned in the link, have a separate runtime for CPU-bound tasks) but it is probably easier to use Rayon.
The article looks great! I like that it explains both classic IO for a webserver (which helps since a lot of developers don't even know that read() and write() might return without all data being processed) and then proceeds from non-blocking IO to async/await.
However there seems one thing worthwhile to point out:
> All to implement graceful shutdown.
It isn't true that async IO is required for that. Blocking IO calls can also be interrupted via signals. For the use-case of Ctrl+C that is mentioned in the blog post even the simple setup which unblock the syscall and let it return EINTR are successful. But even custom in-application cancellation from different threads works via signals. E.g. building a Thread::interrupt() that unblocks IO on a different thread is possible.
Most classical applications (e.g. proxy-servers) that went for the nonblocking route did it indeed for bigger scalability and lower memory footprint than using a thread per connection model. In the last couple of years and especically in Rust I however feel that most applications choose async either because the library ecosystem forces them or because there's just an assumption of missing out on something if not using it. There's rarely measurements performed with both setups that actually quanitfy any win or loss of using async IO.
Yeah you're right regarding signal handlers (though I'm not sure the windows equivalent), and I was planning on adding a note about the alternative. The main thing I wanted to point out was that async/await as a model is built such that expressing complex control flow with arbitrary inputs becomes very simple. Nowadays thread-per-request will work well performance wise for most, but async/await became the default in various ecosystems because it's a more expressive model, especially useful in server-level code, which more or less forces it to be used by everyone. The advantage becomes apparent when you need to compose more complex logic within a system, which async makes seamless (though in Rust specifically things are harder because of the interaction between async and the borrow checker).
Not a Rust dev, but I thought this tutorial was great the way it stepped from the obvious, through the nitty-gritty performance enhancements, before ending up using an elegant feature that you probably wouldn't appreciate very much without knowing what it's doing for you under the hood.
those comparisons to Nginx are not fair. The latter does also do things like loading a configuration per connection and request, processing timeouts, preparing logs and potentially other things.
I enjoyed the async part where the author built up Future and a runtime from first principles. If the topic interests you, these two videos are good follow-ups into Rust async [1] [2]. Rust's model is a little different and more powerful than other languages.
That chain of errors when dealing with async code and closures matches my experience when I tried to use the simplest web framework I could find (Tide, IIRC). That experience is what made me give up on Rust whenever I see async is involved.
I like this post because it explains the debugging process, and makes me want to give another try to async Rust.
I will look it up later, but maybe someone here knows off the top of their head. Does the rust tcp listener accept not fail with an EINTR equivalent upon receipt of SIGINT?
The problem with async programming, or cooperative multitasking (as it was popularized in the Windows 3.1 era) is that the CPU is a resource too. Threads are a better way of slicing up CPU time.
Async is a bit trendy right now and in my experience has a way of infecting a codebase. That said, I like the way it reads for server code. In a project I'm working on, we use both: explicit threads and channels the CPU-bound work and an async REST server on a single thread for handing requests and reading the database. It works nicely. But we do have a clear separation boundary between the two. Also, here's a related post that surprised me. Poster says spawning a thread per connection works fine for their server at 50K+ connections. Context switching and spawns not a problem? Maybe not always. https://stackoverflow.com/questions/17593699/tcp-ip-solving-...
It is clearly not an ideal abstraction for anything CPU bound. But for anything IO bound, it is a much more natural way to tame the complexities of concurrency.
The problem is that systems evolve. What's IO bound today may be CPU bound tomorrow. And then you'll have to rethink and depending on what language you use you may even have the "what color is my function" problem.
Data storage is one example, a lot of time loading of data from HDD was the n°1 performance issue as it was thousands of time slower than CPUs ; nowadays the fastest SSDs have sequential reads that read at 12GB/s (DDR3-1600 speeds!) which can easily be much faster that whatever non-trivial processing you'll apply to the data, so the bottleneck moved from IO to CPU.
I'm not sure that I'd want to avoid using a good abstraction because 20 years from now, it may perform marginally worse than it could had it been modeled with much more complexity. Clearly it depends on the use case, but you get my point.
There's two ways to understand that. What I think is the likely intention is the requirements may change, such that pieces that currently don't do much computation start to do a lot. Easiest one to think about is adding encryption. Your http server may have been bottlenecked on disk or network i/o, but your https server may be bottlenecked on cpu; not a great example because the cpu work in TLS is in bits and pieces and will probably share nicely.
The other way is that sometimes a little more load perturbs the system so things take a lot longer. I've had systems that work fine until you hit X% cpu, then it jumps to 90%+, because of say lock contention; sometimes N is 80% so whatevs, but sometimes it's 25%, and then you need to find and fix the bottleneck.
The section under “a non blocking server” is probably the best description of the biggest issue with the “why not spin out blocking calls on threads” model.
I wonder how much that is obsoleted by io_uring by now?
The article mentions how non-blocking IO is the way to gain back control otherwise conceded to the kernel in the case of blocking IO. However, non-blocking IO has to be laboriously implemented in user code. Is that true still? What if we used the threading model but relied on io_uring?
You still need async (or an event loop model) because otherwise your threads will block waiting for the io_uring result when they could be put to better use (namely submitting new io_uring jobs or conveying the results of completed ones).
Note that the io_uring approach isn't too different (just more universalized) than how Windows has always implemented async operations (with overlapped and io completion ports). Code written for that environment originally just used the event loop model but it's easier/better ergonomics to wrap it in async and use that instead (e.g. how C#'s async/await works) because you can avoid callback hell, state passing, etc. and just write code "linearly" with cooperative yield points invisibly stashing and retrieving the "stack" for you.
What? The total opposite is true! 'async/await' is callback hell. It's the barest syntactic sugar over it, but it's definitely callback hell. You write code sequentially if you use threads (whether they're userspace M:N or N:1 threads, or 1:1 kernel threads).
Sorry, not callback hell (that's what JS had before async/await) but certainly not code that linearly handles a connection from start to end, either. (Threaded code is certainly "linear" but I meant from the perspective of walking through the lifetime of a connection, one connection at a time linearly.)
You can use a "threading model" with readiness-based or completion-based non-blocking IO. There are (userspace-)threaded runtimes based on epoll/etc. and also those based on io_uring/IOCP/etc.
As a commenter in the other sub-thread pointed out, you still need async capabilities to make the most of io_uring API’s (I’m a huge fan on”f io_uring).
Whilst you can do this, I still think the best approach is “actual-async”. The idea of “spin up this worker and shunt the data across to it just so it can block” strikes me as somewhat wasteful. If you know you’re going to have to wait, you’re better off not expending resources doing so, and instead use the time to do useful work instead.
"Spin up this worker" is trivial: one memory allocation, probably from a free list, to allocate a thread stack, setting a couple of pointers, and a function call. Doing any kind of I/O is much heavier-weight. Thread stacks don't have to even be very large. It doesn't expend any additional resources to have a userspace thread blocked on a socket. You can just have a pointer to the thread as the userdata of the SQE.
It doesn't make any sense to say 'use the time to do useful work instead'. That's what you do when you block: in a threaded model, when you block you do 'io_uring_get_sqe', 'io_uring_prep_read' (or whatever), 'io_uring_set_userdata' 'io_uring_submit' and then you switch contexts (with <ucontext.h> or some equivalent) to another thread that is ready. When you have no ready threads you just call 'io_uring_wait_cqe', and for each completion you do 'io_uring_get_userdata', get a pointer to the thread, record the result of the operation and append the thread to a runqueue to be resumed.
From the perspective of the thread, you write
int e = read(fd, buf, sizeof buf);
if (e < 0) err(1, "read: %s", strerror(-e));
and it all feels completely sequential. Underneath it is stackful coroutines.
I've updated it over the years (moving from chaining futures to using async, using the latest tokio, etc) to keep it relevant.
In particular, correctly splitting a duplex connection into send/receive streams and then aborting one side of the connection when the other closes without missing events is trickier than it should be.