
Asynchronously Opening and Closing Files in Asyncio - signa11
https://nullprogram.com/blog/2020/09/04/
======
theelous3
Just want to note and curio and trio, two non-stdlib python async libraries,
have file io as batteries included.

I would also like to point out that asyncio made a huge mistake here. IO is
not limited to interactions with the network buffer. I don't know how
something called asyn _io_ can be released in to stdlib, and not include disk
io.

The fact that there is no actual async file io on the OS doesn't matter.
Dealing with files is a core concept in almost every application, and so
pushing this to stdlib and hoping the python ecosystem takes up the slack, or
people handbake threaded access in themselves, is a joke.

Some of this is a little dated, but a friend of mine wrote this and included
my complaints about file io in there somewhere too:

[https://web.archive.org/web/20171206105117/https://veriny.tf...](https://web.archive.org/web/20171206105117/https://veriny.tf/asyncio-
a-dumpster-fire-of-bad-design/)

~~~
orf
I disagree that dealing with files is a core concept in almost every asyncio
application. I would say that the most common instances of file IO is at the
start of an asyncio process (parsing config files) rather than in the “hot”
asynchronous part.

Some problems might require extensive file IO, sure. But most don’t, and
something like reading from a database or making a HTTP request is orders of
magnitude more common.

~~~
earthboundkid
You’re just writing off being able to write an HTTP file server in Python?
That’s crazy.

~~~
falcor84
The parent never said that. You can of course write a file server, they just
said it's not as common, and I agree.

From personal experience, even for serving files, I find myself relying a lot
more on web services rather than a local disk.

------
Hello71
> The other threads need to continue running while the thread waiting on
> open(2) is paused, but ptrace pauses the whole process.

This definitely doesn't sound right. Running strace, you need to specify -f in
order to trace threads, otherwise only the main thread is traced. gdb does
stop all threads on interrupts, but only by default for convenience, and I
don't think it's atomic.

Because of this, LD_PRELOAD is not the easiest way to delay system calls,
strace -e inject:delay is.

    
    
      $ cat > a.c << EOF
      #include <stdio.h>
      #include <sys/types.h>
      #include <fcntl.h>
      #include <pthread.h>
      #include <unistd.h>
      
      void *start_thread(void *arg) {
          while (1) {
              puts("thread");
              sleep(1);
          }
      }
      
      int main() {
          pthread_t thr;
          pthread_create(&thr, NULL, start_thread, NULL);
          while (1) {
              puts("main");
              close(open("/dev/null", O_RDONLY));
              sleep(1);
          }
      }
      EOF
      $ gcc a.c
      $ strace -o /dev/null -P /dev/null -e inject=openat:delay_enter=3s ./a.out
      main
      thread
      thread
      thread
      thread
      thread
      main
      thread
      thread
      thread
      thread
      thread
      [ and so on. ]

------
halayli
using threads for file IO is a hammer approach imho.

Often times the price you pay to send the job to the queue, yield for another
coroutine, wait for IO task to finish, receive a notification back from the
thread pool and resume is higher than doing the blocking call. Under high load
it might have some small benefits but my hunch is that it's unlikely to buy
you much.

Luckily, io_uring officially made it to the linux kernel few months ago and
can solve these problems in a much more efficient manner. *BSD solved this
problem long time ago, in 2002, when kqueue was introduced.

~~~
ridiculous_fish
Note this is specifically about opening and closing files, which may be on
network mounts. kqueue has no support for this, io_uring just very recently
got it. So there's no alternative to threads today.

~~~
halayli
kqueue does support this via EVFILT_VNODE. io_uring is in linux as of kernel
5.1(march 2020). alternatives are available but might not be viable for
everyone.

~~~
ridiculous_fish
My understanding is that kqueue can notify for file events like "readable" or
"opened," but cannot actually initiate reading or opening.

~~~
halayli
kqueue is not responsible for initiation. It's responsible to notify you about
events. when you open(2) a file with O_NONBLOCK flag, you can register it with
kqueue to let you know when it's readable/writable.

------
mana72
People seem to complain and I do agree that we are not yet approaching an
elegant solution.

I have used Asyncio for an API I built for
[https://www.abstractapi.com/](https://www.abstractapi.com/) in order to be
able to manage high load and avoid being locked by network latency.

It has been painful and I almost thought about migrating to Amazon Lambda to
distribute the load but it feels like a bad use case for Lambda when you're
actually only want to perform async stuff and not lock a full thread.

SCan't believe we still don't have a proper async lib for Python in 2020.
Still have to test Curio but the whole thing makes me seriously reconsider
which language I'll use for my next project. Async is something I need to
handle more and more.

------
robertlagrant
> This is likely in part because operating systems themselves also lack these
> facilities. This is a surprising statement. My impression was that OS file
> operations are generally async, with sync built on top.

~~~
jeffbee
Depends on the operating system. This is one reason why it is folly to try to
abstract away the differences between operating systems.

Ordinary Linux file i/o for example cannot be made non-blocking and ignores
O_NONBLOCK flag. See open(2).

~~~
iameli
Hm. Confused by Node.js async fs operations in this context. Does it just not
actually work asynchronously behind the scenes?

~~~
satori99
Node.js uses libuv for async file operations -- And libuv has its own internal
thread pool to make file ops appear fully async to a node program.

From the libuv docs [0]:

    
    
        Unlike network I/O, there are no platform-specific file I/O primitives libuv could rely on,
        so the current approach is to run blocking file I/O operations in a thread pool.
        For a thorough explanation of the cross-platform file I/O landscape, checkout this post [1].
    

[0] [http://docs.libuv.org/en/v1.x/design.html#file-
i-o](http://docs.libuv.org/en/v1.x/design.html#file-i-o)

[1] [https://blog.libtorrent.org/2012/10/asynchronous-disk-
io/](https://blog.libtorrent.org/2012/10/asynchronous-disk-io/)

------
Snawoot
I've been solving similar problem to decouple python daemon logging IO from
event loop in order to make logging completely non-blocking. I'm pretty sure
this particular case must be most popular file IO case for async applications.

In my case it wasn't important to open log asynchronously, but it was critical
to have async writes.

It turns out, depths of Python stdlib hide logging.handlers.QueueHandler and
logging.handlers.QueueListener which allow to split log message handler into
separate thread from log message emitters, making them communicate via
synchronized queue.

Final solution is quite compact: [https://github.com/Snawoot/postfix-mta-sts-
resolver/blob/589...](https://github.com/Snawoot/postfix-mta-sts-
resolver/blob/589a59a90cf3f53148f9f3814ccd48cd92a9538f/postfix_mta_sts_resolver/utils.py#L26-L57)

Usage example: [https://github.com/Snawoot/postfix-mta-sts-
resolver/blob/589...](https://github.com/Snawoot/postfix-mta-sts-
resolver/blob/589a59a90cf3f53148f9f3814ccd48cd92a9538f/postfix_mta_sts_resolver/daemon.py#L101-L106)

It is convenient that you can still use logging as usual, without any async
obligations. You just have logging function which works in a fire-and-forget
fashion.

> Asynchronous reads and writes would require all new APIs with different
> coloring. You’d need an aprint() to complement print(), and so on, each
> returning an awaitable to be awaited.

> This is one of the unfortunate downsides of async/await. I strongly prefer
> conventional, preemptive concurrency, but we don’t always have that luxury.

I was sceptical about Golang for a long time, but recently I've started
practicing it. It's designed with concurrency in mind from very beginning and
goroutines have advantages of both cooperative and preemptive multitasking.
And also it doesn't have GIL, so your goroutines will use all available CPUs
without any additional effort from your side.

From programmers point of view, coding with goroutines looks exactly like
coding with use of pthreads (even simpler, in fact). So, there are no function
"coloring", no need in sync/async library flavors and so on. But these
"threads" also have cost close to cost of python coroutines. It sounds too
good to be true, but I've rewritten bunch of my Python/asyncio code in Go and
I can say it's true. For this reason (and many others I discovered later) I
quit using Python and asyncio for network applications in favor of Go.

~~~
nly
For anyone looking for async logging in C++ I can highly recommend spdlog:

[https://github.com/gabime/spdlog](https://github.com/gabime/spdlog)

