We are scrambling right now to fix a problem due to change in behavior exposed to user-space from the io_uring kernel module in later kernels.
Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic #44-Ubuntu SMP) io_uring will not return short reads/writes (that's where you ask for e.g. 8k, but there's only 4k in the buffer cache, so the call doesn't signal as complete and blocks until all 8k has been transferred). In later kernels (not sure when the behavior changed, but the one shipped with Fedora 32 has the new behavior) io_uring returns partial (short) reads to user space. e.g. You ask for 8k but there's only 4k in the buffer cache, so the call signals complete with a return of only 4k read, not the 8k you asked for.
Userspace code now has to cope with this where it didn't before. You could argue (and kernel developers did :-) that this was always possible, so user code needs to be aware of this. But it didn't used to do that :-). Change for user space is bad, mkay :-).
I know nothing about io_uring but looking at the man page[1] of readv I see it returns number of bytes read. For me as a developer that's an unmistakable flag that partial reads is possible.
Was readv changed? The man page also states that partial reads is possible, but I guess that might have been added later?
If it always returned bytes read, it would hardly be the first case where the current behavior is mistaken for the specification. My fondest memory of that is all the OpenGL 1.x programs that broke when OpenGL 2.x was released.
Also, note the preadv2 man page which has a flags field with one flag defined as:
-------------------------------
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read.
-------------------------------
This implies that "standard" pread/preadv/preadv2 without that flag (which is only available for preadv2) will block waiting for all bytes (or short return on EOF) and you need to set a flag to get the non-blocking behavior you're describing here. Otherwise the flag would be the inverse - RWF_WAIT, implying the standard behavior is the non-blocking one, not the blocking one.
The blocking behavior is what we were expecting (and previously got) out of io_uring, so it was an unpleasant surprise to see the behavior change visible to user-space in later kernels.
> If this flag is specified, the preadv2() system call will return instantly if it would... wait for a lock.
Doesn't this sound a bit different from ordinary short reads?
Receiving EAGAIN usually happens under fairly specific conditions (signal interruption), but I'd imagine, that filesystem code has a great deal of locks.
For example, FUSE filesystems can support signal interruptions via EAGAIN, but they are not guaranteed to. You can end up in situation, when FUSE filesystem hangs, and you can not interrupt the thread, which reads from it. I suspect, that RWF_WAIT is a "fix" for similar situations and not the opposite of default behavior.
Well, the man page does say that "The readv() system call works just like read(2) except that multiple buffers are filled".
If we go to read(2) we find "It is not an error if [the return value] is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now [...], or because read() was interrupted by a signal."
As an outsider, I'd never rely on this returning the requested number of bytes. If I required N bytes, I'd write use a read loop.
But I do agree that the RWF_NOWAIT flag mentioned in your other comment doesn't help, as it suggests the default is to block.
A user started describing file corruption when copying to/from Windows with the io_uring VFS module loaded.
Tests using the Linux kernel cifsfs client and the Samba libsmbclient libraries/smbclient user-space transfer utility couldn't reproduce the problem, neither could running Windows against Samba on Ubuntu 19.04.
What turned out to be happening was a combination of things. Firstly, the kernel changed so an SMB2_READ request against Samba with io_uring loaded was sometimes hitting a short read, where some of file data was already in the buffer cache, so io_uring now returned a short read to smbd.
We returned this to the client, as in the SMB2 protocol it isn't an error to return a short read, the client is supposed to check read returns and then re-issue another read request for any missing bytes. The Linux kernel cifsfs client and Samba libsmbclient/smbclient did this correctly.
But it turned out that Windows10 clients and MacOSX Catalina (maybe earlier versions of clients too, I don't have access to those) clients have a horrible bug, where they're not checking read returns when doing pipeline reads.
When trying to read a 10GB file for example, they'll issue a series of 1MB reads at 1MB boundaries, up to their SMB2 credit limit, without waiting for replies. This is an excellent way to improve network file copy performance as you fill the read pipe without waiting for reply latency - indeed both Linux cifsfs and smbclient do exactly the same.
But if one of those reads returns a short value, Windows10 and MacOSX Catalina DON'T GO BACK AND RE-READ THE MISSING BYTES FROM THE SHORT READ REPLY !!!! This is catastrophic, and will corrupt any file read from the server (the local client buffer cache fills the file contents I'm assuming with zeros - I haven't checked, but the files are corrupt as checked by SHA256 hashing anyway).
That's how we discovered the behavior and ended up leading back to the io_uring behavior change. And that's why I hate it when kernel interfaces expose changes to user-space :-).
> in the SMB2 protocol it isn't an error to return a short read, the client is supposed to check read returns and then re-issue another read request for any missing bytes
This is interesting and somewhat surprising, since Windows IO is internally asynchronous and completion based, and AFAIK file system drivers are not allowed to return a short read except for EOF.
And actually, even on Linux file systems are not supposed to return short reads, right? Even on signal? Since user apps don't expect it? (And thus it's not surprising that io_uring's change broke user apps.)
So it wouldn't be surprising to learn that the Windows SMB server never returns short reads, and thus it's interesting that the protocol would allow it. Do you know what the purpose of this is?
Obviously the Windows SMB server never returns short reads, otherwise this bug would never have made it out of Redmond or Cupertino.
On Linux, pread also never returns short reads against disk files if the bytes are available, which is why no one noticed this client bug as our default io backend is a pthread-pool that does pread/pwrite calls. It only happens when someone tried our (flagged as experimental thank god) vfs_io_uring backend.
Yeah the protocol even has a field in the SMB2_READ request called MinimumBytes, for which the server should fail the read if less than these bytes are available on return. The Windows 10 clients set this to zero :-). The MacOSX Catalina client sets it to 1. So yes, the clients are supposed to be able to handle short reads.
Out of curiosity, I took a look at how the MinimumBytes (actually MinimumCount) field is used by the Windows SMB server. Interestingly, it fails with STATUS_END_OF_FILE if the actual bytes read is less than MinimumCount, which suggests to me that this is supposed to be a minimum on the (remaining) file length, not on the number of bytes that the server is able to return at the moment.
I can't find any history of MinimumCount being used in the RTM version of any Windows SMB client, so without deeper archeology the reason this field was introduced remains a mystery to me.
Regardless, I agree that the client should validate the returned byte count. But (only having thought about this briefly), I do not think a client should retry in this case--it seems to me if the client sees a short read, it can assume that the read was short because the read reached EOF (which may have changed since the file's length was queried).
Sorry to keep laboring the point :-) but the other reason I'm pretty sure this is a client bug is that the client doesn't truncate the returned file at the end of the short read, which you'd expect if it actually was treating short read as EOF.
If you copy a 100mb file and the server returns a short read somewhere in the middle of the read stream the file size on the client is still reported as 100mb, which means file corruption as the data in the client copy isn't the same as what was on the server.
That's how this ended up getting reported to us in the first place.
Yes, that's a good point. I agree that there appears to be a client bug here. From a quick glance, it appears that nothing is checking that the non-final blocks in a pipelined read are returned from the server in full.
I don't necessarily agree that retry is the right behavior though. Wouldn't that result in an extra round trip in the actual EOF case? Again, not having thought about this much, it seems a more efficient interpretation of the spec is that truncated reads indicate EOF. In that case, a truncated read as in the middle of a pipelined operation either indicate the file's EOF is moving concurrently with the operation (in which case stopping at the initial truncation would be valid) or the lease has been violated.
Regardless, I work on SMB-related things only peripherally, so I do not represent the SMB team's point of view on this. Please do follow up with them.
It's only an extra round trip in the case of an unexpected EOF. File size is returned from SMB2_CREATE and so given the default of a RHW lease then (a) the lease can't be violated - if it is, then all bets are off as the server let someone modify your leased file outside the terms of the lease. Or (b) you know the file size, so a short read if you overlap the actual EOF is expected and you can plan for it.
A short read in the middle of what you expect to be a continuous stream of bytes should be treated as some sort of server IO exception (which it is) and so an extra round trip to fetch the missing bytes returning 0, meaning EOF and something truncated or an error such as EIO meaning you got a hardware error isn't so onerous.
After all this is a very exceptional case. Both Steve's Linux cifsfs client and libsmbclient have been coded up around these semantics (re-fetching missing bytes to detect unexpected EOF or server error) and I'd argue this is correct client behaviour.
As I said, given the number of clients out there that have this bug we're going to have to fix it server-side anyway, but I'm surprised that this expected behavior wasn't specified and tested as part of a regression suite. It certainly is getting added to smbtorture.
Whenever a client gets a short read it needs to issue a request at the missing offset if the caller wanted more bytes. Only if the server returns zero on that read can it assume EOF and concurrent truncation.
We're going to have to fix the Samba server to never return short reads when using io_uring because the clients with this bug are already out there. But if what you're saying is how Microsoft expects the protocol to operate then it needs to be documented in MS-SMB2 because I don't think it's specified this way at the moment.
No the client can't assume that. Consider pipelining reads. You can asynchronously send 10 1MB reads. The server can return the data in any order. So the read sent at offset 0 could return last after the server has already returned 9MB starting at offset 1MB onwards in the file, and this first read then returns a short read of 800k instead of 1MB.
You can't then assume that the read at offset 0 returning short means the file is now truncated to 800MB and the other 9MB is no longer of use.
Also remember you might have a complete RWH lease on the file, so you are guaranteed that there was no other writer truncating the file whilst the read is ongoing.
We are scrambling right now to fix a problem due to change in behavior exposed to user-space from the io_uring kernel module in later kernels.
Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic #44-Ubuntu SMP) io_uring will not return short reads/writes (that's where you ask for e.g. 8k, but there's only 4k in the buffer cache, so the call doesn't signal as complete and blocks until all 8k has been transferred). In later kernels (not sure when the behavior changed, but the one shipped with Fedora 32 has the new behavior) io_uring returns partial (short) reads to user space. e.g. You ask for 8k but there's only 4k in the buffer cache, so the call signals complete with a return of only 4k read, not the 8k you asked for.
Userspace code now has to cope with this where it didn't before. You could argue (and kernel developers did :-) that this was always possible, so user code needs to be aware of this. But it didn't used to do that :-). Change for user space is bad, mkay :-).