> Though it has Python bindings, OpenSlide is implemented in C and reads files using standard OS file handlers, however our data sits on cloud storage that is accessible via HTTP. This means that, to open a WSI file, one needs to first download the entire file to disk, and only then can they load it with OpenSlide. But then, what if we need to read tens of thousands of WSIs, a few gigabytes each? This can total more than what a single disk can contain. Besides, even if we mounted multiple disks, the cost and time it would take to transfer all this data on every new machine would be too much. In addition to that, most of the time only a fraction of the entire WSI is of interest, so downloading the entire data is inefficient. A solution is to read the bytes that we need when we need them directly from Blob Storage. fsspec is a Python package that allows us to define “abstract” filesystems, with a custom implementation to list, read and write files. One such implementation, adlfs, works for Azure Blob Storage.
AWS S3 has byte-range fetches specifically for this use case [1]. This is quite handy for data lakes and OLAP databases. Apparently, 8 MB and 16 MB are good sizes for typical workloads [2].
The OP here is using Azure Blob Storage, which is essentially Azure's S3 competitor. Like S3, it accepts the Range header.[1] (I'm presuming Blob Storage was modeled after S3, frankly. The capabilities and APIs are very similar.)
Similarly, GCP Cloud Storage appears to also support the Range header[2].
Even with byte-ranges in S3, I know that some standards like OME report very poor performance compared to local filesystems (their use case requires very low latency to show various subrectangles of multidimensional data, converted to image form, for pathologists sitting in a browser). They have been exploring moving from monolithic tiff image files to using zarr, which preshards and does compression in blocks.
"Local filesystem" is not a thing... When you mount NFS on your laptop, is it local to your laptop or not? What if you have a caching client?
In other words, "local" or "remote" is not a property of filesystem.
Various storage products exist that try to solve the problem of data mobility, that is moving data quickly to a desired destination (which is usually "pretending to move" but in such a way that the receiving end can start working as soon as possible).
For open-source examples see DRBD. There are also some proprietary products that are designed to do the same.
Local or remote is a property of a filesystem. This has been conventionally understood as whether the block device or the file service uses a directly attached to the host bus, versus more indirectly through a NIC or other networking technology. Of course, this idea breaks down pretty quickly; many servers used SAN, "storage area network" which gave local-like performance from physically separated storage devices over a fiber optic network. And as you point out, you can "remote" a block device since block devices are really just an abstraction.
I don't see what your point is; many applications support multiple storage backends, which is what I was referring to. The performance issues I was discussing were comparing applications that use the host system's VFS layer, compared to the S3 API layer.
You compared "local filesystem" to performance of S3. But you have no idea what are you comparing to what. Both are undefined because neither you nor anyone reading what you wrote can know what you are measuring.
Like I said, there's no such thing as "local filesystem". You invented / repeated this term after someone who invented it on a spot. You / them didn't have a coherent explanation to what it means. Now nobody can understand what is that you are trying to say.
In essence, you are counting angels on the tip of a needle.
Also, this is not about block devices. Filesystems are programs. A lot of them are distributed programs. They run on many computers at once. I'll repeat the example I gave earlier with NFS: it's a distributed system, it has a server and a client. Both of them are the filesystem. But you cannot say that it runs "locally" or "remotely" because it's both or neither, or whichever one you choose... i.e. it's a worthless definition.
I don't know where you are coming from; everything I'm saying is entirely consistent with how the industry (I work in IT and "local filesystem" and "remote filesystem" are terms we all use, including with our storage vendors) talks about storage.
Here's an example paper comparing filesystems and their performance for precisely this kind of problem:
https://www.nature.com/articles/s41592-021-01326-w figure 1A, B
I work in this field and the figure text makes perfect sense to me and all my coworkers.
AWS S3 has byte-range fetches specifically for this use case [1]. This is quite handy for data lakes and OLAP databases. Apparently, 8 MB and 16 MB are good sizes for typical workloads [2].
[1] https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing...
[2] https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.p...