Hacker News new | past | comments | ask | show | jobs | submit login
Processing medical images at scale on the cloud (tweag.io)
71 points by harporoeder on June 20, 2023 | hide | past | favorite | 20 comments



This is a bad and needlessly complicated solution because the authors don't know anything about ML.

It's both incredibly slow and it will perform poorly (in the sense of the final accuracy and the number of training steps) because you can't just software engineer your way through ML.

The important insight is that you do not need all patches from every image! Actually, showing patches like this in sequence is extremely detrimental to training. The network sees far too much data that is too similar in too many big chunks. You want random patches from random images, the more mixed up the patches the better.

So knowing this, when you look at their latency equation there's a different and obvious solution. Split the loading process in two steps that run in parallel.

First step constantly downloads new images from the web and swaps old images out. Second step picks an image at random from the ones that are available and generates a random patch from it.

The first step is network bound. The second step is CPU bound. The second step always has images available, it never waits for the first, just picks another random image and random patch. You get great resource utilization out of this.

That's it. No other changes needed. Just use an off the shelf fast image loader. No need for a cluster.

This is a huge waste of engineering time and ongoing computing resources for what is a simple ML problem, had anyone with any knowledge been around.

Hey, tweag! If you want to do ML, reach out. :) You can do far better than this!


Can't evaluate on the details, but in general I love reading takes like this. Basically a variant of the "When all you have is a hammer" mindset and its inevitable consequences, as we see replicated over and over again in this industry. If only it were the top comment.


> Though it has Python bindings, OpenSlide is implemented in C and reads files using standard OS file handlers, however our data sits on cloud storage that is accessible via HTTP. This means that, to open a WSI file, one needs to first download the entire file to disk, and only then can they load it with OpenSlide. But then, what if we need to read tens of thousands of WSIs, a few gigabytes each? This can total more than what a single disk can contain. Besides, even if we mounted multiple disks, the cost and time it would take to transfer all this data on every new machine would be too much. In addition to that, most of the time only a fraction of the entire WSI is of interest, so downloading the entire data is inefficient. A solution is to read the bytes that we need when we need them directly from Blob Storage. fsspec is a Python package that allows us to define “abstract” filesystems, with a custom implementation to list, read and write files. One such implementation, adlfs, works for Azure Blob Storage.

AWS S3 has byte-range fetches specifically for this use case [1]. This is quite handy for data lakes and OLAP databases. Apparently, 8 MB and 16 MB are good sizes for typical workloads [2].

[1] https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing...

[2] https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.p...


The OP here is using Azure Blob Storage, which is essentially Azure's S3 competitor. Like S3, it accepts the Range header.[1] (I'm presuming Blob Storage was modeled after S3, frankly. The capabilities and APIs are very similar.)

Similarly, GCP Cloud Storage appears to also support the Range header[2].

[1]: https://learn.microsoft.com/en-us/rest/api/storageservices/g...

[2]: https://cloud.google.com/storage/docs/xml-api/get-object-dow...


Even with byte-ranges in S3, I know that some standards like OME report very poor performance compared to local filesystems (their use case requires very low latency to show various subrectangles of multidimensional data, converted to image form, for pathologists sitting in a browser). They have been exploring moving from monolithic tiff image files to using zarr, which preshards and does compression in blocks.


"Local filesystem" is not a thing... When you mount NFS on your laptop, is it local to your laptop or not? What if you have a caching client?

In other words, "local" or "remote" is not a property of filesystem.

Various storage products exist that try to solve the problem of data mobility, that is moving data quickly to a desired destination (which is usually "pretending to move" but in such a way that the receiving end can start working as soon as possible).

For open-source examples see DRBD. There are also some proprietary products that are designed to do the same.


Local or remote is a property of a filesystem. This has been conventionally understood as whether the block device or the file service uses a directly attached to the host bus, versus more indirectly through a NIC or other networking technology. Of course, this idea breaks down pretty quickly; many servers used SAN, "storage area network" which gave local-like performance from physically separated storage devices over a fiber optic network. And as you point out, you can "remote" a block device since block devices are really just an abstraction.

I don't see what your point is; many applications support multiple storage backends, which is what I was referring to. The performance issues I was discussing were comparing applications that use the host system's VFS layer, compared to the S3 API layer.


> I don't see what your point is;

You compared "local filesystem" to performance of S3. But you have no idea what are you comparing to what. Both are undefined because neither you nor anyone reading what you wrote can know what you are measuring.

Like I said, there's no such thing as "local filesystem". You invented / repeated this term after someone who invented it on a spot. You / them didn't have a coherent explanation to what it means. Now nobody can understand what is that you are trying to say.

In essence, you are counting angels on the tip of a needle.

Also, this is not about block devices. Filesystems are programs. A lot of them are distributed programs. They run on many computers at once. I'll repeat the example I gave earlier with NFS: it's a distributed system, it has a server and a client. Both of them are the filesystem. But you cannot say that it runs "locally" or "remotely" because it's both or neither, or whichever one you choose... i.e. it's a worthless definition.


I don't know where you are coming from; everything I'm saying is entirely consistent with how the industry (I work in IT and "local filesystem" and "remote filesystem" are terms we all use, including with our storage vendors) talks about storage.

Here's an example paper comparing filesystems and their performance for precisely this kind of problem: https://www.nature.com/articles/s41592-021-01326-w figure 1A, B I work in this field and the figure text makes perfect sense to me and all my coworkers.


Because of course there's so little to worry about with storing vast reams of medical data from real people in cloud systems (that surely never get breached) to be accessed by AIs that surely will never create data privacy problems from all the ML vacuuming they rely on....


I'm a regulatory consultant and I am currently submitting at least 5-10 510ks/DeNovos per week to FDA for AI/ML devices for a variety of companies. I can't imagine the actual throughput from companies as I am just one person out of many consultants out there. 95% of the software devices I edit and submit are hosting their databases on AWS. Essentially they transer the DICOM images to AWS and then run their algorithms against the data and then present the indications to the physcian. These run the range of CT/MRI/Ultrasound/pathology slides/genomic sequencing. Like I said, most of the databases are on AWS. A few are on Azure and a few european companies are on Orange.


>Essentially they transer the DICOM images to AWS and then run their algorithms against the data and then present the indications to the physcian. These run the range of CT/MRI/Ultrasound/pathology slides/genomic sequencing.

Okay so aside from the physician in question presumably knowing to whom a given set of indications belong to among his patients, how is the data kept anonymized while going through its path from these software devices, to AWS, to algorithmic processing and then to presenting findings to the physician?

and those DeNovos you're submitting, what do most of them relate to? Honestly curious since you describe how you work in this field.


> But as it turns out, we can’t use it.

Although it has Python bindings, OpenSlide is implemented in C and reads files using standard OS file handlers, however our data sits on cloud storage that is accessible via HTTP.

This is a self-inflicted problem. Very typical of people who don't know how storage works / what functionality is available will often push themselves into an imaginary corner.

Why of all things use HTTP for this?

No, of course you don't need to download the whole file to read it.

"standard OS file handlers" -- this is a strong indicator that the person writing this doesn't understand how their OS works. What standard are we talking about? Even if "standard" here is used to mean "some common way" -- then which one? How the files are opened? And so on. The author didn't research the subject at all, and came up with an awful solution (vendor lock-in) as a result.


The standard OS file handlers they mean are the UNIX and Windows APIs to open and read/write file content. The open(), read(), seek(), and write() library calls, which wrap the OS's system calls that use VFS.

HTTP is now a de-facto transport standard. S3 uses, and many other data storage systems do as well. It's highly tuned for performance and there is an entire ecosystem around it. You could implement a system like NFS over XDR-over-TCP-IP (https://en.wikipedia.org/wiki/Sun_RPC) or HTTP (https://en.wikipedia.org/wiki/WebDAV).


Those "file handlers" you speak of are just numbers. There's nothing special about them. There was no reason for OP to complain about them. It's just an indication that they don't understand the problem they are trying to solve.

> HTTP is now a de-facto transport standard. S3 uses,

Was the period here intentional?

Because if not, then HTTP is the standard for Web applications. It's virtually unheard of in storage world. The standards used in the storage world are typically called X-over-Y eg. iSCSI (which is SCSI over IP) or NVMe over IP, NFS (which is Sun RPC over IP) and so on. In other words, it's usually a protocol intended for storage needs implemented on top of some network protocol (since we are talking about storage on the network).

S3 is almost unique in its use of application-level protocol for storage. It is a bad decision from storage perspective as it is imposing too many unnecessary restrictions and would be generally slow for implementing some network-attached storage.

> It's highly tuned for performance and there is an entire ecosystem around it.

Not in the storage world. It's quite worthless as a conduit for storage. In storage, you operate in blocks, you don't need streaming, you can handle out-of-order reads and writes, you may or may not benefit from sessions, but you absolutely have no use for any of HTTP headers, you have no use for content-types. While you could probably interpret GET as "read" and POST as "write", the PATCH, TRACE and whatever else there is would be completely worthless. On the other hand, whenever you read or write in storage you need to transmit a lot of important details which are not reflected in these HTTP methods. None of the HTTP status codes are useful for storage...

In other words, all the features HTTP has to offer on top of TCP are worthless for storage. It's designed for Web applications, and that's why S3 chose HTTP (to be usable directly from Web applications). If Web supported straight up TCP or UDP, S3 wouldn't be using HTTP.

> XDR-over-TCP-IP

What does this have to do with HTTP? (PS. I wrote an NFS3 testing client in Go some 5-7 years ago, no HTTP involved).


> No, of course you don't need to download the whole file to read it.

HTTP doesn't require this, nor do any of AWS S3/Azure Blog Storage/GCP Cloud Storage.

> "standard OS file handlers" -- this is a strong indicator that the person writing this doesn't understand how their OS works. What standard are we talking about? Even if "standard" here is used to mean "some common way" -- then which one? How the files are opened?

The OS's file reading mechanisms, whatever those are. For a C library, probably that means "fopen()" et al. which of course cannot read from, e.g., S3. (Excepting something like a FUSE driver, which the article does even mention, in passing¹.) Even if they mean something more native to the OS, like read(2)/write(2), or whatever Windows uses … it doesn't change the point?

They then describe an FS abstracting layer — those also aren't uncommon, e.g., gvfs, and often exist to bridge exactly the gap above: I have things like S3 that sort of look like files, and I have something that wants a file. I want to directly connect them. FS abstractions like what the article describes bridge that gap.

> The author didn't research the subject at all, and came up with an awful solution (vendor lock-in) as a result.

That the data is in Blob Storage seems to me to be a given in the article. I.e., we're starting with that assumption, so how can we make use of the data without incurring huge network transfers?

The data could be in some FOSS storage system running on your own servers, serving up its data using the S3 API … and you'd have the same article.

¹it's confusing that they say the FUSE driver needs to download the entire file. It shouldn't require that — certainly nothing in the APIs available require it. But … macOS; yeah. That's a rougher problem, and does push one in the direction of something like where the author went.


> The OS's file reading mechanisms

Did you mean "mechanism" as in singular, not plural? Because there are indeed many different ones, and that's why I asked this question.

> or a C library, probably that means "fopen()" et al. which of course cannot read from, e.g., S3.

Why is it so hard to understand... No, they should not have used S3 to begin with. S3 is for Web, it's not for storage. These people used a wrong tool because they don't even know what tools they have...

> if they mean something more native to the OS, like read(2)/write(2),

Again, they don't know what they are talking about. read()/write() are just gateways that expose them to different ways OS can read or write.

> or a C library, probably that means "fopen()" et al. which of course cannot read from, e.g., S3

Of course it can. HTTP goes through a socket, which is a file. C can perfectly fine read from a socket file using file interface. In case you ever had the "engineer" title, things like this should be the reason to lose it.

> That the data is in Blob Storage seems to me to be a given

It's not a given. It's never a given. It's data stored in the proprietary data-center "somehow". Even if it was placed in the blob storage by someone else, nobody prevents the authors from moving it away from it. Not to mention that moving things inside that proprietary data-center is going to be much faster, so, even if they are required to acquire data as blob storage, they can almost instantly convert it into something more usable.


> In case you ever had the "engineer" title, things like this should be the reason to lose it.

Even if you weren't technically incorrect, this would be uncalled for. I would suggest toning down the heat a lot in your replies and asking what substance you're contributing to the conversation.

For example, the POSIX filesystem APIs don't have a concept of TLS or any of the HTTP protocol-level handling (transfer compression, chunked encoding, etc.) so it's only true that C code can read bytes from the socket. There's a reason why projects like s3fs exist to do all of that work in addition to things like ranged requests and caching which are essential for performance so you don't have to wait to stream a 2GB file just to read the header.


> Why is it so hard to understand... No, they should not have used S3 to begin with. S3 is for Web, it's not for storage. These people used a wrong tool because they don't even know what tools they have...

S3 is absolutely for storage. So much so that it is quite literally in the name.

I'm morbidly curious what purpose you think S3 fills, if not storage?

> [capabilities of read/write]

Yes, while read/write can operate on a socket, that doesn't mean it's usable by a library that's going to be expecting image data. Let's say our library which is going to expect to read image data does accept just any old FD, (and not a FILE pointer); even then, I can't simply hand it a socket: the socket is going to be speaking HTTP/TLS, and the library expects image data. In the best case, you'd get a parse error.

I mention FUSE drivers in the comment, and so too does the article. The article itself notes they're on macOS, where FUSE is a second-class citizen. There's pragmatic reasons to avoid it. (They also note that it read the entire file, which, while that shouldn't be true, if they errantly believed it to be so, then it would reasonably drive their judgement in the direction they went.)

> In case you ever had the "engineer" title, things like this should be the reason to lose it.

I don't know why you feel the need to include personal attacks like this. Guess I'm not an engineer anymore!

> It's not a given. It's never a given. It's data stored in the proprietary data-center "somehow".

But it may be out of this engineers scope, or out of their control; they may have been given something resembling "hey, we have this data, already stored in Blobstore, can you do this task with it?", and within that scope, it's a given. Yes, in the grand design of the universe, it need not be, but for the engineer right there in OP, that might be boiling an ocean.

But also … the data likely has to be stored somewhere, and there are good reasons to choose such a service. Given the file sizes mentioned in the article, they need a fair bit of storage (something S3 et al. is good at) and they need access to individual files without a lot of special needs that might be better filled by a true FS like ext4. Within reason, it seems like one can assume that Blobstore isn't an unreasonable choice here.

> Not to mention that moving things inside that proprietary data-center is going to be much faster, so, even if they are required to acquire data as blob storage, they can almost instantly convert it into something more usable.

And format change is going to be something you can do with S3, too. The only thing you really get by moving into a proprietary data center is that the data might be slightly more local, and thus have better latency. Format conversions are sometimes hard: that format may be required for other use cases. But again, this all seems like questioning things in the OP that … don't really need to be questioned from our armchairs.


What does 'at scale' mean here and why would anyone need 'the cloud' ? Medical images aren't like cell phone videos where everyone is creating data all the time. There is only so much medical data being created because the machines to create them are limited.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: