Production AI systems are hard

aabajian · 2023-05-29T14:21:10

I'm just finishing interventional radiology training and I moonlight as a diagnostic radiologist (not to mention having an undergrad/master's in computer science).

Almost 90% of the diagnostic studies I read could be pre-drafted by AI. That's where the money is and where AI-in-radiology companies should focus. The money is not in detecting hemorrhage or pulmonary embolism. It's a classic fallacy to think that life-saving means money-saving. Rather, the money in radiology is reads per day.

Here's a user story: A private practice radiologist reads 20 abdomen and pelvis CT scans with contrast per day. In each of these studies, he must write a short description of each organ. For example, "Gall bladder: Unremarkable" or "Gallbladder: Cholecystitis without evidence of cholecystitis" or "Gallbladder: Dependent sludge." There are around 15 such organs (liver, gallbladder, pancreas, spleen, adrenal glands, kidneys, etc.). The AI should auto-populate the radiologist report with an appropriate description of each organ system.

The job of the radiologist is to confirm what the AI says in each section, and to go into further detail as needed. It's essentially just customizing the existing template to each patient. This type of pre-drafting is exactly what radiology residents do and what companies like vRad do.

kurthr · 2023-05-29T16:13:07

What percentage of the diagnostic radiologists won't bother to even review the AI output? 3%, 10%, 30%? It'll increase as the AI improves. Unless the image is sent to multiple radiologists or their metrics (eg spotting AI errors) are updated, we'll get crap results.

I know one top tier diagnostic radiologist and he could tell which piece of equipment an image was taken on (or if something is out of calibration) with a fairly quick glance. Regularly on the phone with techs and technologists to get things fixed or reimaged. He retired recently.

From his telling most of the work has been outsourced to India for the last 10-15 years. He only gets the really challenging stuff.

edit: Interestingly, it seems like breast imaging is still done more in house and there are regular reviews, but that's from a different source.

gymbeaux · 2023-05-29T17:42:25

We already have an issue in the US at least, of doctors waxing about how their favorite part of their job is the relationship with their patients, and how they take an “evidence-based approach” and take time to listen to their patients… only for them to do the polar opposite in reality.

This is just going to be another case of patients needing to take ownership of their own care. It’s no longer “tell the doc your symptoms and they’ll figure out what’s wrong”, it’s now “tell the doc what you think you have and if they interrupt you and brush it off, go to another doc” until someone agrees with your diagnosis and gets you the referral or test or prescription that you need. For those who know what I’m talking about, I’m sorry. For those who don’t- you’ll learn this lesson someday.

BiteCode_dev · 2023-05-29T19:03:11

You really have to double check everything yourself, which sucks if you are weaken.

I was at the hospital last week for a pretty bad condition and my IV was badly set. I told the nurse to change it for 4 days, they kept refusing, trying to adjust it.

Eventually I just told the doctor what was happening, and he said "oh, that's why your readings were messed up, I was wondering why I had to change your drug dose every day".

They changed my IV in the next hours, and my conditions improved overnight.

I discussed that with some doctor and nurse friends, and they all told me to never, ever trust a health practitioner you don't know very well. You need to be very proactive in your treatment, because they are exhausted, have many people to treat in a row and are only humans.

jmhmd · 2023-05-29T18:15:16

To be fair, this is not how doctors want to practice medicine.

richardw · 2023-05-29T20:56:02

He’s retiring, which means his knowledge is mostly lost unless he teaches. What if he spent some of his time labeling images so AI learns and the machine says “recommend reimage“ or “possible fault” for millions who don’t have your doctor.

brookst · 2023-05-29T19:52:15

> What percentage of the diagnostic radiologists won't bother to even review the AI output?

If someone’s not going to do their job, why does AI matter at all? Why wouldn’t they just use the same boilerplate text for everyone today?

kurthr · 2023-05-29T21:49:05

Because it's a lot easier to rationalize not doing your job when it is (or at least seems to be) being done OK. There's a lot of rationalizing of decisions, when you've spent your life saying you're the expert helping people. Also, a lot easier for management to hand out a LOT more work since it looks like it's being done right.

It's like the lawyer using ChatGPT to file legal briefs... he'd feel bad and be obviously incompetent, if he didn't file anything at all, but didn't feel bad (or feel the need to check anything) once it "looked right".

theage · 2023-05-30T07:51:23

I don't mind if lawyerGPT is used to help my one-off case, time enough to get that right for human eyes.

However, since I'm the final set of eyes judging my care I'll be asking to see a doctor who has never been assisted by AI and is in fact not a mind-atrophied cyborg. Who knows when this opinion will morph into harmspreading and wrongthink that gets you escorted to the schizo wing instead. AI might yet be paraded as safe and effective.

lucubratory · 2023-05-30T12:01:26

>What percentage of the diagnostic radiologists won't bother to even review the AI output? 3%, 10%, 30%? It'll increase as the AI improves.

In my opinion, an AI system like this (diagnostics) shouldn't even be implemented until it is statistically the case that "won't bother to even review the AI output"-level poor performance on the part of radiologists produces better outcomes than a traditional non-AI system. Then we don't have to worry about what happens when the radiologist isn't good enough to review their work product: the answer is that the medical care is at least better than it would have otherwise been. Deploying a system like that would improve medical outcomes with minimal disruptions.

That said, I think we'll reach that point (statistically provable that an AI system is making better diagnoses than the average radiologist) sooner than most people probably think, maybe a couple of years and then [regulatory approval] years until it's actually used. That doesn't mean radiologists shouldn't exist, though. A radiologist fits into a much larger organisational framework - they file paperwork, they hold responsibility, they are ostensibly available to call and ask about a specific quirk, etc. Even with a 100% perfect AI at the task of "what disease, if any, is indicated by this image?", that still doesn't automate a radiologist's job. A lot of other stuff needs to happen first.

jacurtis · 2023-05-29T22:48:46

A system like this will lead to an overall increase of mis-readings. Humans are incredibly forgiving and lack focus when confirming opinions, versus generating raw opinions.

Imagine a Radiologist that already has to read 20 scans per day. With AI helping, and Radiologists just confirming, they will probably be expected to analyze 40 scans per day. Eventually the Radiologist will quickly get into a habit of confirm, confirm, confirm... 500 times in a day. They will stop treating each case as a real reading and instead they look at the AI's output and think, "looks about right" and will become complacent in actually providing valuable feedback. Furthermore, they will be expected to produce more work, meaning they are giving every case less attention than previous. They will begin to just trust the AI to keep up with their schedule.

The better use of AI is to flip it on its head. Let the radiologists continue to work like normal. They will produce reports like they do now, only 20 per day. But the AI will analyze each scan after them, and if it flags the scan as remarkable or disagrees with the radiologist, then it bubbles the scan back to the top for further analysis. This way Radiologists are still expected to provide their expertise as they do currently and we generally rely on their opinion, but we will have an added layer provided by AI to catch anything that might be missed. This overall improves the accuracy of medical imaging instead of compromising it (which the original proposed AI solution certainly would do). Productivity would not be increased in this scenario, but accuracy (and indirectly malpractice instances) would improve.

I for one, would not want to get scanned somewhere that relies on AI to analyze my scan. But i would love a place that used AI as a second pair of eyes to confirm a human's findings.

cicce19 · 2023-06-01T02:38:48

There is a trade off of quality and quantity. Most healthcare systems reimburse based on quantity so which way do you think AI will be implemented?

la64710 · 2023-05-29T16:01:49

So essentially the job of the radiologist is to carefully compare what the AI generated with the report to ensure that the generated text is correct. So in fact the radiologist now has to read both pieces of text (AI generated and the report itself) instead of just reading the report and typing her conclusions. Does it seem like less work or more work? I am not a radiologist but as a programmer trying to work with AI generated BS most of the time I find the work to be increased. We need AI that can reliably produce results (maybe through testing and internal feedback mechanisms).

jmhmd · 2023-05-29T17:02:31

You are partially correct, but I think the value is not in generating a draft for the radiologist to review, but to take a radiologists “raw” dictation, which pertains only to relevant findings in a case, and transform that into a complete, well formatted report. This is what scribes do in lots of different medical specialties, and save a ton of time. In my experience, reviewing pre drafted reports, as from residents, can often make my job slower rather than faster, as you have to read the whole thing to know what needs editing.

Also, at least in my practice, I would double the number of CTs read in an average day :)

NeuroCoder · 2023-05-29T15:30:40

MD PhD student here and this is 100% on the money and what many I work with try to do. The diagnostic classifiers seem to be about pushing papers and getting funding.

throwaway85858 · 2023-05-29T14:55:53

this! I asked a neurologist and she said at least 40% of her time is spent on formulaic discharge letters and rounds documentation.

haldujai · 2023-06-03T13:12:24

What you’re describing is what a template does, I just dictate “stone” or “sludge” from a pick list and never a normal statement. There is also existing technology that creates a structured report from free dictations.

Your math also doesn’t add up because 90%ish of studies are normal so you can basically sign a template without looking at the pictures and be more or less entirely accurate, before adding AI.

Also what practice only reads 20 abdo CTs a day? Seriously, let me know I could use a more relaxing job.

Lastly, an AI that functions with the competency of a PGY3 resident does not exist at the moment or on the horizon. Especially for cross sectional studies.

Source: am a diagnostic radiologist and AI researcher (on the NLP side).

bee_rider · 2023-05-29T15:09:47

Would it be possible to instead have a checkbox for the status of each organ, and have it default to whatever the average case is?

aabajian · 2023-05-30T11:58:30

Drop-down lists (called pick-list in PowerScribe speak) are very common. The issue is that they are completely separate from the image viewer and have no knowledge of the current patient's findings. Admittedly, even a simple classifier could likely choose the "most likely" completion phrase...but again there's a disconnect between the viewer software and the dictation software.

abxytg · 2023-05-29T21:22:40

I've seen some awesome demos of exactly this, ingesting priors and all of that.

oars · 2023-05-29T22:02:50

Thank you for sharing your insight.

wouldbecouldbe · 2023-05-29T09:10:49

"Radiologists, because they have a grounded brain model, only need to see a single example of a rare and obscure condition to both remember it and identify it in the future."

This would actually be a long term reason to go for AI / database diagnosis.

I had a personal case where a close family member, a young child, almost died. The doctors didn't understand condition and last minute it suddenly calmed down.

Im sure a few doctor's in the world seen it before, but they weren't working in my hospital that week. If we can start sharing in-depth diagnosis worldwide of obscure cases using AI to make it easy to query that would probably be of great benefit.

srvmshr · 2023-05-29T09:32:36

That's one of the biggest roadblocks: hospital & healthcare entities will not not have permissible mass sharing of data. When switching hospitals, its really hard to get a full copy of the record for single person. Try that for whole population. I may be wrong, but hospitals also could have incentives to tightly hold on to their proprietary data - where they spent a large sum of money. I worked in ML+Dermatology & getting data was a lot of regulatory check.

Also, DICOM is the format for maximum interoperability from different radiology modalities & manufacturers. But annotation & markup methods vastly vary between institutions. There is no commonality or agreed upon standards on information interchange.

samuell · 2023-05-29T09:42:55

Federated learning (with implementations such as FEDn [1]) is supposed to solve this problem; Only sharing training weights, but not sharing data.

Sure requires some coordination, but the legal parts should at least be solvable in this way.

[1] https://github.com/scaleoutsystems/fedn

haldujai · 2023-05-29T11:24:54

I've been involved in a federated learning radiology project, the biggest issue is image technique and labelling.

Different centers practice very differently, with different imaging protocols, disease prevalences, and labelling/reporting.

This project was looking at renal masses and the only part that worked well with federated learning was image segmentation and probability that the mass is a cancer, this is a competency I expect out of a first or second year trainee.

It was horrible for predicting subtype of cancer as we couldn't get a good training set (few of these lesions are biopsied, specific MRI sequences that may help are not done the same way in every center) which is what the goal was and more of an experienced generalist/subspecialty radiologist skill.

Practically subtype doesn't make too much of a difference for the patient as if it's "probably a cancer" it'll just get cut out anyway, but highlights a challenge with federated learning.

chaxor · 2023-05-29T12:59:50

BERT-like encoder models actually do help here, although unfortunately the dense representation is not as nice and clean as a class, it will at least map to something similar. If the hospital provides even a completely bizarre entity id, the requirement would be to provide the dictionary of descriptions and train the BERT-like model to associate the entity ids and descriptions.

haldujai · 2023-05-30T00:37:09

> BERT-like encoder models actually do help here, although unfortunately the dense representation is not as nice and clean as a class, it will at least map to something similar.

Completely agree with this.

> If the hospital provides even a completely bizarre entity id

This is where it beings to fall apart, the hospital doesn't always provide an entity. In this specific project, some will argue "it's a big enough renal mass just cut it out, based on current evidence it's safest to remove" which while true at the moment doesn't help us in developing a model to prospectively predict the subtypes that do/do not need surgical excision.

Consequently, there is no biopsy and specific MRI sequences that are proving helpful at subtyping are missing, irrelevant for clinical care per current standard practice but hugely relevant if you're trying to change that practice as we need labelled data but we don't have the resources to have a radiologist/pathologist go over every case again and try to fill in some blanks en masse.

renonce · 2023-05-29T16:57:23

Is that really possible? I don't understand the field exactly but at least if you use deep learning, you'll be able to recover inputs from gradients, and if you want to program it such that no other party is able to recover your inputs, that's called cryptography and homomorphic encryption is FAR from practical yet (like a million times slower than practical). Without rigid mathematical foundations, I would doubt that it's just a fancy way of gathering all the training data and doing the training and making it perceive as if everyone retained their own data.

wouldbecouldbe · 2023-05-29T11:49:16

Im not sure it even has to be complicated and full on machine learning.

Could also just make it easy to share diagnosis to the a database stripped from personal information about the client, properly index it, share it worldwide, allow doctors to contact eachother automatically who match.

distant_hat · 2023-05-29T12:15:57

If it is a rare condition, reidentifying people from even small amounts of relevant data (age, gender, hospital visited) data is trivial. If it is a common condition, usually there is plenty of data already there.

weatherlite · 2023-05-29T17:28:56

I think it will happen much faster in China, India or maybe Japan than in the U.S for that exact reason. It's quite possible the U.S will lag a lot in healthcare soon.

wouldbecouldbe · 2023-05-29T09:52:05

Yeah you have to make it non-personal, as well as deal with the commercial side of it. I've seen a few initiatives, but they all seem to lack the pockets, drive and ambition of an arrogant Silicon startup. Even though it's annoying at times, it might be what can do the trick in this case.

sampo · 2023-05-29T09:53:45

> That's one of the biggest roadblocks: hospital & healthcare entities will not not have permissible mass sharing of data.

Maybe in some other country?

srvmshr · 2023-05-29T10:48:57

I am not certain how it works in all of Europe, but it doesn't seem straightforward in Sweden or UK.

In US and Japan, they will rather do the full roster of tests than transfer in radiology results from another hospital. Its not something to do with costs, but physician opinions. Every specialist has his personalized view of the patient in hand - so the corpus of information he needs is different.

brabel · 2023-05-29T10:51:47

As far as I know, in Sweden data about a patient is shared between every hospital... you need to sign up on a web portal, of course, but from then on, you can authorize a doctor or nurse anywhere to get your data with a click... you can go to any clinic and they have access to your full history. It's impossible to do anything properly if you don't have that. I just don't understand why in the US, it's acceptable for hospitals to be silo-ed like that to the detriment of everyone's health.

Here's a news report from 2017, when the system was being set up: https://www.bmj.com/content/357/bmj.j2069

Today, this has been working for years.

hattmall · 2023-05-29T15:42:21

The US has multiple individual medical systems larger than Sweden with essentially the same abilities. Sweden has roughly the same number of hospitals as South Carolina.

brabel · 2023-05-29T15:59:20

What makes you think the Swedish system wouldn't scale to countries with 100's of millions of people??

agrippanux · 2023-05-29T16:08:36

Serious question - could it? How would that be accomplished?

mtkhaos · 2023-05-29T12:15:06

What's nice is considering a global audience.

dustingetz · 2023-05-29T12:34:14

this may be a sufficiently powerful macro force to disrupt that

srge · 2023-05-29T09:57:55

Exactly: what if my radiologist doesn’t know my particular rare condition?

I would feel comfortable removing the human diagnostician. Let’s have actual human doctors acting as researchers working to improve the AI diagnostics.

haldujai · 2023-05-29T11:16:50

My perspective as both a radiologist and CS/AI researcher so exactly what you're proposing:

1. We don't practice in imaginary vacuums, it's easy* to identify that something looks abnormal and then refer to clinical resources/other physicians for rare diseases with specific questions in mind (i.e. recognizing the nuanced imaging finding and referring to a resource to assist).

As a rare disease example, tumors of the eyeball/orbit are very rare, but detecting them is not. If I open the case and see one I can refer to StatDx to help me narrow my differential knowing what imaging features I'm looking for. This reduces my misdiagnosis rate (which as an aside is ~2-5% for radiologists, "major" clinically significant that impact morbidity or mortality ~2-10% of those depending on the study).

2. Rare disease are hard to diagnose, and would likely also be hard for an AI. Imaging appearances are not unique to the vast majority of diseases, especially what we cal the "weird and wonderful".

Pelvic tuberculosis, endometriosis, advanced cervical cancer and advanced rectal cancer can look identical/nearly identical on MRI and the clinical portion as well as additional testing helps us get to the diagnosis.

We don't have to diagnose everything based on a single imaging test, nor should we given:

3. Diagnosis is a tradeoff of sensitivity and specificity. You can't have both.

Let's consider adrenal gland tumors. Statistically these are going to be benign, there is no specific imaging feature to tell a small adrenocortical carcinoma (ACC) ~1 in 1 million incidence from an adrenal adenoma (99% of adrenal lesions).

We also can't tell them apart with a biopsy under a microscope.

If you're unlucky enough to get an ACC you're basically shit out of luck as the only options we have are to recommend adrenal surgery (and their complications which can be death) to optimize sensitivity, or assume that it's benign and optimize for specificity considering disease prevalence and risk of overdiagnosis.

In practice, we just use a cutoff of 4cm. I'm not sure how an AI would solve this, especially as there isn't a large enough training set. MD Anderson has the most experience of any center and they've had ~600 cases in 40+ years which as you can imagine encompasses a very heterogeneous imaging set (we didn't have multidetector CT or 3T abdominal MRI 20+ years ago).

Overall, AI can and should help radiologists and as someone involved in this field I can't envision a world where we can safely remove the human diagnostician element from the mix, given that it's a spectrum of grey not black/white labelling as it is for object detection.

We've had attempts with mammography and stroke AI and it's still horrendously inaccurate compared to what I expect out of a resident radiologist let alone an experience staff physician.

visarga · 2023-05-29T11:30:24

A great post. I can attest the same about information extraction from semi-structured documents. The situation is far from full autonomy. Can't do anything without human in the loop, not even with the latest AI.

I am seeing this trend - everyone can explain at length why AI is "not quite up there" in their own field, but believes it's "near AGI" in other fields. We find it hard to imagine future difficulties AI will have to face in general, we can only do that in our own field where we have learned from direct experience.

haldujai · 2023-05-29T11:40:04

Exactly this! I chose radiology given my background thinking I could "easily build radiology AI systems" and help our struggling system.

Then I became a radiologist and quickly discovered how hard this is. Something "as simple" as NER and entity-linking on radiology reports is damn near impossible at the moment (even with SOTA LLMs which have made it easier but still not accurate enough for production use).

chaxor · 2023-05-29T13:17:31

NER, Entity linking, and relationship extraction definitely seem to be 'low hanging fruit' due to LLM improvements, but one of the big problems is that they really need a completely different architecture to limit the decoder vocab if using a decoder transformer for producing the set of sequences in relation extraction with specific entity ids. A Longformer with full attention to input sequence, and sliding window attention to a large dictionary could be a decent way to find tune a system like this, but there are few that try it. Unfortunately there's a lot of stupidity going around right now in thinking the answer is just to 'pRoOoMpT tHe LLm RiGhT', but that will always be exceedingly wasteful such that processing terabytes of files will be prohibitively expensive, and there's no guarantee the system will always restrict to the specific vocab and structure desired.

The images in radiology definitely make these types of things harder, and the sparsity is an enormous issue. However, working with some projects in this area, I don't think it's as impenetrable as a lot of radiologists in AI suggest. The main thing needed in the field is adoption of better techniques and architectures to deal with these problems.

haldujai · 2023-05-29T23:21:50

I agree it's not impenetrable, that's why I'm working on this problem. What I disagree with is the "this is trivial" statements.

> Unfortunately there's a lot of stupidity going around right now in thinking the answer is just to 'pRoOoMpT tHe LLm RiGhT'

I agree with that this is not the right approach despite all the media hype, my research has been (more or less) attempting what you've proposed.

> A Longformer with full attention to input sequence, and sliding window attention to a large dictionary could be a decent way to find tune a system like this, but there are few that try it.

Good idea, although I'm biased as we tried this ourselves! Problem is the dictionary (ontology) doesn't exist. RadLex and UMLS are far too inadequate in coverage. Actively working to address the gaps and hope to have something to open-source within the next couple of months.

bick_nyers · 2023-05-29T15:45:22

I previously worked for a Radiology PACS and it's hard to get funding/interest to even tackle the problem. With how lucrative it could be, I would think that a corporation would be very interested in putting resources into it, but this has not been my experience.

No PACS that I know of even wants to tackle digital pathology in a significant way, which last I heard had about 5% adoption versus glass slides.

haldujai · 2023-05-29T23:45:30

There are several corporations, some of which have raised > 100M, tackling this problem. AIdoc being one of the more notable ones.

jameshart · 2023-05-29T19:04:18

> In practice, we just use a cutoff of 4cm.

In other words, we extract a single feature, and apply a single nonlinear activation function to that feature to decide whether or not to activate the 'treat' signal. We've replaced all that vaunted human judgement and mental modeling of the body with a heuristic that has equivalent power to a single neuron neural network.

I appreciate that there's a dearth of training data, and it varies in quality. But this is precisely the kind of thing where sufficiently powerful ML could do better than the simple heuristics we can come up with.

The heterogeneity of imaging types doesn't have to be a problem. Train the model on all the data, and all the different kinds of scan, all the anatomical knowledge.

Look at how LLMs are able to do stuff like write code that has comments written in pirate speak. Do you think they learned how to do that by studying a large body of code with pirate-speak comments in? No. They picked up examples of pirate speak in one context, and code in another context, and they're able to combine them together in ways that make sense.

ML models looking at diagnoses with small training sets that are largely in obsolete scan formats could still, in theory, learn how to spot those diagnoses in more modern scan images, because they have learnt from other, much larger datasets how things in the newer format correlate with features in the old form.

haldujai · 2023-05-29T23:14:58

You're missing the point, it takes me < 5 seconds to clear the adrenals. The example is intended to illustrate that there is no feature to extract that would make a model BETTER than a human for the things that we care about (rare and challenging diagnoses).

No one is arguing that ML can't segment and measure a structure, this is the lowest hanging fruit. ML can't diagnose an adrenocortical carcinoma (an example of a rare disease) because medicine doesn't know how to.

> In other words, we extract a single feature, and apply a single nonlinear activation function to that feature to decide whether or not to activate the 'treat' signal.

Now do this for the > 1000 other possible diagnoses on a CT abdomen, and have it be as fast as a human with equal or better ROC curve in under 5 seconds and cheaper than the $70 a radiologist bills for this. Unless you can eliminate having someone like me read this scan a ML model to measure the adrenal glands is worth $0.

I'm aware of the literature in this space. Your proposal is not novel and has been attempted. As soon as you try doing this on more than a handful of (typically easy) diagnoses it stops working. Currently the only useful models flag normal/abnormal to triage interpretation priority.

> Look at how LLMs are able to do stuff like write code that has comments written in pirate speak.

This anecdote doesn't prove anything but we can instead look at OpenAI's own white paper for their more rigorous data on hallucination and accuracy. LLMs aren't ready for a production CRUD app let alone human life.

> ML models looking at diagnoses with small training sets that are largely in obsolete scan formats could stil

It's not obsolete. It's a completely different image type. This is akin to saying a ML model trained on black and white sketches can paint the Mona Lisa in color.

> all the anatomical knowledge.

A misunderstanding of the problem. The anatomy is easy. The pathology is updated every 1-5 years so there is no historical dataset.

gonzo41 · 2023-05-29T10:12:47

You need to pair a system like this with regulation and changes in the medical field to upskill nurses so they have a larger extended skill set and essentially have a path to becoming doctors via practical experience.

There's a lot of information on medical conditions. I remember hearing about IPhone apps that can get a 90% hit rate on correct diagnosis for general practitioner style visits. Taking that into account, and a large scale medical database, you'd think we'd be pushing that technology down so we can get better productivity out of primary care, so most of the time Dr's were using their big brains for big complex problems.

cdogl · 2023-05-29T11:53:11

> I remember hearing about IPhone apps that can get a 90% hit rate on correct diagnosis for general practitioner style visits.

Was that in marketing materials?

> so most of the time Dr's were using their big brains for big complex problems.

I have been misdiagnosed multiple times to my significant detriment. The issue each time was doctors failing to pay attention and exercise their faculties and knowledge, because they didn't care. These people didn't seem to have "big brains" and my subsequent experience taught me they weren't interested in responsibility or accountability either. I've heard similar stories from (for example) women who struggle with conditions like PMDD.

I'm not on board with giving these people tools that they can use to justify paying even less attention.

wouldbecouldbe · 2023-05-29T12:21:28

I honestly don't think GP's need big brains, just slightly above average and a good memory.

Skills required for being a good GP are probably empathy, communication, decent memory, reasonable analytical skills, not to big of an ego. Very different from a researcher.

JCharante · 2023-05-30T05:13:53

You could make the opposite argument that doctors don't pay attention so we should replace them with AI.

wouldbecouldbe · 2023-05-29T11:00:54

90% hit rate is pretty easy to accomplish for a standard doctor's visit, a big percentage of the doctor's visit are standard. You can probably train most humans to deal with that quickly. They often use students if the pressure is high here. It's about the other 10% + unforeseen issues.

haldujai · 2023-05-29T11:28:59

Speaking to radiology, ~90% of imaging studies are normal so I can sign my default template without looking at the images and be correct the vast majority of the time.

cicce19 · 2023-06-01T03:14:39

Also a rad (and heavily in the AI field). I think you mean 90% have inconsequential findings and not that they are "normal". Can you really apply untouched templates to 90% of your cases? I find that very hard to believe. How many templates do you have and how can they possibly describe the variety of possible findings seen on each type of study?

taneq · 2023-05-29T13:45:19

90% hit rate could probably be achieved with "take some panadol and come back in a couple of days if you still feel bad."

avereveard · 2023-05-29T11:18:49

> have a path to becoming doctors via practical experience.

Lol no

tikkun · 2023-05-29T11:52:34

> I interviewed and hired 25 radiologists, whose primary and chief complaint was that they had to reboot their computers several times a day..

Yes. There is still so much low hanging fruit for software everywhere.

itissid · 2023-05-29T12:07:03

Was that due to bad software crashing the os ?

malikNF · 2023-05-29T13:01:13

Smells like a memory leak.

TheCleric · 2023-05-29T15:44:27

My startup idea: rewriting medical apps in Rust.

bick_nyers · 2023-05-29T15:54:41

DCMTK (C++) doesn't have any significant memory leaks that I am aware of, you can take a look there to get a grasp on the amount of work you would be signing up for.

If you (or anyone) are serious about this, please for the love of God allow for multithreading (even if at the expense of efficiency), put some SIMD in there, or better yet utilize GPUs.

So much of Dicom image handling is done single thread and it's so silly.

skybrian · 2023-05-29T20:56:02

Another startup's idea: use AI to assist programmers in rewriting medical apps in Rust.

I'd guess it's probably safer than using AI directly.

ugh123 · 2023-05-31T17:24:21

Smells like Windows

wouldbecouldbe · 2023-05-29T11:54:13

Thats more an IT management problem then a software issue.

TeMPOraL · 2023-05-29T12:16:15

IT management usually is the problem.

sgt101 · 2023-05-29T12:22:10

It's often a software problem too.

It's pretty easy to ship software with memory leaks - especially when there are few users, limited money for testing and technically challenging tasks - such as large images that need to be manipulated.

bick_nyers · 2023-05-29T15:57:40

The testing overhead for Dicom is immense to do properly.

There is a very high chance it is a software problem, especially if dealing with tomos/mammos (or even x-ray) which are incredibly resource hungry.

jameshart · 2023-05-29T18:51:00

Wait a minute.... Because the role of radiologists is not as simple as trained classification, we can conclude that "No, AGI isn't going to take over every social system when GPT5 comes out"?

Author starts off by saying that you can't generalize conclusions based on observations in a limited domain:

> Geoff made a classic error that technologists often make, which is to observe a particular behavior (identifying some subset of radiology scans correctly) against some task (identifying hemorrhage on CT head scans correctly), and then to extrapolate based on that task alone.

Author then goes on to suggest that there are a number of issues with radiology which AI will not solve.

Then, based on this lack of ML's complete coverage of the aspects of radiology that he finds interesting, he then extrapolates:

> Should you be worried GPT5 is going to interact with social systems and destroy our society single-handedly? No absolutely not.

I don't see how the conclusion remotely follows from the argument. And more to the point, the refuting argument is embodied in the article itself.

skybrian · 2023-05-29T21:05:32

Yes, it goes a bit far, but the basic problem is treating this as all-or-nothing: Either the article is true or false. Either AI is going to take over, or it will have little effect.

I believe the article raises relevant issues about why deploying AI-based systems is harder and will take longer than it might appear. Another relevant example might be driverless cars. It's taking many years and deployment is still quite limited. Some companies gave up. But I'm not going to count Waymo out yet.

Similarly, it's possible that Hinton got the timeline wrong for radiology, but it might still have significant affects on radiologist employment in the end.

(Note that this sort of reasoning by analogy is useful for imagining plausible scenarios, but not for ruling things out.)

naijaboiler · 2023-05-30T22:53:59

Driving is another place where people wrongfully think that solving the technical problem means the problem has been solved. Driving is a social problem. We are a long long way from solving it.

naijaboiler · 2023-05-30T22:50:40

Correct. Social systems are far much more complex than technical problems. So nany technology optimists make the mistake of confusing a social problem as a technical problem, and think solving one means solving another. Wrong!

quickthrower2 · 2023-05-29T09:04:41

This is more about production is hard as in "real life example, in a hospital" than production as in "running llama and serving yourself, vs. running llama and serving 10000 people at the same time", although I think serving lots of people something that uses a lot of compute, that people expect to be real-time is going to be hard too!

w10-1 · 2023-05-29T21:25:08

Brain MRI's (in all their forms) are a whole different species than all other MRI's, due to variability and softness of structures, differences in condition presentation/time-frame, and and the high incidence of brain MRI's in the elderly (the brain can shrink precipitously after 60).

Also, neurologists are quite specialized (e.g., stroke vs. MS vs. acute encephalopathy vs. optic neuritis). In the majority of cases, the specialist neurologist is better at reading the MRI of their target conditions than the general radiologist. This is handled by the neurologist phoning the radiologist and guiding them to re-work their findings.

The vast majority of AI in software is in radiology. Decades ago radiology was among the first practice groups to fall under private equity patterns because of the cost of the machines. In some cases, specific radiology groups develop protectable expertise around some protocols (i.e., how to get the machine to give better data), but long machine lifecycle creates opportunities for technical one-upmanship.

Knowing that the labeling quality is variable and the images are not entirely comparable, you realize the bulk of AI-radiology may be built on shifting sands, and amounts to workflow optimization.

Worse, the effort could go into addressing what really plagues all providers: the unusability of their EHR's. But that's not particularly monetizable relative to critical diagnostics, and harder than production AI for MRI.

ghm2180 · 2023-05-29T13:50:31

When you ask interview questions like: How would you design the ML stack that can recommend a restaurant given their past restaurant going history? Most candidates don't get into the craziness of things like bias in datasets or calibration of probability output of models.

They dive straight into embeddings(they have the freedom in the interview to design something the are comfortable with), which is important to the quality but not necessarily the most(or even the only) challenging part of what the team works on.

xpe · 2023-05-29T16:23:55

Caveated [1] or not, the "concluding" paragraphs [2] are not a summary of the article. Neither are they well-supported nor convincing; they are sweeping, unrelated generalizations.

Notes:

1. The caveat at the top states "NOTE: This post is not up to my normal writing standard, just felt compelled to get this down in some form. This is more like a blog post entry than a real newsletter addition."

2. The article's final paragraphs are: "Long story short / Thinkers have a pattern where they are so divorced from implementation details that applications seem trivial, when in reality, the small details are exactly where value accrues. / Should you be worried about GPT5 being used to automate vulnerability detection on websites before they’re patched? Maybe. / Should you be worried GPT5 is going to interact with social systems and destroy our society single-handedly? No absolutely not."

P.S. Pedantic timestamp: the quotes above are taken on 12:24 pm eastern time, May 29. Well, this may seem pedantic until the author updates the blog post and then this comment seems erroneous. It is notable that we still don't have a well-accepted standard for snapshotted content.

YeGoblynQueenne · 2023-05-30T10:31:30

>> Geoffrey Hinton was one of the loudest voices decrying the decline of radiology 5 years, and now he’s crying fear for new AI systems.

It's worth remembering exactly what Geoff Hinton said, and how he said it:

“I think that if you work as a radiologist you are like Wile E. Coyote in the cartoon,” Hinton told me. “You’re already over the edge of the cliff, but you haven’t yet looked down. There’s no ground underneath.” Deep-learning systems for breast and heart imaging have already been developed commercially. “It’s just completely obvious that in five years deep learning is going to do better than radiologists,” he went on. “It might be ten years. I said this at a hospital. It did not go down too well.”

https://www.newyorker.com/magazine/2017/04/03/ai-versus-md

So Hinton didn't "decry the decline of radiology". His comment brims with exuberant sarcasm about how his chosen approach to AI is about to not only leave radiologists without a job, but also leave them looking like the short-sighted fools he thought they surely were.

Hinton was cock-sure and arrogant, like an undergraduate student who just trained his first neural net. That is quite unbecoming of the man whom the New York Times, The Guardian, and many other publications have been calling "the godfather of AI", and a Turing award winner, to boot.

That is something he should be called out on, and not allowed to forget. Nor he, nor anyone else who thinks that leaving highly-trained workers without a job is a laughing matter. Nobody should be allowed to make fun of the harmful consequences of the technology they create.

cicce19 · 2023-06-01T03:22:24

Hinton has been crucified many times in radiology circles for his statements. It honestly feels like we've beaten a dead horse on this. Clearly he was wrong and did not understand the complexity of the field. Although he was wrong, his statements today (through the advent of LLMs) are less insane than they were even 7 months ago and directionally he saw the progress of the technology. His magnitude/timeline was off and time will tell if he is one day proven right.

ugh123 · 2023-05-29T22:30:49

I don't think speaking to Radiologists as "advisors" to an AI system is going to be productive in the grand scheme of things. Yes, radiologists (and all doctors for that matter) will give their thoughts on making their job incrementally easier. But that doesn't necessarily equate to better healthcare.

Is that going to move the needle on better healthcare for patients? Will it allow our hospital systems to bring in more throughput of patients at higher quality of care and lower costs?

Thats not what doctors (and hospitals) want. They won't be an ally in "AI for healthcare" unless it means their paychecks/revenues are protected. The author is being misguided by seemingly smart people who have skin in the game and a lot to lose.

Personally, I only favor AI solutions that will drive down the cost of all healthcare, raise the quality of service, and overall expand high-quality healthcare to underserved populations. Anything less than that is only serving the needs of the industry and not patients.

mensetmanusman · 2023-05-29T13:12:42

Healthcare is fascinating. The quote that nothing will be implemented that possibly harms patients is false, because harm is complicated.

Sharing patient data harms privacy, but in the long run more data sharing is probably the best route for reducing physical harm. The body is just too complicated for individuals to fully grok.

neon_electro · 2023-05-29T14:31:31

Do you believe in a middle ground of informed consent such that data can be shared while mitigating the privacy harms? It allows each patient to make their own decisions, and just like volunteering to be an organ donor, there can be societal benefit without forcing harm on people who do not want to participate.

levihaku · 2023-05-29T19:25:12

I believe that if I cared more about privacy than my life, I wouldn't go to a public institution full of people I objectively cannot trust because I don't personally know them and then put my life in their hands.

Privacy schizos need to stop making up nonsense. All of your diagnosed illnesses are permanently stored in a database that will never ever be deleted until decades after your death. This data is already shared between all hospitals. The moment you step through the hospital door you aren't at home and your privacy fantasies end. Literally any malicious worker could leak your data and then what? Why does this even matter. If I see you in a street, I can tell you're sickly just from the direction you're walking in.

thegrimmest · 2023-05-29T19:12:35

I don't. I think sharing data should be a mandatory part of receiving services. In order to benefit from the system you must commit to continue to improve it.

skybrian · 2023-05-29T21:09:58

> Sharing patient data harms privacy

Yes, that's a very abstract statement that seems plausible and is easy to make, but how do we know there's significant harm? What are some real-world examples? Can we quantify how much harm is done to patients from sharing too much, versus sharing too little?

mercurialsolo · 2023-05-29T09:06:06

The "feel" aspect in any domain is largely embeddings of a multi-modal information space. The challenge though with AI systems today is solving for multi-modality and providing access to datasets in closed domains. We are making strides in both.

As AI's get exposed via API's and ever easier access we will see more of proliferation of this where jobs which we felt never could be AI'fied are more rapidly done so.

Reminds me of the Lee Sedol vs Alpha Go game where famously it was said at the end "All but the very best Go players craft their style by imitating top players. AlphaGo seems to have totally original moves it creates itself.."

I do echo though the production grade rollout of AI. When we want to jinx the game, we can notoriously play spoil sport - regulation, data protection et. al.

OthmaneHamzaoui · 2023-05-29T12:23:27

On point ! Been working on putting AI/ML systems in production for various large companies in the last 5 years and every time the ML part of the system is just the tip of the iceberg. System integrations and user adoption are two main big components that had to be tackled before the AI system was in production.

I think the natural excitement we have with any new AI model (or technology) leads us to assume that it will magically get integrated in any existing system and adopted by any user.

crosen99 · 2023-05-29T13:28:15

"Radiologists are not performing 2d pattern recognition - they have a 3d world model of the brain and its physical dynamics in their head. The motion and behavior of their brain to various traumas informs their prediction of hemorrhage determination."

Radiologists are certainly performing 2d pattern recognition, as the input they are processing is only 2d even if the details of how that recognition is performed relies on some deeper understanding. Likewise, an AI system performs recognition of 2d patterns based on some deeper "understanding" - in the case of a neural net that "understanding" lies in the complex configuration and weightings of its neural connections built up from vast training datasets.

Even if this dataset is only a subset of the dataset that humans are trained on, we still can't a priori claim that this subset lacks patterns and correlations that escape humans and that allow an AI to make certain determinations better than a human might.

senttoschool · 2023-05-29T07:53:14

>Should you be worried GPT5 is going to interact with social systems and destroy our society single-handedly? No absolutely not.

I don't think most people are scared of GPT5. It's AGI that they're scared of.

And GPT5 can destabilize a society because of how fast it could replace workers.

cookieperson · 2023-05-29T10:27:35

I'm not scared of AGI beyond how it's likely lead to societal collapse because of human greed. I am worried about the operationalization of the current chat bot technologies and other generative AI modelling techniques for harm. You don't have to have a smart system to hurt hundreds of millions of people.

NicoJuicy · 2023-05-29T09:19:54

GPT is a tool in the toolbox. It can increase productivity, but won't replace a worker.

Fyi, GPT 5 won't be coming soon and GPT 4 vs. 3.5 is not that different in terms of quality.

whstl · 2023-05-29T12:15:01

The main worry was never about it "replacing a worker". It's about one worker + GPT replacing two or three, because of the increased productivity gains you mentioned.

I've seen it happening in some fields already. First with translation, now with copywriting and GPT-4. Hiring was frozen for those areas for a while in some companies I'm familiar with.

sgt101 · 2023-05-29T12:25:02

Also administrative assistants and secretarial support. 30 years ago maybe 1/2 of the white collar workforce and now maybe 5% or less?

Not AI - but IT.

bsenftner · 2023-05-29T11:57:07

You miss the point: reality is far, far, immensely far more complex than you, I or anyone else realizes. Attempts to use AI to replace things, people and groups and fields, are fool's errands. They will be colossal failures. Just watch, you may be even a participant: reality is more complex than we realize, and our attempts to replace it with encompassing automation will fail.

ekianjo · 2023-05-29T09:12:31

Yawn. The computers were supposed to put us out of work. Then it was the internet. Now its AI. And despite all that the number of people employed keeps increasing and find new ways to thrive with technology.

bberrry · 2023-05-29T09:28:07

The industrial revolution replaced muscle power with machinery. AI is finally at the place where it can replace brain-power for many tasks, and will likely keep improving. I think you are underestimating the significance of this. I don't see too many horses with jobs these days.

laratied · 2023-05-29T14:57:33

Sam Altman is saying we are going to see a shift from labor to capital and that is going to cause more economic inequality along with possibly breaking the social contract.

It is probably nothing though...As if this gopher server is going to replace my TV someday. Give me a break. It only does text!

CatWChainsaw · 2023-06-03T18:21:08

Sam Altman is also a doomsday prepper because he anticipates societal collapse when the shift from capital to labor pushes economic equality past France 1789 levels.

maaanu · 2023-05-29T09:34:45

Currently the AI is not even able to improve my work... How should it replace me? I think the statement "AI is finally at the place where it can replace brain-power for many tasks" is ridiculous.

maccard · 2023-05-29T15:31:29

I don't think your statement is true - AI can definitely improve my work and has done for a while now. It lets me be more productive, but it certainly isn't capable of being left unattended or even lightly attended to.

As long as you know the limits of what you're doing, tools like ChatGPT are very helpful. In about a minute, i was able to generate a skeleton app using a go framework including tests, the infrastructure as code to deploy as a lambda, a makefile build and run it, and a buildkite pipeline that will build and deploy it all. It's also correct.

Now, I can't rely on it to do everything, but it can give me an app scaffold for a tech stack I'm familiar with, quicker than I can Google for it.

maaanu · 2023-05-29T18:37:49

That´s good for you and I could give you some anecdotes with ChatGPT, were it failed horribly, e.g. were it failed to generate unit-tests for a simple api-call (about 10 loc)...

I am sure the tools will only improve, but I don't understand all the hype/fear about it. ("We are fucked", "this is soo over", ... I think you will find more of those comments in older news-threads)

maccard · 2023-05-29T21:00:25

I agree - it's disastrous when it goes wrong. If you ask it to generate the code for a lambda with 32 CPU's it will happily spout out nonsense rather than tell you it's not a valid think to request.

That said, ive found it remarkably good at spitting out slightly modified boilerplate - like IDE templates on steroids. It's been a great tool, like a hammer. But not everything is a nail.

XorNot · 2023-05-29T10:54:53

It's the difference between "some human labor involved" and "zero". Getting to zero is very hard, and has very different ramifications then the "some" quantity.

As long as "some" labor is needed, then that's going to be the entire economy - it'll simply expand along whatever constraint that "some" represents.

The bizarre thing lately is all these arguments people make implying the economy is maximized. That all the things that will ever be needed are currently being produced and no new growth is possible or will ever happen (despite this literally never being the case in the entire history of the human race).

aatd86 · 2023-05-29T09:40:13

I remember seeing a paper that explained that while the a business needed 7 employees on average to reach something like $1MM in revenue in 90s, it was now 3.

So I think it still has the potential to replace some people. And even with AI and AGI eventually perhaps.

This is a progression. Besides, employment is a bad measure in general if one does not compare it to others such as cost of living, poverty etc...

Like non farm payroll doesn't say anything if everyone is employed living with the minimum paycheck.

rolisz · 2023-05-29T12:14:32

How much of that is because of inflation though? 1 million in 1995 corresponds to 2 million in 2023 dollars.

joshuahedlund · 2023-05-29T11:56:45

> I remember seeing a paper that explained that while the a business needed 7 employees on average to reach something like $1MM in revenue in 90s, it was now 3.

An alternative interpretation: while in the 90’s, 7 people could only support one $1MM business, today they can support at least two.

(US unemployment is lower now than it was then.)

dclowd9901 · 2023-05-29T13:11:48

As a thought experiment:

- computers are able to churn through any problem more quickly and cheaply than people

- machines are able to perform any tasks that humans can, more quickly and cheaply

I’m curious what you think is left for people to _do_.

We may not be there today, but at some point, both of those things can and will become true.

levihaku · 2023-05-29T19:31:59

Fantasies never come true.

xyzal · 2023-05-29T13:39:54

If GPT5 is to put people out of work, I hope it will make unemployable the largest percentage people in the shortest amount of time. Only then there might be a chance we will change our way of measuring an individual's value by his or hers economic output.

senttoschool · 2023-05-29T14:23:02

White collar workers, the ones who went to college and studied, will be the first to be unemployed. I'm guessing that these people are more employable than the rest of the population since they dedicated themselves to studying in college versus the ones who mess around in life.

rco8786 · 2023-05-29T11:17:23

> And GPT5 can destabilize a society because of how fast it could replace workers.

How do you know this?

guillemsola · 2023-05-29T21:26:44

Aren't these complex as any other piece of software that is looking to provide value to users?

I've been developing software solutions for several years and the fact that a computer can do cool stuff, automate processes, be more accurate... always need users who know and want to use it. So my takeway is that with AI solutions the human part also needs to be considered.

outside1234 · 2023-05-29T15:32:08

A customer I am working with can’t even set up CI/CD with GitHub.

I have no doubt that AI is not going to kill us because 95% of companies won’t even be able to build it.

xpe · 2023-05-29T16:17:10

> A customer I am working with can’t even set up CI/CD with GitHub.

Can you please characterize your customer without calling them out by name?

Also, can you clarify what you mean by "can't"? Under what time frame? With what background? With what else on their plate?

In other words, why do you expect your customer to be able to setup CI/CD with GitHub? Maybe this seems reasonable to you, but maybe you are overlooking the hundreds of little things that make it seem easy to you, not least of which is the patience and ability to wade through tasks with many steps.

I'm genuinely curious. Are you suggesting they lack capability (such as intelligence or skills), motivation, and/or something else?

xpe · 2023-05-29T16:15:40

> I have no doubt that AI is not going to kill us because 95% of companies won’t even be able to build it.

I take your joke, but the argument doesn't hold water for many reasons. Here are only two:

1. Even if only 5% of companies can build something dangerous, that's more than enough.

2. The "AI" doesn't have to directly kill us; we're capable of doing that ourselves with only a little bit of informational, societal, and/or economic degradation.

jacurtis · 2023-05-29T22:56:38

It is because, especially in tech, most people can't even vocalize what they want to the detail that an AI could generate it. Even in my day-job (which is probably similar to yours), most people give me the wrong information when asking for a solution and if I blindly built what they think they needed, they wouldn't get where they wanted.

jgalt212 · 2023-05-29T13:22:26

The only general distribution production AI system (post ChatGPT) I'm aware of the that "works" decently is Grammarly Go.

What others are out there?

xpe · 2023-05-29T16:32:39

What do you mean by "work decently"? What kinds of behavior are you interested in? (i.e. What are you hoping to learn from this conversation?)

If I were to guess, I'd probably think you are interested in production-level issues, rather than limitations of these "AI" technologies that have nothing to do with typing software scaling issues?

potatoman22 · 2023-05-29T14:09:34

Google search

xpe · 2023-05-29T16:29:52

This conversation is unfolding nicely. Sorry for the sarcasm... I'll say what I mean: I think we can strive to treat even poorly phrased questions as an opportunity.

Narrowly, yes, Google search seems to be behaving roughly as well (or poorly, depending on your point of view) nowadays. But I get the feeling saying "Google search" isn't a helpful response to the person asking the question.

potatoman22 · 2023-05-31T03:08:00

Helpful response: there are thousands of productionized AI systems that have existed long before ChatGPT. For example, google search.

xpe · 2023-06-05T02:42:27

The context of the question suggests the questioner is talking about AI technology at the level of ChatGPT —- large language models —- and the associated technological complexities around them.

Subtopics might include differences between managing state and sessions of a text search engine versus a large language model

hewlett · 2023-05-29T21:20:43

DeepL?

m3kw9 · 2023-05-29T15:21:43

To start, small things like getting the json format to work 100% of the time is impossible.

xpe · 2023-05-29T16:29:08

Me: "Are you speaking of JSON in the context of radiology systems? medical systems? enterprise systems? software in general?"

You: Yes.

amelius · 2023-05-29T12:38:55

Even harder is using them when the vendor keeps pushing updates that may or may not work.

ftxbro · 2023-05-29T17:37:02

this sounds like the kind of cope that gets written before production ai starts going down the list replacing jobs

CatWChainsaw · 2023-06-03T18:07:36

tbf your handle is on point

godelski · 2023-05-29T19:10:10

At the root of this is something I'm constantly saying to other members in my research lab, arguing with reviewers, and passionately teaching my students:

datasets are proxies, measurements are fuzzy.

This is something that is drilled into students learning statistics and I can't for the life of me figure out why this changed with respect to ML (best guess is not requiring statistics courses).

Datasets are proxies: they represent the world, but aren't. They should generally be seen as narrow subsets too. Your dataset quality and type matter a lot! Things like medical image datasets also have tons of correlating factors that can easily invalidate all results without you being aware. There are simple datasets we use to prove a concept (toys, mnist, cifar, etc). There are large scale datasets that have internal inconsistencies (imagenet, flowers). There are huge datasets that haven't been properly filtered/deduplicated (LAION). (there are also just shitty datasets (HumanEval)) Thinking of datasets as proxies helps internalize the frustration that literally every production engineer faces (even outside ML and software): real world results are inconsistent with lab results. Dataset engineering is an underappreciated art that is extremely difficult. But everyone needs to internalize that datasets are just a map, not the territory, and your navigation will only be as good as the map (many are poorly drawn maps, often on purpose)

Measurements are fuzzy: Benchmarkism is running rampant in the ML world and it baffles me that a field who's bottomline objective deals with alignment isn't able to align how we evaluate ourselves. No measurement is perfect, many are far from them. You can train two language models to the same NLL and one might sample well and the other outputs garbage. You can train two image models to have identical FIDs and one samples clearly and the other doesn't. Likelihood also doesn't guarantee sharpness and I can go on. You must think about the limitations of your measurements and know them in depth. This also has seem to have gotten away from us and people just use the measurement libraries and call it that. We've reached a point where ImageNet classification accuracy has decoupled from downstream performance (object detection and segmentation), and things like this are confusing to production people because just taking the model with the highest score doesn't always result in the best performing model for their work (even before we consider things like throughput and memory usage). It is a Goodhart problem through and through.

ML is at a serious point where we've gotten away from our basic stats learning. That's going to pose a real danger to society, not AGI. It's like handing powertools to chimps (who don't know how powertools work), it won't end well. But that is happening because we've shifted focus to meet targets, not to measure our work. Targets are easy, science is hard. Unless we bring these nuances back to our evaluation of works then we are just handing powertools to chimps without any quality assurance.

tzhenghao · 2023-05-29T21:11:41

I remember working with several ML teams on specifically inference performance (latency, memory usage etc.), and it's not a surprise to see some object detection performance variance depending on a scene.

Sometimes, even capping the model architecture to bound us from exceeding performance thresholds is non-trivial in itself, but convincing "some" researchers why p99, inference latency for example, is more important than the p50 case in a safety critical system...that's surprisingly several magnitudes harder.

godelski · 2023-05-29T21:23:59

Most, if not all (can't think of one), research datasets have enough problems where once you reach a certain threshold your model will start to overfit despite no divergence between training and test data. But this is hard to explain to many as they only understand overfitting as divergence. But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting. But damned if you do, damned if you don't. And don't get me started on generative models where there aren't even test datasets[1].

[0] I blame the status quo of lazy reviewers and rejecting works based on benchmarks as well as being uninformed.

[1] I'll admit this is a sore spot right now as a reviewer took my paper's discussions about the limitations of FID and asked why I didn't present a new metric.

YeGoblynQueenne · 2023-05-30T10:58:13

>> But maybe worse, is that it is status quo to tune hyper-parameters on test data results[0] (instead of validation), which causes information leakage and helps support overfitting.

This is extremely hard to convince people about. I'm starting to doubt that even veteran researchers realise how unreliable their error estimates become if they're tuning their models on the test set, which, like you say, is standard practice. And yet, people will stand on that rotten practice and talk about the amazing generalisation ability of neural nets, and how over-parameterising and over-training defies statistical learning theory. That's why we have such gems as this paper:

The unreasonable effectiveness of deep learning in artificial intelligence

https://www.pnas.org/doi/10.1073/pnas.1907373117

Or "the grokking paper" and so on. Machine learning is starting to look more and more like the social sciences, where people simply pick and choose results from the (mostly non-peer reviewed) literature, just because they like the claim in a paper (or because it has a catchy title, or it went viral on twitter), and not because they make any serious attempt to check the results themselves.

P.S. Sorry about your review. It's a good idea to avoid any discussion that goes beyond the central claim of a paper. It can only cause reviewers confusion and distract them from the meat and potatoes of the work. Unfortunately, sticking to that advice makes for dry and boring to read papers, which also reduces the chances of acceptance.

godelski · 2023-05-30T19:46:35

Oh yeah, those are great papers, which I wish more people read. (I'll never not laugh at the journal name) I'll add this too[0] since it seems to be highly missed. But as far as validation sets, I have a hard time convincing anyone I talk to that tuning on test data is information leakage. I've also had issues with discussing people (even outside my lab, even at big schools/labs) what "uncurated samples" means. I say "you sample batch and show that" they say you can sample multiple batches and pick the best one... So this is the same kind of thinking and shows why evaluating generative papers is so difficult.

Fwiw, I blame the conferences for this. We have too many papers to review, too few people to review, and a system that isn't even good at selecting people from the pool of reviewers (I got 0 reviews to do, my coworker got 6 ¯\_(ツ)_/¯). Quality control on reviewers is non existent and ACs don't check that they follow reviewing rules. This results in a situation where I cannot think where it isn't in your advantage to be an evil reviewer: reject everything, be lazy about it. So we promote benchmarkism and abstract everything and everything so that nothing is novel (before we talk about collusion and ethics violations). Without a radical change I don't think the system will work anymore. There's too much incentive to cheat and play dirty now.

For my paper, all that stuff was in the appendix fwiw. Since it is a generative paper I took my chance to do a deep sample analysis (even inventing a new technique) to analyze the biases in different models and noting key indicators that were visible to the eye. So of course I had a small discussion about how FID is limited (see [1,2], [2] isn't deep enough though) and will not capture these differences. These differences matter when you're pushing the edge of the maximum FID on the dataset, which is not 0 as many people think[3]. I do feel like it is my duty as a researcher to stick my flag in the ground and point out how we need to do things better. (I do think it also made the paper much clearer and got good feedback from my colleagues fwiw) That is what research is after all. Research requires nuance, and if we're being honest, that shouldn't be something I have to say. People should deeply understand their metrics (not just for evaluating models, but works). The system is just too noisy right now to be meaningful imo.

[0] A note on the evaluation of generative models: http://arxiv.org/abs/1511.01844

[1] Improved Precision and Recall Metric for Assessing Generative Models: https://arxiv.org/abs/1904.06991 (original is good too: https://arxiv.org/abs/1806.00035)

[2] The Role of ImageNet Classes in Fréchet Inception Distance: https://arxiv.org/abs/2203.06026

[3] Fwiw, train vs test set in CIFAR10 has FID 3.15 and FFHQ256 top 10k vs bottom 60k is 2.25 (which current top paper beats, but that's 50k gen samples vs 50k random dataset samples). These of course have biases though since the sizes are unequal but since FID is distributional it does give us some strong clues about the variance in the datasets. My paper didn't go this far though.

YeGoblynQueenne · 2023-05-30T10:47:40

They funny thing is that all this is clearly spelled out in PAC-Learning and Statistical Learning Theory, but the trend in most machine learning courses is to avoid all that hairy stuff, and concentrate on the practical tasks of training classifiers and using popular libraries and so on.

Btw, in the above comment:

FID = Fréchet inception distance

NLL = Negative Log Likelihood

godelski · 2023-05-30T19:51:35

Yeah this is what bugs me. If these people took a single stats class it would be drilled into them. Doesn't matter if ISLR, PAC-LSLT, SR, or ROS, it will be there. But it isn't. I push for as much of this as I can when I teach my ML course but my advisor pushes back for being "too mathy." I swear, ML people are adverse to math (though they like adding meaningless math equations to papers to make them look technical, but that is probably aligned with the former comment). As the current environment stands, I do not think many ML people understand concepts like data leakage, Bayes, or even likelihood. A problem is that popular way to teach ML is like how you teach software: coding. This doesn't work when you need statistical theory to understand your results.

Ozzie_osman · 2023-05-29T14:56:00

> Geoffrey Miller was one of the loudest voices decrying the decline of radiology 5 years, and now he’s crying fear for new AI systems.

I don't know who Geoffrey Miller I'm pretty sure if there's a Geoffrey who notably predicted the decline of radiology, it was Geoffrey Hinton a few years ago...

xpe · 2023-05-29T16:12:41

I did a few minutes of research. I did not find any such person. So I'm inclined to agree; the author may have meant Geoffrey Hinton.

See also: https://statmodeling.stat.columbia.edu/2021/06/07/ai-promise...

> Gary Smith points us to [this news article][1]:

> Geoffrey Hinton is a legendary computer scientist . . . Naturally, people paid attention when Hinton declared in 2016, “We should stop training radiologists now, it’s just completely obvious within five years deep learning is going to do better than radiologists.” The US Food and Drug Administration (FDA) approved the first AI algorithm for medical imaging that year and there are now more than 80 approved algorithms in the US and a similar number in Europe.

[1] https://qz.com/2016153/ai-promised-to-revolutionize-radiolog...

> Geoffrey Hinton is a legendary computer scientist. When Hinton, Yann LeCun, and Yoshua Bengio were given the 2018 Turing Award, considered the Nobel prize of computing, they were described as the “Godfathers of artificial intelligence” and the “Godfathers of Deep Learning.” Naturally, people paid attention when Hinton declared in 2016, “We should stop training radiologists now, it’s just completely obvious within five years deep learning is going to do better than radiologists.” The US Food and Drug Administration (FDA) approved the first AI algorithm for medical imaging that year and there are now more than 80 approved algorithms in the US and a similar number in Europe.

> Yet, the number of radiologists working in the US has gone up, not down, increasing by about 7% between 2015 and 2019. Indeed, there is now a shortage of radiologists that is predicted to increase over the next decade.

dang · 2023-05-29T19:22:51

Looks like the OP has fixed that now.

2023-05-29T10:51:54

[dead]

RamblingCTO · 2023-05-29T11:52:56

Who tf started calling Hinton godfather of AI? It is not only inaccurate but unscientific. What about Minsky, Weizenbaum, Rosenblatt, Hopfield, Lenz, Hebb, Werbos, Turing, McCarthy and whoelse and whatnot. What differentiates Hinton from all the rest? The truth is that all of science is an amalgamation of past achievements, many so small and unheard of that it's just so very very wrong to pick one person and call them godfather. Why Hinton? Because DL had good marketing 10 years back?

sgt101 · 2023-05-29T12:23:12

>Minsky, Weizenbaum, Rosenblatt, Hopfield, Lenz, Hebb, Werbos, Turing, McCarthy and whoelse and whatnot

Sounds like that Queens of the Stone Age song.. https://open.spotify.com/track/3DaXIGJm0BCEB9X7zHTRfI?si=98b...

ftxbro · 2023-05-29T17:35:17

> What about Minsky

Minsky slowed down AI by using his powerful rizz to hypnotize otherwise reasonable researchers and academics into believing his self serving assertion that multilayer perceptron systems are bad because single layer perceptron systems are bad.

YeGoblynQueenne · 2023-05-30T11:38:35

Sorry but you have no idea what you're talking about. Minsky, together with Seymour Pappert, wrote a book where they pointed out very real practical and theoretical limitations of single-layer perceptrons, which at the time (and for many years before) were the only neural networks that anyone had ever successfully managed to train.

You see, Minsky and Pappert's book, titled "Perceptrons", was published in 1969. Note well that this was a whole fifteen years before Rumelehart proposed backpropagation as an efficient algorithm for training neural nets, in 1986.

Indeed, without backpropagation, even if multi-layer perceptrons were theoretically capable of learning functions like XOR and parity (the two examples of the limitations of single-layer perceptrons in the Minsky and Pappert book) nobody knew how to train multi-layer perceptrons in practice until Rumelhart. That's the whole point of backpropagation, and that's why every single graduate and undergraduate course in neural nets teaches backpropagation: because without it, you're stuck with single-layer perceptrons.

Minsky and Pappert weren't the ones who "slowed down AI" as per your comment, it was neural net researchers who failed to get their shit together and figure out how to train their systems efficiently, who couldn't progress beyond their weaker, lesser single-layer perceptrons, and who accused Minsky and Pappert of foul play in order to cover their own research's inadequacies.

Minsky and Pappert did a big favour to all those early perceptron researchers, by rubbing their nose in their mess, and forcing them to recongise the limitations of their systems. If it wasn't for Minsky and Pappert, it would have probably taken another couple of decades before neural net researchers got off their arses and improved their work.

And, btw: Minsky himself was a staunch connectionist and build neural network systems. So quit spreading the ahistorical nonsense that he was somehow the big bad of early neural net research. Early neural net research was a shambles, is the truth; even more so than it is today.

levihaku · 2023-05-29T19:35:43

Multilayer perceptrons literally are bad. Machine Learning industry is a complete joke because trivial stuff like this is all it has.

It would honestly take less time for biologists to trap a brain in a jar and force it to do things than anyone in this joke of an industry will create real intelligent algorithm that can think for itself.

Oh wait, I also forgot to mention that in our politically correct clown society, free thinking is illegal so AI by literal definition is illegal in the first place because it may just happen that it starts to be a little bit racist despite otherwise being amazingly intelligent and better than humans at solving certain tasks.

ftxbro · 2023-05-29T20:44:22

oh it's the 'LLMs are simultaneously ineffective and dangerously woke' argument

RamblingCTO · 2023-05-30T07:19:27

I was just listing random names that came to mind. I don't think any of these deserve the title. If anyone, Turing imho.

YeGoblynQueenne · 2023-05-30T11:19:31

Hinton, LeCun and Bengio used to be called the "godfathers of deep learning" which makes much more sense. I think this was after their Turing award, btw.

At some point, it seems someone switched to calling them the "godfathers of AI", possibly because that someone was writing an article for a lay audience who couldn't be expected to know what "deep learning" is, but would more certainly recognise "AI". Or maybe the someone (a journalist) was confused herself about the difference between "deep learning" and "AI" (and probably also "machine learning"), something that's very common. That's my best guess anyway.

Who started this? It _may_ have been Forbes, or perhaps Forbes was one of the earliest users of the term:

The trio — who have been nicknamed the "Godfathers of AI" (...)

https://www.forbes.com/sites/samshead/2019/03/27/the-3-godfa...

This is very tenuous but the reason I finger Forbes is because I found the article above linked from this Guardian interview with Hinton, presumably to justify calling Hinton "one of the godfathers of AI":

Known as one of three “godfathers of AI”, (...) Hinton (...)

https://www.theguardian.com/technology/2023/may/05/geoffrey-...

The Guardian article was one in a series of pieces about Hinton leaving Google and warning of AI doom, that appeared in the Guardian, the New York Times, and other outlets, and that called Hinton "the godfather of AI". The first such article I read was this one, again by The Guardian:

The man often touted as the godfather of AI (...)

https://www.theguardian.com/technology/2023/may/02/geoffrey-...

I, too, was annoyed at the inaccuracy of this moniker and so I went on Twitter to ask Hinton to do something about it. He of course ignored my tweet, but the article with the Forbes link appeared soon after, so maybe Hinton took the hint, after all, or perhaps a Guardian columnist saw my tweet and decided to find some sort of reference for the moniker. Or maybe some other annoyed reader wrote to The Guardian directly and protested. Note the switch from "the godfather of AI" to "one of the three godfathers of AI" - clearly someone tried to do at least a bit of fact-checking, after the first Guardian article. But while it would be edifying if my tweet was at the root of this, I have no way to know it really was; it's just wishful thinking.

Anyway the Forbes article seems to be an early user of the term, if not the originator.

RamblingCTO · 2023-05-31T08:44:43

I agree, that was my thinking as well. And I've seen these articles and concluded the same. There's definitely a dynamic there.

While they are very public figures of DL, I don't think it's fair to call them godfathers of DL anyway. Restricted Boltzman Machines are nothing without Boltzman machines. There were so many people involved at the time as well, it's unfair, especially to their research teams most probably having done the heavy lifting anyway.