Hacker News new | comments | show | ask | jobs | submit login

I was recently "playing" with some radiology data. I had no chance to identify diagnoses myself with untrained eyes, something that probably takes years for a decent radiologist to master. Just by using DenseNet-BC-100-12 I ended up with 83% ROC AUC after a few hours of training. In 4 out of 12 categories this classifier beat best human performing radiologists. Now the very same model with no other change than adjusting number of categories could be used in any image classification task, likely with state-of-art results. I was surprised when I applied it to another, completely unrelated dataset and got >92% accuracy right away.

If you think this is a symptom of AI winter, then you are probably wasting time on outdated/dysfunctional models or models that aren't suited for what you want to accomplish. Looking e.g. at Google Duplex (better voice synchronization than Vocaloid I use for making music), this pushed state-of-art to unbelievable levels in hard-to-address domains. I believe the whole SW industry will be living next 10 years from gradual addition of these concepts into production.

If you think Deep (Reinforcement) Learning is going to solve AGI, you are out of luck. If you however think it's useless and won't bring us anywhere, you are guaranteed to be wrong. Frankly, if you are daily working with Deep Learning, you are probably not seeing the big picture (i.e. how horrible methods used in real-life are and how you can easily get very economical 5% benefit of just plugging in Deep Learning somewhere in the pipeline; this might seem little but managers would kill for 5% of extra profit).

AI winters are a result of a massive disparity between the expectations of the general public and the reality of where the technology currently sits. Just like an asset bubble, the value of the industry as a whole pops as people collectively realize that AI, while not being worthless, is worth significantly less than they thought.

Understand that in pop-sci circles over the past several years the general public is being exposed to stories warning about the singularity by well respected people like Stephen Hawking and Elon Musk (http://time.com/3614349/artificial-intelligence-singularity-...). Autonomous vehicles are on the roads and Boston Dynamics is showing very real robot demonstrations. Deep learning is breaking records in what we thought was possible with machine learning. All of this progress has excited an irrational exuberance in the general public.

But people don't have a good concept of what these technologies can't do, mainly because researchers, business people, and journalists don't want to tell them--they want the money and attention. But eventually the general public wises up to the unfulfillment of expectations, and drives their attention elsewhere. Here we have the AI winter.

I'd clarify that there is a specific delusion that any data scientist straight out of some sort of online degree program can go toe to toe with the likes of Andrej Karpathy or David Silver with the power of "teh durp lurnins'." And the predictable disappointment arising from the craptastic shovelware they create is what's finally creating the long overdue disappointment.

Further, I have repeatedly heard people who should know better, with very fancy advanced degrees, chant variants of "Deep Learning gets better with more data" and/or "Deep Learning makes feature engineering obsolete" as if they are trying to convince everyone around them as well as themselves that these two fallacious assumptions are the revealed truth handed down to mere mortals by the 4 horsemen of the field.

That said, if you put your ~10,000 hours into this, and keep up with the field, it's pretty impressive what high-dimensional classification and regression can do. Judea Pearl concurs: https://www.theatlantic.com/technology/archive/2018/05/machi...

My personal (and admittedly biased) belief is that if you combine DL with GOFAI and/or simulation, you can indeed work magic. AlphaZero is strong evidence of that, no? And the author of the article in this thread is apparently attempting to do the same sort of thing for self-driving cars. I wouldn't call this part of the field irrational exuberance, I'd call it amazing.

> Deep Learning makes feature engineering obsolete

I think even if you avoid constructing features, you are basically doing a similar process where a single change in a hyper-parameter can have significant effects:

- internal structure of a model (what types of blocks are you using and how do you connect them, what are they capable of together, how do gradients propagate?)

- loss function (great results come only if you use a fitting loss function)

- category weights (i.e. improving under-represented classes)

- image/data augmentation (self-driving car won't work without significant augmentation at all)

- properly set-up optimizer

The good thing here is that you can automate optimization of these to a large extent if you have a cluster of machines and a way to orchestrate meta-optimization of slightly changed models. With feature engineering you just have to do all the work upfront, thinking what might be important, and often you just miss important parts of features :-(

Yep, and in doing so, you just traded "feature engineering" for graph design and data prep, no? And that's my response to these sorts. And their usual response to me is to grumble that I don't know what I'm doing. I've started tuning them out of my existence because they seem to have nothing to contribute to it.

It's a huge difference in terms of the time invested to create something that performs well. Hand crafted feature engineering is better for some tasks but for quite a few of them automated methods perform very well indeed (at least, better than I expected).

> if you have a cluster of machines and a way to orchestrate meta-optimization of slightly changed models

curious, if there is any good quality open source project for this..

I'm not aware of actually written code/projects that does this, but try looking into neural architecture search, it should be useful https://github.com/markdtw/awesome-architecture-search

> But eventually the general public wises up to the unfulfillment of expectations, and drives their attention elsewhere. Here we have the AI winter.

And more importantly, business and government leaders wise up and turn off the money tap.

This is why it's wise for researchers and business leaders to temper expectations. Better a constant modest flow of money into the field than a boom-bust cycle with huge upfront investment followed by very bearish actions.

I think the problem is, that's absolutely against the specific interests of university departments, individual researchers, and newspapers - even if it's in the interest of the field as a whole.

Prisoner's Dilemma

That requires super rational agents in the game theory sense...

> AI winters are a result of a massive disparity between the expectations of the general public and the reality of where the technology currently sits.

I think they also happen when the best ideas in the field run into the brick wall of insufficiently developed computer technology. I remember writing code for a perceptron in the '90s on an 8 bit system, 64 k RAM - it's laughable.

But right now compute power and data storage seem plentiful, so rumors of the current wave's demise appear exaggerated.

I wonder, though, what will happen with the demise of Moore's law... can we simply go with increased parallelism? How much can that scale?

That part will be harder than we can imagine.

Most of the software world will have to move on stuff like Haskell or functional language. As of now bulk(almost all) of our people are trained to program in C based languages.

It won't be easy. There will be a renewal for high demand software jobs.

I don't think Haskell/FP is a solution either... Even if it allows some beautiful straightforward parallelization in Spark for typical cases, more advanced cases become convoluted, require explicit caching, and decrease performance significantly, unless some nasty hacks are involved (resembling cut operator in Prolog). I guess bleeding edge will be always difficult and one should not restrict their choices to a single paradigm.

I wish GPUs were 1000x faster... Then I could do some crazy magic with Deep Learning instead of waiting weeks for training to be finished...

That's more a matter of budget than anything else. If you problem is valuable enough spending the money in a short time-frame rather than waiting for weeks can be well worth the investment.

I cannot fit a cluster of GPUs into a phone where I could make magic happen real-time though :(

Hm. Offload the job to a remote cluster? Or is comms then the limiting factor?

It won't give us that snappy feeling; imagine learning things in milliseconds and immediately displaying them on your phone.

Jeez. That would be faster than protein-and-water-based systems, which up until now are still the faster learners.

somebody is working on photonics-based ML http://www.lighton.io/our-technology

> AI winters are a result of a massive disparity between the expectations of the general public and the reality of where the technology currently sits.

A symptom of capitalism and marketing trying to push shit they don't understand

I don't think the claim is that AI isn't useful. It's that it's oversold. In any case, I don't think you can tell much about how well your classifier is working for something like cancer diagnoses unless you know how many false negatives you have (and how that compares to how many false negatives a radiologist makes).

There are two sides to this:

- how good humans are in detecting cancer (hint: not very good) and if having an automated system even as a "second opinion" next to an expert might not be useful?

- there are metrics for capturing true/false positives/negatives one can focus on during learning optimization

From studies you might have noticed that expert radiologists have e.g. F1-score at 0.45 and on average they score 0.39, which sounds really bad. Your system manages to push average to 0.44, which might be worse than the best radiologist out there, but better than an average radiologist [1]. Is this really being oversold? (I am not addressing possible problems with overly optimistic datasets etc. which are real concerns)

[1] https://stanfordmlgroup.github.io/projects/chexnet/

Alright. What is the cost of a false positive in that case?

The problem AI runs into is that with too much faith in the machine, people STOP thinking and believe the machine. Where you might get a .44 detection rate on radiology data alone, that radiologist with a .39 or a doctor can consult alternate streams of information. The AI may still be helpful in reinforcing a decision to continue scrutinizing a set of problem.

AI's as we call them today are better referred to as expert systems. AI carries too much baggage to be thrown around Willy nilly. An expert system may beat out a human at interpreting large unintuitive datasets, but they aren't generally testable, and like it or not, it will remain a tough sell in any situation where lives are on the line.

I'm not saying it isn't worth researching, but AI will continue to fight an uphill battle in terms of public acceptance outside of research or analytics spaces, and overselling or being anything but straightforward about what is going on under the hood will NOT help.

> The problem AI runs into is that with too much faith in the machine, people STOP thinking and believe the machine.

See https://youtu.be/R_rF4kcqLkI?t=2m51s

In medicine, I want everyone to apply appropriate skepticism to important results, and I don't want to enable lazy radiologists to zone out and press 'Y' all day. I want all the doctors to be maximally mentally engaged. Skepticism of an incorrect radiologist report recently saved my dad from some dangerous, and in his case unnecessary, treatment.

Or for a more mundane example, I tried to identify a particular plant by doing an image based I'd with Google. It was identified as a Broomrape because the pictures only had non-green portions of the plant in question. It was ACTUALLY a member of the thistle family.

The problem could be fixed by asking doctors to put their diagnosis into the machine before the machine reveals what it thinks. Then, a simple Bayesian calculation could be performed based on the historical performance of that algorithm, all doctors, and that specific doctor, leading to a final number that would be far more accurate. All of the thinking would happen before the device polluted the doctor's cognitve biases.

There is a problem with that approach that at some point hospital management starts rating doctors by how well their diagnoses match those automated ones, and punish those who deviate too much, removing any incentives to be better/different. I wouldn't underestimate this, dysfunctional management exhibits these traits in almost any mature business.

No, it's a "second opinion", and the human doctors are graded with how well their own take differs with the computer's advice, when the computer's advice is different from the ground truth.

And there's probably not even a boolean "ground truth" in complicated bio-medicine problems. Sometimes the right call is neither yes or no, but: this is not like anything I've seen before, I can't give a decision either way, I need further tests.

Is there a prevailing approach to thinking about (accounting for?) false negatives in ground truth data? I'm new to this area, and the question is relevant for my current work. By definition, you simply don't know anything about false negatives unless you have some estimate of specificity in addition to your labeled data, but can anything be done?

I don't get the sentiment of the article either. I can't speak for researchers but software engineers are living through very exciting times.

  State of the art in numbers:
  Image Classification - ~$55, 9hrs (ImageNet)
  Object Detection - ~$40, 6hrs (COCO)
  Machine Translation - ~$40, 6hrs (WMT '14 EN-DE)
  Question Answering - ~$5, 0.8hrs (SQuAD)
  Speech recognition - ~$90, 13hrs (LibriSpeech)
  Language Modeling - ~$490, 74hrs (LM1B)
"If you think Deep (Reinforcement) Learning is going to solve AGI, you are out of luck" --

I don't know. Duplex equipped with a way to minimize his own uncertainties sounds quite scary.

Duplex was impressive but cheap street magic: https://medium.com/@Michael_Spencer/google-duplex-demo-witch...

Microsoft OTOH quietly shipped the equivalent in China last month: https://www.theverge.com/2018/5/22/17379508/microsoft-xiaoic...

Google has lost a lot of steam lately IMO. Facebook is releasing better tools and Microsoft, the company they nearly vanquished a decade ago, is releasing better products. Google does remain the master of its own hype though.

> Microsoft, the company they nearly vanquished a decade ago, is releasing better products.

Google nearly vanquished Microsoft a decade ago? Where can I read more about this bit of history :) ?

IMO, Axios [0] seem to do a better job of criticizing Google's Duplex AI claims, as they repeatedly reached out to their contacts at Google for answers.

0: https://www.axios.com/google-ai-demo-questions-9a57afad-9854...

I think they are overselling Google's contributions a bit. It was more "Web 2.0" that shook Microsoft's dominance in tech. Google was a big curator and pushed state-of-the-art. Google was built on a large network of commodity hardware, they were able to do that because of the Open Source Software. Microsoft licensing would have been prohibitive to such innovation. There was some reenforcement that helped Linux gain momentum in other domains like Mobile and Desktop. Googled helped curate "Web 2.0" with developments / acquisitions like Maps and Gmail. When more of your life was spent on the web, the operating system meant less and that's also why Apple was able to make strides with their platforms. People weren't giving up as much when they switched to Mac as they would have previously.

Microsoft was previously the gatekeeper to almost every interaction with software (roughly 1992 - 2002). I don't know of good books on it but Tim O'Reilly wrote quite a bit about Web 2.0.

My question was actually tongue-in-cheek, which I tried to communicate with the smiley face.

I'm quite familiar with Google's history and would not characterize them as having vanquished Microsoft.

For the most part, Microsoft doesn't need to lose for Google to win (except of course in the realm of web search and office productivity).

You're right, it was Steve Ballmer who nearly vanquished Microsoft at a time when Google was the company to work for in tech and kept doing amazing things. At least IMO.

Unfortunately, by the time of my brief stint at Google, the place was a professional dead-end where most of the hirees got smoke blown up their patooties at orientation about how amazing they were to be accepted into Google, only to be blind allocated into me-too MVPs of stuff they'd read about on TechCrunch. All IMO of course.

That said, I met the early Google Brain team there and I apparently made a sufficiently negative first impression for one of their leaders to hold a grudge against me 6 years later, explaining at last who it was that had blacklisted me there. So at least that mystery is solved.

PS It was pretty obvious these were voice actors in a studio conversing with the AI. That is impressive, but speaking as a former DJ myself, when one has any degree of voice training, one pronounces words without much accent and without slurring them together. Google will likely never admit anything here: they don't have to.

But I will give Alphabet a point for Waymo being the most professionally-responsible self-driving car effort so far. Compare and contrast with Tesla and Uber.

My thoughts on AGI (at least in the sense of being indistinguishable from interaction with a human) are the same as my thoughts on extraterrestrial life: I'll believe it only when I see it (or at least when provided with proof that the mechanism is understood). This extrapolation on a sample size of one is something I don't understand. How is the fact that machine learning can do specific stuff better than humans different in principle than the fact that a hand calculator can do some specific stuff better than humans? On what evidence can we extrapolate from this to AGI?

We haven't found life outside this planet, and we haven't created life in a lab, therefore n=1 for assessing probability of life outside earth (which means we can't calculate a probability for this yet). Likewise, we haven't created anything remotely like animal intelligence (let alone human) and we have no good theory regarding how it works, so n=1 for existing forms of general intelligence.

Note that I'm not saying there can be no extraterrestrial life or that we will never develop AGI, just that I haven't seen any evidence at this point in time that any opinions for or against their possibility are anything more than baseless speculation.

This is what we know from Google about Duplex:

"To train the system in a new domain, we use real-time supervised training. This is comparable to the training practices of many disciplines, where an instructor supervises a student as they are doing their job, providing guidance as needed, and making sure that the task is performed at the instructor’s level of quality. In the Duplex system, experienced operators act as the instructors. By monitoring the system as it makes phone calls in a new domain, they can affect the behavior of the system in real time as needed. This continues until the system performs at the desired quality level, at which point the supervision stops and the system can make calls autonomously." --

If the dollar amounts refer to the training cost for the cheapest DL model, do you have references for them? A group of people at fast.ai trained an ImageNet model for 26$, presumably after spending a couple hundered on getting everything just right: http://www.fast.ai/2018/04/30/dawnbench-fastai/

Thats what you get with Google TPUs on reference models. The ImageNet numbers are from RiseML, the rest is from here - https://youtu.be/zEOtG-ChmZE?t=1079

"Just by using DenseNet-BC-100-12 I ended up with 83% ROC AUC after a few hours of training."

OK, but 83% ROC/AUC is nothing to be bragging about. ROC/AUC routinely overstates the performance of a classifier anyway, and even so, ~80% values aren't that great in any domain. I wouldn't trust my life to that level of performance, unless I had no other choice.

You're basically making the author's case: deep learning clearly outperforms on certain classes of problems, and easily "generalizes" to modest performance on lots of others. But leaping from that to "radiology robots are almost here!" is folly.

Yeah, but the point here was that radiologists on average fared even worse. 83% is not impressive, but better than what we have right now in real-world with real people, as sad as it is. Obviously, best radiologists would outperform it right now, but average ones, likely stressed under heavy workload might not be able to beat it. And of course, this classifier probably works on certain visual structures better than humans and other ones easier detectable by humans would slip through.

There is also higher chance that next state-of-art model would push it significantly over 83% or best human radiologist at some point in the future, so it might not be very economical to train humans to become even better (i.e. dedicate your life to focus on radiology diagnostics only).

I think you're missing a very important part here that maybe you've considered: Domain knowledge. I'm assuming your radiologic images were hand labeled by other radiologists. How did they come to that diagnosis? By only looking at the image? This was a severe limitation of the Andrew Ng paper on CheXnet for detection of pneumonia from chest x Rays. CheXnet was able to outperform radiologists on detection of pneumonia from the chest x Rays, but the diagnosis of pneumonia is considered a clinical diagnosis that requires more information about the patient. My point is that while your results are impressive and indicative of where deep learning could help in medicine, these same results might be skewed since you're testing the model on hand labeled data. What happens if you apply this in the real world at a hospital where the radiologist gets the whole patient chart and your model only gets the x Ray?

There is a paper discussing higher-order interdependencies between diagnoses [1] on X-ray images (they seem to apply LSTM to derive those dependencies). This could be probably extended to include data outside X-ray images. My take is that it's pretty impressive what we can derive from a single image; now if we have multiple low-level Deep Learning-based diagnostic subsystems and combine them together via some glue (either Deep (Reinforcement) Learning, classical ML, logic-based expert system, PGM etc.), we might be able to represent/identify diagnoses with much more certainty than any single individual M.D. possibly could (also creating some blindspots that humans won't leave unaddressed). It could be difficult to estimate statistical properties of the whole system though, but that's a problem with any complex system, including a group of expert humans.

The main critique for CheXNet I've read was focused on the NIH dataset itself, not the model. The model generalizes quite well across multiple visual domains, given proper augmentation.

[1] https://arxiv.org/abs/1710.10501

"Yeah, but the point here was that radiologists on average fared even worse."

Except they don't. See the table in the original post. Also, comparing the "average" radiologist by F1 scores from a single experiment (as you've done in other comments here) is meaningless.

Unless my doctor is exactly average (and isn't incorporating additional information, or smart enough to be optimizing for false positive/negative rates relative to cost), comparison to average statistics is academic. But I don't really need to tell you this -- your comment has so many caveats that you're clearly already aware of the limits of the method.

This thread is a microcosm of this whole issue of overhyping.

On one hand, we have one commenter saying he can train a model to do a specific thing with a specific quantitative metric, to demonstrate how deep learning can incredibly powerful/useful.

On the other hand, we have another commenter saying "But this won't replace my doctor!" and therefore deep learning is overhyped.

The two sides aren't even talking about the same thing.

Agree that the thread is a microcosm of the debate, but ironically, I'm not trying to say anything like "this won't replace my doctor".

That kind of hyperventilating stuff is easy to brush off. The problem with deep-learning hype is that comments like "my classifier gets a ROC/AUC score of 0.8 with barely any work!" are presented as meaningful. The difference between a 0.8 AUC and a usable medical technology means that most of the work is ahead of you.

Agreed. I think it comes down to the presentation/interpretation of results. The response to "My classifier gets score of X" can be either "wow, that's a good score for a classifier, this method has merit" or "but X is not a good measure of [actual objective]".

So I think it's come down to conflict between

1. Which the author is trying to present 2. What an astute reader might interpret it as 3. What an astute reader might worry an uninformed reader might interpret it as

And my feeling is that, given all the talk about hype in pop-sci, we're actually on point 3 now, even when the author and reader are actually talking about something reasonable. Whereas personally I'm more interested in the research and interpretations from experts, which I find tend to be not so problematic.

> Unless my doctor is exactly average

Just to get back to this point: what if the vision system of your doctor is below average and you augment her by giving her a statistically better vision system, while allowing her to use the additional sources as she sees fit. Wouldn't be that an improvement? We are talking about vision subsystem here, not the whole "reasoning package" human doctors posses.

Again, check that table. It says a lot:


On just about every test set, the model is beaten by radiologists. Even the mean performance is underwhelming.

I was referring mainly to this one (from the same group and it actually surpassed humans on average):


In their paper they even used "weaker" DenseNet-121 instead of DenseNet-169 for Mura/bones. DenseNet-BC I tried is another refinement of the same approach.

Those are some sketchy statistics. The evaluation procedure is questionable (F1 against the other 4 as ground truth? Mean of means?), and the 95% CI overlap pretty substantially. Even if their bootstrap sampling said the difference is significant, I don't believe them.

Basically, I see this as "everyone sucks, but the AI maybe sucks a little less than the worst of our radiologists, on average"

What would be the good metrics then? Of course metrics are just indicators that can be interpreted incorrectly. Still, we have to measure something tangible. What would you propose? I am aware of limitations and would gladly use something better...

Some people mention Matthews correlation coefficients, Youden's J statistic, Cohen's kappa etc. but I haven't seen them in any Deep Learning paper so far and I bet they have large blindspots as well.

> Just by using DenseNet-BC-100-12 I ended up with 83% ROC AUC after a few hours of training

Of course! Using DenseNet-BC-100-12 to increase ROC AUC, it was so obvious!

Would you mind sharing which other, unrelated dataset you have used the model on?

I can't unfortunately, proprietary stuff being plugged into existing business right now.

ROC AUC is fairly useless when you have disparate costs in the errors. Try precision-recall.

I mentioned F1 in some later comment.

Yea this sounds extremely unlikely unless the other dataset has a fairly easy decision boundary. The kind of cross-domain transfer learning you seem to think deep neural networks have is nothing I've observed before in my formal studies of neural network

Next winter will probably going to be going over that 92% across all domains.

Possibly, but will it be called AI winter, if e.g. average human has 88% accuracy and best human 97%?

How much of this can we pin on IBM's overhype of Watson?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact