Hacker News new | past | comments | ask | show | jobs | submit login
How to recognize AI snake oil [pdf] (princeton.edu)
907 points by longdefeat on Nov 19, 2019 | hide | past | favorite | 344 comments

I don't have time to read the entire paper but I would like to share an anecdote. I worked at a company with a well staffed/funded machine learning team. They were in charge of recommendation systems - think along the lines of youtube up next videos. My team wanted better recommendations (really, less editorial intensive) so the ML team spent weeks crafting 12 or more variants of their recommendation system for our content. We then ran that system for months in an A/B testing system that judged user behaviour by several KPI. The result was all variants performed equally well within statistically insignificant bounds. The best performing variant happened to be random.

Talking to other groups that had gone through the exact same process our results were pretty typical. These guys were all very intelligent and the code and systems they had implemented were pretty impressive. I'm guessing the system they built would have cost a few million dollars if built from scratch. We did use this "AI/ML" in our marketing so maybe it was payed for by increased sales through use of buzz words. But my experience was that in most limited use cases the technology was ineffective.

This reminds me of a job interview I was on. I was asked about how I would use AI/machine learning for their problem space. Since they seemed to be smart and level-headed, I answered honestly, "Pick something unimportant, use a machine learning algorithm just to get familiar with the tools, ignore the result unless it happens to work, then put machine learning in your marketing materials. But keep track of it, and if it is useful to you in 5 years, be sure to use it for real."

They said, "That's about what we concluded except that we didn't get around to actually doing that pilot project yet."

I got the job. :-)

Speaking of jobs and interviews. I am yet to find a job board which does not show JavaScript jobs when searching for Java jobs. Some of them claim to use AI. :)

My wife once watched me as I did a job search, carefully setting all the available parameters to match my requirements. She said "don't forget, engineering jobs are posted by admin staff", typed "software engineer location" into Google, and found way more than I ever did.

Let's not forget the job application systems that make you chronologically list every previous employer, their location, your title, and dates of employment, your education history including dates and degrees received, skillsets/technologies you have experience with, etc. All information that is on your CV and/or LinkedIn profile and they want you to manually re-type it into their late 1990's era job application system rather than using some basic NLP to extract it. Personally, the moment I start applying and the system asks me for more than my email/mobile and a CV upload, I bail on the application.

I guess that is intentional to throttel down on people applying. It works for me: Except, when I desperately need a job, I don't apply. I primarily only know "this" style of applying - It's all big corp companies, though, I must admit.

In fairness I have this issue with a number of human intelligence based (allegedly) recruiters.

Of course, that's where machines gets it from as well. Your training set is usually which candidates were shortlisted by recruiters. At least that is what I have seen people do who are built one of these AI based recruitment systems. When I asked them are they not worried about such bad quality data, their answer is - humans know the best. Ah well, then why are you building this again?

The exact same applies to blockchains.

Blockchain is mostly a marketing tool, not something you would want to use in production for anything.

I think generally that is how many products and features work.

That sounds odd, like they don't really need machine learning (unless it is to snare investors?).

Snaring customers too. I swear, people are obsessed with "machine learning" even when the domain really isn't suited for it.

I sometimes wonder if management and engineers don’t have more in common than acknowledged. Publications such as Harvard Business Review have huge coverage of things like managing AI, and being able to say you managed an “AI project” might mean something.

I wonder if people are becoming more savvy and anti-ML after all the issues people have with Facebook and Google collecting a lot more data than people are comfortable with.

I suspect what people dislike is big data. Anecdotally, I'd love to hear about a success where some lone genius codes a better tool using publicly available datasets. I don't like the idea of BigCorp using their larger servers to dominate the space.

Since we are sharing anecdotes, I can report it's been 20 years of buying stuff on the internet and the combined billions of ad tracking research dollars spent by Amazon and Google have not yet come up with a better algorithm than to bombard me with ads for the exact same thing I just bought.

I just spent 15m on Amazon trying to prod the recommendation algorithm into finding something I actually wanted to buy so I could get above the "free delivery threshold".

Think about that. I wanted to spend money. I wasn't too fussy what it was. Amazon has a decade of my purchasing and browsing history.

And they still failed.

Amazon infuriates me. I regularly buy gadget-y bits like electronic components and peripherals. Probably at least once a month. Never see adverts for similar.

That ONE TIME I buy a unicorn dress for my 2 year old daughter? That's a lifetime of unicorn related merchandise adverts and recommendations for you!

Actually that's really interesting because it exposes a bias in their recommendation system. They must be heavily biased towards things with mass appeal instead of specifically targeting user preferences, which is funny because it goes against the grain of the whole "targeted advertising" promise of ML. You'd think if anyone could get that right, it would be Amazon, yet..

It's not impossible for that to be a clever decision by Amazon (although I'm not saying it's likely, I have no idea about the numbers).

The ultimate goal of the advertising is return on investment, not making you feel interested in the adverts. If, to exaggerate the possibility, 100% of "people who look at tech" are 0% influenced by adverts, but 10% of "people who bought a unicorn thing" will go on to buy another if they're constantly reminded that whoever they bought it for likes unicorns, all of a sudden it would make sense despite being counterintuitive to viewers.

A more commonly discussed example of a similar thing is that it's easy to think "I just bought a (dishwasher, keyboard, etc), I obviously already have one so why am I seeing adverts for them?" Sure, it might be that the company responsible has an incomplete profile and doesn't know you bought one already. But it's also possible that the % of people who just bought the item and then decide they don't like it, return it and buy a different type is high enough to be worth advertising to them.

This is basically what comes from a mindset increasingly common among ML practitioners of abdicating thinking and assuming "the machine will find what features are important". They throw a junkyard full of features at the algorithm (or even worse automated feature generation). These days at least once a fortnight I get the opportunity to show folks how if only they thought for 10-15 mins or simply charted their data in a few different cuts before modeling, how much better they could have done :-)

I suspect Amazon has learned that some feature labels are easy to recognize and correlate (dress color, size, style, etc) and others are hard and lead to useless results (an electronic device's: computer, format, port type, protocol, etc).

So they gave up trying to match CPU with GPU and went back to connecting beer to diapers.

Ha. Last time I wanted to buy something on Amazon, their search page kept freezing every time I load it: 100% CPU load. Because I really wanted to buy that thing, I spent 30 mins debugging their silly scripts and found one for-loop that tries to find a non-existent element. Unfortunately, I couldn't figure out a way to enable my fix in a minified script, as reloading the page kept loading the original script.

Put a breakpoint on the offending line, manually add the element when it breaks, hit resume and hope for the best?

Whit uMatrix, you could have created a rule to block that script.

Turn off Javascript - Amazon still works OK.

Amazon still suffers from those "This guy really loves washing machines!" type recommendations

I hope 'm' here is minutes, not millions )

don't gift card count anymore?

Right? Recommenders are almost counter productive for me in most cases. I want a recommender to remain broad, not give me increasingly niche recommendations a la youtube. About the only halfway decent recommender system I've interacted with is the various forms of curated lists on Spotify. They seem to actually take a decent stab at it with related but sufficiently different and interesting content.

For all Facebook knows about me they've always been exceptionally bad at advertising to me, which is remarkable considering what they've got. Google is only very marginally better. Actually, now that I think about it, Amazon's 'customer's also bought' is also pretty bad at the recommendation itself since it not uncommonly recommends incompatible things! ...but it does often succeed at getting me to think more about what else I might need and sometimes leads me to buying other things. At least it's not always recommending the same thing, but rather related things, which is probably a much better way to advertise.

What I've always assumed, but now this thread has me doubting myself, but what I've assumed is that these systems, even though they appear to suck at specifically targeting me, must somehow be pretty good on the average, still netting big profits overall even though they don't seem to live up to the promise of getting me to buy stuff. But if everyone has this impression, maybe it doesn't work? I mean, I assume companies like Amazon and Facebook would be pretty good at least at optimizing the total rewards from using recommendation systems, but maybe you can't tell from your own anecdotal use. I have no idea. I'd love to see an analysis that includes aggregate numbers.

I think you've got the nail on the head. It's sort of the tragedy of the commons, which in terms of recommenders is a really easy thing to accidentally optimize for. For example, pop music is popular. In the early days, recommenders would just recommend pop music because on average that was a decent recommendation. We've come a long ways since then. Well, everyone but Facebook has.

> For all Facebook knows about me they've always been exceptionally bad at advertising to me

Because it's not Facebook really, it's the advertisers who choose targeting criteria. You as an advertiser have a myriad of options. For example if you've built a competitor to X, you can target users who've visited X recently, aged N-M, residing in countries A, B and C, and so on. There are options with broader interests too. Poorly targeted ad means poorly selected criteria by the advertiser (or sometimes just advertiser experimenting) and consequently money wasted. Facebook doesn't care though.

Then there is retargeting/remarketing (target bounced traffic) already mentioned here is probably the stupidest looking invention that actually works.

But they are also really really bad at curating content I like to see. Like even worse than the advertising actually. My feed is just garbage and has been ever since they switched over from being chronological. But I sometimes get on a wild hair about some particular person, and I'll look at their page directly and see actually interesting content on their page that was never shown to me. Facebook's only real guess is then to just say: oh! You must like this person, let me show them, all the time.". The reality is, I'm interested in stuff like when a biased person I don't like shares a more neutral or inclusive opinion, or someone does something interesting. Facebook is just unuseably bad at selecting for that. Their algorithm just pushes really shallow crap and buries anything with substance or depth. I keep it around just to stay in touch with hard to contact people.

Likewise Pandora does a decent job of picking songs for me based on what I previously liked. It's not perfect but far better than random.

it's easier to recomment things you probably like than things they want you to like/may like

Hey, personal experience and to be fair for them, Google does occasionally give me ads about a Haas CNC machine. I really want one. But I don't have the disposable $100k for one and I don't have the space... nor do I have 3 phase power. But I do want one and haven't bought one. So, good on them, right?

This can happen when a retargeting campaign doesn't have a 'burn pixel' or conversion event trigger. It's a common oversight, which can cause a re-targeting program to kick-off unnecessarily (or cause ads that have followed you around to become obvious)

You'd think with all the "AI" out there, they could match a sales DB entry with the CRM DB entry, but, in fact, they basically can't.

The AI innovation must be that they can figure out which marketers are likely to forget to have a burn pixel because those marketers drive more revenue.

And yet I haven't seen one ad that has a 'not interested' feature similar to YouTube.

I mean, if I could stop seeing washing machines or whatever, I'd probably click it.

do you clear your cookies and html5 storage? That should wipe any personalization that's happening, but the ads will become very generic then.

You can also block ads (Ublock/Umatrix)

Oh yeah I can do that, I'm just wondering why they're not too interested in direct user feedback from my end. Surely it would be useful to adjust their algos.

That's because:

* They don't have suppression set up

* They're using a conversion tracking platform that is slow

* They're testing the returns conversion hypothesis: you have expressed concrete intent, you have bought the product. If it has 5% return rate, you probably still want it, and there's a 5% chance they need to be in the mix.

Not to mention these systems are also completely useless at recognizing one product someone bought already includes the product they're recommending. When you buy Dark Souls 2 with all DLCs you can be sure Steam will suggest Dark Souls 2 without any DLCs to you for at least a full year.

Sometimes it also recommends other brands of the thing I just bought. Like a cell phone: I didn't buy Samsung, but now I see their ads. I guess that is a very very small improvement- Maybe I'm the type of consumer who gets a new phone every week?

No, but you might be unhappy with your phone, return it and buy a new one.

You are luckier than me: I've spent 10 years bombarded with ads for things I never bought, never wanted to buy and that are directly insulting at this point. (Buying something would maybe give me a two day break though.)

I have been looking for a shelf that’s as close to 50” wide and 10” deep as possible. No search on any site that allows you to search for such a thing as far as I can tell.

You can do this at wayfair.com. It's kind of hidden though, when you get to the shelves, click "Sort & Filter" and then scroll down until you see the dimension sliders.

I think amazon works pretty well.

They show me similar things to the thing I just put in my cart. Sometimes better choices that I made.

They also show related things that other people have bought, that many times I end up purchasing.

Dirty secret: "The customers who bought also bought" algorithm doesn't require ML/AI.

You can accomplish that with relational algebra on a precomputed data warehouse job and only for products with strong correlation. The intelligence of the customers is enough agency to instil a semblance of intelligence in the data.

Yes, this is the straightforward ‘collaborative filtering’ algorithm. I suppose the line between ‘algorithm’ and AI/ML is not well defined though. At what point does a technique become ‘AI’? I don’t know a good answer.

As an utterly cynical layperson, algorithm means directly querying data. AI means feeding systems with training data and sprinkling them with magic obfuscation dust.

I wonder if the dirty secret for lots of stuff is: "They aren't using AI anywhere"

Well, for problems that are simply looking for a ranked relationship if you have human input it could be used to train a ML that attempts to look for similar correlations...or you could just use your human mechanical turks that are already informing you. ML problems that are good are the ones not trying to approximate reality like CWBAB.

I'm doubtful we'll see an ai that makes a serious jump without directly interacting with the world we live in. Under that measure cars might be the closest since you learn to interact with the bounds of where it can and can't go is similar to a toddler learning to crawl.

The marketing budgets are there to be spent :)

Yes - just in case you want two or stopped the purchase along the way.

I have had a similar experience, but I really do think it points more to the team than it does to the efficacy of machine learning. I was on a team of extremely intelligent people, but they were very academic, with minimal practical coding skills, refusals to use version control etc. By academic, I mean graduates from the number one university in their respective fields, top publications, etc. They produced very little actual value. Great theoretical ideas, in depth understanding of different methods of optimization, etc.

The team I was on before that one was a bunch of scrappy engineers from Poland, India and the USA with no graduate degrees, but 20 years coding and distributed systems experience each. The difference in problem solving ability, the speed at which they moved, broke down problems, tried out different methods, was staggering.

I think ML is suffering from a prestige problem, and many companies are suffering for it. The wrong people are being hired and promoted, with business leaders calling the shots on who runs machine learning projects without fully understanding who can actually deliver.

The San Jose Mercury News had a weather-forecasting contest. It was won one year by a 12-year-old, who's algorithm was "The weather tomorrow will be the same as the weather today". A kind of AI I guess.

Brilliant. I think YouTube has arrived at the same algorithm - it picks the videos I watched yesterday to recommend today.

Well to be fair that's how all employers also hire.

If you did a good job at the last company you'll probably do a good job here.

If you did a good job yesterday, you'll probably do a good job today.

For the most part they are usually correct.

I hope you are joking since the industry collectively knows how much employers/interviewers value algorithms-based coding interview, which doesn't correlate strongly with performance. Even if you are talking about senior positions where they don't matter, then you should know that people hire someone they know+like who did decently well, rather than the truly best on the market.

That's SF/Big Tech. The rest of the world basically works like the stated algorithm.

The coding interview comes usually after they vet your resume, any profiles, get you talk to an HR, and potentially check your references.

Even the coding interviews are just a signal against overall performance given a short amount of time. It's only a sample of data, but if you did your interview right you should be able to protect some against bad people getting very lucky. Just like driving a car; bad drivers tend to stay pretty bad, and good drivers tend to stay safe. Even though there's a lot of ways to define what is a good driver, there's clearer ways to define what is a bad driver, and if someone was a bad driver yesterday they are still probably a bad driver.

well, that's the thing though - coding interview is pretty much a yes/no thing. What rank you get is typically based on "what rank do you have now?" and "how much do we respect your current employer?"

I remember reading Steve Jobs used a "different" technique to figure out if someone was good.

He would go around to people and say "I heard Joe sucks". If the people strongly defended Joe, he was probably pretty good. If nobody stuck up for him, Joe might indeed suck.

Probably everyone would be silent as well if someone said "Steve Jobs sucks". This anecdote is meaningless TBH.

Completely ignoring the depth of your subscriptions as well. Amazon music was much better for me. Even Google music just repeats.

Seriously. Why would I want to watch a video that I've already watched (unless it's music maybe)?

They have a lot of videos. You're statistically unlikely to care about any randomly selected one. By watching a video, you establish that it's interesting enough for you to watch. It's much more likely that you'll want to rewatch it than that you'll want to view a random video.

I'm only half-joking here. To me, YT algorithm seems to be a mix of "show random videos you've already watched" + "show random videos from channels and users you watched" + "show the most popular videos in last few hours/days/weeks". It's pretty much worthless, but what are we expecting? Like all things ad economy, the primary metric here isn't whether you like the recommended content, or whether that content challenges you to grow - it's maximizing the amount of videos you watch, because videos are the vehicles for delivering ads to you.

I prefer a user driven random walk. Like a multi-arm bandit over a hierarchical graph instead of a stuck-in-local-minimun + noise recommender. But no one does it.

Years ago, there is an app stumble upon. I always find it engaging. Hard to get bored.

Im guessing things like 5 year olds watching the same video 300 times randomly on their parent's account has irreversibly tainted the Googletron into thinking people like to watch the same video over and over again.

Hundreds of times I have told youtube im not interested in a recommended video that i have already watched, seems to be completely ignored.

Why wouldn't you? Do you never re-watch a film or re-read a book or re-order the same food at a restaurant, or re-drive the same route to work?

I often watch the same video that I've already watched, many times music, many times comedy, many times something I want to link to another person but end up watching some/all again as I find it, often if I remember it being interesting (e.g. a VSauce video or a Dan Gilbert TED talk), sometimes if it was a guide or howto that I want to follow - e.g. a cooking instruction.

Music is probably the main driver, but I've definitely clicked some recommended videos from a channel I'm subscribed to with infrequent long uploads.

I have a friend who lived in San Jose who just wrote the forecast on his whiteboard and left it there, because it never changed. It was funny, because he came from Minnesota where the weather is never the same two days in a row.

Ha! I came here to say something similar. Here in San Jose, the weather tomorrow will be the same as today, most of the time. My joke when I lived in Minnesota was: "If you don't like the weather, just wait 10 minutes. It will be different."

Gotta say though, I don't miss my slow blower even just a wee little bit.

To be fair San Jose weather is heavily biased towards sameness.

Yes, but why did all the 'smart' people using models and training networks get a worse result? A condemnation of modelling, to be sure.

Using a markov chain with your stochastic matrix set to I...

Using “a I”

AR(1) models are commonly employed to model time series, and the "same-as-yesterday" model is the case where the AR1 coefficient equals 1. There is is some mean reversion in seasonally-adjusted temperature, so an AR1 coefficient less than 1 should work better.

Just showing a customer's previously viewed items accounted for 85% of the "lift" credited to the recommendations team I worked on.

What about Benjamin Franklin's moving weather?

>>The best performing variant happened to be random.

Some years ago I heard an anecdote from a developer who had worked on a video game about American football. The gist of it was that they had tested various sophisticated systems for an AI opponent to choose a possible offensive/defensive play, but the one that the players often considered the most "intelligent" was the one that simply made random decisions.

In certain domains, I think, it's quite difficult to beat the perceived performance of an AI system that merely makes committed random decisions (i.e. carried out over time) within a set of reasonable choices. If we don’t understand what an agent is doing, we often assume that there is some devious and subtle purpose behind its actions.

It's quite common in games for AI to pick a random decision. Simply put, a good AI is a character/NPC that appears to have a mind of its own, and its own life. Nothing beats random at explaining someone's behaviour based on a personal history you don't know.

If AI responded/acted based on a predefined set of patterns that could be recognizable, the player would automatically feel it (pattern matching) and makes the NPC far less interesting.

Right-- but often an NPC is merely scenery or traffic, in the sense that their behavior does not compete with your interests. What I found interesting about the football example is that the random strategy of the NPC-oach both suggested a deeper intelligence AND proved to be an apparently effective opponent.

Beyond reinforcing our tendency to project, as you say, a personal history on random behavior, it also highlights what a few other people have commented: that in many non-cooperative situations a committed random strategy is extremely effective, and perhaps more effective than a biased, seemingly "rational" strategy. (For another example, I believe Henrich's "The Secret Of Our Success" discusses the possible adaptive benefits of divination as a generator for random strategies among early societies.)

It's how in Rock Paper Scissors you can probably do better against a smart opponent by playing randomly than by trying to trick them. At least they can't get in your head, because there's nothing there. You won't do better than chance, but at least you can't do much worse.

So, use a random strategy and measure the entropy of your opponent's behavior. And modify accordingly, as needed.

A lot of the best ml right now is effectively about making better conditional probability distributions. You always get random output, but skewed according to the circumstances, and sharp according to confidence in the result.

I'm not sure what the term for it is, but humans have an uncanny ability to ascribe meaning to pure randomness. I'm not surprised a random AI can appear smart.

When your audio player picks these random tracks https://i.imgur.com/QRoGQRy.png (example from earlier in the day) you start believing in a Higher Power..

Unless you consider that randomness, whatever source, is some form of intelligence. (I am only half-joking here).

Humans are great at finding "patterns" in random noise.

We are very good at finding patterns where there aren't any but it's important to remember that random is actually the best answer a lot of the times. Maybe it's as simple as random plays being the hardest to predict and that's the best until you can get an AI trained on the meta of how to plan for what the current player is thinking you'll do.

Sometimes being unpredictable makes for a good strategy

It is well known that tit-for-tat is the best strategy for iterated prisoners' dilemma. See https://en.wikipedia.org/wiki/Tit_for_tat

It may sound similar but offensive vs. defensive behavior in (video) game strategy is a much different concept than cooperation in game theory.

I had a similar experience at one point. A team put energy into building a recommendation system, and were able to demonstrate that the "Recommended for you" content performed better than all other content editorial. After getting challenged a bit, though, turns anything performs better when put under the header "Recommended for you."

Well, that’s a good lesson to learn, just put everything under “recommended for you” header! :)

You aren’t talking about dcom are you?

> put energy put energy

And they say English doesn't have reduplication!

I worked at a larger services marketplace, helping data scientists get their models into production as A/B experiments. We had an interesting and related challenge in our search ranking algorithms: we wanted to rank order results by the predicted lifetime value of establishing a relationship between searcher and each potential service provider. In our case, a 1% increase in LTV from one of these experiments would be...big. Really big.

Improving performance of these ranking models was notoriously difficult. 50% of the experiments we'd run would show no statistically significant change, or would even decrease performance. Another 40% or so would improve one funnel KPI, but decrease another, leading to no net improvement in $$. Only 10% or so of experiments would actually show a marginal improvement to cohort LTV.

I'm not sure how much of this is actually "there's very little marginal value to be gained here" versus lack of rigor and a cohesive approach to modeling. The data scientists were very good at what they do, but ownership of models frequently changed hands, and documentation and reporting about what experiments had previously been tried was almost non-existent.

All that to say, productizing ML/AI is very time- and resource-intensive, and it's not always clear why something did/didn't work. It also requires a lot of supporting infrastructure and a data platform that most startups would balk at the cost of.

If you have historical data to validate against, you can set a leader board on models run against older data, and always leave part of the data out and unavailable for test


This encourages a simple first version and incremental complexity, rather than starting very complex 6 months in, and never having an easy baseline to compare to. A simple baseline can spawn off several creative methods of improvement to research.

The other case is that the models should be run against simple cases that are easy to understand and easy to confirm. This way there's always a human QA component available to make sure results are sensible.

IMH (and biased) O, a lot of great coders are implimentors, or let's say applied computers scientists.

That is great for building incredible open source software and a lot of other things that I would not be able to do given a 1000 years. However (again IMHBO) a specific ML and any other specific application of stastistics or mathematics becomes really tricky once your use case is explicitly defined.

You then need intimate and deep knowledge of the tools that you are using (e.g.: Should I even use NN? Should I even use genetic algorithms? Should I even use x?) but ML for most people is shorthand for NN and its variants or maybe shorthand for something else specific rather than in principle.

A well aimed shot at PCA [1] can often solve your problem. Or at least, tell you what the problem looks like. This is just an example, but IMHBO people waste their time learning ML and not learning mathematics and statistics.

IMHBO I still think that self-driving cars can be solved by defining a list of 1000 or so rules, by hand, by humans, and by consensus. The computer vision part is the ML part.

[1] https://en.wikipedia.org/wiki/Principal_component_analysis

I could not agree more about self driving cars. They will disrupt and cause us to actually look at our terrible transportation infastructure, not learn to survive in it.

I wonder if this could be a case of mismatch between what the recommendations system was designed to do and what the business actually needed it to do. Your team evaluated the models based on live KPIs in an A/B testing environment, but did the recommendations team develop the system specifically with those KPIs in mind? Did they ever have access to adequate information to truly solve the problem your team needed solved? And was the same result observed for other uses of their recommendation systems?

> did the recommendations team develop the system specifically with those KPIs in mind?

Yes they did - in fact they had input on defining them and helped in tracking them.

> Did they ever have access to adequate information to truly solve the problem your team needed solved?

They believed so. Their team was also responsible for our company data warehousing so they knew even better than me what data was available. Basically any piece of data that could be available they had access to.

> And was the same result observed for other uses of their recommendation systems?

I did not have first-hand access to the results of their use in other recommendation contexts. As I mentioned in my original post I only had second-hand accounts from other teams that went the same route. They reported similar results to me.

Some ideas seem to attract smart people like moths to a flame.

It seems like everyone who joins my company to shake things up follows the same path of wanting personalized content to acquire new customers.

But in reality we just don't have enough data points on people before they become customers to segment people that way. Even if we could, being able to accurately

Every time I see people go through the motions of attempting to implement this until they eventually give up.

This idea looks like an obvious win and big companies have done them before with success, but is extremely hard to impossible to pull off for our small company.

That's surprising to hear. Comparing model performance to a randomized baseline model is a "must-have" on my team before we feel comfortable presenting to management.

An old team I advised for a while also compared model performance to a randomised baseline model.

What they didn't seem to get however was that a randomised baseline model would beat a randomised baseline model on a naive comparison 50% of the time, so their understanding of randomness/statistical significance/performance metrics was way off. So while they believed they were also testing their models before presenting to management, none of them were implementing their comparison/measurements properly, and huge parts of their work were just p- hacking and pulling random high performing results out of the tails of the many models they built and compared.

So while it's good your team makes comparison to baselines (it's alarming how many don't even do that), my experience also suggests a huge number who think they're comparing to reasonable baselines and using metrics to measure their performance aren't actually doing so properly.

I am confused, if a new model beats randomly selected randomised model 100% of time for each experiment why does it matter if randomised model beats other randomised models? Are they only comparing against the subset of worst randomised models?

I think he's saying something like the following:

1/ the team implemeted a naive baseline

2/ they implemeted a more sophisticated model that depended on some parameter p

3/ for 100 different values of p, they examined its performance, and picked the model with the best performance

Now they're not quite subject to the multiple comparisons problem there, since the models with different values of p aren't independent from one another. But they're not not suffering from it either. It mostly depends on the model. But it's a very easy mistake to make. I'd say many many academic papers make the same mistake.

Short answer: if you do it right, it doesn't matter.

Long answer: I have saying in statistics: "nature abhors two numbers: 0 and 100". In the real world, there is no 100%, you have a number of models and a (finite) number of trials/comparisons to whatever metric and then you have to then make a decision.

My point was that their "non randomised" models may in fact have the equivalent performance of a random model, and that if this was in fact the case, you would expect them to beat a randomised comparison roughly half the time. If you have repeated trials of multiple models, the odds of one consistently beating others (even if it's properties were essentially equivalent to a random model) in a small finite number of trials is much higher than most people realise. Essentially, they're flipping a large number of coins to determine their performance, and choosing the coins that consistently come up heads.

Another observation I'd make is that in the real world, random or averages are almost the most facetious thing to be comparing performance against. We aren't generally in a state ignorance or randomness, but you see this kind of metric all the time, even from "respected" sources. 2 if/then/else statements will generally outperform randomness universally in a huge number of fields/subject matter areas.

What's not interesting is that one can build a robot that beats/meets the average human at tennis (the average human probably is probably incapable of serving out a single game), but that one can build one that performs better than a relatively cheap implementation of our current state of knowledge of the game.

Moving from 2 if/then/else statements to an n parameter complicated model that requires training data and that no one understands and requires huge amounts of power and time to train is not only not progression, it's actually a regression on the current state of affairs. In almost all fields, random or average is the last thing you want to compare against.

Getting a team to publish their results (after patenting) is also a good way to get them to do these sorts of things. Significance, baselines, and other things are asked for by reviewers for the better journals and conferences.

Recommender systems are notoriously hard because it's difficult to do better than just recommending the most popular content. You can recommend more personalized content at the expense of KPIs like click-through rate.

If the team can’t even beat random, then I think that says more about your team (or perhaps your features) than about ML as a whole.

Today ML can solve some problems. In the future it might solve some problems with advances in the field. Yet other problems will likely remain unsolved, such as the stock market, or the weather, or predicting /dev/rand

"Up Next" problem can easily fall into any of the three buckets.

YouTube's "Up Next" recommendations do (significantly) better than random, therefore "Today ML can solve some problems".

>YouTube's "Up Next" recommendations do (significantly) better than random, therefore "Today ML can solve some problems".

IMO YT AI is the opposite of intelligent , it still recommends things I disliked. for some reason this basic rule of not showing something that I explicitly disliked was to hard for it to learn, I am wondering if it is truly an AI behind it or just statistics

If the goal is simply to maximize engagement, there is no hard requirement that the algorithm should never show you things that you dislike. Essentially, what I am saying is that your belief of what their objective function is may be different from their actual objective function, and that is in no way an indication that their model is a failure.

Isn't AI statistics?

The modern AI/ML is more like we throw a lot of data and we generate a model, we know that it works but we have no idea how and why.

It's intentional. Controversy is a strong signal for the youtube algorithm.

I don't think so, the videos I was referring were music videos. I engaged with it in a way but hitting dislike on a music genre I don't want to listen(I normally don't dislike things because is not my genre) in the hope the algorithm will learn but it was even worse, it did not learned that I disliked artist X and genre Y it continue to play the exact video that I disliked,

The bad algorithm will force the unhappy user to use manually created playlists leaving less people to engage with the algorithm and probably have the algorithm getting worse in time as more users will avoid it

And even more interesting trying to google "how to make youtube not show X" it is a complete fail, it will just show you youtube video results.

I'm pretty sure it does better than 'next video = random(from all of YouTube)' would, but would it be much better than 'next video = random(videos with the same subject or tags as the one playing now)'?

Yes. They've published quite a number of papers on their recommendation algorithms. You can take a look and decide for yourself whether it's snake oil or not.

Or not enough communication with the team, discussing what the objectives are, providing them with good, enough, relevant data to work with etc. I guess in many cases a data science team is expected to just "do their magic", build the AI and then come back and meanwhile not bother anyone else. In other cases, nobody really cares anyway, they just want the buzzword label to be includable in the brochures.

A relevant Twitter thread: https://twitter.com/NeuroStats/status/1192679554306887681

At the risk of projecting, this has the hallmark of bad experimental design. The best experiments are designed to determine which of many theories better account for what we observe.

(When I write "you" or "your" below, I don't mean YOU specifically, but anyone designing the kind of experiment you describe.)

One model of gravity says the postition/time curve of a ball dropped from a height should look like X. Another model of gravty says it should look like Y.

You drop many balls, plot their position/time, and see which of the two models' curves match what you observe. The goal isn't to get the curve; the goal is to decide which model is a better picture of our universe. If the plotted curve looks kinda-sorta like X but NOTHING like Y, you've at least learned that Y is not a good model.

What models/theories of customer behavior were your experiments designed to distinguish between? My guess is "none" because someone thinking about the problem scientifically would start with a single experiment whose results are maximally dispositive and go from there. They wouldn't spend a bunch of time up-front designing 12 distinct experiments.

So it wasn't really an experiment in the scientific sense, but rather a kind of random optimization exercise: do 12 somewhat-less-than-random things and see which, if any, improve the metrics we care about.

Random observations aren't bad, but you'd do them when you're trying to build a model, not when you're trying to determine to what extent a model corresponds with reality.

For example, are there any dimensions along which the 12 variants ARE distinguishable from one another? That might point the way to learning something interesting and actionable about your customers.

Did the team treat the random algorithm as the control? Well, if you believe some of your customers are engaged by novelty then maybe random is maximally novel (or at least equivalently novel), and so it's not really a control.

What about negative experiments, i.e., recommendations your current model would predict have a NEGATIVE impact on your KPIs? If those experiments DON't produce a negative impact then you've learned that some combination of the following is the case:

   1. The current customer model is inaccurate
   2. The model is accurate but the KPIs don't measure what you believe they do (test validity)
   3. The KPIs measure what you believe they do but the instrumentation is broken
Some examples of NEGATIVE experiments:

What if you always recommend a video that consists of nothing but 90 minutes of static?

What if you always recommend the video a user just watched?

What if you recommend the Nth prior video a user watched, creating a recommendation cycle?

Imagine if THOSE experiments didn't impact the KPIs, either. In that universe, you'd expect the outcome you observed with your 12 ML experiments.

In fact, after observering 12 distinct ML models give indistingiushable results, I'd be seriously wondering if my analytics infrastructure was broken and/or whether KPIs measured what we thought they did.

This is a very good comment. Is this line of reasoning fleshed out and written up somewhere so I could point people to it? (Also, I would like to think more deeply about its implications)

> What models/theories of customer behavior were your experiments designed to distinguish between? My guess is "none" because someone thinking about the problem scientifically would start with a single experiment whose results are maximally dispositive and go from there.

This is how science is (at least, ought to be) done. This way, the goal is to always be improving your understanding of objective reality.

> They wouldn't spend a bunch of time up-front designing 12 distinct experiments. [...] So it wasn't really an experiment in the scientific sense, but rather a kind of random optimization exercise: do 12 somewhat-less-than-random things and see which, if any, improve the metrics we care about.

The problem is that a lot of AI salesmen tend to hype the "model-free" nature of "predictive" AI towards optimizing outcomes/goals, and people who don't know better get carried away with the bandwagon. Overly business-oriented people are susceptible to the ostrich mentality of not wanting to understand problems with bad tools -- they are too focused on the possibility of optimizing money-making. I find the movie "The big short" to be a fantastic illustration of this psychology.

It's probably going to lead to a very bad hangover, but for the moment the party's still going on and nobody likes the punch bowl being yanked away.

Nope, I just typed the above off-the-cuff. I could tweet storm it. Would that be useful?

Personal recommendation systems all have tradeoffs. It's just the nature of curation as an intangible endeavor. You can love "Scarface", "Heat" and "LA Confidential" but still find "Casino" boring ;)

More on such tradeoffs in a recent case study from DeepMind on Google Play Store app recommendations. Even they acknowledge the same techniques that surface 30% cost efficiencies in data center cooling, may not be completely applicable to "taste"


Isn't sparse recommendation for videos kind of solved in netflix prize, where the winner uses SVD to extract signature characteristic and recommend videos base on that?

There are a lot of ways of formalizing the problem of recommendation. Perhaps the variant of the problem used by Netflix is "solved", but it's kind of an odd one. Basically, they built a system to answer questions of the following form: "Given that user X watched media Y, what rating would they give it?" They trained and tested on media that users have already rated. Some of the ratings are masked and thus need to be "predicted" for the test.

The issue is that the Netflix dataset has a baked-in assumption that a recommender system should show media that a user is likely to have ranked highly. It may be more important to show the user media they wouldn't have found (and thus ranked) at all. Or perhaps a user will be more engaged with something controversial rather than generically acceptable. Who knows?

Probably not. I say this because for many months, i would visit netflix and not want to watch anything. Eventually I cancelled my subscription after many years.

I think I'd rather have a random collection of titles than a recommended list for me.

Just because the medium is the same doesn't mean the customer wants the same types of recommendations in two different contexts.

I feel like if the Netflix Prize results had truly solved the problem, then they’d still be using them. It seems like the video recommendations aren’t as good as they were ten years ago during the prize competition, and they’re no longer based on what I might like but rather what Netflix wants me to watch.

Interestingly, Steam's recommendation algorithm to show "similar to a game" by their 'learning machines' works very well. I have found really good games via that, and none of them did show up in the regular recommendations/carousel on top.

pretty sure it just uses the tags players give it, strategy, sci-fi , openworld etc.

excuse me for butting into dead conversation, but your team used ml to inapplicable case. (at least they aсquired experiense).

up next recommendations wont work without advanced image recognition and topics gathering - basically titles/tags for most videos are garbage and clickbait, and most of youtube work by watching buzzed videos: some well known (by a big amout of watchers) "influencers" push video of some topic (thing/brand) then it get traction from other content creators - they produce videos about it and watchers tend to stick to buzzed topics. it's like news about news.

if your team used ml to recommend up next on your own video hosting your result simply means your videos are equally not on topic Or non-interesting for your service auditory; or they are garbage.

Doesn’t it depend on the size and variety of your dataset?

I can see “random” performing well in a set of <1000 videos, all on similar subject spaces (eg “memes”, or “python”), but recommending relevant stuff gets much harder as the amount of content grows...

I feel like A/B testing isn’t a great way to determine correctness either.

IMO even in interface designing you should be arguing from first principles rather than relying on telemetry and other empirical data.

Over the years my heuristic has turned into: "Did the team formulate their problem as a supervised learning problem?" - If not it's probably BS.

In longform if anyone is interested https://medium.com/@marksaroufim/can-deep-learning-solve-my-...

EDIT: I would consider autoencoders, word2vec, Reinforcement Learning examples of turning a different problem into a supervised learning problem

EDIT 2: Social functions like happiness, emotion and fairness are difficult to state - you can't have a supervised learning problem without a loss function

You miss the point of the slides. His point isn't about supervised vs unsupervised, it's about the general areas where AI seems to excel and fail. It's being used to predict social outcomes where it does very poorly, may be inscrutable, and is unaccountable to the public.

Your examples (deep learning applied to perception) are what he argues AI is generally good for.

Auto-encoders have been more successful in fraud and anomaly detection then supervised methods. For the uninitiated: the basic concept is to reduce the feature space (i.e. the things you know) to a lower dimensional space, then decode back into the original space. When enough differences arise between the original and reconstructed variables, the event may be flagged for a human to review (or some triage process).

I wonder if a similar approach can be used for a classification task where one or more classes have only few training examples (those would be similar to "anomalies", I suppose).

It's hard to verbalize this, most of it is "intuition" but I think it boils down to "supervised learning is BS."

Humans are smarter than computers. How can a human teach a computer how to do something when the human itself can't teach another human that something?

We haven't solved that problem. The snake is eating its tail.

You can't teach a human how to do something when the methodology to do that is the student trying something and the teacher saying "Yes" or "No".

Well.... why? Why is it yes or why is it no? What is the difference between what the human or the computer, or in general, the student, did and what is good or correct? And then you still have to define "good" and many times that means waiting, in the case of the PDF linked to above, perhaps many years to determine if the employee the AI picked, turned out to be a good employee or not.

And how do you determine that? How do you know if an employee is good or not? We haven't even figured that out yet.

How can we create an AI to pick good employees if human beings don't know how to do that?

Supervised learning isn't going to solve any problem, if that problem isn't solved or perhaps even solvable at all.

In other words, over the years, my heuristic has turned into, "Has a human being solved this problem?" If not, then AI software that claims to is BS.

Supervised learning in machine learning is nothing remotely like a human teaching anyone anything. It's a very clear mathematical formulation of what the objective is and how the algorithm can improve itself against that objective.

The closest analogy for humans would be to define a metric and ask a human to figure out how to maximize that metric. That's something we're often pretty good at doing, often in ways that the person defining the metric didn't actually want us to use.

> Supervised learning in machine learning is nothing remotely like a human teaching anyone anything.

I disagree, I think it's exactly the same. As an example, a human teaching a human how to use an orbital sander to smooth out the rough grain of a piece of wood.

The teacher sees the student bearing down really hard with the sander and hears the RPM's of the sander declining as measured by the frequency of the sound.

The teacher would help the student improve by saying, "Decrease pressure such that you maximize the RPM's of the sander. Let the velocity of the sander do the work, not the pressure from your hand."

That's a good application of supervised learning. Hiring the right candidate for your company is not.

But that's not at all how "supervised learning" works. You would do something like have a thousand sanded pieces of wood and columns of attributes of the sanding parameters that were used, and have a human label the wood pieces that meet the spec. Then you solve for the parameters that were likely to generate those acceptable results. ML is brute force compared with the heuristics that human learning can apply. And ML never* gives you results that can be generalized with simple rules.

* excepting some classes of expert systems

One of the columns of sanding parameters is the sound of the sander.

Machine learning really almost nothing in common with most types of human learning. The only type of learning that has similarities is associative learning (think Pavlovs dogs studies).

The human learning situation you describe works quite differently, though: The student sees either the device alone or the teacher using the device to demonstrate its functionality. This is the moment most of the actual learning happens: The student creates internal concepts of the device and its interactions with the surroundings. As a result the student can immediately use the decive more or less correctly. What's left is just some finetuning of parameters like movement vectors, movement speed, applied pressure etc.

If the student would work like ML, it would: hold the device in random ways, like on the cord, the disc, the actual grip. After a bunch of right/wrong responses she would settle on using the grip mostly. Then (or in parallel) the student would try out random surfaces to use the device on: the own hand (wrong), the face of the teacher (wrong), the wall (wrong), the wood (right), the table (wrong) etc. After a bunch of retries she would settle on using the device on the wood mostly.

It's easy to overlook the actual cognitive accomplishments of us humans in menial tasks like this one because most of it happens unconsciously. It's not the "I" that is creating the cognitive concepts.

That is such a horrible metaphor

> You can't teach a human how to do something when the methodology to do that is the student trying something and the teacher saying "Yes" or "No".

Strangely, I recently had to complete a cognitive test that was essentially that process. I was given a series of pages, each of which had a number of shapes and a multiple choice answer. I was told whether I chose the correct answer, then the page was flipped to the next problem. The heuristic for the correct answer was changed at intervals during the test, without any warning from the tester. I'm told I did OK.

You're touching on the "difficulty" in verbalizing it. I see what you mean, because you did learn that the heuristic was changing with just a yes or no. I said you can't teach that way, but you clearly learned that way, so I wasn't exactly correct, but I'm not practically wrong either still I don't think.

I wonder, how would an AI perform on the same test.

What is the mathematical minimum number of questions on such a test, subsequent to the heuristic change, that could guarantee that new heuristic has been learned?

I'm curious about the test. Did it have a name? What were they testing you for?

> I wonder, how would an AI perform on the same test.

This situation is called Multi-armed Bandit. In this setup you have a number of actions at your disposal and need to maximise rewards by selecting the most efficient actions. But the results are stochastic and the player doesn't know which action is best. They need to 'spend' some time trying out various actions but then focus on those that work better. In a variant of this problem, the rewards associated to actions are also changing in time. It's a very well studied problem, a form of simple reinforcement learning.

If the rewards are changing, then isnt it a moving target problem?

Doesn’t it depend on what you mean by guarantee? The test can’t get 100% certainty, since theoretically you could be flipping a coin each time and miraculously getting it right, for 1000 times in a row. The chance of that is minuscule (1/2^1000), but it’s nonzero. So we’d have to define a cutoff point for guaranteed. The one used generally in many sciences is 1/20 chance (p = 0.05), so that seems like a plausible one, and with that cutoff, I think you’d need five questions passed in a row (1/2^5 = 1/32). Generally, if you want a chance of p, you need log2(1/p) questions in a row passed correctly. However, that only works if your only options are random guessing and having learned the heuristic. If you sorta know the heuristic (eg. right 2/3 of the time), then you’d get the 5 questions right ~13% ((2/3)^5) of the time, which isn’t inside the p = 0.05 range. So you also need to define a range around your heuristic, like knowing it X of the time. Then you’d need log(1/p)/log(1/X) questions. For example, if you wanted to be the same as the heuristic 19/20 times and you wanted to pass the p = 0.05 threshold, you’d need log(1/0.05)/log(1/(19/20)) ~= 59 questions.

There were more than two possible answers to choose from on each page, so the odds of being right were considerably lower.

I'm sure the test was a standard with a name, but I was never told. It was a small part of a 3 hour ordeal, evaluating my healing progress since suffering a brain injury in March.

I would agree that it's a very inefficient way of teaching something. It gave me an unexpected insight into machine learning though.

I'm sure the test was designed so that picking the same answer each time or picking one at random would result in a fail.

Sounds like Ravens progressive matricies.

Similar but not the same.

Well... why is it necessary that we can teach a human to do something in order to teach a machine to do it?

Teaching a human is a heuristic for understanding the problem well enough to teach a machine.

I agree and rather than post a sibling response, I'll add that I think it's necessary today, simply because we don't have AGI, yet. And also point out that we are talking about determining if AI is snake oil or not. There may be some scenarios where we can teach a computer to do something we can't teach a human to do, I can't think of any off the top of my head, but if we can't, then I'm going to be super doubtful that an AI software can do it better than a human, if at all.

AGI, in the singularity sense, will be solving problems before we even identify them as problems. Experts in a field can do this for the layman already and I think it's possible. Some don't. I do.

It'll be super interesting when it flips! When the student becomes the master and we, as a species, start learning from the computer. You can kind of get a sense of this from the Deep Mind founder's presentation on their AI learning how to play the old Atari game Breakout. He says when their engineers watched the computer play the game, it had developed techniques the engineers who wrote the program hadn't even thought of.

Even still, the engineers could teach another human how to play Breakout, so yes, I do believe they did in fact create a software to play Breakout better than they could.

Same for AlphaGo, but it only works when you have access to cheap simulation (breakout being a game easy to run, Go being just a board). It doesn't work as well in situations where you don't see the full state of the system, or where there is randomness.

AlphaGo did pretty well in Starcraft 2. Even though it is still pretty far from the best players in the world.

This simply isn't true. We know your intuition here is mistaken, as we have plenty of counter-examples.

The best chess AIs can beat any human chess player. They use techniques that were never taught to them by a human.

Another example: a machine-learning-driven computer-vision system predicting the sex of a person based on an image of their iris. No human can do this. [0]

[0] Learning to predict gender from iris images (PDF) https://www3.nd.edu/~nchawla/papers/BTAS07.pdf

I don't follow that. The recidivism predictor was supervised. Conversely, AlphaZero is unsupervised and certainly not BS.

AlphaZero is not unsupervised. It is a reinforcement learning algorithm, it knows exactly what the outcome of the game is.

The terms "supervised machine learning" and "unsupervised machine learning", by their ordinary English meaning, make it sound like all machine learning is partitioned into one or the other. But a lot of the literature in machine learning considers reinforcement learning to be neither 'supervised learning' nor 'unsupervised learning'. See, e.g., section 1.1 of [1].

[1] Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, second edition. MIT press, 2018.

I basically agree with this rule. I find that my colleagues who overly hype unsupervised approaches typically don't have much experience working on ML problems without labeled data. My suspicion of this comes from the fact that whenever I give a talk on ML I always have a wealth of personal experience to draw on for examples. My colleagues almost always reuse slides from projects they never worked on.

I'm a little surprised to see this sentiment. Some of the most important advances in the field have been unsupervised tasks:

- OpenAI: Dota 2 (PPO), GPT-2...

- NVidia: StyleGAN, BigGAN, ProGAN...

Those are certainly important advances, but they don't really apply to most business needs for AI or ML.

I work in the industry on NLP tasks. Unsupervised learning has been behind the largest developments in the last decade in the field.

I don't disagree with your point, but the unsupervised aspect of NLP typically isn't useful on its own. Usually it's a form of pre-training to help supervised models perform better with less data.

From Google in 2018:

"One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch."

As I said, I'm an NLP researcher and practitioner, so you don't need to quote this at me.

The unsupervised aspect is the engine driving all modern NLP advancements. Your comment suggests that it is incidental, which is far from the case. Yes, it is often ultimately then used for a downstream supervised task, but it wouldn't work at all without unsupervised training.

Indeed, one of the biggest applications of deep NLP in recent times, machine translation, is (somewhat arguably) entirely unsupervised.

I didn't mean to make it sound incidental although I do see your point. Just wanted to chime in with how important having a labeled dataset is for a successful ML project.

I think the point is labeling itself is very difficult except for special and limited domains. Manually constructed labels, like feature engineering, are not robust and do not advance the field in general.

That makes sense. I'm coming from the angle of applied ML where solutions need to solve a business problem rather than advance the field of ML. In consulting many problems can't be solved well without a labeled dataset and in lieu of one, less credible data scientists will claim they can solve it in an unsupervised manner.

For sure. There are counter-examples however - fully unsupervised machine translation for resource poor languages comes to mind and is increasingly getting business applications.

I think that in the future, more and more clever unsupervised approaches will be the path forward in huge AI advances. We've essentially run out of labeled data for a large variety of tasks.

Echo the other commentator. Unsupervised techniques are the only reason NLP works as well as it does.

I would argue that GAN's by definition aren't unsupervised, they just aren't supervised by humans. Additionally, OpenAI's game stuff also has similar arguments against it.

> I would argue that GAN's by definition aren't unsupervised

You can define the terms how you want - but in terms of how they're understood in both industry and academia, you are incorrect.

The discriminator definitely is supervised but the generator is unsupervised. I.e., it has no labels on its targets.

I'm not sure that's correct. The discriminator and the generator both learn to match a training set. You don't need to label the training set at all. You can just throw 70,000 aligned photos at it.

I think I see what you're saying, but that might be a different definition of "supervised". It seems impossible for one half of the same algorithm to be supervised and the other to be unsupervised. But I like your definition (if it was renamed to something else) because you're right that the discriminator is the only thing that pays attention to the training data, whereas the generator does not.

There's a lot of gray area between unsupervised and supervised learning. For example self-supervised learning: https://www.facebook.com/722677142/posts/10155934004262143/

Ironically, the algorithm you pose in that comment itself, is a BS algorithm in it of itself.

"Formulate the problem as X" - what is your input for how a problem is formulated? That you personally like how it was formulated?

"Probably," - OK, so you assign probability scores? Or do you mean, "likelihood based upon my guess?"

Finally, how do you measure performance? Your own assessment of how good you were at it?

The author says "AI is already at or beyond human accuracy in all the tasks on this slide and is continuing to get better rapidly" and one of his examples is "Medical diagnosis from scans". That is an example of precisely the sort of snake oil hype he's berating in the social prediction category.

In an extremely narrow sense of pattern recognition of some "image features", i.e. 5% of what a radiologist actually does, he's probably right. But context is the other 95%, and AI is nowhere close to being able to approach expert accuracy in that. It's a goal as far away from reality as AGI.

"AI" tools will probably improve the productivity of radiologists, and there are statistical learning tools that already kind of do that (usually not actually widely used in medical practice, you can say yet, I can say who knows but nice prototype). But actual diagnosis, like the part where an MD makes a judgement call and the part which malpractice insurance is for? Not in any of our lifetimes.

A radiologist friend complains that it's been 10+ years since they've been using speech recognition instead of a human transcriptionist, and all the systems out there are still really bad. Recognizing medical lingo is something you can probably achieve with more training data, but the software that sometimes drops "not" from a scan report is a cost-cutting measure, not a productivity tool. It makes the radiologist worse off because he's got to waste his time proofreading the hell out of it, but the hospital saves money.

Author here. I appreciate your criticism. What I had in mind was more along the lines of Google's claims around diabetic retinopathy. I received feedback very similar to yours, i.e. that those claims are based on an extremely narrow problem formulation: https://twitter.com/MaxALittle/status/1196957870853627904

I will correct this in future versions of the talk and paper.

Then I shall write to you directly. I don’t know how you can make the claim that automated essay grading is anything but a shockingly mendacious academic abuse of student’s time and brainpower. To me, this seems far worse than job applicant filtering, firstly because hiring is fundamentally predictive, and secondly because many jobs have a component of legitimately rigid qualifications. An essay is a tool to affect the thoughts of a human. It is not predictive of some hidden factor; it stands alone. It must be original to have value; a learned pattern of ideas is the anti-pattern for novelty. If the grading of an essay can be, in any way, assisted by an algorithm, it is probably not worth human effort to produce. If you personally use essay grading software, or know of anybody at Princeton that does, you have an absolute obligation to disclose this to all of your students and prospective applicants. They are paying for humans to help them become better humans.

Thanks for the .pdf and the research in general, great stuff!

One thing I'd love is a look at 'noise' in these systems, specifically injecting noise into them. Addons like Noiszy [0] and trackmenot [1] claim to help, but I'd imagine that doing so with your GPS location is a bit tougher. I'd love to know more on such tactics, as it seems that opt-ing out of tracking isn't super feasible anymore (despite the effectiveness of the tracking).

Again, great work, please keep it up!

[0] https://noiszy.com/

[1] https://trackmenot.io/

FYI, I too looked at his categorisation of what was/was advancing/was not snake oil and didn't exactly agree with all his decisions either.

Medical imaging diagnosis was one of them.

Speech recognition/transcription was another. I don't know if it's my accent or my speech patterns(though foreigners regularly compliment my wife and myself on our pronunciation), but the tech hasn't gotten noticeably better for me since the days of Dragon natural speaking, and that was, what...10 years ago?

Sure, I can "hey Google/siri/alexa" a handful of predefined commands, but I still have to talk in a sort of stoccato "I am talking to a computer" voice, it still only gets it right 90% of the time, and God help you if you try anything new/natural not in the form of "writing Boolean logic programs with my voice".

I feel like it’s gotten loads better. You can watch as the voice recognition on your phone changes the words it recognizes to match the context in the sentence. Sometimes it gets a word wrong and fixes it after a half second. Google translate does magical things recognizing common phrasing constructions and bad accents, stuff that Dragon could never do. I built a lipsync pipeline for video games based on Dragon a decade ago, and it most definitely was not as good as what I have on my phone today.

Again, just anecdotally, I don't know if its just me, but most of my experience is of google/apple translate and auto-corrects going from the word I want to a wrong/incorrect one.

It is one of my most frustrating everyday software experiences.

Not only is it not getting better, its actually getting worse, because before I at least had the correct sentence. Now my correct sentence is mangled as it tries to force corrections/substitutions, and I have to continually go back and manually-correct the auto-correct.

It seems to work for me on short pre-formed sentences and toy examples (if you communicate using pre-formed phrases and use well-worn cliches in your writing, it seems to pick up and predict for them). I wonder whether the "increased accuracy" of modern solutions aren't just functionally having access to a larger library of lookup rules of stored common/popular phrases and direct translations (a huge part of practical 'AI' advancement has been on the scaling-infrastructure/collection of new-scales-of-data rather than the AI techniques themselves IMO) effectively mined from its training data, but the moment I try to write or dictate anything new, original, or lengthy and it turns absolutely pear-shaped.

That does sound frustrating. I didn’t mean to discount your experience, it certainly is possible that the additional AI has made it worse for some people, especially if there’s an accent involved. Not to mention that the meme of autocorrect mistakes when texting somewhat backs up your experience on a larger scale. I wonder if the scale and complexity of what they’re doing now compared to last decade is the cause of the regressions, like is the problem being solved much harder by trying to factor in and autocorrect based on context, and causing worse results than pure phoneme detection?

My brush with AI snake oil:

I interviewed at a startup that seemed fishy. They offer a fully AI powered customer service chat as an off the shelf black box to banks. I highly suspect that they were a pseudo AI setup. LinkedIn shows that they are light on developers but very heavy on “trainers”, probably the people who actually handle the customers, mostly young graduates in unrelated fields, who may believe that their interactions will be the necessary data to build a real AI.

I doubt that AI will ever be built, it's just a glorified Mechanical Turk help-desk. I guess the banks will keep it going as long as they see near human level outputs.

Very common AI startup. They use some pre made AI tools (e.g. Watson bot) and resell it to specific industries where they already have the "intent trees" made (common questions/actions the user wants). The trainers are nothing more than analysts that will identify an intent not listed on the tree and configure it there. The devs probably are API and frontend devs, not much AI stuff going on.

I don't think that is in principle problematic (unlike the social problem statements pointed out in the talk). A system which amplifies human resources by filling in for their common activities over time could use sophisticated tech drawing on the latest in NLP. The metric would be a ratio of the number of service requests they handle per day / the number of "trainers" (or whatever name given) compared to the median for a purely human endeavour where every service request is handled by a customer-visible human.

In the Mechanical Turk analogy there is no such capability amplification happening.

My experience with automated "help" desks is that I have to let the automatons fail one after the other until I finally get connected to a real human. Then I can start to state my problem. All that those automated substitutes do is discourage customers from calling at all.

I have a feeling I know _exactly_ which company you're talking about...so it's either just that obvious, or there's more than one of these, or both!

It seems to be the latter. In fact, the trick (should I say, fraud?) is so common that they were even several articles about it in the press over the past two or three years. Even the famous x.ai had (and I guess still has) humans doing the work.


I was going to say the same thing!

A couple of weeks ago such a startup based in London contacted me on LinkedIn - the product really hyped AI, but it all seemed very dubious. My guess was it was really a mix of a simple chatbot with a Mechanical Turk-style second line.

I guess the idea would be to get a few contracts, pull down some money from those and then go bust as the costs of the Mechanical Turk become evident?

> go bust as the costs of the Mechanical Turk become evident

I'm afraid you have misspelled "raise a humongous round form SoftBank". It's an easy typo to make, don't feel bad.

Hugh mongous what?

My company is sourcing AI from MTurk. It's actually cheaper than running fat GPU model training instances. The network learns fast and adapts well to changes in inputs.

I envision the sticker "human inside" strapped on our algorithms.

You should emphasize that this is Organic AI. It's low carbon and overall greener.

Or keep calling it AI, and concede that AI stands for "actual intelligence" if someone asks you directly.

AI now is like Cyber was in the 1990s it's seems to be nothing but a buzzword for many organizations to throw around.

The term AI is used as if humanity now has figured out general AI or artificial general intelligence (AGI). It's quite obvious organizations and people use the term AI to fool the less tech inclined into thinking it's AGI - a real thinking machine.

Remember 5-ish years ago when IBM's marketing department was hawking their machine learning and information retrieval products as AI, and everyone in the world rolled their eyes so hard we had to add another leap second to the calendar to account for the the resulting change in the earth's rotation?

I suppose their only real sin was business suits. Everything seems more credible if you say it while wearing a hoodie.

Cyber is still happens to be a buzzword, it just shifted meaning to the defense sector.

Has been like that for quite a while, it has up and lows with the 90s marking a winter season for AI, and now the hype machine is on full steam again, until people find out again a lot of it is marketing BS to get funding. Then the research that is worthwhile gets mixed up with that, go into a lack of funding and in 30 years or so it's back on full hype again.

Call it Al, and only hire Alans, Alberts, Alphonses, Alis, etc

Each unit uses about 100W continuously and emits about 1kg of CO2 per day before adding impact of supporting infrastructure.

These things better be smart, because they are not low-footprint.

We just have to hope our robot overlords will not be overly environmentally conscious

20+ years ago I used to refer to this is as artificial artificial intelligence (AAI) specifically as part of a pitch to MGM for an MMORPG to run their non-player characters. Not surprisingly, it didn't catch on...

That was MTurk's slogan on launch.

Totally should have trademarked it along with "IGlasses" in the very same pitch. The patch was apparently rejected because our level design ideas were better than the actual episodes of the show upon which it was based: "Poltergeist: The Legacy."

You mean high carbon, low silicon? Because humans usually have a higher carbon footprint than computers, it takes a lot of computers to match one human. Plus we're made of carbon.

i m not sure, if you factor in the CO2 footprint of computer manufacture, and the fact that AI needs powerful computers & networking to be delivered. Our body carbon is almost 100% recycleable.

If only the carbon footprint of a human was the body carbon.

Modern humans have a very heavy carbon footprint, especially in the US. Think of all the things you do and consume and all the carbon involved all thorough the chain. It's a big number. Computers are extremely efficient compared to that.

People you hire on MTurk don't generally work in the US, and have a small fraction of the carbon footprint of an average American.

Aren't most of those arguments here moot, since they also need a computer (+ networking) to be available on MTurk?

computer < human + computer

Of course not every computer has the same specs and footprint, but they should be in roughly the same ballpark.

A typical computer used by a typical person has only a fraction of energy use of GPU farms utilized to train DNN models. We're not at the point where you can pick an existing DNN off the shelf and just use it, you have train (at least partially) a new one for each new task, often many times.

I think simply OI (Organic Intelligence) would the more appropriate term since it's no longer artificial ;)

Eh, every existing instance was created by some people working together.

I like calling them MeatBots

Even companies like Facebook, Apple and Google employee humans to do work that people believe is done by "computers" and non of the companies seem keen on informing the public that they do in fact have humans scanning through massive amounts of data. So perhaps it is in fact cheaper, or the problems they face remains to hard for current types of AI.

Given the number of people Facebook employees to censor content and the mistakes they make I would label most of Facebooks AI claims as snake oil.

Well, just about any ML task needs people to prune and correct the training data.

I'm aware of moderation. What else?

Recaptcha is probably one you've actually interacted with, but even then you're mostly reinforcing existing predictions. But other applications within Google Maps are things like street number recognition, logo recognition, etc. Waymo contract out object detection from vehicle cameras and LIDAR point clouds. Google even sell a data labeling as a service.

I believe Google Maps has a lot of humans who tidy up the automated mapping algorithms (such as adjusting roads).

Annotation is time consuming and therefore extremely expensive if you have a $100k engineer doing it.


Yes you can report problems with the road network and people update GMaps manually. And up until a couple of years ago users could do it themselves, but they took it down for some reason.

Changes to other types of places can still be done manually by GMaps users themselves, and other users can evaluate that, and I guess if it's a "controversial" (low rep user did the change or people voted against it) a Google employee evaluates it. And if you're beyond certain level as a GMaps user you can get most changes published immediately.

What about data privacy? do your customers knows that random Turks look at their data?

I worked at a place that was selling ML powered science instrument output analysis. It did not work at all (fake it till you make it is normal, was told). So there was a person in the loop (machine output -> internet -> person doing it manually pretending to be machine -> internet -> report app). The joke was “organic neural net.” Theranos of the North! ML is a great and powerful pattern matcher (talking about NN not first order logic systems) right now, but, I fear we are going into another AI winter with all the over promising.

We won’t ever have an AI winter like in the 70s again. A lot of ML is already very useful across many domains (computer vision, NLP, advertising, etc). Back then, there was almost no personal computing, almost no internet, smol data, and so on. Stuff you need for ML to be useful and used.

So what if some corporate hack calls linear regression “AI”? The results speak for themselves. The ML genie is too profitable to go back in the bottle.

Didn't linear regression used to be called "AI" as recently as a decade ago?

It's still better in many cases than modern ML (especially if you incorporate explainability and efficiency as metrics of "better" next to the predictive power), so I wouldn't object much if a company called it "AI". In fact, if I learned that an "AI" behind some product was just linear regression, I'd trust them more.

I personally don’t see a problem with this. Where do you draw the line at model simplicity? Are decision trees too simple to be AI? What about random forest? Are deep neural nets the only model sophisticated enough to be “AI”? It’s not the model, it’s how you use it.

it still is, but the people who mean "regression" when they say "AI" will generally not admit this

all fair points. I have have used it to do amazing things. It’s not going away. Just that AGI seems very far away. I think CS is like biology before the discovery of the microscope (why are we getting sick? Microorganisms etc). Or DNA. Once that big breakthrough happens we will quickly transition to a new local maximum.

There's a recent xkcd about your company: https://xkcd.com/2173/

"We trained a neural network to oversee the machine output"

That sounds at least achievable, unlike examples in OP.

What I dislike far more than the idea of using such systems to predict social outcome is that the usage of such systems is done behind closed doors. I would be much more willing to accept such systems if the law required any system to be fully accessible online, including the current neural network, how it was trained, and training data used to train it (if the training data cannot be shared online, then the neural network trained from it cannot be used by the government).

Independent companies using AI is far less a concern for me. If they are snake oil, people will learn how to overcome them. Government (especially parts related to enforcement) is what I find scary.

In my country, a relatively recent law added an obligation for the government to give on request a detailed and joe-six-pack-undersandable explanation for how an "algorithm" has reached a decision pertaining to that person.

I've therefore started stockpiling popcorn since this law was announced for the inevitable clusterfuck that was going to happen when this law would have to apply to a decision taken using machine learning.

(Which is pretty much impossible to explain the way the law requires to, because even those that made the neural network would be quite at loss to understand themselves how exactly the neural network came up with that decision, even less being able to explain it to your average person !)

Maybe they can use something like this?


They optimise a simple set of decision rules which has reasonable accuracy in their application, quite cool really

Using an association between features to make a prediction about something, rather than measuring the thing itself, is exactly what’s meant by “prejudice.” Even when the associations are real and the model is built with perfect mathematical rigor. ML is categorically unsuitable for government decisions affecting lives.

You seem to think this level of prejudice for prediction is wrong. Why?

If someone has killed 12 people, being prejudice about their chance of killing another and using that to determine the length of a sentence seems reasonable.

Even with something like a health inspection. Measuring how they store and cook raw chicken is about predicting the health risks to the public eating it, not about measuring the actual number of outbreaks of salmonella. And even if they were to measure the previous outbreaks of salmonella and use it to prediction the future outbreaks, that is still two different things.

I understand and agree with the outcry around using ML areas like criminal justice, but there are some really compelling examples of ML being used by governments to help citizens[1].

[1] https://www.kdd.org/kdd2018/accepted-papers/view/activeremed...

The problem with leaving this to independent companies is that some of the most natural application areas are dominated by independent companies operating as a cartel -- think about credit scoring. The fact that the data and models are a natural "moat" (in the sense of Warren Buffett) is all the more worrying.

The difference is the level of threat between a private company denying you a loan and the government deciding you should spend 10 years in prison. Please don't misunderstand, I'm not saying the former is good, only that it isn't nearly as bad as the latter.

The "independent company" part doesn't work though. If Facebook comes up with anything useful, the US government walks in, grabs the data, then issues a gag order so nobody knows. It simply wouldn't matter that the government was officially restricted to "open AI".

A judge using his experience and judgement to subjectively set a jail sentence is as opaque as a proprietary algorithm. He or she may cite reasons for the sentence, but nobody is verifying that judges' sentences are consistent with the criteria they cite.

You are right and I see that as a flaw in the current system that should be fixed.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact