Hacker News new | past | comments | ask | show | jobs | submit login
Are We Making Much Progress? Analysis of Recent Neural Recommendation Approaches (arxiv.org)
86 points by sndean 10 days ago | hide | past | web | favorite | 17 comments





Eerily reminiscent of the replication crisis in psychology and the social sciences. My key takeaways:

1) Half the papers couldn't be reproduced on a technical level. Publish your code and your data, people!

2) Most of these papers uses "weak baselines" so they can show some kind of improvement and get their paper published. I'm conflicted about this because if we require every paper to beat state-of-the-art, we'd (collectively, as the entire discipline) be lucky to publish one paper a year. From one point of view, these papers actually represent a form of publishing negative results - we tried this and it didn't work - which isn't a bad thing. But the biased way its presented makes it harder to separate the wheat from the chaff.

3) It's not obvious that we're going to squeeze any more business value from this particular stone. Sometimes all the useful information in a dataset can be found with a fairly simple algorithm. Not everything benefits from a more complex representation, and sometimes you can't fix that with regularization or more data. Sometimes you just have to use the simple model and accept that it captures all the signal that's available and everything else is noise.


It's not obvious that we're going to squeeze any more business value from this particular stone. (ie get more value from recommendations)

I could even argue "relying primarily on recommendations is a dark pattern"; In many platforms, the approach is replacing good search with good recommendation. Essentially, you don't get say, to explicitly specify your results, instead you get an opaque mix of "things you want and things we you to want" and you're supposed to be happy with this. You can see this in action in Amazon-recommendation, youtube and Google - actual control is replaced by "we know what you're thinking" (which indeed works some of the time but is infuriating when it doesn't and leaves lots of room nefarious effects, so Youtube fascist/extremist propaganda effects, etc).

And someone always pipes in here with "users are dumb and can only deal with recommendations since they can barely figure out toasters". Well, sites kind of need to educate their users, users actually have learned a bit given the dominance of the Internet over the last 30 years and a general awareness of the problems of recommendation-reliance can contribute to changes just as now a lot of supposedly not-dumb developers and managers think of recommendations as a benevolent or just neutral approach.


That's a rather cynical take. The objective function is clearly defined as "giving recommendations the user agrees with". As such, a well-working algorithm is clearly superior to search, almost by definition.

Here's an optimistic cultural take: good recommendations take you outside your bubble. It's not following Return of the Jedi with Episode I, but changing to something different, that you enjoy even though you would not have expected it.


That's a rather cynical take. The objective function is clearly defined as "giving recommendations the user agrees with".

I wouldn't call cynical at all. I view it as idealistic. The user should be in charge, the user should have tools to formulate a query describing what they want. Doubting the value of something other than the user being in charge goes with this. You can say we're each optimistic about something different but then we come to objective comparisons.

Here's an optimistic cultural take: good recommendations take you outside your bubble. It's not following Return of the Jedi with Episode I, but changing to something different, that you enjoy even though you would not have expected it.

The thing consider with your comparison is - most Youtube recommendation don't give anything new at all. They usually some other thing from whatever is considered the top ten. Google gives ten mainstream movies, Youtube gives top ten of whatever song-era you are looking at. The engines never, ever find "hidden gems". The rise of recommendation engines resulted in a homogenization of the web - we have all experienced this. And I'd claim better algorithms can't change since the algorithms just don't have enough information to why a user like song/movie/product X; for just songs you have multiple qualities that different people look for (etc. back to the GP and OP's fine arguments).

And further, if a user wanted something new, you could have a "choose at random" button that let them know what they were getting into. Further, I don't think a certain amount of recommendation within tool-type processes is bad but that can devolve. Google rose by being more likely to indeed get people what they were thinking of but that vein has mined to the point that all that's left is ... low quality trash. But I'm optimistic there are ways to do better.


Even though recommendation engines can sometimes surface things you didn't know you actually wanted, there should be a way for users to be in control and search knowing how the search will be conducted on their behalf

It's not just reminiscent. It's the same incentive structure: people need papers, as many and as soon as possible. The results need to sound good (so they land in good venues), but they won't be scrutinized too much once they do.

regarding 1, even papers with high quality published code fall prey to bit rot very quickly -- go dig through some git repos from ML papers from 2015 and try to run them.

distributing Docker containers works better but even that won't necessarily help, because especially if you have to use GPUs the container software has been shifting underneath you. plus it's still cumbersome to make a fair comparison between containerized code and a new approach implemented somewhere else, especially if it's on different data.

Would it be worth making everyone expend a huge amount of unpaid additional effort maintaining the code for every ML paper ever published, for years to come? I'm not sure -- many are not very good and should probably be left to die. On the other hand it's common and immensely frustrating to be unable to reproduce results.

regarding 2, it seems like if the technique itself works in a substantially different way from the state of the art, even if it can't consistently beat it, that's still a positive result -- something new.


What if, instead of publishing just the code, authors would publish intermediate results (e.g. tensors) of running the code?

So basically sprinkle the code with print statements. Then if a future researcher has difficulty reproducing the result, they know exactly at which step the problem occurs.


Haha that's not going to work for typical models that have millions of parameters. Besides, it's not the code that's important (except for verifying that it's a good method) it's the actual method, it's description, why it works, how it works, how it can be improved, adpated, etc.

> ...if we require every paper to beat state-of-the-art, we'd (collectively, as the entire discipline) be lucky to publish one paper a year.

That's the case if there's just one baseline to beat. It would be a bad thing if everyone were working on beating today's single state of the art model. It would be great to incentivize more work along varied lines.

Either way, we absolutely don't want papers that claim their innovation is an improvement on a metric they don't really improve.


Not all papers are about achieving state of the art performance,but to explore the general viability of a new approach.

To answer if we are making much progress, here’s a quote: “Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method.“

The method that outperformed the others was MultVAE [1]

[1]: https://arxiv.org/abs/1802.05814


There are fields (recommender systems in particular in my opinion) where you will not get published if you report a strong baseline and your approach does not outperform it, but you may get published if you use weak baselines and your approach outperforms it. This doesn't make sense to me as surely it's possible for promising new approaches to not yet outperform current state of the art methods.

These peer review aberrations incentivise these bad research practices I think.



Most papers in most disciplines end up not being worth much, and the incentives regarding publishing hold across all of computer science. Is there really something particularly bad going on in deep learning? Or is it just the usual process?

Most papers are slight improvements, or barely an improvement after selectively comparing against other methods or using particular datasets that work well with that method, or after mangling the results so much with vagaries and statistics that the results look kind of good if you squint a lot. That's just how it is because 99% of academics are not geniuses but they still need to graduate or get their tenure. But the papers of non-geniuses provide ideas and an environment for the geniuses to publish their work and flourish, so it works out in the end.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: