
A Worrying Analysis of Recent Neural Recommendation Approaches - nkurz
https://arxiv.org/abs/1907.06902
======
nkurz
Summarizing, they tried replicating the results of 18 recent papers, and
determined that because of lack of available code or data, they were only able
to attempt replications of 7 of them. Of these, they found that the
"improvements" in 6 of them could also be achieved with a simpler non-machine
learning approach (such as nearest neighbor). For the 7th (Mult-VAE), they
were able to almost match the improvement with better tuning of one of the
baselines used in that paper.

Does this mean that the published results are garbage? No! But what does it
mean?

First, don't miss the main claim: less than half the published papers have
sufficient code and data available for replication. While one might argue
exactly what the standards for availability should be, if journals and
reviewers were to demand it, this number can definitely be improved. Greater
availability of data and code makes for better science.

Second, my guess is that it means that paper authors are using excess
"puffery" in an attempt to improve the chances that their papers are accepted
for publication. They've decided that using a weak baseline and claiming a 50%
improvement is more likely to result in publication than using a realistic
baseline and claiming a 0-5% improvement. Unfortunately, they are probably
right.

The useful change (in my opinion) would be to change the standards by which
papers are judged to be publishable. Reviewers should push back against weak
baselines, but be more accepting of results that make modest claims. You can
still have a useful paper even if the degree of improvement is small --- or
even nonexistent. Maybe it will inspire someone else, maybe it's useful in an
ensemble, or maybe it applies to a case where the baseline wouldn't. It's the
puffery that's the problem, not the publication. Creating a system that
rewards authors who more honestly evaluate their work would be an overall win.

~~~
goostavos
Is there some background reason that academics generally don't provide code in
their papers? Hell, in a lot of papers I've tried looking at recently[0],
there will be paragraphs and paragraphs of english text _describing_ an
algorithm rather than even just plunking down some pseudo code.

It makes pulling out the actual content of the papers such a slog.

[0] exhibit A:
[https://ieeexplore.ieee.org/document/1017616](https://ieeexplore.ieee.org/document/1017616)

~~~
mayank
Research code is generally an abomination, scraped together by MS/PhD students
who may never have had any industry exposure to best practices, norms,
testing, etc. There may also be the fear of releasing buggy code, which
if/when discovered, will discredit the paper. Far safer to just not release
the code, since it isn't mandated (which should absolutely 100% change)

Source: am former producer/publisher of abomination research code.

~~~
dv_dt
I wonder if a journal could be established that emphasizes code integrated
papers, something that might accept submissions as jupyter notebooks for
example.

Edit: Perhaps one way to encourage this is via topic "reuse" papers, which
present further detail on a previously published papers topic, but this time
with code and more detailed discussion. It would at least help the situations
in that the author gets some reuse of effort, but in a way that still shows
new information and advances the field.

~~~
codebje
Jupyter notebooks are risky business for reproducible work. No dependency data
means they're highly prone to bit rot. Stored results and out-of-order
execution means they're prone to subtle errors. Environmental leakage is
relatively high.

Literate programming suffers from this in general because you don't want to
clutter up your document with noise about versions and so forth - even as
appendices, including package information and build instructions is a lot of
noise.

Requiring that papers making claims about some code base publish that code
base simultaneously with the paper via the peer review process should be
sufficient to improve the overall state of affairs.

Don't underestimate the value of demos and competitions with companion papers
either, those tend to get a lot more notice.

~~~
disgruntledphd2
I dunno, I get where you are coming from with respect to literate programming,
but I find that it's often better to show all the versions in an org file (or
whatever tool you use) and write up a report separately including the final
results.

In general, you'll have a lot of approaches that don't work out which are nice
to have a record of, but definitely don't merit being in the final paper.

~~~
codebje
A published code base doesn't necessarily mean a full revision history, just
something others can reproducibly build and run.

If your claims don't depend on specific behaviours of some body of code, you
wouldn't need it - eg, an article claiming some asymptotic performance of an
algorithm should describe the algorithm in the abstract s.t. the performance
bound can be proven, not some specific language's implementation of the
algorithm.

------
codingslave
Similar to this post is this:

[https://arxiv.org/abs/1907.07355](https://arxiv.org/abs/1907.07355)

"Probing Neural Network Comprehension of Natural Language Arguments

Timothy Niven, Hung-Yu Kao (Submitted on 17 Jul 2019)

We are surprised to find that BERT's peak performance of 77% on the Argument
Reasoning Comprehension Task reaches just three points below the average
untrained human baseline. However, we show that this result is entirely
accounted for by exploitation of spurious statistical cues in the dataset. We
analyze the nature of these cues and demonstrate that a range of models all
exploit them. This analysis informs the construction of an adversarial dataset
on which all models achieve random accuracy. Our adversarial dataset provides
a more robust assessment of argument comprehension and should be adopted as
the standard in future work."

In the field of computer vision, there are similar suspicions that state of
the art computer vision models are overfit to ImageNet datasets and the like.
The issue being that even average research labs do not have hundreds of
thousands of dollars to reproduce extremely expensive and highly tuned models.
Reproducing and advancing the field of deep learning is quickly becoming
inaccessible to almost everyone except for a few of the highest funded
industrial labs (Google, FB, OpenAI, Microsoft Research among a few others).
This is not all negative, it's just that not everything can be taken as
gospel.

~~~
Eridrus
This paper says more about that specific dataset/task than about neural models
for NLP.

~~~
codingslave
The paper is relevant to the original post because it points out that NLP
models show state of the art results that are overfit to their respective
datasets. This means that we may not be advancing as fast as we think. The
issue here is also that few people have the monetary resources to provide a
secondary analysis on these models. It's too expensive.

~~~
Eridrus
I think this is way too broad an interpretation to draw from a paper that
shows that a single dataset was bad. NLP datasets have had a good chunk of
these issues, but people have been aware of this issue for a few years and
newer datasets, e.g. SQuAD 2.0 explicitly try to address these sorts of
issues.

~~~
disgruntledphd2
Any chance you can provide some more recent results that justify your lack of
concern?

I'm an outsider to NLP, but have been very surprised by the recent
improvements, so would be interested in anything that covers their limitations
and newer datasets which help us to get over them.

~~~
Eridrus
The main breakthrough in the last 2 years is increasing success in
unsupervised/self-supervised pre-training on very large datasets transfering
to other tasks, e.g. ELMo, BERT, XLNet.

Have a look at the BERT & XLNet papers, they have results on a wide range of
datasets, including SQuAD 2.0.

I think it's worth keeping in mind that all of the headlines about "surpases
human performance" are 100% bullshit regardless of whether they are about
vision, ASR or NLP. Performance on a dataset is not the same thing as real
world usefulness.

But this takes nothing away from the fact that ELMo/BERT/XLNet make it far
easier to pull a model off the shelf and fine tune it for your task and get
good performance.

And the gains here are so broad and not task specific that issues with
individual datasets are not really an issue, it is the massive gains in
pretraining that are the real story, and if that even helps you overfit a test
set, it shows that they now have enough built-in linguistic knowledge that we
previously didn't have to even overfit a bad test set.

This is obviously bad news for you if you thought that these tasks were
basically solved, but it's completely irrelevant if you were building systems
since these methods all result in relative gains on whatever language task
they were pointed at.

The bigger issue IMO with many recent NLP papers is not that they are
expensive to train from scratch, which is a one time cost, but that they are
expensive to even do inference with, which makes it hard to actually deploy
them in real systems.

But if your question is "is recent progress in NLP illusory" I think the
answer is clearly no since the gains were across a broad range of tasks, and
frankly I've used BERT and it makes my application better, the main issue I
have is the runtime issue.

------
xamuel
The abstract is damning, but not surprising to anyone outside the machine
learning bubble. It's a symptom of a major problem in meta-science:
breakthroughs are hard, and when you shovel billions of dollars into trying to
incentivize breakthroughs, you end up with a lot of envelope-pushing dressed
up like breakthroughs, and most of them don't even successfully push the
envelope.

~~~
hyperbovine
> not surprising to anyone outside the machine learning bubble.

Or in it, either. ML researchers are painfully aware of how the sausage gets
made. (About 12 hours before the NIPS deadline, typically.)

~~~
fock
12 hours, that's early, according to a friend of mine doing his PhD in ML/NN
(or as he put it: everyone uses L²-loss and SGD - it's just not original)...
He also thinks that the quality of papers from a formal standpoint is very low
(not speaking of content yet). Riddled with spelling errors and illogical
sentences. But as published around NIPS gives all the credits and fixing those
things does not (moreso, it probably taints results), this won't change
anytime soon...

------
TallGuyShort
There are plenty of valid reasons to be critical / skeptical of the data
science / machine learning bubble, but the last few articles I've read about
the research actually makes it sound not-too-bad. 7/18 releasing enough code
to reproduce sure beats the rate at which other fields release code, even when
the code is still critical to confirming the results. In fact it's almost more
understandable since the code is a larger portion of the intellectual property
that might be valuable to whoever is funding the research, if it's a private
entity, or university staff who might still want to monetize the research. I
saw another one recently about reproducibility, something like 85%. Which
sounds terrible - until you remember that something like 90% [1] of science
papers weren't successfully reproduced in an attempt to do so. I'm gonna go
dig up citations if I can find them quickly...

[1] 1 quick citation before I get back to work:
[https://www.bewellbuzz.com/technology/many-published-
studies...](https://www.bewellbuzz.com/technology/many-published-studies-
complete-bs/)

It's terrible, but no more terrible than many other comparable fields.

~~~
disgruntledphd2
Yeah, but we'd expect papers that predominantly run code to be much, much
better than this. Like, there is definitely a set of code that produced the
results, and presumably a repository somewhere with some of the scripts that
did this. Supplying this to the world is not too much to ask (and should 100%
be a requirement of funding agencies (which would work as long as users of
data and code were also required to cite the original article (isn't nesting
parentheses fun?))).

~~~
TallGuyShort
I agree, but all of that is also true of a lot of papers in a lot of other
fields. I've seen papers about better complex simulations of natural processes
- I don't think I've ever seen one with complete code. I've seen papers that
described the experiment and the results of their analysis with hand-wavy
descriptions of which algorithms and statistical methods they used - but no
code.

My point is that the findings of this article are not necessarily indicative
of machine learning being junk science, but it's more likely indicative of a
systemic problem with all research, where there's incentive to publish the
paper, but not the data, code, models, or even generally enough information of
any kind to successfully replicate the experiment more often than 10-15% of
the time.

I remember when LIGO first detected a black-hole collision, they released the
raw data and the Jupyter notebooks that were actually used to go from the raw
data to the published results. Everyone was floored. It shouldn't be that
amazing - it should be the standard. For every field.

------
debt
"Only 7 of them could be reproduced with reasonable effort. For these methods,
it however turned out that 6 of them can often be outperformed with comparably
simple heuristic methods, e.g., based on nearest-neighbor or graph-based
techniques. "

Ha. I get a feeling there's alotta overpaid data scientists out there.

~~~
baron_harkonnen
Basically when you see a RNN in industry, chances are somebody has made a huge
and expensive mistake. And quite likely the person behind it thinks that
they've applied cutting edge research to a real world problem (when they have
done neither).

~~~
quelltext
Please elaborate. What's wrong with RNNs and what should be used instead?

~~~
sdenton4
I met a dude once who was using LSTMs to predict HVAC failures for some sort
of HVAC support firm. On some interrogation, it turned out that the input data
was garbage, and the only way the model "worked" is if it was repeatedly
overfit on only the last week's worth of data... comprising about 5k rows,
iirc.

This struck me as completely terrifying... it's not like the set of HVAC units
was changing out every week. And one would hope that the historical data would
help arrive at a general solution. My guess is they had some sort of church-
meets-state problem in the test/train split, so that the short-term data
allowed 'predicting' the test set.

Dude was quite convinced that they were making HVAC history, though.

------
derefr
> For these methods, it however turned out that 6 of them can often be
> outperformed with comparably simple heuristic methods, e.g., based on
> nearest-neighbor or graph-based techniques.

Even if these models are outperformed, would they be useful anyway, to serve
as insights into the functioning of an agent that can deduce solutions to
these tasks "from scratch?"

After all, the human brain is a wondrous AI, but it is often _also_
outperformed by simple heuristics.

~~~
desc
It's outperformed at basic arithmetic by fairly simple electronic circuits
too.

Which is probably your point :P

Right tool for the right job. The problem is that when you have a hammer
called machine learning, every problem looks like a machine-learning-sized
nail.

------
firasd
Seems like the desired outcome here is an unfixed target. What does Facebook
want to show you next in the Newsfeed, and why? It's gonna be different from
the factors Amazon evaluates to determine your home page suggested items. So
while it's worth trying to replicate academic research on these datasets like
MovieLens, the real-world outputs from collaborative filtering are inherently
subjective.

~~~
nsuser3
Isn't it all about 'engagement' (at least for companies)?

Facebook -> make the user see more ads

Amazon -> make the user buy more things

~~~
disgruntledphd2
In general, if you provide a social media service, optimising for ads will
probably not lead to you becoming a world-dominating destroyer of democracy,
like Facebook.

You probably want to optimise for keeping people on your service, first (and
last).

------
jlukecarlson
In my opinion, the most provocative point that this paper makes isn't just
about general reproducibility issues or problems with comparing to a weak
baseline — it’s that a number of these papers used improper methods to obtain
their results in the first place.

For instance the NCF and MCRec papers tuned model parameters on the test set
and the SpectralCF paper used a non-randomly sampled test set for evaluation.

That to me is even more surprising than their revelations that a well-tuned
statistical baseline outperforms these models.

------
sdinsn
Eh, this isn't really surprising. The purpose of research is to explore- there
are no guarantees that newer, complex methods will be better than simpler
methods.

I think the greater issue is that researchers are afraid to admit that their
results aren't great. And I don't blame them- it could impact funding, access
to journals, etc. But mediocre results are still important for research, and
it's a shame the scientific community doesn't recognize that.

~~~
ska
> And I don't blame them- it could impact funding, access to journals, etc.

No, being "understanding" about this makes you part of the problem.

Part of research is trying new things, and exploring, but you should always be
checking this against simpler techniques and currently accepted best
performance. If it isn't an improvement, you don't publish at all unless you
have something interesting to say about _why_ it isn't improving things. Or
perhaps in a methods survey.

One thing that happens when a technique like deep learning gets popular and
things like python toolkits pop up making it easy to try, is that you get
researchers in adjacent areas who don't really understand what they are doing
trying to apply it to their favorite problem domain, and then comparing almost
exclusively to other naive groups doing the same thing. It can be hard to tell
if there is anything interesting there, even when there are dozens of
conference papers, etc.

Basically the same thing happened with kernel methods when SVMs were hot, to a
smaller scale.

Compare this to say, AlexNet. The reason that was immediately obvious
something interesting was happening there was the fact that lots of people who
_did_ know what they were doing in models, had tried lots of other approaches,
and you could make direct comparisons.

So yes, blame them. I do think negative results should be valued higher, but
the fact you did some work doesn't make it publishable.

Frame it another way, if you give me a paper proposing a complex model and I
grab your data an play with it a bit and it turns out linear regression with
appropriate pre treatment works better ... well then I'm forced to believe you
are either incompetent or being dishonest. Or, if students, your supervisors
are.

This generalizes well. You should always be comparing against a known (or
considered) good solution on your data under the same conditions, not just
comparing against last years conference paper and variants you are trying to
"improve". The right choice of baseline comparison will depend a bit on the
domain, but not including it at all is shockingly poor scholarship.

I've even seen paper submissions with no direct comparisons at all, because
the "researchers" didn't have access to the comparators data, and were too
lazy to implement the other methods. Which leads to another sloppiness -
methods that get pulled into comparison not because they are the right choice,
but because there is a publicly available implementation. In the best case
this forms a useful baseline. In the worst case, well, I guess it's good for
your citation count if you implemented it :)

~~~
SubiculumCode
Whether an approach is currently better than other well understood approaches
shouldn't be the focus of researchers. It might be the focus of private sphere
engineers, but researchers should be interested in whether a new approach has
potential to advance machine learning.

~~~
ska
Apologies; I think I was unclear.

I don't mean to suggest that papers should boil down to performance
comparisons with baseline results, not at all.

What I'm saying is that if you don't do this comparison somewhere, it can be
very hard to tell what your numeric results do mean. In the worse case you see
people offering numerical comparisons to other approaches that are similarly
unpinned, and you can't tell if they are interesting even if they are apples
to apples.

As a researcher, you are being lax if you never do that work if for nothing
else than a sanity check on your implementation. If you've done it, it's good
to include in the paper as a point of reference. If you currently aren't
performing better than baseline, that's fine - but you should understand why
and discuss that with insight too.

So it isn't the focus. But it is table stakes that you understand this stuff.

------
agent008t
I did not see any author names there that I recognized.

I would imagine most people in academia know that most papers published are
published "just to publish" to help one's academic career, mostly by MSc and
PhD students. They are essentially "write-only" papers. Everyone has to start
somewhere.

Now, if we saw something like this out of DeepMind or OpenAI or in Nature,
that would be worrying.

------
sytelus
Even though this is only 18 papers, correlation with h5-index for conferences
they were published is inescapable!

    
    
      Conference | reproducibility | h5-index
      ----------------------------------------
      KDD        | 43%             | 77
      WWW        | 33%             | 70
      SIGIR      | 25%             | 55
      RecSys     | 13%             | 40

------
sjg007
I guess this shows that the data really are what matter. Most ML methods can
get signal that are reasonably approximate to each other. So are we hyper
optimizing the methods? So my guess is that the data is the advantage. Having
a lot of it and having exclusive access to it. Is it also time to move beyond
derivative me too type publications?

~~~
bluejay2387
There is plenty of work out now showing that we may have gone as far as we can
with the 'throw more data' at it approach. Work out of the Adversarial AI area
showing that models are learning shallow representations regardless of the
amount of data used, work showing that even with deep learning we are hitting
diminishing marginal returns with larger and larger corpus sizes, almost
complete lack of progress in dealing with 'context'... I think that is a large
part of the sudden resurgence of interest in ontologies, knowledge graphs,
etc...

~~~
sjg007
I agree with the context argument. But essentially with an ontology though we
are adding more “data”. In this case it would be relationships between things.
The hope is that implicit connections between other things not directly
connected might be revealed by the AI.

------
SubiculumCode
To me this seems less understandable than the reproducibility crisis in
psychology research, for at least in psychology research there is always the
possibility that different samples can belegitimately different, and thus
legitimately provide different results, but in machine learning, there are
standardized training and testing samples, and the methods can be entirely
encapsulated into immutable computer code. In this respect, the lack of
reproducibility in machine learning papers suggests to me either distorting
incentives or rampant unreported tweaking of parameters and/or falsification,
or probably, I suspect, intentional obsfuscation of methods to retain value as
IP in the private market. I'm not sure which or which mixture, but publishing
code, and datasets if funded with public money should be made more manditory.

~~~
nkurz
Interestingly --- and in my impression unlike psychology research --- I don't
think this paper ever attempted to replicate and got contradictory results.
Instead, I think the complaint is that for the majority of papers they were
unable to begin the replication for lack of working code or available data
sets. When they had these, they appear to have obtained the same results.

------
akerr
Isn't the main advantage that the companies don't need to be transparent (to
users, regulators etc) about their algorithms because they "can't"?

------
luminati
Summary:

18 papers/algorithms were tested

11 weren't reproducible

7 were repoducible with considerable effort but only 1 clearly outperformed
the baseline but still didn't outperform a well-tuned linear ranking method

\--

Travis Ebesu et al. 2018. Collaborative Memory Network for Recommendation
Systems. In Proceedings SIGIR ’18. 515–524. was the only paper out 18 that was
both reproducible and outperformed the baseline but did not consistently
outperform a well-tuned non-neural linear ranking method

 _Reproducible with considerable effort but outperformed by comparably simple
heuristic methods, e.g., based on nearest-neighbor or graph-based techniques._

====================

They are:

kDD:

[17] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu. 2018. Leveraging
meta- path based context for top-n recommendation with a neural co-attention
model. In Proceedings KDD ’18. 1531–1540. [SIGIR]

[23] Xiaopeng Li and James She. 2017. Collaborative variational autoencoder
for recommender systems. In Proceedings KDD ’17. 305–314.

[48] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep
learning for recommender systems. In Proceedings KDD ’15. 1235–1244.

RecSys:

[53] LeiZheng,Chun-TaLu,FeiJiang,JiaweiZhang,andPhilipS.Yu.2018.Spectral
Collaborative Filtering. In Proceedings RecSys ’18. 311–319.

WWW:

[14] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng
Chua. 2017. Neural collaborative filtering. In Proceedings WWW ’17. 173–182.

[24] Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018.
Variational Autoencoders for Collaborative Filtering. In Proceedings WWW ’18.
689–698.

 _Non-reproducible:_

====================

KDD:

[43] Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Multi-Pointer Co-
Attention Networks for Recommendation. In Proceedings SIGKDD ’18. 2309–2318.

RecSys:

[41] Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Long-Kai Huang, and Chi
Xu. 2018\. Recurrent Knowledge Graph Embedding for Effective Recommendation.
In Proceedings RecSys ’18. 297–305.

[6] Homanga Bharadhwaj, Homin Park, and Brian Y. Lim. 2018. RecGAN: Recurrent
Generative Adversarial Networks for Recommendation Systems. In Proceedings
RecSys ’18. 372–376.

[38] Noveen Sachdeva, Kartik Gupta, and Vikram Pudi. 2018. Attentive Neural
Archi- tecture Incorporating Song Features for Music Recommendation. In
Proceedings RecSys ’18. 417–421.

[44] Trinh Xuan Tuan and Tu Minh Phuong. 2017. 3D Convolutional Networks for
Session-based Recommendation with Content Features. In Proceedings RecSys ’17.
138–146.

[21] Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu.
2016. Convolutional Matrix Factorization for Document Context-Aware Recom-
mendation. In Proceedings RecSys ’16. 233–240.

[45] Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec:
Prod- uct Embeddings Using Side-Information for Recommendation. In Proceedings
RecSys ’16. 225–232.

SIGIR:

[32] Jarana Manotumruksa, Craig Macdonald, and Iadh Ounis. 2018. A Contextual
Attention Recurrent Architecture for Context-Aware Venue Recommendation. In
Proceedings SIGIR ’18. 555–564.

[7] Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-
Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation
with item-and component-level attention. In Proceedings SIGIR ’17. 335–344.

WWW:

[42] Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent relational metric
learn- ing via memory-based attention for collaborative ranking. In
Proceedings WWW ’18. 729–739.

[11] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep
learning approach for cross domain user modeling in recommendation systems. In
Proceedings WWW ’15. 278–288.

