1) Half the papers couldn't be reproduced on a technical level. Publish your code and your data, people!
2) Most of these papers uses "weak baselines" so they can show some kind of improvement and get their paper published. I'm conflicted about this because if we require every paper to beat state-of-the-art, we'd (collectively, as the entire discipline) be lucky to publish one paper a year. From one point of view, these papers actually represent a form of publishing negative results - we tried this and it didn't work - which isn't a bad thing. But the biased way its presented makes it harder to separate the wheat from the chaff.
3) It's not obvious that we're going to squeeze any more business value from this particular stone. Sometimes all the useful information in a dataset can be found with a fairly simple algorithm. Not everything benefits from a more complex representation, and sometimes you can't fix that with regularization or more data. Sometimes you just have to use the simple model and accept that it captures all the signal that's available and everything else is noise.
I could even argue "relying primarily on recommendations is a dark pattern"; In many platforms, the approach is replacing good search with good recommendation. Essentially, you don't get say, to explicitly specify your results, instead you get an opaque mix of "things you want and things we you to want" and you're supposed to be happy with this. You can see this in action in Amazon-recommendation, youtube and Google - actual control is replaced by "we know what you're thinking" (which indeed works some of the time but is infuriating when it doesn't and leaves lots of room nefarious effects, so Youtube fascist/extremist propaganda effects, etc).
And someone always pipes in here with "users are dumb and can only deal with recommendations since they can barely figure out toasters". Well, sites kind of need to educate their users, users actually have learned a bit given the dominance of the Internet over the last 30 years and a general awareness of the problems of recommendation-reliance can contribute to changes just as now a lot of supposedly not-dumb developers and managers think of recommendations as a benevolent or just neutral approach.
Here's an optimistic cultural take: good recommendations take you outside your bubble. It's not following Return of the Jedi with Episode I, but changing to something different, that you enjoy even though you would not have expected it.
I wouldn't call cynical at all. I view it as idealistic. The user should be in charge, the user should have tools to formulate a query describing what they want. Doubting the value of something other than the user being in charge goes with this. You can say we're each optimistic about something different but then we come to objective comparisons.
The thing consider with your comparison is - most Youtube recommendation don't give anything new at all. They usually some other thing from whatever is considered the top ten. Google gives ten mainstream movies, Youtube gives top ten of whatever song-era you are looking at. The engines never, ever find "hidden gems". The rise of recommendation engines resulted in a homogenization of the web - we have all experienced this. And I'd claim better algorithms can't change since the algorithms just don't have enough information to why a user like song/movie/product X; for just songs you have multiple qualities that different people look for (etc. back to the GP and OP's fine arguments).
And further, if a user wanted something new, you could have a "choose at random" button that let them know what they were getting into. Further, I don't think a certain amount of recommendation within tool-type processes is bad but that can devolve. Google rose by being more likely to indeed get people what they were thinking of but that vein has mined to the point that all that's left is ... low quality trash. But I'm optimistic there are ways to do better.
distributing Docker containers works better but even that won't necessarily help, because especially if you have to use GPUs the container software has been shifting underneath you. plus it's still cumbersome to make a fair comparison between containerized code and a new approach implemented somewhere else, especially if it's on different data.
Would it be worth making everyone expend a huge amount of unpaid additional effort maintaining the code for every ML paper ever published, for years to come? I'm not sure -- many are not very good and should probably be left to die. On the other hand it's common and immensely frustrating to be unable to reproduce results.
regarding 2, it seems like if the technique itself works in a substantially different way from the state of the art, even if it can't consistently beat it, that's still a positive result -- something new.
So basically sprinkle the code with print statements. Then if a future researcher has difficulty reproducing the result, they know exactly at which step the problem occurs.
That's the case if there's just one baseline to beat. It would be a bad thing if everyone were working on beating today's single state of the art model. It would be great to incentivize more work along varied lines.
Either way, we absolutely don't want papers that claim their innovation is an improvement on a metric they don't really improve.
These peer review aberrations incentivise these bad research practices I think.