
Bestpetting Cheated - bgraves
https://www.kaggle.com/bminixhofer/how-bestpetting-cheated#715681
======
75dvtwin
Main cheater identified as Pavel Pleskov.

According to this presentation

[https://www.slideshare.net/DataFestTbilisi/how-to-win-a-
mach...](https://www.slideshare.net/DataFestTbilisi/how-to-win-a-machine-
learning-competition-pavel-pleskov)

He worked at H2O.ai (but I think he was now fired). Prior to that (again
according to the above).

    
    
        - Master of Science from Moscow State university
        - New economics school (Moscow)
        - Financial Consultant
        - Quantitative Researcher
        - HFT Fund partner
    

Overall seems to be impressive track, this is the type of track that often
mentioned on HN, the top firms would hire from...

Completely not clear why he needed to cheat, are there other sophisticated
cheaters out there for these types of competitions?

May be there needs to be prises for 'checking' other peoples work..?

------
g82918
I don't see much wrong, or how this would be cheating. They produced a winning
entry, it should be on the organizer to ensure that their test data set isn't
trivially findable. It would be like testing a digit recognizer on the MNIST
data set and being surprised when someone just hashes it. A real solution
isn't to force opensourcing it is to get better metrics. Maybe add a random
component like a GAN to generate potential test data, and see if anything
classifies that correctly. In the real world when the metric becomes the
target it ceases to be a good metric. So test what you want to test and not
just some existing data set.

Edit: I didn't see that the test data was given. See the first reply to this
comment.

~~~
w1
The issue isn't that they found a copy of the test data online (The test input
data was provided to them as part of the problem.)

The issue is that they manually labeled the test data, and then pretended they
didn't.

The competition objective is to provide an ML solution that produces labels
for the test data, showing your work with code (to prove you didn't just hand
label the data.)

Instead, they did manually label the data, and hid their manual labels in the
id column of that external data source.

~~~
g82918
Ah, thank you, that part wasn't clear when I was reading it!

