
Data mining Reddit posts reveals how to ask for a favor and get it - Libertatea
http://www.technologyreview.com/view/527496/dating-mining-reddit-posts-reveals-how-to-ask-for-a-favor-and-get-it/
======
philh
> It turns out that their algorithm makes a successful prediction about 70 per
> cent of the time. That’s far from perfect but much better than random
> guessing which is right only half the time.

From the graph, it looks like only about 27% of requests are fulfilled in the
best case (jobs). In which case I can do better than 70% just by constantly
predicting "no".

(I assume that this is just bad reporting. I haven't read the reference.)

Edit: skimmed the referenced article. Average success rate is 24.6%. The 70%
they give (well, 67.2%) is "the probability that a classifier will rank a
randomly chosen positive instance over a randomly chosen negative one".

~~~
im3w1l
They got a ROC AUC score of 0.67. This means that for a randomly chosen denied
request and a randomly chosen accepted request, they will give a higher score
to the accepted request 67% of times.

~~~
Irishsteve
Below 0.8 is pretty much random

~~~
colanderman
No, reread the parent's statement. The test involves distinguishing exactly
one _known_ accepted request and exactly one _known_ rejected request.
Probability of an acceptance does not come into play here; hence a random
algorithm would choose the accepted request correctly 50% of the time.

What the algorithm performs poorly at is determining whether any _single
arbitrary_ request is accepted or rejected; that's the test that would require
around 80% success rate.

~~~
Irishsteve
Eh - ok - let me rephrase. anything below 0.8 with ROC curves is usually
considered to be very poor.

Can't quite recall the paper that gives the details. Think it might be this
one
[http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf](http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf)

~~~
andreasvc
From the PDF you cite:

"Since the AUC is a portion of the area of the unit square, its value will
always be between 0 and 1.0. However, because random guessing produces the
diagonal line between (0, 0) and (1, 1), which has an area of 0.5, no
realistic classifier should have an AUC less than 0.5."

So it appears that not 0.8 but 0.5 is the randomness threshold; therefore 0.7
is not so bad.

~~~
taralx
Unfortunately not. IIRC, any classifier with AUC < 0.5 can be improved by just
inverting its output. Real classification quality requires being significantly
above 0.5 -- 0.8 is a common threshold.

------
wingerlang
Off topic now, but is the title confusing? The word "Dating", I can't make
sense of why it is there and/or what it means.

~~~
dewey
Typo, should be "Data Mining" not really that confusing in this context
though.

~~~
wingerlang
Aha. In hindsight, that does seem logical.

~~~
UVB-76
Please tell me this is a typo, and "dating" has not become an acceptable
shortening of "data mining"

~~~
Varcht
I propose "data'n"...

~~~
maxerickson
damning.

------
scottfr
Random pet-peeve. I dislike how the graph in the article has markers at each
decile.

This is model-generated data so they could put an infinite of markers at
arbitrary locations. The use of markers implies to the viewer that is a data-
generated figure, which it is not.

~~~
ronaldx
I was burned by this, thanks.

------
pervycreeper
>We find that Reddit users with higher status overall (higher karma) __[...]
__are significantly more likely to receive help

There could be a common cause to these two factors (a certain way of writing,
for instance), that could explain the correlation. I can't imagine someone
basing a judgement on a stat reported on someone's user page.

~~~
rjvir
Do you think someone with an account of say 10 karma points and 2 comments on
a month old account is equally likely to receive help, ceteris paribus?

~~~
pervycreeper
I'm sure it could be a factor (nothing's preventing someone from doing that,
after all) but my original point was essentially as characterized in this post
below
[https://news.ycombinator.com/item?id=7793607](https://news.ycombinator.com/item?id=7793607)

The tone of the article, I think, was suggestive of a causal relation, which
surely doesn't (necessarily or plausibly) hold.

------
colanderman
I'd expect better from a publication titled "MIT Technology Review" than to
start a Y-axis oh-so-close to – but not quite at – zero.

Makes "craving" seem to perform much worse than it actually does at a status
of 0.

~~~
pvdm
"MIT Technology Review" has more similarity to BusinessWeek than the actual
university.

------
CurtMonash
It's been a long time since a date of mine was satisfied with pizza.

------
nwenzel
> Althoff and co used a standard machine learning algorithm to comb through
> all the possible correlations.

What exactly is a "standard machine learning algorithm"?

I'm sure that probably means that they used something from scikit-learn but
"comb[ing] through correlations" isn't as simple as clicking the Go button.

The rest of the article does start to get into labeling, holding out a test
set, and some of the data cleanup (the real combing working).

I guess I was just hoping for more detail of _how_ it worked and not _that_ it
worked. I get that this wasn't meant to be a PhD thesis on supervised machine
learning, but the mechanics of data analysis are really interesting as a
process of discovery.

Curious to know what others think. How did the balance of _how_ vs _that_ work
for you?

~~~
jebus989
It's a logistic regression model, a basic statistical technique which wouldn't
have even come under "machine learning" a few years ago. Later they use some
kind of LASSO regression to penalise the inclusion of redundant features.

"Combing through the correlations", it seems, literally means calculating the
(Pearson) correlation between two variables (success vs. an input feature) and
adding some interpretation, as they do on p6 of the arXiv paper. For
test/training data, it looks like they just used a 30/70% split rather than
k-fold cross-validation and holdout, but I'm sure it makes no difference
either way and in this case (as often) is trivial to design and implement.
Presumably their AUROC could be increased just by dropping in an SVM or a
Random Forest in place of the logistic regression.

From what I've skimmed of the paper you're over-estimating the complexity of
the study.

~~~
md2be
Great Comment: Having studied graduate statistics within the stats department
and a data mining within the CS dept, It amazing how well the CS crowd has
rebranded statistics into something you can talk about at a bar without
people's eyes glazing over.

~~~
wodenokoto
Is train/test split part of a normal statistics?

------
jlrubin
Here's a project I put out in January, 65% accuracy, that does the same thing.
These guys actually contacted me back then, surprised to see no mention.

And it has a live demo!

[http://randomacts.media.mit.edu](http://randomacts.media.mit.edu)

------
lotsofmangos
This is why the coming AI overlords will not need to kill anyone. They will
just be really convincing instead.

------
bwooceli
A more interesting analysis would be to identify the correlative features that
describe the giver of pizza vs the asker.

------
danbruc
Internet access but no food - strange world.

~~~
pbhjpbhj
Are you suggesting that these people aren't being truthful about their need,
or, are you saying it's amazing that internet access is so [relatively] cheap?
Or... ?

Internet access costs less per month than a takeaway pizza in my country.
Libraries and other community centres provide free access to computers with
broadband internet.

One of the surprises for me when, some years ago, I found myself in a
developing nation (with unstable power supplies and non-potable tap water) was
that every street corner in the town I was staying in had a (hand painted)
advert for an internet cafe. It wasn't cheap compared to food prices - but
since then globally food prices seem to have gone up and internet prices gone
down considerably.

~~~
danbruc
I just wanted to express how bizarre this situation is but not imply anything
about (possible) reasons. My first thought was like just terminate the
internet contract, sell the phone or computer or whatever you are using and
buy you 50 kg of rice. That would be even healthier than pizza!

But I decided not to phrase it this way because this is obviously a very
simplistic view and the people on the internet will tell you that in full
detail even if you are aware of it. _They might have free access to internet.
It might be a temporary situation and not be economical to sell and later buy
back the computer. Corn is much cheaper than rice. You have to cook rice and
that requires energy. Rice alone is not healthy. Maybe they just do this to
get in contact with others not because of having no money for food. They might
need the computer for work. They just got robbed late at night, no money left
but they still have a phone._

