
How we analysed 70M comments on the Guardian website - EwanToo
https://www.theguardian.com/technology/2016/apr/12/how-we-analysed-70m-comments-guardian-website
======
takno
The technology choices seem okay. I suspect it could have been done fairly
quickly and slightly less expensively just by putting the metadata into the
postgres image and running plain SQL queries there. On the other hand gaining
some experience of Spark is a reasonable thing to want to do as part of the
analysis.

I'd be concerned for the analysis itself that using the blocking of a comment
as a proxy for the comment being abusive contains significant risk of
confirmation bias - moderators may be more inclined to moderate articles by
authors with traditionally female names, or those authors may be more inclined
to report comments, or indeed they may be more inclined to write articles on
more controversial topics. I can't think of a better automated approach to the
analysis, and I find the outputs pretty believable, but I'd be wary of
regarding the study as evidence.

~~~
Uhhrrr
They respond to that and other criticism here:

>It is also important to say that it is true that the research is only valid
if you accept a) that the Guardian's community standards are fit for purpose,
and b) that the moderators are reasonably skilled at applying them. Some
readers may not accept either of these things, and our research is not
designed to justify these assumptions.

[https://www.theguardian.com/technology/2016/apr/12/how-we-
an...](https://www.theguardian.com/technology/2016/apr/12/how-we-
analysed-70m-comments-guardian-website#comment-72218686)

~~~
takno
I made no comment on the fitness for purpose of the community standards or the
skill of application. I'm mostly commenting that the standards are largely
reactively enforced, and this can lead to bias.

~~~
Uhhrrr
Oh, I agree - I was just quoting they acknowledge that point of view and don't
dispute it.

------
aerovistae
This seems weirdly not insightful to me. Did anyone else learn anything they
didn't already know, or which they wouldn't consider obvious?

Given 70 million comments I might hope for something more, like perhaps a
proposition as to how to better automatically catch the bad comments without
having to have moderators at all, via some sort of lexical analysis of the
good ones versus the bad ones from this manually sorted set.

Would the Guardian be willing to release the data set for others to analyse?

~~~
dominotw
>Did anyone else learn anything they didn't already know, or which they
wouldn't consider obvious?

"Our list of authors contains the approximately 12,000 individuals"

so guardian is more like a blog aggregator ?

~~~
takno
This covers articles published in a large daily newpaper over 20 years. The
CIF section in latter years does provide an online-only platform for blog-
standard articles from enthusiastic amateurs, but I would have thought that
most of the 12,000 made it into the print version as well.

~~~
corin_
There are 7.2k days in 20 years, and each day's paper has plenty of space for
guest columnists/etc. so seems could even be easy enough to hit 12k even
without CIF content.

------
YeGoblynQueenne
I am not convinced by The Guardian's methodology. Their criterion for an
abusive comment is that it was moderated. That's fine and dandy, but then they
go and make the assumption that if:

a) a comment was abusive

b) AND it was posted under a given article

c) THEN it was abusive _towards the author of the article_

Based on this assumption they find a correlation with articles being written
by women and minorities, and high rates of abuse (a.k.a. moderated comments).
The assumption is a very big one to make. People in comments may just be
hurling abuse at each other. They might be trolling each other rather than the
author of the article, so their abuse may have nothing to do with the sex or
minority status of the author.

Did they check that this is not the case? They report that they did check that
the correlation was, to their opinion, statistically significant [1] but they
don't give any details of their test for this.

In general, it's true that The Guardian has a big problem with comments.
Eyballing it, most of their comments are dross that doesn't contribute
anything to any conversation, even if they're not abusive. They need to
attract a better class of commenter. They'd do well to take a page out of HN's
book and implement a self-moderation system.

Also, allowing editing of comments for a brief period might help improve their
quality. Sometimes people might regret pressing "post" but have no way to
retract their own comment.

[1] [https://discussion.theguardian.com/comment-
permalink/7221868...](https://discussion.theguardian.com/comment-
permalink/72218686)

------
Uhhrrr
They have a little quiz with 8 blocked comments, to see which ones you would
allow. Here is one of the ones the Guardian blocked as abusive:

"“The Guardian, once a standard bearer of quality journalism now contains
football journalists so in love with Mourinho it makes me sad. This is just
the latest in an incredible long campaign for the despicable one to join the
club of Matt Busby and Jimmy Murphy. I am astonished that the editor of the
paper allows this dross to be published. You are a disgrace to the
profession.”"

~~~
petecox
I tried the quiz and scored about 4/8\. I'm probably a little more permissive
in letting people express their opinions if it means they dig their own holes.

I know nothing of this soccer caper but aside from the last sentence, it seems
like a fairly typical rant. But why do they publish Mourinho-loving dross? :)

If they genuinely want to clean up their comments section, eliminate the
professional trolls. e.g. On any politics story the same faces appear,
seemingly plucked from university debating teams. It doesn't matter what
ridiculous position is being argued as long as they cheered for their side and
smeared the other mob. Most voters have leanings but don't have a "team" per
se and simply want good policy.

