
A Large Self-Annotated Corpus for Sarcasm - blopeur
https://arxiv.org/abs/1704.05579
======
nebabyte
To provide you refuge from the inevitable deluge of sarcastic comments in this
comment section, here is a genuine/sincere comment: I like cats.

> sarcasm is labelled by the author

They literally just searched out "/s". Clever. Though I'm guessing the
"independently verified" entailed reading a lot of those comments.

Did they also read through the nonlabelled comments to catch any unlabelled
sarcasm? (Guessing not since the pitch is of "self labelled sarcasm") wonder
if that'll trip any usage up.

~~~
TwoFx
From the paper: "To investigate the noisiness of using Reddit as a source of
self-annotated sarcasm we estimate the proportion of false positives and false
negatives induced by our filtering. This is done by having three human
evaluators manually check a random subset of 500 comments from SARC-main
tagged as sarcastic and 500 tagged as non-sarcastic, with full access to the
comment’s context. A comment was labeled a false positive if a majority
determined that the “/s” tag was not an annotation but part of the sentence
and a false negative if a majority determined that the comment author was
clearly being sarcastic. After evaluation, the false positive rate was
determined to be 2.0% and the false negative rate 3.0%. Although the false
positive rate is reasonable, the false negative rate is significant compared
to the sarcasm proportion, indicating large variation in the working
definition of sarcasm and the need for methods that can handle noisy data in
the unbalanced setting."

~~~
eriknstr
I've seen people add '/s' to their comment when what they wrote wasn't
actually what I'd call sarcastic. Probably quite a few people have seen others
use '/s' but they've inferred the wrong meaning of the label and then they use
it incorrectly.

Seems from what was said above that this is something that has not been taken
into account.

~~~
tgjsrkghruksd
If it's indistinguishable to fellow humans as sarcasm then it doesn't matter
for the purpose of this corpus.

~~~
eriknstr
What I mean is, it looks like the researchers only looked at whether or not
the "/s" was intentionally placed at the end of the comment, not whether the
comment was actually sarcastic or whether the person that wrote it understood
that "/s" is meant to convey sarcasm.

------
3131s
A professor of mine named John Haiman had many interesting thoughts on
sarcasm. His book "Talk is Cheap", which I unfortunately can't find a PDF of
online, is definitely recommended:

[https://www.amazon.com/Talk-Cheap-Alienation-Evolution-
Langu...](https://www.amazon.com/Talk-Cheap-Alienation-Evolution-
Language/dp/0195115252)

I haven't read it in a few years, and my copy is at my parent's house in
another country, but his writing always avoided the obtuse, impenetrable style
that a lot of linguists are unfortunately guilty of. It is also approachable
for anyone without a linguistics background.

~~~
gtirloni

      Select Format
      
      Kindle – $65.39
      Hardcover – $129.74
      Paperback – $18.70
    

_head spins_

~~~
3131s
Yeah, I know. No idea why the ebook is so expensive :(

------
dlkf
A Kaggle user had the same idea two years ago:
[https://www.kaggle.com/smerity/d/reddit/reddit-comments-
may-...](https://www.kaggle.com/smerity/d/reddit/reddit-comments-
may-2015/finding-sarcasm/code)

I had some fun exploring the data so I wrote a short blog post about it:
[https://davefernig.com/2015/10/19/the-lowest-form-of-wit-
mod...](https://davefernig.com/2015/10/19/the-lowest-form-of-wit-modelling-
sarcasm-on-reddit/)

------
sverige
Funnily, though I have a naturally sarcastic personality and frequently (and
unintentionally) confuse people with my tone, I also have trouble sometimes
persuading people that I _was not_ being sarcastic when I say something
plainly. I think it has to do with some statement I've made being so outside
the norms of what they find acceptable that for them it is only understandable
as sarcasm.

And this sort of thing happens both with written and oral communication,
unless I really focus on providing facial and other body language clues as to
my intent, which I find to be somewhat annoying. I am, after all, of
Scandinavian extraction, and excessive emotional expression is not only
frowned upon culturally, it has also been systematically bred out of my
genetic code for dozens of generations.

~~~
Broken_Hippo
Funny you should say that. My spouse is Norwegian, and I still occasionally
have to ask him if he's being serious or sarcastic.

And to be fair, I find the lack of emotional outbursts to be a rather
enjoyable part of society. It lets me relax and not have to keep track of so
many "approved" emotions to keep track of.

------
gavinpc
> We collect a very large corpus, SARC-raw, with around 500-600 million total
> comments, of which 1.3 million are sarcastic.

So Reddit is 0.2% sarcastic. That sounds accurate.

------
theprop
Has anyone used this to build a Sarcasm bot? I desperately need this to handle
all my Twitter & Facebook replies.

~~~
js2
There was this time my daughter and I were having a conversation that
descended into sarcasm to the point where we no longer could tell if the other
was still being sarcastic.

My daugther ended it with: "I'm afraid we're caught in a sarcasm trap."

I'd be worried two sarcasm bots would end up similarly entangled.

------
sparkzilla
Yeah, that'll be really useful.

------
mrcactu5
am I walking into something here? I am concerned about regional bias. which
dialect of English is being spoken?

------
basicplus2
Is this Corpus for Sarcasm real or is this report of a corpus of sarcasm
sarcasm?

~~~
blopeur
Real, you can download it there :
[http://nlp.cs.princeton.edu/SARC/](http://nlp.cs.princeton.edu/SARC/)

~~~
tfm
Is that the actual URL? Seems to be missing /s

Marginally related: on cs.CL the other day was "Punny Captions: Witty Wordplay
in Image Descriptions"[0]. A mashup of these two projects would bring us that
much closer to the dream of Social Media In A Box.

[0] [https://arxiv.org/abs/1704.08224](https://arxiv.org/abs/1704.08224)

~~~
blopeur
It's the URL provided in the paper (first-page bottom left)

------
psyc
Neat. HNers could train themselves on it.

~~~
dsacco
Now we just have to train machines on subtle meta!

~~~
nebabyte
> subtle

search '[forum name]' or 'recently trending phrase'. subtle indeed

~~~
dsacco
It's a good start, but that sort of meta seems pretty on the nose. I'm not
sure that would catch the comment I replied to though (it was meta because it
was sarcastic and included a comment about machine learning, not because it
referenced HN).

------
pavlov
If you combine this corpus with a compilation of Donald Trump's tweets, will
it result in a matter-antimatter explosion of intentional sarcasm and
unintentional irony?

