
How to parse a sentence and decide if to answer "that's what she said"? - fogus
http://www.quora.com/How-would-you-programmatically-parse-a-sentence-and-decide-whether-to-answer-thats-what-she-said
======
nyellin
This submission reminded me why I love HN: On Reddit, there would have been a
flowchart (at best) if I followed the link. Here, there's an engineering
solution, with practical tips for machine-learning.

Now I want to build a "that's what she said" bot for Twitter, which will parse
new tweets and reply accordingly. (Maybe I will, as a learning exercise.)

~~~
tcard
[http://www.reddit.com/r/programming/comments/gyook/how_would...](http://www.reddit.com/r/programming/comments/gyook/how_would_you_programmatically_parse_a_sentence/)

Actually, a fair part of the links about programming are just the same both
here and on Reddit.

~~~
nyellin
I don't think so, considering that the 1st comment thread on Reddit is a
series of "that's what she said" jokes.

Edit: All the same, I'll try giving /r/programming a chance. As a whole,
Reddit looks less civil than HN, but maybe I haven't been fair. People tend to
criticize subjects that they don't know about or have never experienced 1st
hand, and I'm guilty of that here.

~~~
wzdd
On a programming topic, on HN, maybe 20% of the comments are interesting, or
at least informative. On a programming topic, on Reddit, maybe 2% of the
comments are interesting or informative, but there are 10 times as many
comments, the interesting ones are voted to near the top (not necessarily
right up there), and some of the interesting Reddit comments don't have an
analogue on HN.

~~~
GoodIntentions
/agree SNR on reddit mebbe 5%, slashdot %1 here, I don't know 50%? I see some
less than useful posts (like this one), but almost no outright
trolling/patently wrong posts.

HN makes my daily read list, even if it is a quick glance. Great community.

------
follower
Slightly OT, but I was really intrigued to read about the "Switchboard" Corpus
([http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.re...](http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html)),
especially given they were from audio recordings.

It seems, unfortunately, that the recordings aren't publicly available,
although archive.org seems to have small sample of the transcripts:
<http://www.archive.org/details/SwitchboardCorpusSample>

One particularly interesting section of the above readme was the section on
technical issues
([http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.re...](http://www.ldc.upenn.edu/Catalog/readme_files/switchboard.readme.html#technical))
including this example:

"ii.) The third problem was small changes in synchrony between A and B, due to
a pseudorandom dropping of 2 ms chunks of data on either side. Over the course
of a 10 minute conversation, these could accumulate to a differential of 30 or
40 msec between sides--enough to change a cross-channel echo from inaudible to
audible, for example, or from barely audible to very noticeable, for a human
listener.

When this bug was finally run down, it turned out to be a piece of code in the
utility which extracts conversations ('messages') from the Robotoperator
message master file. The code performed a check at each data block boundary to
see if the first two bytes had the values 'FF FF'; if so, these were
interpreted as header information, and the 16 bytes beginning with "FF FF"
were discarded as not part of the speech data. This code was a relic from an
earlier version of the Robotoperator which did not deal with mu-law values,
and thus never encountered FF in data. In mu-law data, FF is one of two ways
of representing zero signal level ('minus zero'). The offending lines of code
were removed and the problem ceased."

~~~
woodson
That corpus was frequently used some time ago for performing automatic speaker
recognition tests/system evaluations. Today, however, there are more
challenging corpora, like those provided by NIST for its bi-annual speaker
recognition evaluations (<http://www.nist.gov/itl/iad/mig/sre10.cfm>). These
provide more mismatch conditions which require more advanced channel
compensation mechanisms.

------
jzila
One of the commenters on Quora linked to an academic paper providing a
solution to this problem. It was apparently published in ACL-HLT this year.

[http://www.cs.washington.edu/homes/brun/pubs/pubs/Kiddon11.p...](http://www.cs.washington.edu/homes/brun/pubs/pubs/Kiddon11.pdf)

Enjoy. It's easily the most hilarious academic paper I've read.

~~~
GFischer
I've seen several better ones, but it's an interesting read nonetheless :)
("applying a novel approach - Double Entendre via Noun Transfer (DEviaNT) ")

As an example of other funny articles, this legal one on the word "fuck" is
one of the best I've read recently (found it linked here on HN previously):

<http://moritzlaw.osu.edu/faculty/articles/fairman_fuck>

Of course, the annals of improbable research and Ig Nobel prizes yield several
other examples :)

------
zecho
"Explaining a joke is like dissecting a frog. You understand it better but the
frog dies in the process." — E. B. White

I think that answer at Quora, while awesome, should sufficiently kill TWSS.

------
hugh3
It's a lot harder than I was expecting.

~~~
kirubakaran
That's what she said.

~~~
martinkallstrom
I scanned through this page looking to see if this exact comment would appear
here. Now when I see it I realize that it is less funny than I imagined. Maybe
because the comment you replied to was a trap you fell for.

------
tlrobinson
"Pre-trained That's-What-She-Said (TWSS) classifier in Ruby":

<https://github.com/bvandenbos/twss>

------
shasta
Here's my algorithm: Return false

~~~
kleiba
Probably a pretty good baseline.

------
thedaniel
I would gem install twss

<http://rubygems.org/gems/twss>

------
r00fus
Answer ignores a very public and growing corpus of data: IRC channels.

~~~
enjo
The author makes a great point, however, about the ability to link those two
together. I'm sure you'd end up with a lot of false positives from IRC as even
if it's attributed (ircuser: TWSS!) the flow of IRC conversations is such that
it's not always the last thing _ircuser_ said that is being replied to
(because of network lag and just general reply lag in IRC).

Although now that I think about it... twitter probably often suffers from the
exact same issue, although somewhat mitigated as people often include the
original tweet in their reply.

~~~
DasIch
Unless I'm missing something, this only occurs if someone sends two messages
to someone and the receiving user responds to the first message but gets the
second message before sending his.

Using average read/write times for a message - taking its length into account
- it should be fairly easy to check if someone could have read and written a
response to a message. Assuming it is a response if he could have written one
in time should be fairly accurate.

You could probably even cheat and use a constant time while still getting good
results.

------
gilesc
Working on this, if anyone wants a corpus (positive examples from
twssstories.com, negative from fmylife.com)

<https://gist.github.com/945614>

------
lwhi
A British equivalent to 'that's what she said', would probably be 'said the
actress to the bishop'.

------
mitko
2 suggestions:

1) Ask people in mechanical turk to write those sentences. Then you can ask
others to verify them - you get far with few dollars

2) Include higher level features - for example bi-grams, there is more
information in them

Also: corpus at <http://thatswhatshesaid.com/> (I have no relation with this
site)

------
zach
Hey, let's see Google's prediction API take on this challenge, if it's so
general. Just find a decent corpus and pop it into a GAE app.

In any case, I think most of us are now primed to try out some implementation
or another now. It would be a lot of fun regardless of the quality. Actually,
the false positives are probably funnier.

------
snikolov
I would be very surprised if a simple bag-of-words approach works in this
case. Intuitively, it's not the presence of certain groups of words that's
important, it's something much more subtle and structural. Something that
might be promising (and I'm being very handwavy here) is to discover
'template' sentence structures, as well as the particular words that populate
those templates that result in TWSS.

~~~
whosKen
I don't believe templates would work very well. The variations of sentences
are too great such that you will result in very low recall.

An alternative solution is to attack the problem backward, training on terms
(words or phrase) from sex-related conversations (such as adult chatroom
transcripts). Then, from general corpus (Twitter or generic chats) identify
terms that highly cooccur with those sex-terms. I would still use a Bayesian
classifier, with strong prior against labelling something as a TWSS.

------
Skywing
If this were to emulate some of my friends then it could respond to just about
anything with "that's what she said."

------
wging
I think this is marvelous but overthought. Here's my Python code (2.7 and
3.2-compatible):

twss = lambda sentence: True

------
lucisferre
Nice try Skynet.

~~~
TheAmazingIdiot
Skynet? Who's heard anything of skynet? I'm GLADoS.

Now, will you stand over.... there?

