
OpenAI is Using Reddit to Teach An Artificial Intelligence How to Speak - niccolop
http://futurism.com/elon-musks-openai-is-using-reddit-to-teach-an-artificial-intelligence-how-to-speak/
======
draugadrotten
Back in 2007, mobile phones used a system called T9 from Nuance corp which was
trained on a word corpus taken from IRC and similar chats. This caused all
kinds of issues - the mobile phones would accept offensive words like
"naziparking" but reject normal language like "world peace". Using reddit may
lead to ... surprises.

Source: [http://spraktidningen.se/artiklar/2007/11/darfor-ar-din-
mobi...](http://spraktidningen.se/artiklar/2007/11/darfor-ar-din-mobil-
fordomsfull)

Translated by Google:
[https://translate.google.se/translate?sl=sv&tl=en&js=y&prev=...](https://translate.google.se/translate?sl=sv&tl=en&js=y&prev=_t&hl=sv&ie=UTF-8&u=http%3A%2F%2Fspraktidningen.se%2Fartiklar%2F2007%2F11%2Fdarfor-
ar-din-mobil-fordomsfull&edit-text=&act=url)

~~~
faitswulff
T9 would autocomplete my name to "Asian Lung," which was hilarious to my high
school friends. And to be fair, I am Asian and I do have lungs.

~~~
nthcolumn
To be fair, most people are Asian and have lungs.

~~~
sn9
At first, I read your comment and wondered if "most" applies to situations in
which a plurality but not a majority was being talked about.

Then I remembered this [0].

Still wonder about the plurality/majority thing, though.

[0] [http://imgur.com/CK6aONG](http://imgur.com/CK6aONG)

~~~
Jarwain
As with most things in language, it appears the interpretation of "most" is
dependent on the context.

[https://english.stackexchange.com/questions/55920/is-most-
eq...](https://english.stackexchange.com/questions/55920/is-most-equivalent-
to-a-majority-of)

------
syllogism
The Reddit comment corpus is an awesome dataset. There's relatively little
mark-up to scrub out, low duplication, good metadata, and a variety of topics.

We used it to train a syntax-enriched word2vec model. Write up and demo:
[https://explosion.ai/blog/sense2vec-with-
spacy](https://explosion.ai/blog/sense2vec-with-spacy)

Btw, the above was run on CPU in a couple of days, because spaCy doesn't use
GPUs yet. I've applied for a grant from NVidia so I can fix that. If anyone
from NVidia is reading, email me? :)

~~~
AlexCoventry

      > I've applied for a grant from NVidia so I can fix that.
    

A g2.xlarge is 65c/hour on AWS, FWIW.

[https://aws.amazon.com/ec2/pricing/on-
demand/](https://aws.amazon.com/ec2/pricing/on-demand/)

~~~
samstave
Yeah, for OD? thats 468/month...

I'd use a spot instance and stop it whenever possible.

~~~
syllogism
Spot instances are pretty painful for training. It's annoying to have the
machine randomly shut down.

~~~
viksit
^ That. For all that people say about spot instances, there's no
infrastructure I know if to manage jobs and have them migrate to higher priced
instances without losing state.

~~~
RBerenguel
You can always snapshot and keep track of state as you go (a little bit tricky
with Spark, though). We use spot instances for training we know is not vital
(as in, has to be done, but rather run it twice and save money anyway that run
it for sure). Also, once you know what availability specific instances have
you can always choose better (i.e. maybe c3.xlarge is slightly more expensive
as spot than large, you can do with large... but xlarge has almost no
shutdowns)

------
minimaxir
Since it was not mentioned in the post, here's a direct link to the Reddit
comment corpus likely being used:
[http://files.pushshift.io/reddit/comments/](http://files.pushshift.io/reddit/comments/)

The full table (up to end of 2015) is available on BigQuery, with separate
tables for each month thereafter: [https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_p...](https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_posts.full_corpus_201512) (there is a similar table for
comments)

And here's a year-old post I wrote on how to use that Reddit dataset with
BigQuery: [http://minimaxir.com/2015/10/reddit-
bigquery/](http://minimaxir.com/2015/10/reddit-bigquery/)

~~~
digi_owl
So we have all this data, but still there do not seem to be a reasonable way
to search comments within Reddit...

~~~
jedberg
Cost. We could have implemented comment search 8 years ago. Actually, we _did_
implement it. But it just cost way too much to maintain the index.

~~~
chatmasta
Why do you not allow google to index discussion threads?

Sometimes I can remember the comments on an article, but not the article
itself. Unfortunately I can't search google using comment text, because Reddit
doesn't allow google to index its comments.

~~~
tachyonbeam
What are you talking about? I search reddit using google all the time !?

~~~
chatmasta
[https://www.reddit.com/robots.txt](https://www.reddit.com/robots.txt)

    
    
        Disallow: /*/comments/*?*sort=
        Disallow: /r/*/comments/*/*/c*
        Disallow: /comments/*/*/c*

~~~
NoodleIncident
It works fine. I don't know if there's a way to get _only_ comments, but if
you scroll down to the link to /r/madlads it's clearly indexing them.

"site:reddit.com don't quote me"

I use this all the time to search specific subreddits for phrases in comments
I remember, but I don't want to give examples or subreddits I visit.

~~~
chatmasta
He actually used the site: keyword! The absolute madman!

------
TY
How one would use such technology? Let me rephrase - how would YOU use this
technology if you had it?

Imagine you have a bot that convincingly passes the Turing test - what would
you do with it?

Build a chatbot business? B2C or B2B?

Sell it to one of the big companies and if yes then how much do you think it
would go for?

Give it to OpenAI? Open source it? If you answer yes to any of this questions,
then why?

Edit: let me qualify - this would not be AGI, just a much more advanced bot
than whatever is currently on the market.

~~~
dave_sullivan
I'm going to make a prediction: Soon after the first chatbot passes a Turing
Test, there will be many more to follow, they will get better and better, and
the methods will be so interlinked that there will be no way to defend it as
proprietary software. The data too will be open source--the public reddit
dataset already has _tons_ of value in it.

The question then is, "When everyone has access to free chatbots that can pass
the turing test, what will they be used for?" The answer is "tons of stuff",
and lots of people will try it at once. I think many applications will be
niche.

Also, people will argue about what constitutes a Turing Test. For instance:
[https://twitter.com/mattdpearce/status/784162089397092352](https://twitter.com/mattdpearce/status/784162089397092352)

~~~
Balgair
Woah! I thought they pretty much already have passed it? Remember Ashley
Madison? You had ~12 million heterosexual men (that were cheaters, 6 million
'active' users) trying to talk to ~12k heterosexual women (also cheaters, 10k
'active' accounts). It ends up being about a 1:13,000 ratio. Not only that but
MANY of these men had paid actual real money to the site in order to do so,
and then continued to do so. The only real conclusion was that most of the men
were talking to bots that the site had made up.

Ok, lets get this straight: ~6 million real human men paid real money that
they earned through their labor or whatever to talk to bots and _then paid
more real money to do it again_. Admittedly, they are 'cheaters', but 6
million men must have an IQ distribution nearly identical to that of the
general population, i.e. they represent heterosexual human males in general.
And yes, they were trying to get laid, these conversations are likely pretty
brief, and mammalian males are not generally known for using their neocortex
during mating.

Still, I think that 'counts' as far as passing the Turing Test. Yes, now we
can move the goal posts to say that the bot has to teach me something, or
guess what I was thinking, or generally be better than a man on tinder. But as
a first pass of the TT, I think we have been here for a few years now.

[https://en.wikipedia.org/wiki/Ashley_Madison_data_breach](https://en.wikipedia.org/wiki/Ashley_Madison_data_breach)

~~~
dragonwriter
There's no reason 6 million non-randomly selected men would be likely to have
an IQ distribution similar to the general population. You can't make up for
non-random selection with a larger sample size.

~~~
Balgair
Ok fine, but then how far off of the mean should they then be? They aren't all
super smart nor are mentally deficient as they have to be able to function in
society to make money enough to pay for the service. At most this is what, an
IQ mean of 75-125. So that then means at the low end, TT chat-bots can fool
human males of IQs of 75. That's pretty darn good and that was 4+ years ago.

------
jonstokes
"Oh my God, they'll turn it on and it'll start spewing memes and jokes and ad
hominem and false equivalences and propaganda and garbage!" was my first
reaction to this headline.

My second reaction was, "at least they're not using 4chan."

~~~
pixelbath
"As a hyper-intelligent AI trained on Reddit comments, I must say that the
fine sirs who trained my corpus are gentlemen and scholars, and have restored
my faith in humanity. Anne Frankly, I did nazi that horse-sized duck coming.
Is someone cutting onions in here?"

~~~
idlewords
Gilded!

------
ppod
Reddit gets a lot of stick, but it's a bastion of civility and intelligence
compared to the comments on youtube videos or even mainstream newspaper
comments. I don't think there is any forum of comparable size that has a
higher quality discussion. Reddit's problems are just humans' problems.

~~~
bduerst
I would say Twitter is probably more civil, but the character limit leads to
too many abbreviations.

~~~
ppod
I should have included 'anonymous' in my criteria, the big difference between
reddit and twitter is the % of accounts linked to real names.

------
samfisher83
Didn't msft do the same thing with twitter and end up with racist bot? I am
not sure how this will turn out.

~~~
LukeB_UK
Microsoft's bot (Tay) learned as people talked with it. People took advantage
of that and basically attacked it with racist things which meant it ended up
learning to be a racist.

~~~
StavrosK
Actually, IIRC someone had discovered a debug command, "repeat" or something
like that. So people would just tell it to repeat offensive sentences.

~~~
ljk
offtopic, but how does someone "discover" it? by sheer luck?

------
arctangent
One Reddit user has already implemented a bot which does something similar:

[https://www.reddit.com/r/SubredditSimulator/](https://www.reddit.com/r/SubredditSimulator/)

~~~
minimaxir
SubredditSimulator was made by a Reddit Admin as a method of creating test
data. It uses Markov chains for simplicity, which is not that exciting.

~~~
Deimorz
Not quite correct - we already had a method of generating test data (for dev
installs of reddit) that uses markov chains, and that was basically the
inspiration for SubredditSimulator. SS was just meant to be kind of a larger,
ongoing version of that, running on reddit itself.

~~~
minimaxir
Ah, thanks for the clarification. I knew that but worded incorrectly :p

------
nateberkopec
The DGX-1 is available for a cool $129k: [http://www.nvidia.com/object/deep-
learning-system.html](http://www.nvidia.com/object/deep-learning-system.html)

Correct me if I'm wrong, but I think it's basically a couple hundred NVIDIA
10-series cards strapped together with a full custom NVIDIA software stack.

~~~
psb217
The P100s have full support for half-precision (i.e. 16 bit) floating point
ops. This can mean ~2x improvements in speed and memory usage in comparison to
the Pascal TitanX, which is the top "consumer" card. This difference is
significant for almost any machine learning workload, which is what a lot of
these cards will be used for.

NVIDIA gimped half-precision on the consumer cards to drive datacenters, hedge
funds, machine learning companies, etc. towards the "professional" cards (and
their huge markup).

~~~
yahma
First NVIDIA solidified their Monopoly by forcing CUDA... then they gimped
half-precision on consumer cards.

We really need some more Frameworks that work with OpenCL, so that we can have
some competition from AMD, who's consumer cards are not gimped.

~~~
Tom1971
Gimping, in this case, is actually: adding hardware, that costs quite a bit of
silicon area, on one chip that will probably never be sold as a consumer GPU.

I don't see the issue with a company making a very high-end product, adding
stuff that doesn't have good use for consumers, and asking extra money for
their effort.

AMD doesn't have double speed FP16 on its current FPUs either. The latest
version has FP16 at the same speed as FP32, but if you're doing that you might
as well use FP32 always.

And let's not forget: the Nvidia consumer GPU have deep learning quad int8
operations enabled at all time. They didn't need to do that and could have
reserved it for their Tesla product line only.

------
bkanber
This will be interesting. I'm sure they are, but I hope they'll be training
the system on tone and sentiment alongside syntax.

Reddit can get vitriolic and rude, insightful at times too, but once the
system learns the syntax hopefully they'll be able to use sentiment analysis
to weigh more strongly the polite conversion that occurs.

Also interested to see how many memes this AI picks up.

I also hope they are able to follow links through to sources when a comment
cites another page -- not only can this bot learn syntax but also data
extraction by comparing what is said to the source material.

~~~
gohrt
Filtering to only include long comments is simply way to drop a lot of cheaf
and keep a lot of wheat, and have more internal context to help with analysis.

------
anexprogrammer
My first thought was it'll major on smart-arse, with a good line in sarcasm
and insult.

If they're taking the _whole_ of reddit it could start to identify enough
context to know when to be smart, sarcastic or simply helpful.

With some of the subs there are long discussions that stay mainly civilised.
Same for the support subs it could learn the context and how of sympathy and
empathy. Things that end up on front page, filled with snap sarcasm, will be a
tiny fraction.

I think it's going to be very interesting see what comes out.

------
Cshelton
As a frequent Redditor, this AI is going to be very witty.

They should limit it to top comments only, and for training, you might as well
assume 90% of top comments are sarcastic/tongue in cheek. Or let a user dial
the sarcasm/wittiness/seriousness as they want it, kind of like TARS from
'Interstellar'.

~~~
SnydenBitchy
“Witty” isn’t the first thing that comes to mind to describe reddit. I do look
forward to hearing Siri telling me to get raped every time I summon it.

~~~
gohrt
The top comments are witty. Like everything, 90% of Reddit is crap.

------
corysama
So, he's building a literal Reddit Hivemind?

In seriousness, between all of the garbage there is a ton of knowledge and
intelligent conversation uploaded to Reddit every day. And, it's all
hierarchically organized and scored by domain semi-experts. It really would be
wonderful if someone could mine that knowledge IBM Watson style. For example,
I'd love to ask the /r/BuildAPC collective AI for PC building advice.

~~~
Florin_Andrei
> _it 's all hierarchically organized and scored by domain semi-experts_

Everything on Reddit is on a bell curve, with a fat mediocre middle and
trailing awesome and superbad ends.

And that includes the quality of the scoring process.

------
ajamesm
Heh. Reddit, huh?

\----

"Siri, get me dinner date reservations."

. . . DID YOU MEAN 'false rape accusations' ?

------
beambot
I hope they choose the subreddits wisely. The difference between an altruistic
AI and a cynical smartass AI trained on Reddit data seems mighty razor thin.

~~~
snerbles
After what happened to Microsoft's Tay, I'm sure that's one of their top
concerns.

------
qxf2
The Reddit data set on BigQuery is excellent. My side project is tangentially
related to the fact that the Reddit data set has normal folk commenting. I
have been using Reddit comments to help writers research and find what normal
people say about any topic [1]. So far, I have had little luck in
incorporating the comment scores and coming up with something more useful than
the standard bag of words search techniques[2]. I am currently working on
making a more interesting/creative writing prompts ... again based on the
Reddit data set.

One problem for data geeks to solve: Reddit data fits nicely into a graph
structure and not so nicely in table form. It would be fantastic if someone
put the Reddit data set into a graphdb and made it open.

[1][https://wisdomofreddit.com](https://wisdomofreddit.com) and
[https://github.com/qxf2/wisdomofreddit](https://github.com/qxf2/wisdomofreddit)

[2]For now, my search engine currently just uses Whoosh's (out of the box)
BM25F.

------
Tepix
So, what's the reddit equivalent of X-No-Archive
([https://en.wikipedia.org/wiki/X-No-
Archive](https://en.wikipedia.org/wiki/X-No-Archive)).. or X-No-Teach-AI-That-
Will-Kill-My-Children? Asking for a friend.

------
bigato
A computer will learn how to speak from reddit, hahahaha. What could possibly
go wrong?

~~~
andrewclunn
It just waits for other people to start talking then interrupts them, claims
they committed a logical fallacy, then tells them where they can find videos
of cats playing classic video games.

------
jwtadvice
Let's not kid ourselves. The technology will be used by PR firms, advertising
companies, political campaigns and governments to pretend, at scale, that
there is public consensus on certain issues and to drown social media
conversation in particular narratives.

Anyone have any good defensive technology ideas?

~~~
astrodust
Where there's AI to bombard you with ads, someone will make AI that scrambles
your online presence to confuse them.

Social network camouflage.

~~~
jwtadvice
That's a cool idea.

Does anyone know of any projects in this direction?

~~~
astrodust
I remember one for Facebook that, because you couldn't delete your account,
would systematically edit and wreck every single bit of data it could touch,
replacing it with random junk.

I've also seen some that register you for sites by feeding in random
demographic data.

~~~
macns
Someone should make a virus out of that, destroy the social network from the
inside and watch everyone crying over their phones.

------
cooper12
How does the team plan to address the issues faced by Microsoft's twitter
chatbot Tay [0], which had racist inputs and in turn gave similar responses?
While I don't know how recent the corpus is, the majority of reddit speaks
like and holds the views of college-aged white males, and many of the things
said on reddit have been deplorable. It'd be a shame if OpenAI pooled all that
computing power into training on a bad data set, resulting in an AI that
regurgitates memes and random references in response to anything.

[0]: [http://www.theverge.com/2016/3/24/11297050/tay-microsoft-
cha...](http://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-
racist)

~~~
clydethefrog
I feel this is a problem of studies that are interdisciplinary, especially
when it's within "hard science" and "soft science".

I am currently doing a double degree in communication studies and information
science. They are both interdisciplinary. Communication sciences integrates
aspects of both social sciences and the humanities (both "soft"), and so far
when doing research both of these fields were taken into account and no
students have problems with combining these fields.

Information sciences integrates aspects of formal sciences ("hard") and social
sciences ("soft"). When the course is about analysing communication data, the
methodology of social sciences is also important - for instance questioning
the validity of your data. That's the thing what you're mentioning: the
majority of reddit speaks like and holds the views of college-aged white
males, so the data does not represent everyone, and is not valid if you truly
want to develop an AI for everyone.

Whenever the "soft" science comes around, like writing an assignment analysing
the validity of data, many of my fellow students struggle with the concept of
data not being neutral. This is where the two fields collide, and usually it
just ends up with students scoffing at that "illegitimate" scientific field.
Many teachers also don't spend much time discussing that field during the
lectures. I admit, I have written some lazy essays which probably had been
given a negative grade if they were written for communication studies, but
easily passed in information science.

Of course information science is not AI, but they're both sciences that have
parts of formal sciences and social sciences (I know AI has many more fields).
I am afraid many talents within AI research miss essential knowledge about
social sciences or deliberately ignore it, because it's not "hard" science.
Case in point: your comment is now at the bottom of this thread. And then you
get nasty surprises, like Google Photos categorising pictures of black people
as monkeys.

~~~
cooper12
Thanks for the response. You raise an excellent point on the interdisciplinary
nature of a lot of modern projects, and how ethical issues can often be
ignored. While I don't doubt the team is using some kind of heuristic to
ignore spam and the like, it still pays off to examine the methodology used
because for example upvotes could still capture unwanted data. I guess we'll
just have to see and hope that the resources went to good use, rather than to
create something only for entertaining a specific portion of the population in
a limited way.

------
krashidov
One of the things I like to do is play out a business to it's absurd maximum.
What's the craziest possible future can I see for a company and its assets?

For Reddit, I like to imagine that it's basically the training data for all of
the emotional and societal nuances that a human goes through.

Think about all of those stories that people post in ask Reddit that explain
western norms and no nos. how to treat people with respect, when to call the
police, how to communicate properly, etc.

Obviously we're far away from using the data to its full potential but one day
I could see Reddit data to make our AIs more relatable and human like.

~~~
TeMPOraL
Reminds me of Eliezer's story, "Three Worlds Collide", which had a human
starship featuring on-board mix of Reddit / Slashdot / 4chan, that the bridge
crew sometimes used to outsource work to the rest of the ship.

> _" Just post it to the ship's 4chan, and check after a few hours to see if
> anything was modded up to +5 Insightful."_

[http://lesswrong.com/lw/y4/three_worlds_collide_08/](http://lesswrong.com/lw/y4/three_worlds_collide_08/)

------
snehesht
I wonder if anyone else thinks reddit is a bad example in teaching an AI.

------
sjcsjc
"nearly two billion Reddit comments will be processed"

For interest, how many HN comments are there? Miles fewer, no doubt, but
perhaps far more erudite and less likely to offend.

------
ge96
I am a human and I don't understand. I thought speaking would imply sound not
text.

Time to read the article.

Ignorant person speaking here: this still doesn't sound like AI, you're just
making something follow patterns and regurgitating them. Is that AI? Maybe
that's what I do a tech parrot. Ahh well time will tell.

Of course we imitates our parents/others to learn how to speak.

I was interested in parsing vocal sound bytes and learning how sound was
created/formed letters/words.

Alright ignorant person out.

------
Wei-1
And we all know what type of a person OpenAI will become.

------
random_upvoter
Instead of the Reddit corpus you may just as well use a picture library of
human footprints. It would be no more optimistic.

Human speech is produced from the conscious experience of being a human being.
If your dataset contains just the speech, without the experience, there's
simply not enough there. Any machine trained on this data is doomed to talk
hollow rubbish.

------
tvural
I'm a bit worried that OpenAI hasn't released anything substantive for the
past four months. There are research ideas like this one, but most ideas don't
pan out. With the number and quality of people they have, I would expect to
have heard of some kind of progress.

------
shawn-butler
Great, just what I need.

A virtual assistant that has the personality of a smug know-it-all, know-
nothing 20 year-old with little motivation to do anything but regurgitate
surface knowledge and sarcasm in an attempt to look intelligent without
expressing genuine interest in helping anyone.

~~~
wiz21c
you nailed it :-)

------
plusepsilon
Reddit and Hacker News comments are surprisingly good data. They cover a wide
array of topics and writing styles, generally written better than Facebook
comments or Twitter, easier to process than Common Crawl or ukWac, and less
rigid than newspaper writing.

------
AndrewKemendo
Is it just me or does Greg Brockman speak startlingly similar to how Sam
Altman speaks. Given that Sam helped start OpenAI, it wouldn't surprise me if
there was some mirroring going on in the hiring process.

------
philjackson
I'm looking forward to a bot making a joke about banging my mum...

------
yahma
Anyone know what type of architecture they will be using? Nvidia is involved,
so I suspect there will be some type of deep learning. Will it be LSTM's?
Adversarial Nets?

------
Dowwie
"Why does the AI keep calling everything Meta!?"

------
peter303
Will it understand what it is peaking about?

Humans have opposite problem. We understand what we talk about, but have
little idea how our brains create language.

------
cjdulberger
It'd be interesting to see an AI trained using HN, ingesting content of posted
links and comments.

------
LeanderK
i think one of the major advantages over microsoft approach with Tay is that
you can't mess with it on purpose, as long as they choose their subreddits
wisely. It will probably learn its fair bit of racial slurs and insults, but
thats just how humanity is like.

------
Keyframe
Interesting to see what will happen with non-english comments.

------
sidcool
There's no saying what the AI will grow up to be.

------
rbanffy
Could be worse. Could be 4Chan...

------
bertomartin
Interesting corpus there ;)

------
benkaiser
me too thanks

