
How people talk about marijuana on Reddit: a natural language analysis - sararob
https://medium.com/@srobtweets/how-people-talk-about-marijuana-on-reddit-a-natural-language-analysis-a8d595882a7a
======
nonsince
Decent intro to natural language processing, but scientifically rubbish. I
think using "cannabis" and "marajuana" skews heavily towards advocacy and
serious discussion, since general chat about weed will use words such as,
well, weed. Or pot, or hash, or herbs, or the maple leaf emoji, or a link to
/r/trees. The problem with natural language processing on drugs is that the
names people use for drugs are specifically chosen to be easily-confused with
another, innocuous usage. That's the entire point of street names - to hide
the fact that you're talking about drugs. I think you would need some kind of
AI leagues ahead of our technology to accurately analyse people colloquially
chatting about any drug, let alone one as popular and ubiquitous as weed.

~~~
Yen
Even more egregious than that:

> For quality control, I looked only at comments with Reddit score > 100

That's a non-trivial popularity score. Also, since it's an absolute score, it
will bias against smaller subreddits, where 100 points on _any_ comment is a
difficult task.

This is much less "how people talk on reddit", and much more "the type of
comment that gets upvotes on the default subreddits"

~~~
PascLeRasc
Yikes, that sounds like a great way to bias your data away from controversial
opinions about weed. That would be like taking an exit poll of only people
wearing lots of political apparel.

------
minimaxir
A quick note about using natural language/sentiment APIs: trained machine
learning models must be used apples-to-apples on similar datasets; for
example, you can’t accurately perform Twitter sentiment analysis on a dataset
using a model trained on professional movie reviews since Tweets do not follow
AP Style guidelines. (e.g. for some reason, training Python's NLTK on the IMDb
movie review dataset to predict the sentiment of Donald Trump's tweets is a
oddly common Hello World, even though the results are misleading and may cause
confirmation bias)

Reddit comments are _very_ idiosyncratic, and in this particular case, _even
moreso_ than usual. As a result, I am skeptical of trusting the output of such
APIs as gospel, even one trained on massive datasets. (however, training a
model on a _Reddit-only_ dataset might be interesting, and is an idea I have
in the pipeline.)

Last year, spaCy trained a model, sense2vec, on the Reddit dataset and got
interesting results: [https://explosion.ai/blog/sense2vec-with-
spacy](https://explosion.ai/blog/sense2vec-with-spacy)

~~~
ppod
This is a complicated problem and is I think best thought of as type of
overfitting rather than a complete mistake. The independent or output
variable, sentiment, does have an obvious generalisation from movies to
politicians, unlike, for example, cinematography quality or trustworthiness.
You are also overtraining when you test movie sentiment in the 2010s with
reviews trained in the 90s as the concept of sentiment might have shifted if
you look at it in that much detail.

(I don't disagree with anything you wrote, just expanding.)

------
kennywinker
Not directly related, but the podcast On The Media recently did a great
episode on the origins of the war on drugs. One thing I didn't know? The word
"marijuana" was actively popularized during the early days of the war on drugs
to make the plant feel like a foreign import, despite the fact that it grew
wild throughout the states.

[http://www.wnyc.org/story/on-the-
media-2017-04-14/](http://www.wnyc.org/story/on-the-media-2017-04-14/)

------
SmellTheGlove
This seems more like a quick intro to some Google BigQuery and NLP
capabilities using a keyword that will attract readers. Not a bad thing, but
anyone expecting analysis of the topic in the headline should know it's really
not about that.

However, it worked on me, I'll probably give these tools a spin in the near
future.

~~~
rcarrigan87
I was hoping to see a little more analysis as well. We did a study[1] about
people moving for marijuana and we were surprised to find people were fairly
open about discussing the topic. But I'd be very curious to see more about how
conversations online are forming around marijuana.

[1][https://www.movebuddha.com/blog/moving-for-
marijuana/](https://www.movebuddha.com/blog/moving-for-marijuana/)

------
inuhj
'Marijuana' is considered a racist term in the industry. The preferred word is
'cannabis'.

~~~
metaphorm
upvoted for truth. maybe people downvoting aren't familiar with the history.

so here's some history. the name "marijuana" was pushed by Harry Anslinger[0]
as a way to trigger racial anxiety amongst conservative whites who held
negative views with respect towards Mexicans. the other names for the plant
being "hemp" (a non-psychoactive strain used as an industrial fiber crop) and
"cannabis" (latin name for the genus of the plant).

[0] -
[https://en.wikipedia.org/wiki/Harry_J._Anslinger](https://en.wikipedia.org/wiki/Harry_J._Anslinger)

~~~
logicallee
It has a Mexican name literally IN it. A conservative, white term would be
Marijosepha.

~~~
maxerickson
_It has a Mexican name literally IN it. A conservative, white term would be
Marijosepha._

logicallee, I think you've misunderstood the point made in the gp post. Harry
Anslinger wanted it to sound strange and foreign to white conservatives, not
familiar.

~~~
logicallee
I mean that it has Juan in the name (which I contrast with Joseph). So Mari
_JUAN_ a. Couldn't sound more Mexican if they tried.

------
Muuuchem
Funny how marijuana was most mentioned with Donald Trump. I know the /r/trees
subreddit thought Trump would be good for recreational Marijuana but doesn't
look like that is true. The Trump subreddit (The_Donald) was also extremely
pro marijauna and often boasted that Trump was better for legal marijuana than
Clinton. Then enter Jefferey Bouraguard Sessions III.

~~~
cavanasm
I don't really understand why people would think Trump would be supportive of
any kind of intoxicant. His older brother was an alcoholic and died before
their dad (which depending on the source you're reading, had a big impact on
him).

~~~
code_duck
Many believed Trump to be a libertarian or even a closet Democrat, who
supported saner policy for the practical reasons that many do, setting aside
their personal beliefs about whether individuals should consume drugs.

~~~
ktRolster
More than a closet Democrat, he was an active Democrat who voted Democrat and
donated to the party. Overall philosophically he's probably more 'opportunist'
than anything.

~~~
code_duck
Sure, in the past. If he's a Democrat now or in the past 8 years, it's
definitely a well-kept secret.

------
rwmj
Is filtering by score > 100 a good idea? At least you might want to
counterbalance that with negative-scoring comments, since people downvote to
disagree, and it may be the Reddit audience are more likely to disagree with
anyone thinking pot should remain illegal. In fact, why filter on score at
all?

------
cool_shit
Sometimes data is not beautiful, but very ugly. These results are based on a
flawed premise. One red flag -- where is the word "dank" in your list? Where
are words used by people who _actually smoke weed_? Also, is "where score >
100" a good heuristic for this kind of study? I would argue that "where score
< 100" is a better heuristic.

For example, a shill or superuser (people getting top comment) will not be
using domain specific language -- they will be using language that caters to a
general audience. If this is true, you would end up squeezing most of the
interesting language out of your study. Have you been to Grass City forums? I
am guessing these people surely aren't using terms like "Donald Trump" in
their everyday conversations about weed.

Reddit is a huge melting pot and probably isn't a _good_ place for insight
about potheads. Grass City might not be either -- Grass City users are not
typical potheads. The best place would be 10th grade high school social
circles and college dorms. It really is amazing how _little_ data is produced
by social networks, in the grand scheme of things. We are all so used to
hearing about how _much_ data is produced by the internet. There are orders
more data in the raw world just waiting to be scooped up.

