Hacker News new | comments | show | ask | jobs | submit login
How people talk about marijuana on Reddit: a natural language analysis (medium.com)
117 points by sararob on Apr 20, 2017 | hide | past | web | favorite | 40 comments

Decent intro to natural language processing, but scientifically rubbish. I think using "cannabis" and "marajuana" skews heavily towards advocacy and serious discussion, since general chat about weed will use words such as, well, weed. Or pot, or hash, or herbs, or the maple leaf emoji, or a link to /r/trees. The problem with natural language processing on drugs is that the names people use for drugs are specifically chosen to be easily-confused with another, innocuous usage. That's the entire point of street names - to hide the fact that you're talking about drugs. I think you would need some kind of AI leagues ahead of our technology to accurately analyse people colloquially chatting about any drug, let alone one as popular and ubiquitous as weed.

Even more egregious than that:

> For quality control, I looked only at comments with Reddit score > 100

That's a non-trivial popularity score. Also, since it's an absolute score, it will bias against smaller subreddits, where 100 points on any comment is a difficult task.

This is much less "how people talk on reddit", and much more "the type of comment that gets upvotes on the default subreddits"

Yikes, that sounds like a great way to bias your data away from controversial opinions about weed. That would be like taking an exit poll of only people wearing lots of political apparel.

Using an innocuous encoding of a word is a form of encryption. People who expect to be under surveillance agree on a set of code words to denote illegal things. Though hard, there are multiple ways to semi-automatically break such a linguistic encryption.

Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].

Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.

Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.

Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.

[1] https://arxiv.org/abs/1312.3005 "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling"

[2] http://www.darpa.mil/program/memex

[3] https://universe.openai.com/envs#world_of_bits

and then there is the next level:

intentional use of slang terms during serious discourse in order to subtly delegitimize your opponents arguments. There is an Atlanta City Council member currently who, when responding to a question about e.g. medical marijuana will always change the noun to "pot" or "weed" or even "weed.. or.. pot" with an enunciation implying the concept of medical marijuana is a joke.

Who would agree to legalize the "devil's tobacco"? It's clearly and evil plant! Think of the "reefer madness"!

It's the devil's lettuce, not tobacco. :D

Satans salad?

What may be interesting is to use cosine similarity between the embeddings of these words to see if synonyms can be accurately identified.

Awhile ago, SpaCy set up a demo doing just that on the Reddit dataset:



It gets a little more fuzzy when you consider that /r/marijuanaenthusiasts is a subreddit for the discussion of trees (and happens to be subreddit of the day for 4/20).

Having studied culture for more than a decade, this makes me feel some validation that even in 2017, computers have yet to unravel the intricacies of culture.

In a very limited sense they have since the ads you end up seeing are targeted towards your culture based on your browsing history.

A quick note about using natural language/sentiment APIs: trained machine learning models must be used apples-to-apples on similar datasets; for example, you can’t accurately perform Twitter sentiment analysis on a dataset using a model trained on professional movie reviews since Tweets do not follow AP Style guidelines. (e.g. for some reason, training Python's NLTK on the IMDb movie review dataset to predict the sentiment of Donald Trump's tweets is a oddly common Hello World, even though the results are misleading and may cause confirmation bias)

Reddit comments are very idiosyncratic, and in this particular case, even moreso than usual. As a result, I am skeptical of trusting the output of such APIs as gospel, even one trained on massive datasets. (however, training a model on a Reddit-only dataset might be interesting, and is an idea I have in the pipeline.)

Last year, spaCy trained a model, sense2vec, on the Reddit dataset and got interesting results: https://explosion.ai/blog/sense2vec-with-spacy

This is a complicated problem and is I think best thought of as type of overfitting rather than a complete mistake. The independent or output variable, sentiment, does have an obvious generalisation from movies to politicians, unlike, for example, cinematography quality or trustworthiness. You are also overtraining when you test movie sentiment in the 2010s with reviews trained in the 90s as the concept of sentiment might have shifted if you look at it in that much detail.

(I don't disagree with anything you wrote, just expanding.)

the good news is as long as the training data is known accurate (basically human-prepared), you can use a relatively tiny amount of it for very good results on huge datasets.

Not directly related, but the podcast On The Media recently did a great episode on the origins of the war on drugs. One thing I didn't know? The word "marijuana" was actively popularized during the early days of the war on drugs to make the plant feel like a foreign import, despite the fact that it grew wild throughout the states.


This seems more like a quick intro to some Google BigQuery and NLP capabilities using a keyword that will attract readers. Not a bad thing, but anyone expecting analysis of the topic in the headline should know it's really not about that.

However, it worked on me, I'll probably give these tools a spin in the near future.

I was hoping to see a little more analysis as well. We did a study[1] about people moving for marijuana and we were surprised to find people were fairly open about discussing the topic. But I'd be very curious to see more about how conversations online are forming around marijuana.


'Marijuana' is considered a racist term in the industry. The preferred word is 'cannabis'.

upvoted for truth. maybe people downvoting aren't familiar with the history.

so here's some history. the name "marijuana" was pushed by Harry Anslinger[0] as a way to trigger racial anxiety amongst conservative whites who held negative views with respect towards Mexicans. the other names for the plant being "hemp" (a non-psychoactive strain used as an industrial fiber crop) and "cannabis" (latin name for the genus of the plant).

[0] - https://en.wikipedia.org/wiki/Harry_J._Anslinger

I didn't downvote, but some research shows that this is a controversial truth, even among those in the industry. The history is not controversial but the treatment of marijuana as a racist term is, from what I can see.

It has a Mexican name literally IN it. A conservative, white term would be Marijosepha.

It has a Mexican name literally IN it. A conservative, white term would be Marijosepha.

logicallee, I think you've misunderstood the point made in the gp post. Harry Anslinger wanted it to sound strange and foreign to white conservatives, not familiar.

I mean that it has Juan in the name (which I contrast with Joseph). So MariJUANa. Couldn't sound more Mexican if they tried.

Do you have a source for the about downvoters' opinions?

Ganja is the drug harvested from the cannabis plant.

Funny how marijuana was most mentioned with Donald Trump. I know the /r/trees subreddit thought Trump would be good for recreational Marijuana but doesn't look like that is true. The Trump subreddit (The_Donald) was also extremely pro marijauna and often boasted that Trump was better for legal marijuana than Clinton. Then enter Jefferey Bouraguard Sessions III.

I don't really understand why people would think Trump would be supportive of any kind of intoxicant. His older brother was an alcoholic and died before their dad (which depending on the source you're reading, had a big impact on him).

Well, he said a bunch of pro-weed stuff at one point, so it's not coming from nowhere:


Many of the biggest benefits of marijuana legalisation come from the fact that it tends to displace alcohol.

Many believed Trump to be a libertarian or even a closet Democrat, who supported saner policy for the practical reasons that many do, setting aside their personal beliefs about whether individuals should consume drugs.

More than a closet Democrat, he was an active Democrat who voted Democrat and donated to the party. Overall philosophically he's probably more 'opportunist' than anything.

Sure, in the past. If he's a Democrat now or in the past 8 years, it's definitely a well-kept secret.

Trump's media prominence means Trump is mentioned a lot in any topic.

And I'd note this is not a political point; it's a base-rate comment [1]. Political comments will tend to have other political topics arise in them, and it would require more detailed analysis to know if there is any true signal here. (Not necessarily a lot more detailed... just more than an eyeball glance.)

[1]: https://en.wikipedia.org/wiki/Base_rate

>Then enter Jefferey Bouraguard Sessions III

People haven't been this obssessed with a politician's middle name since Barack hit the scene almost a decade ago. I'm really glad you felt the need to add it, since I don't think I would've felt the full dramatic effect of your comment otherwise.

I love the 3 named presidents.

Franklin Delano Roosevelt, Lyndon Baines Johnson, Warren Gamaliel Harding.

Trump can't legalize marijuana. Only Congress can do that. If Sessions is enforcing the law, well, that's his job.

Is filtering by score > 100 a good idea? At least you might want to counterbalance that with negative-scoring comments, since people downvote to disagree, and it may be the Reddit audience are more likely to disagree with anyone thinking pot should remain illegal. In fact, why filter on score at all?

Sometimes data is not beautiful, but very ugly. These results are based on a flawed premise. One red flag -- where is the word "dank" in your list? Where are words used by people who actually smoke weed? Also, is "where score > 100" a good heuristic for this kind of study? I would argue that "where score < 100" is a better heuristic.

For example, a shill or superuser (people getting top comment) will not be using domain specific language -- they will be using language that caters to a general audience. If this is true, you would end up squeezing most of the interesting language out of your study. Have you been to Grass City forums? I am guessing these people surely aren't using terms like "Donald Trump" in their everyday conversations about weed.

Reddit is a huge melting pot and probably isn't a good place for insight about potheads. Grass City might not be either -- Grass City users are not typical potheads. The best place would be 10th grade high school social circles and college dorms. It really is amazing how little data is produced by social networks, in the grand scheme of things. We are all so used to hearing about how much data is produced by the internet. There are orders more data in the raw world just waiting to be scooped up.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact