
Show HN: Subreddit Finder - Trained on 4M Reddit Posts from 4K Subreddits - Arimbr
https://valohai.com/subreddit-finder
======
H8crilA
"What is the penalty for living" ->
[http://reddit.com/r/Poland](http://reddit.com/r/Poland), 28%

"When should I kill my chicken" ->
[http://reddit.com/r/csgo](http://reddit.com/r/csgo), 19%

"Am I conscious" -> [http://reddit.com/r/INTP](http://reddit.com/r/INTP), 25%

"How to not think" ->
[http://www.reddit.com/r/howtonotgiveafuck/](http://www.reddit.com/r/howtonotgiveafuck/),
49%

"Is the government evil" ->
[http://www.reddit.com/r/ENLIGHTENEDCENTRISM/](http://www.reddit.com/r/ENLIGHTENEDCENTRISM/),
19%

"Is the government good" ->
[http://www.reddit.com/r/CoronavirusUK](http://www.reddit.com/r/CoronavirusUK),
10%

"Is the government useful" ->
[http://www.reddit.com/r/iran](http://www.reddit.com/r/iran), 31%

~~~
teslademigod1
Poland for the win :D

------
localcrisis
I tried "find hot local singles in your area" and the top result was /r/vinyls

actually very impressed

~~~
Arimbr
lol

------
jrumbut
A very cool demo and I congratulate the author, but I am always a little sad
for more data science type demos that try to answer the question (that is
proving toxic) "given what I know about you, how can I find a community of
people just like you?"

I would love to see a subreddit finder that answers questions like "what
community would complement your interests?" or "what community needs to hear
what you have to say?" or "what community would be made better by your
presence?". Similarity is at best a proxy for it.

Those are harder but, I think, more useful.

~~~
rhizome
They aren't plug 'n play for advertising, though. "How can I find a community
for you," is the charitable flipside to, "here's a community you might like to
be a part of," where the community is "Coors Lite purchasers."

~~~
cwillu
That's so short sighted though; I'd be so much more likely to engage with an
advert that figured out a new thing that would interest me than the usual
"you've been reading about sc2 for a week, so have more of the same" nonsense.

~~~
rhizome
Yeah nobody knows how to do that. To the degree that anybody's figured any
amount of it out, it's much more likely and lucrative to point that code at
changing your vote than your brand of toilet paper ( _wink wink_ ).

------
Nextgrid
I tried it with "best time tracking app for iOS?" and "I'm looking for a time
tracking app. Any recommendations?"

I expected the iPhone or iOS subreddit to be suggested, but it suggested
GearVR | 13.0%, ringdoorbell | 9.0%, canadacordcutters | 5.0%, TTVreborn |
5.0%, AusSkincare | 4.0%, sideloaded | 4.0%, FlutterDev | 2.0%, shopify |
2.0%, weightwatchers | 2.0%, crossfit | 2.0%.

Congrats on the attempt but it does still need some work.

~~~
burnte
I'm not really sure your query is what it was built for. That's more of a
google search than a community idea.

~~~
folkhack
Not sure if you're familiar with reddit but question posts like that are
incredibly common, especially in regards to technology questions and the like.
It's still a forum when you get down to it so lots of people like myself post
questions on niche topics because you can find small communities of experts on
everything from dogs to obscure vintage computers.

Just saying, as a huge user of reddit - I'd expect the same as OP, those seem
like reasonable searches to get those results.

~~~
burnte
I'm a long time redditor, and that's why I said what I did. I felt the took
was more for finding your niche community than finding an answer to a
question. Different interpretations I guess.

------
gnicholas
The intercom chat widget makes the tab title switch back and forth between
"Subreddit Finder" and "Valohai says". There does not appear to be a way to
dismiss the chat widget, so it just keeps flipping back and forth, which is
visually annoying.

I keep many tabs open, but I am going to close this one immediately because I
don't want to have something flashing at me out of the corner of my eye all
day.

------
Der_Einzige
One place to improve this would be to use a better set of word-embeddings.
FastText is, well, fast, but it's no longer close to SOTA.

You're most likely using simple average pooling, which is why many users are
getting results that don't look right to them. Try a chunking approach, where
you get a vector for each chunk of the document and horizontally concatenate
those together (if your vectors are 50d, and do 5 chunks per doc, than you get
a 250d fixed vector for each document regardless of length). This partially
solves the issue of highly diluted vectors which is responsible for the poor
results that some users are reporting. You can also do "attentive pooling"
where you pool the way a transformer head would pool - though that's an O(N^2)
operation so YMMV

If you have the GPU compute, try something like BERT, or GPT-2 which is fine-
tuned on all of reddit. Better yet, try vertically concatenating all of the
word-embeddings models you can together (just stack the embeddings from each
model) if you have the compute

To respond to your comment (since HN isn't letting me post cus I'm 'posting
too fast')

You can use cheaper and more effective approaches for getting the subword
functionality you want.

Look up "Byte Pair Embeddings". That will also handle the OOV problem but for
far less CPU/RAM overhead. BERT also does this for you with its unique form of
tokenization.

A home CPU can fine-tune FastText in a day on 4 million documents if you're
able to walk away from your computer for awhile. Shouldn't cost you anything
except electricity. If you set the number of epochs higher, you'll get better
performance but correspondingly longer times to train.

For BERT/GPT-2, you'll maybe want to fine-tune a small version of the model
(say, the 117m parameter version of GPT-2) and then vertically concatenate
that with the regular un-fine-tuned GPT-2 model. That should be very fast and
hopefully not expensive (and also possible on your home GPU)

~~~
Arimbr
Cool man! Thanks for sharing :) I wasn't familiar with the chunking approach.
I will read more!

Regarding BERT, it indeed may perform better if fine tuned correctly. For a
baseline fastText is great because it is super fast and runs on a CPU. It
costed me 24$ to run a 24h autotune on a 16 CPU core machine. Also, fastText
is great out of the box as it also builds word vectors for subwords, which
helps with typos and specific terms that may otherwise be out of vocabulary.

I am betting that fine tuning BERT will cost me at least x10 more. But I this
project is a chance to try it out :) Looking forward to v2!

Luckily, with Valohai, I get access to GPU credits for open source projects!

------
BatFastard
Would be nice to have the subreddits be links. So I could just click it to
open a new tab of that subreddit.

~~~
Arimbr
Great idea! Will add that :)

------
weaponizedwords
Tried it with Hearthstone related content. Title: turn 2 lethal Content: I
managed to cheat out 4 prophet valens on turn 2 followed up by mind blast.

Results: shadowverse, elderscrollslegends, teamfighttactics, teemotalk,
fioramains, ekkomains, ezrealmains, bobstavern, kaisamains, xcom2

Should include: hearthstone It did pick up BobsTavern which is something. I
thought you would want some feedback.

~~~
Arimbr
thanks! That helps a lot, although I am not familiar with that area of
knowledge.

Indeed, I got some ML metrics on a test split that gives me an idea of its
accuracy :) But it's just an estimation, so indeed I am looking out for
feedback to know its real performance so I can debug bad cases and fix those
with more data or a better model.

The test performance on subreddit r/hearthstone is 0.21 f1-score, which is not
great. And looking at the confusion matrix for r/hearthstone it gets often
confused with:

r/BobsTavern r/CompetitiveHS r/customhearthstone r/Blizzard

If you are curious, I uploaded the metrics (precision, recall, f1-score) and
confusion matrix on the test dataset on a Google Spreadsheet.

[https://docs.google.com/spreadsheets/d/1NBY1o85ZiNpcm4tcYhKk...](https://docs.google.com/spreadsheets/d/1NBY1o85ZiNpcm4tcYhKk_Z0TNxm2Q-MhptuCi9qJJoQ/edit?usp=sharing)

The sheet 'confusion_matrix_gt2' can be used to find similar subreddits.

------
exegete
Cool. Last year I created something like this as a Chrome extension so that
you could type in your post and it would show up on reddit where to post. You
could then just select it by clicking a link. Project is here
[https://github.com/wesbarnett/insight](https://github.com/wesbarnett/insight)

~~~
Arimbr
Nice work @exegete! The Chrome extension idea is great ;) Nice to also see
some metrics comparison. I will review your work. Looks like you achieved 0.6
accuracy with 600 classes. I got 0.4 f1-score on 4000 classes, but I have a
ton of posts and subreddits with images and no text ;) For this case, it is
also nice to report Recall@k. The current model has Recall@5 of 0.6. Meaning
that on the test dataset, 60% of the time the human choice is within the first
5 suggestions. Currently, it is not supposed to automatically post but help
the user discover new subreddits :)

I will probably retrain it on more subreddits, and fine tune a few things.

------
jnwatson
Searching for "marijuana" should point to trees (internal Reddit joke) and not
marijuana primarily.

~~~
TheGallopedHigh
Hilariously enough: those that want to stop smokeing post to r/leaves.

~~~
siegelzero
And those who want to see trees go to r/marijuanaenthusiasts

------
applecrazy
This is pretty good. Typed in "$spy 1000" and it said r/wallstreetbets (100%).
Accurate.

------
ramraj07
Tried stocks, stock options, investing - all kept giving Robinhoodpennystocks
as the top option. Not sure if the model is not fully trained?

What are some examples where the model does recommend meaningful things?

~~~
rjtobin
Here were my two experiences, one I felt would be easy and the other hard:

Title: Build recommendations. Message: "I'd like to upgrade some components.
My current rig has an old i7 and an RTX 2060. Looking for something midrange
that can handle modern games at high settings (but maybe not ultra)."

Matches: Nvidia (19%), IndianGaming (8%), GamingLaptops (8%),
pcgamingtechsupport (6%)

Title: Travel advice. Message: "I'm returning to Ireland in July from the USA.
My visa is up. I know I will have to self-quarantine for two weeks. I cannot
move back to my family home due to elderly parents. Are there any
recommendations for people in this sort of situation? I'm happy to pay for a
hotel, but don't want to put a hotel worker at risk. We have an old house down
in Wexford I could stay in, but would involve taking a train when I arrive,
and the HSE guidance says not to take public transport. Any advice?"

Recommendations: LegalAdviceUK (8%), IWantOut (7%), AskUK (5%)

Overall I think this was pretty good, even if it wasn't perfect. I thought it
would struggle more with the second one (maybe getting confused and suggesting
vacation planning subreddits). A little controversial that it kept suggesting
"UK" reddits for a question about Ireland though :)

~~~
Arimbr
Thanks so much for your detailed feedback. There is definitely something going
on with the UK thing... Need to dig deeper!

~~~
rjtobin
Thanks for making it, super cool! Btw, if it wasn't clear, my "controversial"
comment isn't to be taken seriously

------
duxup
Suggested subreddits for this post:

lostredditors 45%

Well yes that is likely, but maybe not a good suggestion as that is a place
where folks point out people who posted the wrong thing in the wrong sub or
conversation ;)

~~~
Arimbr
Interesting! I didn't know about that subreddit. Which text did you try? :)

~~~
duxup
IIRC I had some text about cat pics ;)

~~~
Arimbr
;)

------
SeekingMeaning
“I got straight A’s this semester!!!”

aggies 19.0%

------
bluetwo
Clicking a sub-reddit name should open the sub-reddit in a new window/tab.

------
SkyPuncher
This is awesome!

I often find that I when I'm buying something new, I want to find subs related
to that product category.

While this doesn't find me direct results, it should me communities that I
should focus my research on.

------
s_dev
Tried to find /r/DevelEire using the search terms "Irish Software Developers"

No luck but Google will bring it up as a first result if the query is "Irish
Software Developers Reddit".

~~~
Arimbr
Unfortunately, I trained only on what Reddit told me are the most popular 4k
subreddits :) It was not trained on r/DevelEire :/

So, I think i need to train it on more subreddits to make it more useful.
Thanks for sharing!

------
jokoon
I wish reddit would allow me to download all my comments.

Apparently it's not possible since they're all archived, because reddit
constantly regenerate its webpages.

~~~
benibela
You can download them from google:

[https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_c...](https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_comments.all?pli=1)

Run an SQL over all comments of everyone

edit: so only comments till october. 6 months old, that is when they become
archived. Guess google has only the archived comments.

~~~
behindsight
You can also download them from their original mirror at pushshift[1]

PS: that BQ you linked is maintained by fhoffa[2]

1: [https://files.pushshift.io/reddit/](https://files.pushshift.io/reddit/)

2:
[https://news.ycombinator.com/user?id=fhoffa](https://news.ycombinator.com/user?id=fhoffa)

~~~
benibela
Original mirror? But the BQ has 2019_10 while the pushshift has only up
2019-09

~~~
behindsight
Apologies let me clarify: Pushshift has a live api that mirrors reddit which
you can use to compile a more recent dataset[0].

The uploader of that BQ has cited Pushshift as their source[1].

PS: In the next couple of days the batched archive data for Q4 of 2019 as well
as Q1 of 2020 will be available[2]

0: [https://github.com/pushshift/api](https://github.com/pushshift/api)

1:
[https://www.reddit.com/r/bigquery/comments/fcyu4m/extended_o...](https://www.reddit.com/r/bigquery/comments/fcyu4m/extended_on_reddit_what_proportion_of_all_upvotes/fjdup6t/)

2:
[https://www.reddit.com/r/pushshift/comments/fuoe2d/september...](https://www.reddit.com/r/pushshift/comments/fuoe2d/september_december_submissions_will_be_available/)

------
ulucs
Is r/thedonald not included in the database? I'm trying the usual suspect
titles, but get nothing.

~~~
Arimbr
Yes, unfortunately is not part of the 4k subreddits I trained on. I will
retrain it with more subreddits :)

The list of subreddits and an estimation of the performance for each one is on
this Google Spreadsheet

[https://docs.google.com/spreadsheets/d/1NBY1o85ZiNpcm4tcYhKk...](https://docs.google.com/spreadsheets/d/1NBY1o85ZiNpcm4tcYhKk_Z0TNxm2Q-MhptuCi9qJJoQ/edit?usp=sharing)

------
minimaxir
You should probably add how exactly you retrieved the 4 million Reddit posts.

~~~
dan1234
The official API is well documented.

[https://www.reddit.com/dev/api/](https://www.reddit.com/dev/api/)

~~~
minimaxir
True, but 4M posts is heavy (even when bypassing the API itself and using
unauthenticated requests), so was curious.

------
dominotw
hey i tried

title: My siberian cat Message: My floof

I was hoping to find r/SiberianCats where i usually post but it wasn't in the
list.

I googled "siberian cat subreddit" and r/SiberianCats was the first link.

~~~
captn3m0
Maybe not a top 4k subreddit?

~~~
Arimbr
You are right!

~~~
dominotw
oh ok. thank you. I overestimated Siberian cats popularity :)

------
pmoriarty
Or you could just ask here:

[https://old.reddit.com/r/findareddit/](https://old.reddit.com/r/findareddit/)

~~~
Arimbr
Yes, that is the human solution to the problem. I manually tested a bit on
what people ask there :)

On the other hand, the machine is faster and lot of people don't get an answer
there or can wait for it. The machine is not necessarily better, just a
complement.

------
dehrmann
Anyone remember /r/reddit.com? That was around the time reddit looked like
Hacker News and people were embarrassed to admit they use it.

~~~
ChefboyOG
HN doesn't get enough credit for the tight rope they walk maintaining this
community. I see some people post that HN should expand to other topics a la
Reddit, but the team does a great job of maintaining focus.

It's not just HN's aesthetic that is minimal and no-nonsense, it's their
moderation policies and the tone they set for the community. There is perfect
alignment between their approach to content, community, and UX—no fluff, no
nonsense, no manipulation, just the simplest, most valuable material possible.

If they expand the scope of acceptable content, it will be really hard not to
tweak moderation policies, and eventually you end up with something like
reddit, each subreddit might as well be its own (typically under-staffed)
site.

EDIT: This is not critical of any post on this thread, just seeing the above
comment about old-school reddit got me thinking.

~~~
whatsmyusername
/r/programming is basically "I reposted this from
[https://news.ycombinator.com"](https://news.ycombinator.com")

~~~
jedieaston
I've found accounts (possibly bots) that seem to go find the accompanying HN
thread, and post a comment from it onto the reddit thread for karma. I sent
reddit's anti-evil team a note about it since it's probably karma farming, but
they never responded. Maybe it doesn't matter? It's not like we hold exclusive
rights to stuff on here, so there's no legal issue, it just seems to be an
efficient way of getting karma for that specific subreddit.

The reposts also result in a lot of not programming related content getting on
that sub, which none of the mods seem to delete very often (stuff that should
go to r/sysadmin or even r/technology).

~~~
drusepth
I used to do something like this any time I detected twin threads (discussing
the same URL) across HN and reddit. If anyone asked any unanswered question at
one source, one of my bots would ask it at the other sources (plus Quora,
usually) and wait for a response elsewhere, then paste that response back to
the original OP with a link/citations. Including the citations got me
autobanned a few times (which reddit admins graciously removed, repeatedly);
if I weren't concerned with plagiarism, bot management would probably have
been much smoother.

Would be pretty trivial to skip all the question/answer stuff and just share
comments around sites. In a vacuum, I'd say it could be argued that mirroring
comments around the Internet would result in good in various ways (sharing
information, letting people choose what site they want to use, limiting
censorship and/or site downtime, getting answers to people who might not know
the best place to ask them, etc).

What you saw was probably karma farming, but could also have been someone
trying to help in some abstract way. :)

------
bobberkarl
I just ran some pretty dumb queries. I can _assure_ you something is missing.

------
dzonga
good attempt. but needs a lot of work.

~~~
Arimbr
Thanks man! Looking forward to working on V2 :)

