
Topic Suggestions for Millions of Repositories - dchuk
https://githubengineering.com/topics/?hn
======
buro9
This didn't work so well for one of my projects:
[https://github.com/microcosm-cc/bluemonday](https://github.com/microcosm-
cc/bluemonday)

I chose to look at this one as the README is quite descriptive and offers
examples, and it is reasonably well structured.

It's a Go HTML sanitizer.

The suggested topics included "go" but none of the others I have now tagged it
with.

It did include things like "html-element" and "data-uri". I can see how it did
this from word prevalence, but these were far too specific to examples and
documentation, and did not describe the project.

It feels as if the word counted should be weighted towards the early part of
the README, perhaps no farther in than the 3rd heading.

~~~
justinclift
Thanks for bluemonday, we use it in our GFM sanitisation. Certainly saved time
on having to write our own from scratch and potentially getting it wrong. :)

------
dchuk
I was hoping some folks could jump in here and comment on this topic (pun
slightly intended). I've been researching it heavily as it's something I want
to do on [https://engineered.at](https://engineered.at) and I feel like the
way that Github approached it seems pretty straightforward.

Has anyone else implemented something like this before? War stories?

~~~
jaredchung
We just implemented topic suggestions as well on our Q&A data. The literature
in this field is actually quite solid. Here are a few key insights from our
research: 1) Using recall@5 as your test metric gives you the benefit of being
able to compare your results against the academic literature. They often (but
not always) use recall@5. 2) We read the articles that introduce the following
systems: TagCombine (was state of the art in 2013), NetTagCombine, EnTagRec,
TagMulRec, fastText's supervised topic recommendation, and a few others.
Unfortunately other than fastText not many of these have OSS libraries
available out of the box, but that's ok as long as you're willing to use the
underlying methods which _are_ available off the shelf. 3) Our approach: we
tested a few of the above systems, as well as mix-and-matched our own systems
with tf-idf, topic modeling, multilabel classification, l-lda, fasttext, and
others. For each attempt we calculated (a) recall@5 against our test set (we
used k-means cross validation) and (b) how long it took to complete the
training step. 4) In the end, we decided that we got the best combination of
recall@5, training time, and engineer efficiency by using ONLY the fastText
library. In the end we spent about 1 week trying other methods before we tried
fastText. fastText took us about 3 hours to get first results, and then we
tweaked the parameters for another two days before we found the right
combination of learning rate, epoch, n-gram #, etc. (read the docs :) Our
current recall@5 for v1 of this feature is a little above 0.5, which we think
is good enough to provide a solid user experience.

Caveat: Unlike Github we actually do have user-generated labels to start with,
so our modeling problem has one-step less complexity than theirs but are
otherwise the same.

(TL;DR If you have gold standard data to start with and no time to read
academic articles, try fastText supervised learning. If you have time to read
academic articles, start with a lit search.)

~~~
dchuk
Great reply, thank you!

Can I ask what your site is?

~~~
jaredchung
CareerVillage.org

------
khc
It did ok for the most part, except it also suggested "goofys" as a tag for
[https://github.com/kahing/goofys](https://github.com/kahing/goofys).

