Hacker News new | past | comments | ask | show | jobs | submit login
Topic Suggestions for Millions of Repositories (githubengineering.com)
50 points by dchuk on Aug 29, 2017 | hide | past | favorite | 9 comments



This didn't work so well for one of my projects: https://github.com/microcosm-cc/bluemonday

I chose to look at this one as the README is quite descriptive and offers examples, and it is reasonably well structured.

It's a Go HTML sanitizer.

The suggested topics included "go" but none of the others I have now tagged it with.

It did include things like "html-element" and "data-uri". I can see how it did this from word prevalence, but these were far too specific to examples and documentation, and did not describe the project.

It feels as if the word counted should be weighted towards the early part of the README, perhaps no farther in than the 3rd heading.


Thanks for bluemonday, we use it in our GFM sanitisation. Certainly saved time on having to write our own from scratch and potentially getting it wrong. :)


I was hoping some folks could jump in here and comment on this topic (pun slightly intended). I've been researching it heavily as it's something I want to do on https://engineered.at and I feel like the way that Github approached it seems pretty straightforward.

Has anyone else implemented something like this before? War stories?


We just implemented topic suggestions as well on our Q&A data. The literature in this field is actually quite solid. Here are a few key insights from our research: 1) Using recall@5 as your test metric gives you the benefit of being able to compare your results against the academic literature. They often (but not always) use recall@5. 2) We read the articles that introduce the following systems: TagCombine (was state of the art in 2013), NetTagCombine, EnTagRec, TagMulRec, fastText's supervised topic recommendation, and a few others. Unfortunately other than fastText not many of these have OSS libraries available out of the box, but that's ok as long as you're willing to use the underlying methods which are available off the shelf. 3) Our approach: we tested a few of the above systems, as well as mix-and-matched our own systems with tf-idf, topic modeling, multilabel classification, l-lda, fasttext, and others. For each attempt we calculated (a) recall@5 against our test set (we used k-means cross validation) and (b) how long it took to complete the training step. 4) In the end, we decided that we got the best combination of recall@5, training time, and engineer efficiency by using ONLY the fastText library. In the end we spent about 1 week trying other methods before we tried fastText. fastText took us about 3 hours to get first results, and then we tweaked the parameters for another two days before we found the right combination of learning rate, epoch, n-gram #, etc. (read the docs :) Our current recall@5 for v1 of this feature is a little above 0.5, which we think is good enough to provide a solid user experience.

Caveat: Unlike Github we actually do have user-generated labels to start with, so our modeling problem has one-step less complexity than theirs but are otherwise the same.

(TL;DR If you have gold standard data to start with and no time to read academic articles, try fastText supervised learning. If you have time to read academic articles, start with a lit search.)


Great reply, thank you!

Can I ask what your site is?


CareerVillage.org


I have also been researching this heavily recently. I am more concerned with Keyphrase Extraction rather than Topics. Topics I view as being more generalised across documents where as Keyphrase Extraction is concerned with the phrases or words that best represent each document which better suits my use case.

Moz have a good write up of their Keyphrase Extract pipeline here. https://moz.com/devblog/machine-learning-approach-to-keyword...

I would be happy to discuss this with you more if you would like.


That would be awesome! My email is admin@engineered.at


It did ok for the most part, except it also suggested "goofys" as a tag for https://github.com/kahing/goofys.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: