
Show HN: Using LDA to suggest GitHub repositories based on what you have starred - c5urf3r
http://www.gitsuggest.com
======
rw
This only uses LDA on your starred repository descriptions, to find topic
terms that describe your starred repositories. These topic terms are then used
to query the GitHub search API to find matching repositories. The results are
then sorted by star count.

That is a clever way to make use of a search API like GitHub's. The principled
way to do this, though, is to run LDA over all descriptions on GitHub, then
use that similarity index to find similar repositories. You could run LDA over
code, too.

I'll note that there is a cold start problem with this implementation: using
LDA on such a small set of short documents will often lead to uninformative
topics with words that are too-specific. You need a big corpus to capture e.g.
synonym relationships.

~~~
painted
Your point is quite interesting although I'm not sure running LDA on the
entire code would be useful. I spent half a year writing my postgraduate
thesis on a recommender systems for streaming services based on LDA, in
particular we wanted to infer who is watching what and when in a shared
account. From all the tests I did with LDA I believe the best thing would be
to run it on the README files.

~~~
rw
Good idea, the READMEs would be best of all.

------
opportune
Typo when you don't have the right kind of repos starred: "yeild" should be
yield

I've worked with NLP a bit before, but haven't worked with LDA and have only
read the wikipedia article and gensim documentation. One thing I don't
understand is why you only generate a single topic for each user, and then
query the top n (5) terms. From what I understand of LDA, its usefulness is in
partitioning text into k separate topics based on how often words are used in
similar contexts. In my mind, this is more or less analogous (please tell me
if this is wrong) to finding k centroids for a vector representation of text
after training a word2vec mapping (in an appropriately low dimension given the
document size) on that text. However, if you are only finding a single topic,
you are only using one centroid, so your search will be the n tokens that are
closest to the centroid. I'm pretty sure that the tokens (from the text)
closest to the centroid of a word2vec mapping trained on a text will mostly
consist of high-frequency words and semi-stop words (by this I mean words used
in varied contexts because of their use in language, but not filtered by the
stop word check).

Then if someone has many different topical interests, LDA might over-represent
whichever topic has the plurality of text dedicated to it. For example if my
starred repos are something like 30% Fortran, 30% Javascript, 40% Java, I
believe your algorithm will mostly contain Java terms as queries. This seems
to run counter to the goal of using LDA, which would be (to my understanding)
to identify these latent topics and give relevant queries for each one /
combining them.

I think a good way to address this would be to implement some way to change
the default number of topics. One approach may be to use a trained (perhaps on
github instances itself) word2vec instance to determine the "spread" of the
incoming tokens: you could construct cliques based on pairwise distance
between vectors and do something with that (let k be the number of cliques, or
the number of cliques of size greater than m, etc.).

A different approach might be to precompute the vector average of each github
repo. Then you could perform richer comparisons directly to documents (e.g.
compare each clique's centroid to the repos) without directly querying github
for tons of repos.

~~~
c5urf3r
Thanks for your suggestions. Will keep track of it and try including them in
the next run. Raise it as an issue if it bothers you a lot.

Also, I have rectified the typo.

------
jbochi
Very interesting. I've implemented something similar[1] using a pure
Collaborative Filtering approach[2][3], that I think works better for me, but
it's unable to recommend unpopular repositories.

The New York Times recommender system uses a hybrid approach (Content Based +
Collaborative Filtering) called Collaborative Topic Modeling on top of LDA[4].
It would be interesting to try that.

[1]: [https://github-recs.appspot.com/](https://github-recs.appspot.com/)

[2]: [https://medium.com/towards-data-science/recommending-
github-...](https://medium.com/towards-data-science/recommending-github-
repositories-with-google-bigquery-and-the-implicit-library-e6cce666c77)

[3]: [https://github.com/jbochi/facts](https://github.com/jbochi/facts)

[4]: [https://open.blogs.nytimes.com/2015/08/11/building-the-
next-...](https://open.blogs.nytimes.com/2015/08/11/building-the-next-new-
york-times-recommendation-engine/)

~~~
c5urf3r
Really nice links... Thanks! Will take a look at it and compare and contrast
for better tuning. :) Feel free to do the same and raise some pull requests :)

------
keeran
[https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

For anyone else who didn't know :D

~~~
morazow
I thought it was about Linear Discriminant Analysis..

~~~
bugmen0t
It is.

~~~
painted
no, it's not

~~~
nerdponx
You can, however, use Linear Discriminant Analysis on thr topic scores from
Latent Dirichlet Allocation.....

~~~
painted
I guess, if you have a fair amount of topics, although the point was to
explain the title :)

------
ninjakeyboard
I appreciate the effort but it is giving me as a scala guy, an author of books
on such subjects as distributed computing and akka, links for PHP libraries. I
think the ranking should account for language interest very highly.

~~~
c5urf3r
That is a thought which indeed crossed my mind.

I personally play around with a lot of tools and languages and wanted to make
sure that language barrier didn't make the tool limited. In the sense that if
there are a lot of repositories out there in js for something I am starting in
python I may wanna take a look at why so many people chose to do it in js
instead of python and then make my decision based on it.

This definitely will be lost if I filtered based on languages. In the current
form, it gives importance to the topic and not the language for this specific
reason.

~~~
nerdponx
There's a variant of LDA called "structured LDA" that allows you to introduce
a linear combination of "structural variables" into the topic distribution
prior. If you wanted to make this project "language-aware", might be a good
use case for sLDA.

------
iKlsR
Does this only recommend awesome style repos? I got about 20 of those despite
having over 400 stars, majority computer graphics. Also a lot of Ruby, a
language I don't use or less than 1% of my stars. Odd.

~~~
c5urf3r
Problem is this, the suggestions are generated by using the repo descriptions
as the starting point dropping all references to language and non-English
words. What this means is even if you have a lot of stars on ruby
repositories, if the search on terms derived from the above process leads to
non-ruby repositories that is what is shown. Will look into tuning it better.

~~~
mrkstu
I'd suggest adding a filter at the top of results to filter by language- i.e.
a checkbox by each language result at the top to specify which results to
filter to

------
tschellenbach
Well not sure why, but it definitely didn't work for me. Random would have
outperformed it. I like the idea though. Suggested github repos would be fun
to have a look at.

~~~
c5urf3r
Take a look at the repositories suggested, you may find it interesting after
some digging or add a new liking to your list. After some more starring, the
tool might actually start suggesting repositories to your liking.

Also if you find something strongly troubling feel free to create issues on
the repo and I will try to see how it can be better tuned/filtered.

------
jayunit
Cool project! Thanks for publishing and sharing it.

It'd be interesting to know what topic terms it produces for each of my repos.
It looks like it's taking all the repo descriptions, producing a topic model
over that corpus with a single topic (`LdaModel(num_topics=1)`), and
retrieving the top N terms for that topic. Those topic terms will be the most
frequent words from the topic, so I think this will end up producing the most
frequent words from the cleaned token set.

I'd be curious to see what happens if you could run LDA over the full dataset,
produce multiple topics, and suggest repos based on those topics. This would
be a pretty fun extension to the project!

If you're just running LDA over the repo description (and not looking into the
content of any file, e.g. README), might
[http://ghtorrent.org/](http://ghtorrent.org/) be able to provide this?

Alternatively, maybe you want to include text from the README files -- could
you use the Google Data snapshot of GitHub
[https://cloud.google.com/bigquery/public-
data/github](https://cloud.google.com/bigquery/public-data/github) and do
analysis like this: [https://blog.exploratory.io/clustering-r-packages-based-
on-g...](https://blog.exploratory.io/clustering-r-packages-based-on-github-
data-in-google-bigquery-1cadba62eb8d)

Or, it might be interesting to try producing a vector representation per repo
by taking the description (and readme?), and doing something like: produce
word vectors for each word, and sum the word vectors.
[https://spacy.io/](https://spacy.io/) is a nice-to-use library that could
help here.

Once you have a vector representation for each repo, using a distance metric
cosine similarity could find related repos. Or (depending on the dataset size
/ performance) an approximation like spill trees or LSH forest.

Looking forward to seeing where this goes next!

~~~
c5urf3r
Some really good suggestions, can you please raise an issue on the repo so
that we can keep track of the same.

------
futureishere
That is one really cool application of LDA!

I am so curious about your implementation, for instance, what sort of
preprocessing did you have to carry out? I had written a script sometime back
to analyze Paul Graham's essays (link: [https://github.com/futureUnsure/pg-
essay-lda](https://github.com/futureUnsure/pg-essay-lda)), and had to remove
date and times because they appeared a lot and distorted the top topics. I'm
wondering if you had to do something similar for text that described code?

Also, did you write an LDA library yourself or did you leverage an existing
library?

I apologize in advance if my questions sound naive/stupid, am just a noob...

~~~
c5urf3r
Thanks, I am using gensim package for LDA. In a nutshell:

1\. Get descriptions of repos user is interested in 2\.
Cleanup/Filtering/Tokenization 3\. Use LDA to generate Topics 3\. Use the
topics to search for repositories github can provide.

------
serf
given that stars, afaik, are public -- why do I have to login? is there some
'hide my stars' option you're trying to get around?

~~~
c5urf3r
[https://developer.github.com/v3/search/#rate-
limit](https://developer.github.com/v3/search/#rate-limit)

------
michaelmior
Got an error :( Going to assume it's the HN effect. Look forward to trying it
out later!

~~~
c5urf3r
Did you eventually get to try it out?

~~~
michaelmior
Unfortunately not. I tried again several times over the course of multiple
days (including today) with no success.

------
painted
looks like the code for this is here:
[https://github.com/csurfer/gitsuggest](https://github.com/csurfer/gitsuggest)

~~~
c5urf3r
There are fork and star buttons which directly link to this repo on the
website.

~~~
painted
sorry, I missed that

------
eggie5
I never know if people mean NLP LDA or gaussian classifier LDA....

------
Whoaa512
Didn't work for me, I have about ~2700 stars though.

~~~
c5urf3r
That is sad :(. Please raise an issue with more details and I will try to fine
tune it to work better.

------
supernintendo
Just a heads up, the layout is pretty broken on mobile.

~~~
c5urf3r
Please raise it as an issue. My UI knowledge is pretty limited but will try to
make it a bit cleaner.

------
alfla
Did not work for me, I have about ~130 starred repos.

~~~
c5urf3r
That is sad :(. Please raise an issue with more details and I will try to fine
tune it to work better.

