
Ask HN: Any tools to easily cluster ideas/notes by similarity? - arikr
Would like to pass in my Evernote note and output something grouped by similar ideas&#x2F;notes.<p>Does this exist?
======
stared
If it can be interfered from words alone, try Latent Dirichlet Allocation
(e.g. with [http://radimrehurek.com/gensim/](http://radimrehurek.com/gensim/))
to generate tags. Some sources:

* [http://blog.echen.me/2011/08/22/introduction-to-latent-diric...](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/)

* [http://alexperrier.github.io/jekyll/update/2015/09/04/topic-...](http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html)

* [http://engineering.flipboard.com/2017/02/storyclustering](http://engineering.flipboard.com/2017/02/storyclustering)

Alternatively, if you know tags, just want to see which are similar to each
other, methods like word2vec should help, vide:

* [http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)

~~~
rspeer
I think the OP might have been looking for code that already exists, not just
techniques for how to build it. But I have some opinions on these techniques.

LDA is not very controllable. It gives you a set number of clusters. Run it
again and you get different clusters.

It can't give you documents that are similar to a particular document that you
asked for, except to say that all the documents that happen to be in a cluster
at the time are similar.

There are indeed up-to-date things you can do with word vectors -- though I'm
sad that the tutorials always point to word2vec or GloVe, as if it's 2014.
That's ancient in machine learning terms, and we now know of flaws in their
outputs, such as [1].

If you want downloadable, pre-computed word vectors, ConceptNet Numberbatch
[2] (part of the ConceptNet project that I develop) is the best in class right
now, and you can download it in the same format as these older systems. And if
you don't believe me tooting my own horn, at least use something else that's
been updated in the last year, such as NASARI, or maybe fastText's precomputed
English vectors.

Or you can just compare "bags of words", which is still good enough for many
applications and readily available out of the box. I believe this is what you
get with a "More Like This" plugin for a search engine such as Lucene, for
example.

[1] [https://arxiv.org/abs/1607.06520](https://arxiv.org/abs/1607.06520)

[2] [https://github.com/commonsense/conceptnet-
numberbatch](https://github.com/commonsense/conceptnet-numberbatch)

~~~
stared
It's true that the number of topics is a free parameter of LDA. Yet, maybe
unavoidable if we want to tweak between just looking at similar words vs at
very coarse patterns.

For document similariaty tools typically does not be that subtle as for
analogy tasks. Thank you for pointing to your research/tools anyway. :) And
well, word2vec/GloVe are by no means the end of story - just a simple
baseline.

------
jamessb
DevonThink has a "See Also and Classify" feature [1] that will suggest other
notes related to a note, or suggest a category for to a note.

[1]: see screenshot here [http://blog.devontechnologies.com/2016/11/a-users-
journey-in...](http://blog.devontechnologies.com/2016/11/a-users-journey-into-
devonthink-student-academic-workflow/)

------
le-mark
Long ago, before link counting (googles page rank algorithm) there was quite a
lot of research around finding similarity among a group of documents. "Scatter
gather"[1] was a method I've often wished I could use on top of a set of
search engine results.

[1]
[https://pdfs.semanticscholar.org/1134/3448f8a817fa391e3a7897...](https://pdfs.semanticscholar.org/1134/3448f8a817fa391e3a7897a95f975ad2873a.pdf)

------
skate22
If you know some python and are so inclined, you could use scikit-learn's
k-means clustering implementation [1]. I used it to find similar movies based
on their plot summaries for a school project.

[1]: [http://scikit-
learn.org/stable/auto_examples/text/document_c...](http://scikit-
learn.org/stable/auto_examples/text/document_clustering.html)

------
Veen
DevonThink is very good at figuring out related notes. It's Mac only,
unfortunately. You'd have to export from Evernote and then into DevonThink,
which is an Evernote competitor.

[http://www.devontechnologies.com/products/devonthink/overvie...](http://www.devontechnologies.com/products/devonthink/overview.html)

------
bryanph_
I'm currently working on a graph-based approach to this problem. I wrote about
it here: [https://hackernoon.com/building-a-open-source-personal-
knowl...](https://hackernoon.com/building-a-open-source-personal-knowledge-
base-45c25f5a4324)

------
mozartoz
You may be able to craft something like this in org-mode, by writing a little
bit of elisp.

org-mode already provides all infrastructure to organise your ideas and tag
them. You would just need to do the clustering part.

------
cartercole
[https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/) makes
topic modeling really easy in python

------
evolve2k
Different tool but for anyone doing a bunch of academic research and wanting
to group what they are learning using an open source tool; check out
docear.org

They could really do with more developer support also. Had difficulties
finding someone to build a LibreOffice plugin and a (Mac)Word Plugin.

------
projectramo
I assume you mean something that does it automatically, but you should also
know that you can tag notes, which would be a manual way of achieve this.

(For developers: just take the most frequent words that are not "a, the, if,
but" etc and convert them to tags)

------
ollybee
[https://piggydb.net](https://piggydb.net) I dont use this myself but it's
been on my list of things to investigate for a very long time as it seems like
a fresh approch to note taking.

------
sciencerobot
There is an option/plugin for Jekyll for finding similar blog posts.
[https://jekyllrb.com/docs/configuration/](https://jekyllrb.com/docs/configuration/)

------
teapot01
I am working on something related - unfortunately it's on the far left back
burner at the moment. Haven't found anything similar though

------
ehudla
I wanted to build something like for Zotero. Never got around to doing it,
unfortunately.

------
SeaDude
Emacs org-mode. Plain text for everything, for the rest of your life.

