

Ask HN: Semantic relationship distance between terms - dzink

We&#x27;ve been trying to hunt down a data source that might have semantic relationship data about the conceptual distance between terms. Example:
Cancer and Tumor, being closely related would have a value ~1. Cancer and Healthcare would also be closely related, though maybe at 0.8, Cancer and Nurse might also be more than .6 related, Cancer and Space on the other end, would be completely unrelated and thus closer to 0. (obviously the data may be in a different format and weight, but if it carries the right information I can translate that to the graph database we are considering). We&#x27;ve looked at Wordnet, Freebase, Wikipedia scraping, expert system databases, even TF-IDF algorithms we could let loose on data. Synonyms don&#x27;t work. Is there something out there that sounds like it would be useful or should we build something on our own &#x2F; start a new project?
======
parsabg
that's a well-studied subject. check out CEPT [1], Word2vec [2] (demo
implementation [3]) and Wikipedia Miner [4].

[1]
[http://www.cept.at/demo_retina_viewer.html](http://www.cept.at/demo_retina_viewer.html)

[2] [https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/)

[3] [http://radimrehurek.com/2014/02/word2vec-
tutorial/#app](http://radimrehurek.com/2014/02/word2vec-tutorial/#app)

[4] [http://wikipedia-
miner.cms.waikato.ac.nz/demos/compare/?term...](http://wikipedia-
miner.cms.waikato.ac.nz/demos/compare/?term1=tumor&term2=cancer)

~~~
dzink
This is very helpful. The odd part is some of these consider cancer as closely
associated with health as it is with outer space. My goal is to connect people
and projects with related people and projects who might be relevant to them. A
user interested in healthcare might find someone working on a cancer project
very relevant. With enough data we can create a structured process that
correlates topics, but until then it seems the context behind each term in the
libraries I'm seeing is very different. Better than nothing, but surprisingly
different from what we want to do.

~~~
parsabg
> The odd part is some of these consider cancer as closely associated with
> health as it is with outer space.

wikipedia miner fails at that one -- CEPT performs better. also with CEPT it's
important to consider where the similarities are happening in the 2d matrix.

in data mining problems the data sometimes becomes equally/more important
as/than the underlying algorithm/technique so it's hard to tell what would
work best for you without knowing more about your application. feel free to
drop me an email if you'd like to discuss: parsa {at} aylien {dot} com

------
bdevine
As a broad first pass, have you looked into LSI and other related topic-
modeling algorithms? I am a Pythonista and I like a library called gensim very
much for playing in that space; maybe look at the documentation for that and
see if it sparks any ideas?

~~~
dzink
Will do. Thanks!

------
skram
Can you maybe elaborate on why wikipedia scraping and other tries have not
worked? Have you looked into
[https://www.google.com/trends/correlate](https://www.google.com/trends/correlate)
and [https://books.google.com/ngrams](https://books.google.com/ngrams) ?

~~~
dzink
Thanks! Wikipedia is close, but it will require a lot of clean up and it shows
no obvious weight of the relationships between terms. Google trends are
looking strong. This will require a lot of work, so we're trying to see if
there is a data source or algorithm that will get us as close possible to what
we need before we put in the hours.

------
alok-g
I'll like to hear what you find. Thanks.

