
Text Matching Using Cosine Similarity - pplonski86
https://kanoki.org/2018/12/27/text-matching-cosine-similarity/
======
PaulHoule
In this world of content marketing in the form of thinly sliced salami it
would be nice to see something that takes the time to tell a complete story as
opposed to this.

Right now on the front page there is another article about "deep learning"
classifiers which are competitive with this technology and it would be nice to
see an objective comparison. (e.g. a co-worker made a 0.93 accurate text
classifier in 2004 w/ bag o' words and the SVM)

~~~
autokad
its nice to use when you don't have labeled data such as plagiarism detection.
I use it a lot in the security space to see if a user has changed their
behaviors by looking at the cosine distance of resources used over a period of
time vs the next.

------
0xfaded
Keep in mind that any similarity measure is mapping a high dimensional space
onto a one dimensional space. The volume of vectors within a cosine of another
vector is an N dimensional cone with volume rapidly decaying towards 0 for N >
5 (assuming positive cosine). Therefore cosine is not a particularly good
metric in high dimensional spaces.

~~~
SubiculumCode
For n ~ 250k to 1.5M, I've use Manhattan distance (as opposed to Euclidean) in
an analysis of neuroimaging functional data...but I am interested in people's
takes on choosing a distance metric..There are so many exotics out there.

------
CorvusCrypto
> _For a novice_ it looks a pretty simple job of using some Fuzzy string
> matching tools and get this done.

Nice.

That annoying quip aside, as with many things in data processing, it's case to
case. In TF-IDF you lose ordering information by definition. This is probably
fine for this use case but it does mean that if ordering does matter since a
set of stores share the same words.in different ordering, this will fail to
resolve the difference. The author says he did due diligence on the data but
there are other ways this can fall short. For example ["Walmart", "5280"]
compared to ["Store", "5280"] is going to not be so similar as one would want
due to the down-weighting of the identifying number in TF-IDF. So imo the
disadvantage mentioned for using BoW over TF-IDF is actually not a
disadvantage sometimes. As with everything it depends on your problem and
data.

To the author I would hope in the future you remove statements like "to a
novice, it seems easy to use X". There is nothing novice about going into a
problem with an idea and trying it if it seems to fit the use case.

------
fallingfrog
One interesting fact to note is that there is a link between the cosine rule
and probability- here’s an explanation:

[https://www.johndcook.com/blog/2010/06/17/covariance-and-
law...](https://www.johndcook.com/blog/2010/06/17/covariance-and-law-of-
cosines/)

~~~
nerdponx
The link goes even deeper than that. Cosines have an intimate relationship
with _Euclidean distance_ , and the fundamental statistics concept of variance
is in turn intimately related to Euclidean distance... and Gaussian
distributions (perhaps the most continuous distribution family because of the
Central Limit Theorem) are parameterized directly by mean and variance. And
Gaussian models have a convenient habit of reducing to straightforward linear
algebra as a result. The fact that Gaussian problems also happen to be nicely
_differentiable_ is yet another bonus. Oh, and least-squares linear regression
(read: fit a line to optimize the Euclidean distance between your prediction
vector and the data) is equivalent to a Gaussian maximum likelihood model.

Basically, everything boils down to high-school trigonometry, because it's the
most natural way to define distances in our world. I still marvel at it.

~~~
linearsep
> The link goes even deeper than that. Cosines have an intimate relationship
> with Euclidean distance, and the fundamental statistics concept of variance
> is in turn intimately related to Euclidean distance... and Gaussian
> distributions (perhaps the most continuous distribution family because of
> the Central Limit Theorem) are parameterized directly by mean and variance.
> And Gaussian models have a convenient habit of reducing to straightforward
> linear algebra as a result. The fact that Gaussian problems also happen to
> be nicely differentiable is yet another bonus. Oh, and least-squares linear
> regression (read: fit a line to optimize the Euclidean distance between your
> prediction vector and the data) is equivalent to a Gaussian maximum
> likelihood model.

Thanks for this insight! Can you also suggest a book or other materials that I
can read to understand this in greater depth?

~~~
nerdponx
Note that I meant to write

 _perhaps the most important continuous distribution family because of the
Central Limit Theorem_

instead of

 _perhaps the most continuous distribution family because of the Central Limit
Theorem_

As for a single book, not really. See my response to the sibling comment for
some more insight.

------
atum47
When I was in the middle of the first paragraph I was prompted to sign a
newsletter so I gave up reading.

------
maxaigner
Hacker News has been going downhill for some time now, and articles like this
are perfect examples. This article:

\- Lacks a snippet of the data in question

\- Lacks proper notation ("Vector(A) = [5, 0, 2]"? never seen anything like
this)

\- Has grammar and formatting mistakes everywhere, even failing to properly
copy-paste Wikipedia's end quotes

\- Uses a batteries-included sklearn implementation which doesn't go into any
interesting details

------
jagadkanihal
There are already plenty of blog posts explaining the same topic (eg : 2013
post - [https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-
sim...](https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-
similarity/)). I don't know why people have the urge to write a blog post
about simple things. I don't think there is anything new in the post that I
can not find already.

This is just stupid, people writing about stuff that has been there for some
time as if it is new. I Can't believe this link is there on the front page of
HN. Sorry if it seemed harsh, but somebody had to say it.

~~~
siculars
Your comment is ridiculous on its face. The reduction of your argument is that
all topics should be written about once. Well, thankfully, that's not the
case. There is a saying, the best way to learn something is to teach. And that
is what wiring a blog post is - teaching. We are all better off having
multiple voices describe the same thing in multiple ways. I hope to see your
writing on simple things one day too.

Whether or not this specific article should be on the front page of HN is a
different question.

~~~
jagadkanihal
My point is "what's the value added with this article?" is it - ease of
explanation? \- numerical example? \- code?

I can find plenty of blog posts which do a better job at all these criteria
with a simple google search.

I don't see any value added with this post.

Maybe the author wrote it to teach whatever he has learnt, but it's not worthy
of HN front page.

~~~
johnday
The value add is not for people who have already seen the article from 2013
that you linked. It's for people who have never been exposed to this idea
before.

You could have added value just as easily by saying something like "this
article from 2013 does a good job of explaining the same topic: ..."

But you chose to be rude.

