Hacker News new | past | comments | ask | show | jobs | submit login
Cosine Similarity (algebrica.org)
31 points by kyroz 14 days ago | hide | past | favorite | 12 comments



Hm, that article says cosine similarity ranges from 0 to 1, but it can range between -1 to 1 if the vectors are allowed negative components.

Also FYI to any Postgres users: the pgvector operator is for cosine distance, which is 1 - cosine similarity, and thus ranges from 0 to 2 or 0 to 1.


While cosine itself has a range of 0 to 1, the cosine similarity is usually defined for tf-idf or bag-of-words vectors, in which the largest angle between any two vectors is 90deg.


That article seems like a long way to say something very simple:

1) If you have an n-dimensional space where each dimension means/measures something, then points close together in that space will of course be similar in these dimensional qualities.

2) One way to measure closeness/similarity between two points is to consider them as vectors from the origin to the points, then compare the angle (cosine similarity) between the vectors. If the vectors point in the same direction (small angle between them) then the points lie in that same directional region of the n-dimensional space.

3) There are also other ways to measure closeness of points such as the euclidean distance between the points. Cosine similarity tends to work well for high dimensional embedding spaces such as for text embeddings or face embeddings, where distances between different dimensions (e.g. eye-shape vs size-of-nose) are meaningless since these are not euclidian spaces.


This was informative. One thing I think it glosses over is that cosine similarity becomes remarkably more effective (is only effective?) in vector spaces with large dimensionality: hundreds at least (maybe more? I'm not up-to-date).


It actually seems to imply the opposite: “ As document length increases, the complexity of the text also increases, rendering cosine similarity less effective in capturing nuanced semantic relationships”

I think its true that using TfIdf to calculate the vectors doesnt scale, but we’ve seen that the OpenAi embedding models are pretty capable of encoding semantics of a longer document.


The dimensionality of cosine similarity refers to how you describe a given chunk of text, not how long or short the text itself is. It's easy for humans to visualize cosine similarity between two-dimensional vectors: draw two angles on graph paper, starting from the origin and going at some angle for some distance (normally 0-1) from the origin. You can visually understand that a 45-degree angle is far from a 270-degree angle, unless both happen to be very very short.

But that is almost of negative help in describing what's going on here: the vectors in question have hundreds (or more!) dimensions, and are (except very abstractly) impossible for humans to visualize. Cosine similarity only becomes useful, and interestingly becomes counterintuitively useful, when used at these higher dimensions.

My point was that the article doesn't describe the above.


I think this is referring to using a single cosine similarity metric for a given piece of text.

The similarity between the strings “dog” and “cat” is qualitatively different from (and substantially more meaningful than) the similarity between the full contents of War and Peace and Anna Karenina, despite both being described by a single value.

In practice, you would never try to calculate the similarity between such long texts. You would use a chunking strategy to break them into pieces more likely to yield meaningful similarity values.


TF-IDF for long documents is in general not very informative. TF-IDF works well for shorter documents that are largely about one topic. It will most likely not catch onto the fact that the Iliad is about the Trojan war.

(TF-IDF vectors are largely viewed as a cute trick but ultimately a bit of a dead end in classic information retrieval.)

That said, cosine similarity is definitely still useful for large vectors of things that aside from tf-idf.


Does remove words like “the” and “and” improve outcomes on this?

I’d imagine speed would


Those are typically called “stop words” and the article removed them for the example. Many NlP algorithms do remove stop words as a first step.


Although worth noting that more recent techniques (eg, Transformers) need stop words for context.


Does it actually help though? I would think embeddings of cat and the cat would be functionally similar




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: