
Show HN: Using word vectors to classify spam messages - doody_parizada
https://github.com/doodyparizada/word2vec-spam-filter
======
mci
Sounds like a fun project. However, I doubt if word vectors buy you anything
more than, say, old good Nilsimsa from 2001
([https://en.wikipedia.org/wiki/Nilsimsa_Hash](https://en.wikipedia.org/wiki/Nilsimsa_Hash)).
Side note: py-nilsimsa should iterate over Unicode points instead of UTF-8
bytes. As it stands now, the similarity of any texts in the same language
using a non-Latin script is ~80 rather than ~0.

~~~
laretluval
word2vec has the advantage that you could potentially identify spam messages
that are paraphrases rather than exact copies of the ones in the training set.

~~~
mci
1\. Pedantically: it's GloVe, not word2vec. 2. Nilsimsa or any locality-
sensitive hash detect changed messages, too, be the changes synonyms or not.
3. I don't think OP's GloVe contains words like v1agra.

~~~
doody_parizada
We don't have words like v1agra. As I mentioned in the README, we took vectors
pretrained on wikipedia. One of the possible improvements can be to train the
vectors on our own dataset.

------
amelius
Suggestion for better title:

"Collaboratively Filtering Spam with Word Vectors while Respecting Privacy"

~~~
chrbarrol
I was hoping to learn about word2vec by reading the sourcecode but am I right
when I say this has nothing to do with word2vec?

~~~
drwl
Looks like it uses GloVe and not word2vec. They're both algorithms for
generating word vectors but they are different.

~~~
RHSman2
Not by much

------
programmarchy
Slightly tangential, but does anyone know if word2vec can be used in a
compound form to build up "concepts"? I'm interested to know if it could be
used to identify parallelism in works of literature e.g. identifying
plagiarism, parallels between the old and new testament, or intertextual works
like Ulysses by Joyce and the Odyssey.

~~~
physicsyogi
Maybe look into ConceptNet Numberbatch:
[https://github.com/commonsense/conceptnet-
numberbatch/blob/m...](https://github.com/commonsense/conceptnet-
numberbatch/blob/master/README.md)

------
abc-xyz
This may be off topic, but could this be used for classifying the
trustworthiness or Amazon/App Store/etc reviews? Or does anyone perhaps know
about an open source project that can be used to achieve this by someone who
doesn't know anything about machine learning?

~~~
codegladiator
> [https://thereviewindex.com/blog/hello-
> world](https://thereviewindex.com/blog/hello-world)

------
Arnt
This sounds like an early version of DCC:
[https://www.rhyolite.com/dcc/](https://www.rhyolite.com/dcc/)

At first glance, I don't see anything that DCC didn't do, what did I miss?

~~~
EmilStenstrom
It seems DCC isn't using word vectors at all? Using word vectors you can know
that viagra and v14gr4 is the same word, because it is used in the same way in
messages. That in turn means you don't need word lists, and can instead build
from huge knowledge bases like GloVe.

~~~
massaman_yams
That, and the fact that a message is sent in bulk isn't actually a very strong
indicator that the message is spam, at least in the email world. As one input
to a filtering system, it can be useful, but not as a rule applied on its own
without consideration for other factors.

------
chasing
Why is Silicon Valley so interested in censoring certain kinds of speech?

~~~
dang
The HN community is international and overwhelmingly not based in Silicon
Valley. From his GitHub profile it looks like the author of this project isn't
either. So what you said is considerably off the mark. Either way, though,
please don't post flamebait here.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
chasing
'Twas a joke based on a series of other highly active threads on this site. I
assumed it would be taken as such. My error!

