

How do spammers harvest your e-mail address? - karangoeluw
http://karan.github.io/email-spam/

======
minimaxir
This is the third time you've posted this link in as many days. (Although it
appears that your strategy worked.) Note for the future that deleting then
resubmitting links is against HN rules.

~~~
karangoeluw
Apologies for that, but I wanted to refine the post as much as possible before
submitting here, so I was collecting feedback from my friends and now I think
was a good time to submit.

~~~
dang
Repeatedly deleting and reposting the same story is against the rules—it's an
abuse of deletion. Please don't do that. Accounts that do eventually lose
submission privileges.

~~~
gus_massa
Idea: Perhaps the deleted submissions can be visible in the user’s submission
list as [deleted] (without the title or any other information). So it’s easy
to see the accounts that do this frequently.

------
birken
A couple notes that aren't related to the content of the post:

1) The graphs are not presented in a way that makes them easy to consume. The
font is too small, the bars are too densely combined, the axis labels are not
descriptive enough ("percentage of emails posted"), and there is no
discernible ordering of the bars (alphabetical, by value, etc). Presenting
your data in an a way that is easy to consume is just as important as having
worthwhile data to present, because a general audience like this isn't going
to struggle to parse those plots, they are just going to move on.

2) Considering you are doing data analysis with Python, you should check out
pandas ([http://pandas.pydata.org/](http://pandas.pydata.org/)). It will not
only make the data easier to work with, but it will do plotting for you with
better defaults than you have chosen, and you will drastically cut down on
having to write matplotlib code (a worthwhile benefit!).

~~~
bigd
I do agree 100% with the comment about the graphs. Also, I'm surprised to see
UTF-8 in the wordcloud. Is it an error or is a real spam word? I don't know if
I'm missing something but I would have add it to the stopwords. but anyway,
very interesting read, thanks.

------
slavik81
_" Tech companies, like Google and Yahoo, use about 30 billion watts of
electricity (1) - that's enough electricity to power 3 million houses for a
year."_ Powering 3 million houses is a measure of power. Powering 3 million
houses for a year is a measure of energy. 30GW, however, is a measure of
power.

------
JacobAldridge
Funnily enough, it was the topic of spam and building a better spam filter
that first introduced me to pg's essays (and thence, to HN).

It doesn't look like his Spam page has been updated in a long time
([http://paulgraham.com/antispam.html](http://paulgraham.com/antispam.html)),
which reflects for me the quality of spam filters now compared to 2002-2005
when most of those essays were written. Incidentally, they're a great way to
learn about Bayesian Filtering as well !

------
thaumaturgy
There seems to be two sources missing from you list of "platforms", based on
some recent experiences (I'm a mail server admin and I put a lot of effort
into tracking down and blocking spam):

1\. Hotel registration. I was asked for my email address when staying at a
Hyatt for the BSides Conference in SF a while back. I didn't even think twice
about providing my standard email address, and within a week, started
receiving a lot of extra spam. I tracked some of it down to a company that has
affiliations with hotel networks, so I'm pretty sure it came from the
registration process.

2\. Public wifi hotspots. On this one, I dunno when or where I absent-mindedly
entered my email address, but again, followed some of the spam back to a
marketing company affiliated with public hotspots. Bastards.

It's fairly persistent spam, and it's walking right past greylisting,
SpamAssassin, and my usual filters for bad actors.

------
nedwin
This is an awesome piece of research - I feel your pain that you can't get it
published due to the Gmail data issue!

------
karangoeluw
Some devices are having issues with responsive layout. In that case, use this
link:
[https://github.com/karan/karan.github.io/blob/master/_posts/...](https://github.com/karan/karan.github.io/blob/master/_posts/2014-03-26-email-
spam.markdown)

------
betterunix
Does it matter? Email addresses are harvested by spammers all the time. The
key piece of the puzzle now is that spam filtering is advanced enough that we
do not need to care. I almost never see spam in my inbox, and I almost never
see ham in my junk folder.

~~~
karangoeluw
Well it might not matter to you directly, but those spam filters do need this
kind of research.

What else is really important is that we, as webmaster/web programmers, should
be able to protect users' email addresses from being spammed away. Many people
I know use catch-all emails while signing up for websites so they can see what
site sent any spam. If a site did, the trust is abased.

------
jamesbrownuhh
I must admit to not being entirely surprised that email addresses, when posted
in public, get picked up by spammers. I would have liked to see answers to the
harder questions - e.g. Here are email addresses that we've only given to
banks or other large companies, now let's see where the leaks are and
investigate them.

------
privong
It would be interesting to also look at harvesting from PGP keys which have
been posted to keyservers. I'm sure that's a small portion of the population,
but I wonder if it is being (ab)used.

