Hacker News new | past | comments | ask | show | jobs | submit login
How do spammers harvest your e-mail address? (karan.github.io)
27 points by _hoa8 on April 22, 2014 | hide | past | favorite | 15 comments

This is the third time you've posted this link in as many days. (Although it appears that your strategy worked.) Note for the future that deleting then resubmitting links is against HN rules.

Apologies for that, but I wanted to refine the post as much as possible before submitting here, so I was collecting feedback from my friends and now I think was a good time to submit.

Repeatedly deleting and reposting the same story is against the rules—it's an abuse of deletion. Please don't do that. Accounts that do eventually lose submission privileges.

Idea: Perhaps the deleted submissions can be visible in the user’s submission list as [deleted] (without the title or any other information). So it’s easy to see the accounts that do this frequently.

A couple notes that aren't related to the content of the post:

1) The graphs are not presented in a way that makes them easy to consume. The font is too small, the bars are too densely combined, the axis labels are not descriptive enough ("percentage of emails posted"), and there is no discernible ordering of the bars (alphabetical, by value, etc). Presenting your data in an a way that is easy to consume is just as important as having worthwhile data to present, because a general audience like this isn't going to struggle to parse those plots, they are just going to move on.

2) Considering you are doing data analysis with Python, you should check out pandas (http://pandas.pydata.org/). It will not only make the data easier to work with, but it will do plotting for you with better defaults than you have chosen, and you will drastically cut down on having to write matplotlib code (a worthwhile benefit!).

I do agree 100% with the comment about the graphs. Also, I'm surprised to see UTF-8 in the wordcloud. Is it an error or is a real spam word? I don't know if I'm missing something but I would have add it to the stopwords. but anyway, very interesting read, thanks.

"Tech companies, like Google and Yahoo, use about 30 billion watts of electricity (1) - that's enough electricity to power 3 million houses for a year." Powering 3 million houses is a measure of power. Powering 3 million houses for a year is a measure of energy. 30GW, however, is a measure of power.

Funnily enough, it was the topic of spam and building a better spam filter that first introduced me to pg's essays (and thence, to HN).

It doesn't look like his Spam page has been updated in a long time (http://paulgraham.com/antispam.html), which reflects for me the quality of spam filters now compared to 2002-2005 when most of those essays were written. Incidentally, they're a great way to learn about Bayesian Filtering as well !

There seems to be two sources missing from you list of "platforms", based on some recent experiences (I'm a mail server admin and I put a lot of effort into tracking down and blocking spam):

1. Hotel registration. I was asked for my email address when staying at a Hyatt for the BSides Conference in SF a while back. I didn't even think twice about providing my standard email address, and within a week, started receiving a lot of extra spam. I tracked some of it down to a company that has affiliations with hotel networks, so I'm pretty sure it came from the registration process.

2. Public wifi hotspots. On this one, I dunno when or where I absent-mindedly entered my email address, but again, followed some of the spam back to a marketing company affiliated with public hotspots. Bastards.

It's fairly persistent spam, and it's walking right past greylisting, SpamAssassin, and my usual filters for bad actors.

This is an awesome piece of research - I feel your pain that you can't get it published due to the Gmail data issue!

Some devices are having issues with responsive layout. In that case, use this link: https://github.com/karan/karan.github.io/blob/master/_posts/...

Does it matter? Email addresses are harvested by spammers all the time. The key piece of the puzzle now is that spam filtering is advanced enough that we do not need to care. I almost never see spam in my inbox, and I almost never see ham in my junk folder.

Well it might not matter to you directly, but those spam filters do need this kind of research.

What else is really important is that we, as webmaster/web programmers, should be able to protect users' email addresses from being spammed away. Many people I know use catch-all emails while signing up for websites so they can see what site sent any spam. If a site did, the trust is abased.

I must admit to not being entirely surprised that email addresses, when posted in public, get picked up by spammers. I would have liked to see answers to the harder questions - e.g. Here are email addresses that we've only given to banks or other large companies, now let's see where the leaks are and investigate them.

It would be interesting to also look at harvesting from PGP keys which have been posted to keyservers. I'm sure that's a small portion of the population, but I wonder if it is being (ab)used.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact