Given that the percentages of blocked spam are just 17, 50, 67 & 83, it appears that the sample was likely just 6 spam emails which isn't a lot. Also doesn't really explain the methodology. I assume it's against a control of no obfuscation which was the 6 emails.
So is the real story here that putting your email on the open web simply isn't a significant "vulnerability" anymore?
Honestly, why would you spider and scrape sites when you can just grab a leaked dataset?
It would be slightly more interesting to put the canaries in a more commonly scanned location, like a plain-text bio field on a popular site (like this one... or FB... or LinkedIn...).
Spam is no longer significant. Back around year 2000 I was getting multiple thousands of spam messages per day (to the very same email address I still use today). Those were the dark and difficult days.
Today I basically don't get any spam. I don't keep precise stats on the spam that makes it to my inbox but it's something like 2 or 3 per month. Compared to a few thousand per day twenty+ years ago!
I want to highlight again that it is the exact same email address I use today.
I block misconfigured connections at the SMTP level in postfix (I run my own mail server), that takes care of pretty much all of it.
After that I run bayesian filtering (spamprobe) to score spam, but there isn't much to catch. Over the last 30 days spamprobe has caught 37 messages, all true positives.
Unlike gmail, I also don't get any false positives (zero so far this year).
I know how bad it was couple decades ago and a lot of people are still traumatized by those days and act like it's still 2001, but the reality in 2023 is that spam is a very minor issue.
I also run postfix on my own domain but instead of bayesian filtering I also run PostFwd and PostGrey. The combination of these 3 is enough in my case to reach zero SPAM emails. I can share my config if someone is interested.
To offer one corroborating point, my email address is posted in cleartext on the front page of my website and it receives spam at a rate of less than one message per day.
Another: Used the same email for ~2 decades, never hidden it, giving it out freely and available on lots of public websites including my own. Get around ~30 spam emails per day, all automatically marked as spam. Around 2-3 gets through my filter each month.
Difficult to extract any useful info from this post without specifying what or where your spam filter is. Does it run locally? On your own mail server? An ISP's default spam filter? Gmail? CPanel? Have you tweaked any settings?
Or what I'm saying is, if the SMTP server blocks by IP first before determining what mail is being delivered then the actual rate of potential spam to any particular email address is not being discovered.
But he chose to mention that all spam were marked as such and that only 1 or 2 get through. Readers will naturally be curious what methodology and tools are in use.
“Difficult to extract any useful info from this post” reads as undue criticism, as if preemptive satisfaction of your curiosity about the tangent was compulsory.
I disagree that the criticism is "undue"... a post that waves off spam as if it's a non-issue ("2-3 each month"), when in fact many people do struggle with it, helps nobody. A post actually explaining how you built or configured an effective spam filter, however, benefits the HN community.
The conversation was about how many spam messages we received, not how many were caught or how. But since you're curious: sendmail + SpamAssassin, running on my own server.
But you specifically mentioned the percentage of messages which are marked as spam, and how many get through your filter. If you choose to include that info, people are going to have questions.
My experience is similar; I have my personal email, which has been in use since the early 00s, in almost plain text (@ replaced with <at>) on the about page of a fairly popular site which gets several million visits a month. I only get 5-10 spam messages/day, most of which are filtered without issue. I do get a decent amount of email, but not true 'spam'. It's mostly just crap I've signed up for over the years and can't be bothered to get rid of.
I honestly get more at my work email, which has never been posted anywhere... I wonder if spammers have started to assume the easy to get email addresses are suspicious or not valuable for various reasons.
Scraping is pretty dangerous if you're running a spam business, since there are a lot of honeypots out there, and you'll pretty rapidly find yourself blacklisted. Email spammers generally want verified email addresses.
Since it's far safer to buy email lists from some broker, you tend to get a lot more spam for signing up for things than posting your email publicly.
The problem with email is that it fills so many roles. It's both the only universal chat program between trusted entities and the only universal way to allow people to cold call you.
If you're really worried about spam, I recommend just keeping a different email address for each separate purpose.
While it doesn't stop spam, I have been using a catch-all email system for a while now.
The benefit is that I know where someone got my email from, and I can then try to figure out whether the place has been compromised, or whether they're selling my email, etc. And I can just blacklist that particular address forever as well.
Previously, I just did whatever@mydomain.tld, but I've switched to something similar to blame.email [1].
This makes my emails look a little weirder, but it has stopped the weird looks I'd get when walking into a physical place, like my doctor, and telling them "Yeah, email me at <doctor's name>@<first><last>.com".
It also makes it less obvious that its effectively a throwaway email, particularly combined with my domain; it looks fitting. And since each address is salted and hashed, it pretty much eliminates the risk of someone successfullying trying to phish me by sending me an email to something like `paypal@<first><last>.com`.
Lastly, on my HN profile and elsewhere, I've got my "email", but despite them being unique, I still don't want to have to rotate it if it gets picked up by a spambot, so I've tried to do some plaintext simple "obfuscation" like in the article.
I went for <address> ~АТ~ <domain>.<tld> -- with the "AT" being Cyrillic rather than Latin - I figure at least some will get tripped up by not being able to use purely English regex.
So far, I have yet to receive any spam with that strategy. Maybe I'm lucky or just not getting indexed, or maybe it's working a little.
Still torn about how to handle Git or copyright/license headers, though; those addresses need to last a long time, in case anyone needs to reach out and ask for re-licensing/etc, and I figure it'd be annoying doing different emails for each repo.
I have been using a catch-all for about 2 years now with great results. I hadn't thought of hashing/obfuscating the emails though. I think since I use a password manager anyway, I could just generate a random 6-8 character prefix when signing up for a new account, and since it's saved in my password manager it's easy to look up again later (no need for a true hash).
> I think since I use a password manager anyway, I could just generate a random 6-8 character prefix when signing up for a new account, and since it's saved in my password manager it's easy to look up again later (no need for a true hash).
Yeah, same. I store all the addresses in KeePass.
The main reason I don't just totally randomize them is just that there have been a few moments where I do have my salt somehow, but for whatever reason, it is either inconvenient or impossible to immediately open up the password manager and add a new entry.
In those moments, being able to deterministically generate the address and then add it at my leisure without having to double-check what I used is nice.
It also likely wouldn't happen to me, but should I ever somehow lose/lose access to both my old emails and my password manager, as long as I have my salt, I can still "remember" my email addresses for important services (e.g., PayPal or whatever) to re-generate the addresses and reset my passwords.
Whatever route you go, be it randomized addresses or hashed addresses, even though I think I am more vigilant and careful than most, it's still nice having an extra-layer to the catch-all that can't easily be targeted by someone malicious without first either somehow obtaining your salt, compromising the service, etc; it's handy being able to immediately filter and flag anything relating to my bank or whatever else if it isn't sent to the right address.
I just append 4 random characters to the email, e.g. domain.com-xy3j@myname.mail . I have to explicitly configure an alias for each of these addresses - this prevents someone from guessing a correct address for a different domain.
If Spam arrives, I can block that specific address and use different random characters to live in peace again.
Keeping the domain readable makes it easier to explain to people that they must’ve “lost” my email address somehow.
This is something I think about a lot too. I’m interested how you handle verbally communicating an address? Most people probably don’t expect a completely random string. I wonder if there could be an easy “word-sounding” generator that could be integrated into something to manage emails?
> I’m interested how you handle verbally communicating an address?
It depends on how off-guard I'm caught and how important it is to me. I usually have my phone, which has my KeePass and email salt inside, and I usually have at least enough battery to last a conversation, so it's rare that I can't generate the proper email address in <30 seconds in most scenarios.
But yes, having like, 5e5ee440@<domain>.<tld>, has definitely resulted in a few "can you repeat that?" or "just to confirm?" moments (especially over the phone since audio quality often sucks). That said, for whatever reason, people are still seemingly less surprised by "5e5ee440@" vs. "<your place of work>@".
On the rarer occasions where I don't at least have my phone or something, if it's something I know I can update later, I'll tell them whatever is easy to input and remember; I separate emails to unknown recipient addresses, but I don't completely reject them outright, so it's not usually an issue pulling out the confirmation email or whatever later, and then updating the address.
However, if I don't have my phone, and I don't know how easily I could update the email, then it depends more. For example, my doctor wants an email on file for whatever record-keeping reason and for sending appointment confirmations and such. In that scenario, I don't know that I'd necessarily be able to easily change it without going in/calling them.
The first time, I did give them <doctor's practice>@domain.tld, because I figure, despite being an important email, it's unlikely that it'd get abused; if someone somehow knows my GP's full name and practice, and is using it maliciously, I've probably got bigger worries than getting a phishing email sent to it or whatever else.
The second time, though, I just asked her to email me the contact update form and told her I'd send it back with the proper email inside.
> I wonder if there could be an easy “word-sounding” generator that could be integrated into something to manage emails?
I figure you could do something similar to like the horse-battery-stapler XKCD meme or bitcoin wallet seed phrases, if you wanted to avoid the "sorry, can you repeat that?" moments.
But it might be slightly more annoying to deterministically generate those, if you care about that aspect, compared to simply salt+hash & truncate. If you find a good method, let me know, though.
I've been doing this for a few years and I think I've had a single organization use the email in a way I was surprised about (used car dealer gave it to Sirius radio and they spammed me). I get far more spam directly from orgs, which is annoying but expected, and they usually honor unsubscribes, so I've never blocked one of my addresses. At the end of the day I don't think it's been worth the effort, and I'm considering switching to a single public email address that I just expect to receive spam.
This is all making me wonder if email should have been more like some modern messaging systems with friend requests that aren't transferable from one person to another (i.e. a friend can't tell someone else your handle and let someone else message you) rather than the phone system of "if you know the number you can call".
I've discovered that the most effective trick is to use an email address that is considered invalid by most email scrapers but valid by most mail servers.
I have an unobfuscated mailto: link on my blog and I receive barely any spam to my public email address, ~@eligrey.com
Oo just the other day I discovered that "&ers" is a great way to shorten my name. It's also a valid email handle. Maybe I should make it my public address.
I have used a keyed email address scheme since the mid 1990s (me_context@domain.org). 99% of the email I get now is sent to the same 5 addresses, in particular me_boingboing, me_linkedin. Those addresses were never on a web page. The keyed addresses I used for my company occasionally get spam, but it's generally on topic (offering relevant services, though still imo spam). I also have a gmail address that is a common first/last name but I don't send from that address, I get a lot of presumably guessed address spam there. This suggests quite a bit of organization in how spam works these days.
What I'm missing from the article is an explanation of what the control is. The email with no protection, which you'd expect to be the control, blocked 17% of spam - 17% of what amount of spam? How does the author quantify the spam that's out there?
it says in the page that its because all the mails were on the same page (and presumably then scaped by the same bots), and yet a couple mails were recieved by other addresses and not by the unprotected one. So it would seem that the control is the set of unique mails recieved by all addresses.
"Surprisingly, the unprotected email address appears to have blocked a spam email. Either that message wasn’t received, or an extra message was sent to one of the protected email addresses."
Really? Seems like you'd have to actively try to be blocked by that, i.e. to extract 'email' but not the mailto. I suppose if you didn't actually scan the HTML, but instead all the contained text.. maybe it just tests an implementation detail basically.
Some comments claim that it can break some of the more complex techniques presented in the article. I've tried it a few times myself with varying results that tend towards not working.
Total aside, but I recently purchased a .us domain name for the first time. I was on my phone and signing up with the Namecheap app. I thought it odd that it didn't ask for privacy protection, but I figured I'd quickly turn it on after the fact and didn't get around to it until later that night. I used our home phone as the phone number for the site. The next day we got no fewer than fifty spam phone calls, and the email address I offered was quickly inundated with over a hundred spam messages. It was absolutely nuts. Fortunately, I used an email alias and just turned it off. Likewise, I have a Callcentric VOIP number for the home, so I was able to add a "press six to continue" prompt, but I can't change that number without a lot of pain and I now no longer get the automated phone calls from the school (we still get them through the app). Since then, the call volume has dropped off, but I can see in the logs that we still get several per day.
After all of that I learned that privacy protection is not available for .us domains. No wonder they're so cheap.
A long way of saying, if you sign up for a .us domain, definitely do not give them your real information.
What about this one? I think I found it in a Stack Overflow answer a year or two ago. I've used it, but haven't tested it for effectiveness in avoiding spam.
Turn your email into a PNG. It's absolutely easy to read, and My brother who worked for a scraping company said they haven't cracked that yet (as of 5 years ago). Even today, you would need a neural net to do it which makes it more expensive. They would also need another neural net to figure out whether or not it was an email address! It's just an image after all.
I use JavaScript obfuscation[1] on my website and get very low amount of spam.
However I am confident that some modern scrapers are using headless browsers and wait for the DOM to be fully loaded before extracting the email from the text data.
A similar but CSS based method: put each character of the email address in a separate div, place the divs in the DOM in random order, but use CSS absolute positioning to make them appear in the right order to the user. Probably not straightforwardly scrapable by reading the DOM, would mess up text selection though.
What about accessibility for people with visual impairments?
The author has warnings for the last three version, because of usability. I see lots of red flags also for the other versions in terms of accessibility.
ultimately I'm not sure that is a tractible problem; any method a screen reader would need to use to decode to speech would require getting the address in clear text via programatic means, which is something that one specifically needs to avoid if we want to stop bots scraping addresses
Not really a block. I’ve worked on CAPTCHA-solving bots. But it does increase the cost. I’d probably not scrape for emails anyways though, using leaked datasets is likely far more economical.
It's an interesting claim that defeating xor requires a JS interpreter but defeating the others (e.g., concatenation, rot18, etc.) does not. I mean, if a bot author wanted to scrape a site or sites where all obfuscation used the same xor routine, the author (not the bot!) could just read (not execute) the JS and customize the bot accordingly. Granted it restricts things to more targeted attacks rather than a bot that just crawls arbitrarily across the web, which is noteworthy.
Most scrapers search for anchor tags. So a span with the aria role=link could work too. This will also keep accessibility. Because I'm not shure what a screenreader will read for the encoded E-Mail.
https://developer.mozilla.org/en-US/docs/Web/Accessibility/A...
I’m surprised URL encoding works so well. I’ve been base64 encoding and using JS to decode, but if URL encoding is that reliably, it’s probably not worth it.
It may not be the clerverness of the obfuscation that beats the spammer's bot. If you go to such lengths to obfuscate your email, then you may simply not be a bad target for my bot: response rates tiny. No good ROI for the spammer.
Does not matter, it works then!! But not because of technical cleverness.
> Long form communication is done via texting or shared chat.
That's a nightmare.
In reality, long form just doesn't exist in chat/text. For me, long form is at least 500 words. Years go by before I get "long form" text/chat of that order.
How do you review old communication from 8 years ago? Or find texts with particular criteria - “from x, subject like y, received within a week of date z, has attachment”
We are living in strange times where the digital stuff just does not exist unless it is consumed in the present.
I have loving emails written to me in pen and email. I read neither yet for some reason the penned documents are held onto with an emotional tie.
My father, who use to painstakingly capture photos, create slide shows using a rotating projector, laments the days where people would review photos as a time of bonding. That no longer exists and now, even with 500x more photos I still long for the days of having the context of a few photos that are forever lost. GPS and timestamps are helpful I guess but I will never know the reason for that captured moment.
i'd be curious to know, of the spam emails received, what percentage were blocked by basic spam filtering?
in 2023 email spam seems like mostly a solved problem - i very rarely get any that actually makes it through to my inbox. trying to solve the spam problem by protecting your email address from becoming public might have seemed like a valid strategy 20 years ago, but we have better tools now.
The limitation of spam filters is false positives. For most people it's probably not a big deal to have one or two messages land in the spam filter .. then someone follows up and says "Hey did you get that legit email I tried to send you? I haven't heard back." But for certain business accounts, the amount of spam + false positives can get to unmanageable levels, where important emails are flagged as spam and left undiscovered because sifting through the spam folder regularly is as time consuming and annoying as if the spam just went straight to inbox.
My point is that it's still valuable to try and reduce the overall volume of spam. Spam filters are another angle of attack. Spam is a tough enough problem that it's always worth throwing multiple solutions at it, each good at solving a separate slice of the overall problem.
None, I always use those in tests, because they're owned by the IANA, so unless random domains that have no mx records now, but might in the future, the example org, net and com are safe.
I do this because sometimes in companies, people will put DB dumps in the wrong environment that has an actual SMTP going to WAN, then shit happens. I also make sure the environments have a dummy smtp or mailcatcher, but it's better to be safe than sorry.
Seriously if you are still running email with HTML and scripting enabled you need to turn it off. Plain text is fine for everything use a secure messaging program like signal for anything else.
On the topic of spam..."Spam" to me these days includes email from companies who abuse the soft opt-in, service emails and "existing commercial relationship" rules.
Cart abandonment, tips on how to stay safe online, requests to go paperless, newsletters, claim your free subscription that came with your purchase, thank you for your purchase (separate to the order confirmation), requests to leave a review, continue your application, offering support, "you haven't logged in for a while", new login from $device, thank you for completing step X ... it's endless.
An endless torrent of excuses to get the their company name in front of your eyeballs.
Except all my logins are from "a new device" because I use Cookie AutoDelete, there's no way to opt out of this spam, and I don't give a shit if anyone DID actually hack my Google account because it's only a way to keep track of my Youtube watch history and subs.
That’s ideal, but it’s a lot harder and involves more invasive tracking to determine normal activity.
You really have 2 options from a UX standpoint. Either allow the login and notify you, which gives you less friction in your experience with the application if it really is you.
Or they can stop you right on the login screen and send you an email with a code or a link to click before you go further. It’s more secure, but it adds friction on the more common case that it really is you.
The big spam to me at work is now people using linkedin and other tools to automatically find people's emails, and send them targted automated email that are, plaintext old outreach emails. for some reason this is just not getting caught in my spam filter.
I deleted my LinkedIn account when it started becoming more like Facebook (nothing against FB, Wide likes it) but FB isn't for me.
Also, I removed my public open source projects from GitHub after Microsoft bought them then removed them from GitLab too when the AI warnings about stealing copyrighted code started to sound more real.
Gitlab does that stuff. Almost everytimr I log in, I get an e-mail about someone logging in from a new device. I am not even using a VPN on that machine. Wonder what their problem is. By now their stuff is marked as spam, when it hits my mailbox.