Hacker News new | past | comments | ask | show | jobs | submit login
Email obfuscation: What still works in 2023? (spencermortensen.com)
170 points by surprisetalk on Nov 22, 2023 | hide | past | favorite | 104 comments



Given that the percentages of blocked spam are just 17, 50, 67 & 83, it appears that the sample was likely just 6 spam emails which isn't a lot. Also doesn't really explain the methodology. I assume it's against a control of no obfuscation which was the 6 emails.


So is the real story here that putting your email on the open web simply isn't a significant "vulnerability" anymore?

Honestly, why would you spider and scrape sites when you can just grab a leaked dataset?

It would be slightly more interesting to put the canaries in a more commonly scanned location, like a plain-text bio field on a popular site (like this one... or FB... or LinkedIn...).


> So is the real story here that putting your email on the open web simply isn't a significant "vulnerability" anymore?

Yes. The whole obfuscate my email thing is silly. I have the same email since the mid 90s, I post it without any care to wherever.


Spam isn't a problem for you, I take it? What are you using to filter spam out?


Spam is no longer significant. Back around year 2000 I was getting multiple thousands of spam messages per day (to the very same email address I still use today). Those were the dark and difficult days.

Today I basically don't get any spam. I don't keep precise stats on the spam that makes it to my inbox but it's something like 2 or 3 per month. Compared to a few thousand per day twenty+ years ago!

I want to highlight again that it is the exact same email address I use today.

I block misconfigured connections at the SMTP level in postfix (I run my own mail server), that takes care of pretty much all of it.

After that I run bayesian filtering (spamprobe) to score spam, but there isn't much to catch. Over the last 30 days spamprobe has caught 37 messages, all true positives.

Unlike gmail, I also don't get any false positives (zero so far this year).

I know how bad it was couple decades ago and a lot of people are still traumatized by those days and act like it's still 2001, but the reality in 2023 is that spam is a very minor issue.


I also run postfix on my own domain but instead of bayesian filtering I also run PostFwd and PostGrey. The combination of these 3 is enough in my case to reach zero SPAM emails. I can share my config if someone is interested.


Most of the spam I get is the stupid newsletters that you get signed up for automatically and the unsubscribe link is broken.


I have the same situation


You should have posted it right here to prove your point.

/s


To offer one corroborating point, my email address is posted in cleartext on the front page of my website and it receives spam at a rate of less than one message per day.


Another: Used the same email for ~2 decades, never hidden it, giving it out freely and available on lots of public websites including my own. Get around ~30 spam emails per day, all automatically marked as spam. Around 2-3 gets through my filter each month.


Difficult to extract any useful info from this post without specifying what or where your spam filter is. Does it run locally? On your own mail server? An ISP's default spam filter? Gmail? CPanel? Have you tweaked any settings?


We’ve been discussing observed rates of attempted spam, not the effectiveness of downstream filters.


Is that pre or post IP filter attempted spam?

Or what I'm saying is, if the SMTP server blocks by IP first before determining what mail is being delivered then the actual rate of potential spam to any particular email address is not being discovered.


For my case above we have a decent explanation of the “SMTP stage” policy [1] but no visibility into its actual metrics.

[1]: https://www.fastmail.help/hc/en-us/articles/360060591393


> not the effectiveness of downstream filters.

But he chose to mention that all spam were marked as such and that only 1 or 2 get through. Readers will naturally be curious what methodology and tools are in use.


“Difficult to extract any useful info from this post” reads as undue criticism, as if preemptive satisfaction of your curiosity about the tangent was compulsory.


I disagree that the criticism is "undue"... a post that waves off spam as if it's a non-issue ("2-3 each month"), when in fact many people do struggle with it, helps nobody. A post actually explaining how you built or configured an effective spam filter, however, benefits the HN community.


The conversation was about how many spam messages we received, not how many were caught or how. But since you're curious: sendmail + SpamAssassin, running on my own server.


But you specifically mentioned the percentage of messages which are marked as spam, and how many get through your filter. If you choose to include that info, people are going to have questions.


My experience is similar; I have my personal email, which has been in use since the early 00s, in almost plain text (@ replaced with <at>) on the about page of a fairly popular site which gets several million visits a month. I only get 5-10 spam messages/day, most of which are filtered without issue. I do get a decent amount of email, but not true 'spam'. It's mostly just crap I've signed up for over the years and can't be bothered to get rid of.

I honestly get more at my work email, which has never been posted anywhere... I wonder if spammers have started to assume the easy to get email addresses are suspicious or not valuable for various reasons.


Spidering and scraping is 'more' legal?


Scraping is pretty dangerous if you're running a spam business, since there are a lot of honeypots out there, and you'll pretty rapidly find yourself blacklisted. Email spammers generally want verified email addresses.

Since it's far safer to buy email lists from some broker, you tend to get a lot more spam for signing up for things than posting your email publicly.


> Given that the percentages of blocked spam are just 17, 50, 67 & 83, it appears that the sample was likely just 6 spam emails which isn't a lot.

This is mentioned, but a little hidden (in the description for URL encoding):

> This is based on a small sample size: just six bots that were observed over a one-year period.

I think the number of spam emails would be a bad measure, since a single scan could result in many hundreds of spam emails.


The problem with email is that it fills so many roles. It's both the only universal chat program between trusted entities and the only universal way to allow people to cold call you.

If you're really worried about spam, I recommend just keeping a different email address for each separate purpose.


While it doesn't stop spam, I have been using a catch-all email system for a while now.

The benefit is that I know where someone got my email from, and I can then try to figure out whether the place has been compromised, or whether they're selling my email, etc. And I can just blacklist that particular address forever as well.

Previously, I just did whatever@mydomain.tld, but I've switched to something similar to blame.email [1].

This makes my emails look a little weirder, but it has stopped the weird looks I'd get when walking into a physical place, like my doctor, and telling them "Yeah, email me at <doctor's name>@<first><last>.com".

It also makes it less obvious that its effectively a throwaway email, particularly combined with my domain; it looks fitting. And since each address is salted and hashed, it pretty much eliminates the risk of someone successfullying trying to phish me by sending me an email to something like `paypal@<first><last>.com`.

Lastly, on my HN profile and elsewhere, I've got my "email", but despite them being unique, I still don't want to have to rotate it if it gets picked up by a spambot, so I've tried to do some plaintext simple "obfuscation" like in the article.

I went for <address> ~АТ~ <domain>.<tld> -- with the "AT" being Cyrillic rather than Latin - I figure at least some will get tripped up by not being able to use purely English regex.

So far, I have yet to receive any spam with that strategy. Maybe I'm lucky or just not getting indexed, or maybe it's working a little.

Still torn about how to handle Git or copyright/license headers, though; those addresses need to last a long time, in case anyone needs to reach out and ask for re-licensing/etc, and I figure it'd be annoying doing different emails for each repo.

[1]: https://news.ycombinator.com/item?id=31820502 / https://blame.email/


I have been using a catch-all for about 2 years now with great results. I hadn't thought of hashing/obfuscating the emails though. I think since I use a password manager anyway, I could just generate a random 6-8 character prefix when signing up for a new account, and since it's saved in my password manager it's easy to look up again later (no need for a true hash).


> I think since I use a password manager anyway, I could just generate a random 6-8 character prefix when signing up for a new account, and since it's saved in my password manager it's easy to look up again later (no need for a true hash).

Yeah, same. I store all the addresses in KeePass.

The main reason I don't just totally randomize them is just that there have been a few moments where I do have my salt somehow, but for whatever reason, it is either inconvenient or impossible to immediately open up the password manager and add a new entry.

In those moments, being able to deterministically generate the address and then add it at my leisure without having to double-check what I used is nice.

It also likely wouldn't happen to me, but should I ever somehow lose/lose access to both my old emails and my password manager, as long as I have my salt, I can still "remember" my email addresses for important services (e.g., PayPal or whatever) to re-generate the addresses and reset my passwords.

Whatever route you go, be it randomized addresses or hashed addresses, even though I think I am more vigilant and careful than most, it's still nice having an extra-layer to the catch-all that can't easily be targeted by someone malicious without first either somehow obtaining your salt, compromising the service, etc; it's handy being able to immediately filter and flag anything relating to my bank or whatever else if it isn't sent to the right address.


I just append 4 random characters to the email, e.g. domain.com-xy3j@myname.mail . I have to explicitly configure an alias for each of these addresses - this prevents someone from guessing a correct address for a different domain.

If Spam arrives, I can block that specific address and use different random characters to live in peace again.

Keeping the domain readable makes it easier to explain to people that they must’ve “lost” my email address somehow.


This is something I think about a lot too. I’m interested how you handle verbally communicating an address? Most people probably don’t expect a completely random string. I wonder if there could be an easy “word-sounding” generator that could be integrated into something to manage emails?


> I’m interested how you handle verbally communicating an address?

It depends on how off-guard I'm caught and how important it is to me. I usually have my phone, which has my KeePass and email salt inside, and I usually have at least enough battery to last a conversation, so it's rare that I can't generate the proper email address in <30 seconds in most scenarios.

But yes, having like, 5e5ee440@<domain>.<tld>, has definitely resulted in a few "can you repeat that?" or "just to confirm?" moments (especially over the phone since audio quality often sucks). That said, for whatever reason, people are still seemingly less surprised by "5e5ee440@" vs. "<your place of work>@".

On the rarer occasions where I don't at least have my phone or something, if it's something I know I can update later, I'll tell them whatever is easy to input and remember; I separate emails to unknown recipient addresses, but I don't completely reject them outright, so it's not usually an issue pulling out the confirmation email or whatever later, and then updating the address.

However, if I don't have my phone, and I don't know how easily I could update the email, then it depends more. For example, my doctor wants an email on file for whatever record-keeping reason and for sending appointment confirmations and such. In that scenario, I don't know that I'd necessarily be able to easily change it without going in/calling them.

The first time, I did give them <doctor's practice>@domain.tld, because I figure, despite being an important email, it's unlikely that it'd get abused; if someone somehow knows my GP's full name and practice, and is using it maliciously, I've probably got bigger worries than getting a phishing email sent to it or whatever else.

The second time, though, I just asked her to email me the contact update form and told her I'd send it back with the proper email inside.

> I wonder if there could be an easy “word-sounding” generator that could be integrated into something to manage emails?

I figure you could do something similar to like the horse-battery-stapler XKCD meme or bitcoin wallet seed phrases, if you wanted to avoid the "sorry, can you repeat that?" moments.

But it might be slightly more annoying to deterministically generate those, if you care about that aspect, compared to simply salt+hash & truncate. If you find a good method, let me know, though.


I've been doing this for a few years and I think I've had a single organization use the email in a way I was surprised about (used car dealer gave it to Sirius radio and they spammed me). I get far more spam directly from orgs, which is annoying but expected, and they usually honor unsubscribes, so I've never blocked one of my addresses. At the end of the day I don't think it's been worth the effort, and I'm considering switching to a single public email address that I just expect to receive spam.


This is all making me wonder if email should have been more like some modern messaging systems with friend requests that aren't transferable from one person to another (i.e. a friend can't tell someone else your handle and let someone else message you) rather than the phone system of "if you know the number you can call".


I've discovered that the most effective trick is to use an email address that is considered invalid by most email scrapers but valid by most mail servers.

I have an unobfuscated mailto: link on my blog and I receive barely any spam to my public email address, ~@eligrey.com


~@eligrey.com implies it's not you, I would add an alias for ~~@eligrey.com for the comfort of any programmer friends.


But ~user is user's home for Unix friends.


People have gotten... upset... when I submit pull requests on github as `wget${IFS}r.vc/ghe`@ryanc.org


Oo just the other day I discovered that "&ers" is a great way to shorten my name. It's also a valid email handle. Maybe I should make it my public address.


I guess you have a different email address that you use for website signups, shopping etc?


Indeed! I also maintain a list of entities that have leaked the unique email addresses that I have shared with them.[1]

1. https://gist.github.com/eligrey/5084991


Same. I took over a programming conference earlier this year and it uses a .codes domain.


OP was saying that their email is literally ~@eligrey.com

He’s saying this works bc scrapers might not match a left hand side of an @ that contains no alphanumerics


I get that. I think the less common domain extensions are having a similar effect right now though.


I have used a keyed email address scheme since the mid 1990s (me_context@domain.org). 99% of the email I get now is sent to the same 5 addresses, in particular me_boingboing, me_linkedin. Those addresses were never on a web page. The keyed addresses I used for my company occasionally get spam, but it's generally on topic (offering relevant services, though still imo spam). I also have a gmail address that is a common first/last name but I don't send from that address, I get a lot of presumably guessed address spam there. This suggests quite a bit of organization in how spam works these days.


What I'm missing from the article is an explanation of what the control is. The email with no protection, which you'd expect to be the control, blocked 17% of spam - 17% of what amount of spam? How does the author quantify the spam that's out there?


it says in the page that its because all the mails were on the same page (and presumably then scaped by the same bots), and yet a couple mails were recieved by other addresses and not by the unprotected one. So it would seem that the control is the set of unique mails recieved by all addresses.

"Surprisingly, the unprotected email address appears to have blocked a spam email. Either that message wasn’t received, or an extra message was sent to one of the protected email addresses."


<a href="mailto:email@example.com">email</a> blocked 17% but email@example.com blocked 0%


Really? Seems like you'd have to actively try to be blocked by that, i.e. to extract 'email' but not the mailto. I suppose if you didn't actually scan the HTML, but instead all the contained text.. maybe it just tests an implementation detail basically.


Maybe their scrapper is so bad it actually captured "mailto:example@..."


Here's an older submission about using ChatGPT to de-obfuscate the more basic methods:

https://news.ycombinator.com/item?id=38150096

Some comments claim that it can break some of the more complex techniques presented in the article. I've tried it a few times myself with varying results that tend towards not working.


Feels like ChatGPT has to show up in every Hacker News thread now.


Total aside, but I recently purchased a .us domain name for the first time. I was on my phone and signing up with the Namecheap app. I thought it odd that it didn't ask for privacy protection, but I figured I'd quickly turn it on after the fact and didn't get around to it until later that night. I used our home phone as the phone number for the site. The next day we got no fewer than fifty spam phone calls, and the email address I offered was quickly inundated with over a hundred spam messages. It was absolutely nuts. Fortunately, I used an email alias and just turned it off. Likewise, I have a Callcentric VOIP number for the home, so I was able to add a "press six to continue" prompt, but I can't change that number without a lot of pain and I now no longer get the automated phone calls from the school (we still get them through the app). Since then, the call volume has dropped off, but I can see in the logs that we still get several per day.

After all of that I learned that privacy protection is not available for .us domains. No wonder they're so cheap.

A long way of saying, if you sign up for a .us domain, definitely do not give them your real information.


What about this one? I think I found it in a Stack Overflow answer a year or two ago. I've used it, but haven't tested it for effectiveness in avoiding spam.

<a href="#" class="cryptedmail" data-name="david" data-domain="davidlane" data-tld="io" onclick="window.location.href = 'mailto:' + this.dataset.name + '@' + this.dataset.domain + '.' + this.dataset.tld; return false;"></a>

Paired with this CSS:

.cryptedmail:after { content: attr(data-name) "@" attr(data-domain) "." attr(data-tld); }

Which results in a clickable link that opens whatever is set up to handle mailto: links (1).

1: https://nexus.armylane.com/files/email-crypted.png


Tip: Almost none of the scrapers and bots run WebAssembly :)


How long until you have a proof of work challenge to see the email?


This has worked incredibly well for about two years now.


Intriguing, but has this been tested in the wild alongside a control eMail address in plaintext?


> 1.3 No protection Blocked 17% of spam

I read the explanation, but it just sounds like something is a bit off in the methodology and metrics. Or at least in my understanding of them :)


What they mean is that at least one bot doesn't read mailto: links and only cares about plaintext e-mails.


Ah, thank you for that. I feel a bit dumb now, but that would have been a good inline explanation.


Turn your email into a PNG. It's absolutely easy to read, and My brother who worked for a scraping company said they haven't cracked that yet (as of 5 years ago). Even today, you would need a neural net to do it which makes it more expensive. They would also need another neural net to figure out whether or not it was an email address! It's just an image after all.


The article covers this method of using an image: https://spencermortensen.com/articles/email-obfuscation/#tex...

The problem is that humans who want to cut and paste your email into a different client have to retype it, which is annoying and error prone.


I use JavaScript obfuscation[1] on my website and get very low amount of spam. However I am confident that some modern scrapers are using headless browsers and wait for the DOM to be fully loaded before extracting the email from the text data.

[1]: https://ianisbernard.com/contact/


A similar but CSS based method: put each character of the email address in a separate div, place the divs in the DOM in random order, but use CSS absolute positioning to make them appear in the right order to the user. Probably not straightforwardly scrapable by reading the DOM, would mess up text selection though.


What about accessibility for people with visual impairments?

The author has warnings for the last three version, because of usability. I see lots of red flags also for the other versions in terms of accessibility.


ultimately I'm not sure that is a tractible problem; any method a screen reader would need to use to decode to speech would require getting the address in clear text via programatic means, which is something that one specifically needs to avoid if we want to stop bots scraping addresses


You could have a button that triggers a captcha that then authenticated a request for the plaintext email.

Or you could just use a spam filter.


you know, you're right, an optional audio capcha alongside these could work quite well. I stand corrected !


Not really a block. I’ve worked on CAPTCHA-solving bots. But it does increase the cost. I’d probably not scrape for emails anyways though, using leaked datasets is likely far more economical.


It's an interesting claim that defeating xor requires a JS interpreter but defeating the others (e.g., concatenation, rot18, etc.) does not. I mean, if a bot author wanted to scrape a site or sites where all obfuscation used the same xor routine, the author (not the bot!) could just read (not execute) the JS and customize the bot accordingly. Granted it restricts things to more targeted attacks rather than a bot that just crawls arbitrarily across the web, which is noteworthy.


Most scrapers search for anchor tags. So a span with the aria role=link could work too. This will also keep accessibility. Because I'm not shure what a screenreader will read for the encoded E-Mail. https://developer.mozilla.org/en-US/docs/Web/Accessibility/A...


I'm using CSS-based obfuscation on my website[1] and it works quite well.

1 - https://picheta.me


I’m surprised URL encoding works so well. I’ve been base64 encoding and using JS to decode, but if URL encoding is that reliably, it’s probably not worth it.


Might want to add an intro paragraph. I initially thought that "Email Obfuscation" meant something different, like trying to trick a mail client.


It may not be the clerverness of the obfuscation that beats the spammer's bot. If you go to such lengths to obfuscate your email, then you may simply not be a bad target for my bot: response rates tiny. No good ROI for the spammer.

Does not matter, it works then!! But not because of technical cleverness.

No idea if what I am saying is true.


I have successfully abandoned email.

Sites that require email for logins get my common email that I do not check unless required to reset a password or links for authentication.

Long form communication is done via texting or shared chat.


> Long form communication is done via texting or shared chat.

That's a nightmare.

In reality, long form just doesn't exist in chat/text. For me, long form is at least 500 words. Years go by before I get "long form" text/chat of that order.


How do you review old communication from 8 years ago? Or find texts with particular criteria - “from x, subject like y, received within a week of date z, has attachment”


What is there to review?

We are living in strange times where the digital stuff just does not exist unless it is consumed in the present.

I have loving emails written to me in pen and email. I read neither yet for some reason the penned documents are held onto with an emotional tie.

My father, who use to painstakingly capture photos, create slide shows using a rotating projector, laments the days where people would review photos as a time of bonding. That no longer exists and now, even with 500x more photos I still long for the days of having the context of a few photos that are forever lost. GPS and timestamps are helpful I guess but I will never know the reason for that captured moment.


He gave up on that too and leaves it to his parents.


You're getting downvoted quite hard for doing the same as what I suspect is the majority of Internet users right now.


I recently created a new company domain with email on it and I'm amazed at how quickly I'm getting spam on it.

To be fair, most of it is directly from LinkedIn, but still...


i'd be curious to know, of the spam emails received, what percentage were blocked by basic spam filtering?

in 2023 email spam seems like mostly a solved problem - i very rarely get any that actually makes it through to my inbox. trying to solve the spam problem by protecting your email address from becoming public might have seemed like a valid strategy 20 years ago, but we have better tools now.


The limitation of spam filters is false positives. For most people it's probably not a big deal to have one or two messages land in the spam filter .. then someone follows up and says "Hey did you get that legit email I tried to send you? I haven't heard back." But for certain business accounts, the amount of spam + false positives can get to unmanageable levels, where important emails are flagged as spam and left undiscovered because sifting through the spam folder regularly is as time consuming and annoying as if the spam just went straight to inbox.


okay, but no amount of email obfuscation is going to allow me to disable my spam filter, so this isn't solving the false positive problem.


My point is that it's still valuable to try and reduce the overall volume of spam. Spam filters are another angle of attack. Spam is a tough enough problem that it's always worth throwing multiple solutions at it, each good at solving a separate slice of the overall problem.


What service?

I have gmail and O365 and some other web hosting provider. My work email is unknown, probably filtered 5x, plus on O365 so limited to known contacts.


What worked really well for me: storing email in a base64 encoding and then decoding it whenever user hovered over the link for my email.


I wonder how much mail the example.com domain gets!


None, I always use those in tests, because they're owned by the IANA, so unless random domains that have no mx records now, but might in the future, the example org, net and com are safe.

I do this because sometimes in companies, people will put DB dumps in the wrong environment that has an actual SMTP going to WAN, then shit happens. I also make sure the environments have a dummy smtp or mailcatcher, but it's better to be safe than sorry.

See RFC2606: https://www.rfc-editor.org/rfc/rfc2606.html


Given the MX record resolves to `.` I suspect not any (successfully)


The only thing that works is to stop using email in 2023.


Seriously if you are still running email with HTML and scripting enabled you need to turn it off. Plain text is fine for everything use a secure messaging program like signal for anything else.


On the topic of spam..."Spam" to me these days includes email from companies who abuse the soft opt-in, service emails and "existing commercial relationship" rules.

Cart abandonment, tips on how to stay safe online, requests to go paperless, newsletters, claim your free subscription that came with your purchase, thank you for your purchase (separate to the order confirmation), requests to leave a review, continue your application, offering support, "you haven't logged in for a while", new login from $device, thank you for completing step X ... it's endless.

An endless torrent of excuses to get the their company name in front of your eyeballs.


"new login from $device"

That's a security precaution, not spam.


Except all my logins are from "a new device" because I use Cookie AutoDelete, there's no way to opt out of this spam, and I don't give a shit if anyone DID actually hack my Google account because it's only a way to keep track of my Youtube watch history and subs.

I filter these as spam.


As long as you know the reason. For everybody else, it’s not spam.


It's not a precaution. A security precaution would be blocking the login because of unusual activity

I get so many, I tune out. I bet I'm not alone


That’s ideal, but it’s a lot harder and involves more invasive tracking to determine normal activity.

You really have 2 options from a UX standpoint. Either allow the login and notify you, which gives you less friction in your experience with the application if it really is you.

Or they can stop you right on the login screen and send you an email with a code or a link to click before you go further. It’s more secure, but it adds friction on the more common case that it really is you.


I understand, I was just playing along with the bold assertions and nit-picking


The big spam to me at work is now people using linkedin and other tools to automatically find people's emails, and send them targted automated email that are, plaintext old outreach emails. for some reason this is just not getting caught in my spam filter.

Frustrating! I never opted in!


I deleted my LinkedIn account when it started becoming more like Facebook (nothing against FB, Wide likes it) but FB isn't for me.

Also, I removed my public open source projects from GitHub after Microsoft bought them then removed them from GitLab too when the AI warnings about stealing copyrighted code started to sound more real.


I removed my github when they took my code and printed and put it in some archive. All rights reserved?


Gitlab does that stuff. Almost everytimr I log in, I get an e-mail about someone logging in from a new device. I am not even using a VPN on that machine. Wonder what their problem is. By now their stuff is marked as spam, when it hits my mailbox.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: