
Modern Anti-Spam and E2E Crypto (2014) - Radim
https://moderncrypto.org/mail-archive/messaging/2014/000780.html
======
mikaraento
Previous discussions
[https://news.ycombinator.com/item?id=8275970](https://news.ycombinator.com/item?id=8275970)
and
[https://news.ycombinator.com/item?id=12300953](https://news.ycombinator.com/item?id=12300953)

(not because I want to dismiss conversation here - did want to read up on it
myself)

------
mistermumble
The content in this article about anti-spam techniques is fascinating. But the
title should be edited to indicate its vintage: 2014.

Also, by the by, there is some interesting context about the author (Mike
Hearn) and his recent activities that are not related to the topic of spam.

Warning: off-topic digression below.

Mike Hearn became one of the most visible Bitcoin core developers, working in
that community for 5 years, until a well-publicized departure where he
declared Bitcoin a failure[0].

He then joined R3CEV, a startup venture that is building a private blockchain
platform for a consortium of 70 of the world's largest banks. Hearn's
departure was criticized by members of the cryptocurrency community, such as
Bram Cohen, who famously called his exit a "whiny ragequit" [1].

Regardless about how one feels about the internal politics of the bitcoin dev
community, the R3 project is technically interesting (to me, at least) because
it uses Kotlin [2], a JVM-based functional language, and also because it has
some interesting design approaches that depart from the established bitcoin
blockchain model [3].

I still think that private blockchain platforms have an uphill battle if they
want to compete for developer attention with rapidly evolving broad-based
platforms like Ethereum and Bitcoin, but the R3 Corda platform is nevertheless
worth tracking.

[0] [https://medium.com/@octskyward/the-resolution-of-the-
bitcoin...](https://medium.com/@octskyward/the-resolution-of-the-bitcoin-
experiment-dabb30201f7)

[1] [https://medium.com/@bramcohen/whiny-ragequitting-
cab164b1e88](https://medium.com/@bramcohen/whiny-ragequitting-cab164b1e88)

[2]
[https://twitter.com/hhariri/status/790077263572299780](https://twitter.com/hhariri/status/790077263572299780)

[3] [https://gendal.me/2016/10/25/r3-corda-what-makes-it-
differen...](https://gendal.me/2016/10/25/r3-corda-what-makes-it-different/)

~~~
generalseven
Just to steer this interesting comment slightly back on topic, see

[http://www.bitcoinwednesday.com/nsa-gchq-tapped-security-
sys...](http://www.bitcoinwednesday.com/nsa-gchq-tapped-security-system-
designed-by-mike-hearn/)

The debate about permissioned vs. permissionless blockchains ties back into
Hearn's thoughts on privacy.

It will be fascinating to see how his work on Corda develops compared to its
permissionless competitors.

------
bascule
I think this is an area where functional encryption could help:

[https://en.wikipedia.org/wiki/Functional_encryption](https://en.wikipedia.org/wiki/Functional_encryption)

This would allow a client to combine a server-provided function that
calculates a spam score with their private key such that the resulting
function calculates a spam score on encrypted email. The client could then
hand that function back to the server so it can perform server-side spam
detection.

There are a number of drawbacks, including performance and general questions
about the security of such a system. That said, I think this is probably the
biggest problem (from the OP):

"The third problem is that spam filters rely quite heavily on security through
obscurity, because it works well. Though some features are well known (sending
IP, links) there are many others, and those are secret. If calculation was
pushed to the client then spammers could see exactly what they had to
randomise and the cross-propagation of reputations wouldn't work as well."

Using functional encryption to provide server-side spam detection would still
require handing a spam scoring function to the client so they can apply that
function to their private key and hand the server a result. This would expose
the internals of the spam detection routine to all clients, including
spammers.

A difficult tradeoff.

~~~
quantumtremor
The problem with functional encryption is as you say, you need to hand over
the "function" somehow to the server (presumably they use machine learning and
tools that aren't feasible client-side), and there's no guarantee the private
key is hidden unless you use something like indistinguishibility obfuscation,
which isn't really practical at all right now.

Did you mean fully homomorphic encryption?
([https://en.wikipedia.org/wiki/Homomorphic_encryption#Fully_h...](https://en.wikipedia.org/wiki/Homomorphic_encryption#Fully_homomorphic_encryption))
The server can compute the spam score under the encryption of an email, and
client side decrypts and sorts it from there, so not even the server knows if
a given email is spam or not. Of course, not that FHE is feasible, but perhaps
this special case is...

~~~
bascule
No, I didn't mean FHE, because FHE does not meet the criteria given in the
post, namely that it must happen as quickly as possible and cannot rely on the
liveness of the client. The OP practically rules out schemes that involve
looping in the client.

~~~
quantumtremor
What? With FHE the client just gets an additional encrypted metadata that is
the encryption of whether the attached file is spam or not. No looping
required, whereas your functional encryption scheme seems to necessitate the
client being "live."

~~~
bascule
One of the requirements given in the OP (the one I was referencing in my
previous post) is that the _server_ can tell spam from non-spam without the
client being online. The FHE solution doesn't work for this requirement.

The functional encryption scheme only requires a client to bootstrap it. Once
the client has calculated the appropriate function based on their private key,
they can give it to the server, who can thereafter apply it to incoming emails
regardless of whether the client is online or offline.

~~~
quantumtremor
Okay so to fix mine: create a circuit that decrypts a ciphertext using the
private key, returning 1, 0, or Bottom depending if it's an encryption of spam
marking or not, or not valid, and run it through iO. So both solutions still
require iO...

------
kccqzy
Perhaps we should put (2014) in the title. I'm sure Gmail has improved a lot
in the intervening two years and the information could be outdated.

~~~
zmanian
This post was also extensively discussed back then. It is a classic but yeah
2014 belongs in the title.

It's interesting that encrypted messaging has exploded since 2014 but spam has
not yet become that much of a problem.

~~~
btown
The people who use encrypted messaging, while greatly increased in number, are
still not the target demographic who would click on FREE VIAGRA ads. That
said, I wouldn't be surprised to see a small niche industry arise around
unsolicited encrypted bulk email sending for tech recruiters...

------
Animats
The fundamental problem is not spam messages. It's unlimited free identities.
Encrypted email bodies would play well with spam detection, as long as there's
some effort or cost associated with creating a new sender identity. Google
tries to do this by tying your entire life to a Google account. So do Facebook
and Linkedin. It takes a while for a new account on those systems to develop a
life history, so there's a reputation anchor of sorts that doesn't involve
money. If you could send an encrypted email that was signed with your
Facebook, LinkedIn, Google, or Github ID, that would be a reasonable way to
tie messages to a reputation.

------
tempestn
This tangentially relates to something I was thinking about yesterday.

Does anyone have a sense of how difficult it would be to create a service that
scans your gmail spam folder and categorizes the contents into 'definitely
spam' and 'maybe spam'? I'm probably somewhat of an edge case, but I get over
100 spam emails per day in my spam folder. Almost none make it through to my
inbox. However, every month, one or two legitimate emails land in spam.
Usually these are there for an obvious reason - "cold call" emails from
advertisers and that kind of thing, but still stuff that I want to see, and
obviously (to a human) not on the same level as penis enlargement garbage or
whatever. (Although occasionally there are real head-scratchers, like I thread
where I've already replied to someone twice, and then their third message goes
to spam. I guess gmail's filters only operate on the current message and don't
look at history.)

Anyway, I get enough false positives that I need to scan through the thousands
of spam messages I get each month to try to find them, which is obviously a
huge waste of time. If something could go through there, identify the least
spammy fraction, and label them, it would save me a ton of time. It would be
lovely if gmail offered this themselves, since they already have the spam
_score_ for each message. But barring that it seems conceivable that you could
do it with a browser add-on, perhaps using a neural network. Still seems kind
of like reinventing the wheel though, since you'd basically be building a
reverse spam filter. So I'm wondering if there's an easier way...

~~~
dochtman
I had similar problems. On top of that, I forward my email through my own
server to GMail (so I control the domain, but can use the GMail ecosystem as
UX), and this was posing problems because GMail would greylist my server quite
a bit for sending in too much spam.

I now run rspamd on my own server, which does a pretty great job. With
properly training the bayes filters it has, I now receive on the order of 3
spam messages per day in GMail. Actually, rspamd seems to have fewer false
positives than the GMail spam filter -- I guess this could be because it has
more information as the original receiver, though?

Getting these results did take some very limited tweaking of the rspamd
configuration; I lowered the treshold for what's "definitely" spam (that is,
just gets discarded), and I bumped the weight of the BAYES_SPAM rule.

~~~
tempestn
Actually, that's the exact situation I'm in as well. So to clarify, the spam
that gets through rspamd lands in gmail's spam folder, so you still need to
manually check that, but all the obviously-spam stuff has been cut out before
it ever got to gmail (solving both your problems). Sounds like exactly what I
asked for!

Interesting your mention about being grey-listed too. How did you determine
that happened? Presumably the same thing could happen to me as well.

Guess I should also check out SRS as mentioned by emilburzo.

~~~
dochtman
rspamd has three levels of handling, depending on the spam score: (1) ham,
which gets passed through, (2) spam, which does not, and (3) "not sure", which
gets passed through but gets headers attached with the spam score and how the
score is built up. So I get (1) and (3) in my GMail account, but all the stuff
for which rspamd is confident it's spam no longer makes it into GMail.

Of course, rspamd lets you tweak the thresholds for these levels. For example,
after a while I lowered the threshold for "spam", increasing the amount of
stuff that gets discarded by rspamd, because I noticed that rspamd was doing a
pretty good job of scoring, and the false positives I was seeing had a lower
score anyway.

I'm actually not sure grey-listing is the correct term, but I noticed in my
server's MTA log that Google was rate limiting me a lot because my server was
sending through significant amounts of email. This was also noticeable
sometimes because it would take quite some time for email to get through,
which I found annoying.

Yes, you probably want to run SRS as well, otherwise GMail will be unable to
correctly understand your headers. However, this effectively puts your server
on the hook for any email forwarded; this is why I don't think you want to go
there without also putting some kind of spam filter in place, otherwise I
assume your server's reputation will deteriorate.

~~~
emilburzo
Have you ever had a legit email marked as spam by rspamd?

And is there a way to (manually) double-check those?

~~~
dochtman
I don't think it has ever marked any legit mail as spam. The rspamd web UI has
an overview of recent history which has the date/time, message ID and score,
but there are no full headers/content for things that qualify as outright
spam.

------
hamandcheese
I like the approach Facebook Messenger takes - I can receive messages from
anyone who can view my profile, but first time messages from non-friends end
up in a special bucket and require approval.

~~~
Crazywater
That only moves the problem, though. I get a friend request from a spammer
every couple of days.

~~~
hobarrera
Exactly. With enough spam, legitimate requests end up hidden behind a large
volume of spam.

------
fiatjaf
Spam is a problem everywhere except Gmail. People have an illusion that the
battle against spam is won because Gmail did the job. All the other email
providers struggle everyday. SpamAssassin is worthless, it is a piece of shit
that does nothing.

We desperately need a service that helps filtering email -- like, for example,
something simple that just accepts reports about some email being spam or not,
and creates a list of spam addresses.

~~~
pmlnr
> SpamAssassin is worthless, it is a piece of shit that does nothing.

This is not entirely true. Spamassassin, dspam, and all the bayesian based
ones need constrant training and feedback loop to work. The time trainings
lasts for is getting shorter and shorter, but it's not entirely inefficient. (
I'm running dspam on my mail box. )

Combine this with weighted blacklists ( postscreen in my case ), add dkim and
dmarc checks and it's fine for a small provider. Far from ideal, but working.

I'd love to see an open source implementation of what google was referring to
as domain based trust, but it's also a nasty thing.

I've recently tried to change my mail address from a .eu domain to a .net, and
most of my mail landed in the recipients' spam folder. It's a fresh domain, no
one ever sent anything from it, so I'd assume fresh domains are untrusted by
default, which is really bad and is generally wrong. If the trust is not OK by
default, spam is users consider it spam, that's going to kill domain based
mailing, which is horrible, and gmail will be the one to blaim when we have
not alternatives to few providers.

------
yRetsyM
Fascinating insight into one of the biggest email operations. I'd be very
interested to hear how the other companies in this space differ or are
similar.

Question: Is there a curve where pushing this processing back to the phone
will become possible? The most powerful counterpoint at the moment is battery
life, but I do see that improving to a plausible point where this sort of
continual processing is feasible?

------
jimktrains2
> Botnets appeared as a way to get around RBLs, and in response spam fighters
> mapped out the internet to create a "policy block list" \- ranges of IPs
> that were assigned to residential connections and thus should not be sending
> any email at all.

While I understand that _some_ residential ISPs don't let you run services on
your connection, policies like this make me sad because it means the web is
becoming more-and-more something you need other people to do for you.

~~~
pmlnr

        smtpd_helo_restrictions = permit_mynetworks,
          reject_invalid_helo_hostname,
          permit
    
        smtpd_recipient_restrictions =  permit_mynetworks,
          permit_sasl_authenticated,
          reject_invalid_hostname,
          reject_non_fqdn_recipient,
          reject_unknown_recipient_domain,
          reject_unauth_pipelining,
          reject_unauth_destination,
          check_policy_service unix:private/policy-spf,
          check_client_access    pcre:${config_directory}/dspam_filter_access,
          permit
    

This is in my postfix config. The invalid hostname and non_fqdn tests are
working quite well against residential hosts: they don't have valid reverse
DNS or an fqdn, and so they get eliminated fast.

~~~
jimktrains2
How's this address my point that we're moving forward to a web that needs to
be done for most people, not by?

------
rocqua
One of the issues mentioned is that bulk mail and spam are not necessarily the
same thing. However, bulk mail much less use for e2e than directed mail.

Personalized bulk mail (e.g. bulk mail with a small personalized offer) is an
issue here. For that, partial encryption seems like a nice solution. As the
template remains plaintext, reputation can include the template. As such, you
could levy a lower 'tax' on such email as opposed to fully confidential email.

------
finishingmove
Inside Gmail for Android: ads. Lots of ads.

Disclaimer: if you haven't seen ads in the Gmail android app yet, please
consider yourself lucky and kindly buzz off from this comment.

------
zimbatm
All this is based on the wrong premises.

The reason why spam is an issue with email is, anyone can send emails and it's
not always possible to identify the sender. Once encryption is deployed then
the sender is associated with a public key and it's possible to establish a
web of trust. Gmail could also manage identity-based scores based on all it's
user's trusted connections.

------
algesten
> The other reason it sucks is that it confuses bulk mail with spam. This is a
> very common confusion. Lots of companies send vast amounts of mail that
> users want to receive. Think Facebook, for example.

I challenge this. In 2016 I think most companies don't need to rely on massive
bulk mails anymore.

But then I always hated all forms of marketing...

------
duncan_bayne
This was sad to read:

> _Botnets appeared as a way to get around RBLs, and in response spam fighters
> mapped out the internet to create a "policy block list" \- ranges of IPs
> that were assigned to residential connections and thus should not be sending
> any email at all._

 _Your_ residential connection might not, but mine does.

