

A mrjob for processing bounced emails in a Postfix log - derwiki
http://derwiki.tumblr.com/post/3683175232/mrpostfixbounce

======
cnagele
Note: I run Postmark (<http://postmarkapp.com>) where we handle delivery and
bounce processing for a huge volume of email.

It depends on how the emails are being sent. The typical approach is to set a
"Return-Path:" header in your outbound messages. The email that this is set to
will get NDRs from receiving mail servers. You can then collect these NDRs in
a mailbox or Maildir in Postfix and process them as needed. The benefit of
this is having a single place where all of the messages are stored, even when
sending from many FROM addresses.

I'd recommend a commercial tool to do this unless you already have tens of
thousands of NDRs to learn against. You'll find that every mail server and ISP
has something slightly different and it's very difficult to process each
message accurately all the time. One example of a commercial NDR processing
tool is Boogie Tools:

<http://www.boogietools.com/>

To take this a bit further, if you are using a Return-Path header, just make
sure you also setup the proper SPF and DKIM records in DNS for better
deliverability. You may also want to read up on VERP.

<http://en.wikipedia.org/wiki/Variable_envelope_return_path>

------
rlpb
The traditional way to do bounce processing is to have an email alias that
runs a script. This is how mailing lists work. The really traditional Unix
mechanism is a .forward file with a pipe to a command in it. MTAs (including
Postfix) are configurable in this respect to have something a bit cleaner.

Rather than re-inventing the wheel, take a look at how Mailman does it.

~~~
thwarted
I'm the one who originally suggested processing the postfix log as a way to
get ahead on bounce processing for mass mailings, and that derwiki decided to
do it with map-reduce is awesome.

Doing it the "traditional way", having an email alias that runs a script, or a
.forward file with a pipe directive, doesn't scale, especially when you have a
cluster of SMTP machines and a centralized store for email address
information. You'd end up putting all your changes into a log file anyway and
bulk processing that, you might as well just use the log file that postfix
already generates (especially if you're using something like scribe to
aggregate logs from entire clusters).

Obviously, you still need to process bounces generated by the recipient/remote
MX mail systems, but that's easier done with a POP/IMAP client, and is
significantly fewer transactions to process overall because most sites reject
mail immediately (rather than accepting it, queuing it, and bouncing it
later).

~~~
rlpb
You can of course do it that way.

I would prefer to have a script fed by the alias do some basic RFC3464 parsing
and then feed the result to a central database or queue (perhaps via a local
database or queue for availability). This would scale just fine. If the
forking for the script becomes a problem, it could be refactored into an LMTP
listener easily at that stage. This method would avoid some hacky logfile glue
or batching or tailing log files.

(I haven't looked at what Mailman does; this would be my first port of call
and may change my approach).

~~~
thwarted
_it could be refactored into an LMTP listener easily at that stage._

My mailers only run postfix, and I'm not about to allow custom code to be
deployed to them every time the bounce processing needs to be updated. Real-
time processing is also difficult to debug, and you need to build in support
for caching/storing the input so you can rerun it later in case something
screws up. If you're processing log files asynchronously, the same logs can be
processed multiple times without having to touch the mailers, independently of
how busy the mailers are.

The goal is to eventually get rid of the local bounce message generation all
together: the content is relatively large compared to what's in the postfix
logs, and we don't use it since we read the logs

 _This method would avoid some hacky logfile glue or batching or tailing log
files._

One man's hack is another man's elegance.

My team already supports massive log file aggregation, storage, and
processing, and writing batch processing map-reduce jobs (specifically through
the mrjob software developed in house and available on github) is a common
task for our developers. There's nothing hacky about this setup, nothing about
it is one-off. Installing mail aliases that send to scripts or custom LMTP
code would be a hacky solution, we don't do that anywhere else in our
production infrastructure.

 _feed the result to a central database or queue (perhaps via a local database
or queue for availability)_

This is the hacky solution: that's a lot of things to maintain just to be able
to extract from locally generated bounces that you need to not send to an
address anymore.

Our setup is literally postfix logs via syslog -> syslog-ng -> scribe -> our
batch processing map-reduce infrastructure.

 _(I haven't looked at what Mailman does; this would be my first port of call
and may change my approach)._

Mailman is an excellent source for inspiration. Are there many multi-million
recipient Mailman installations that run on clusters of SMTP machines?

------
rlpb
Also, if you parse the Postfix log files, you will only catch bounces
generated at SMTP delivery time and miss bounces generated further downstream.

