
Ask HN: Need help with distributed application infrastucture design - herbst
Hey HN,<p>I hope some smart brains here can lead me to the right direction. I am looking for a simple yet robust data replication and transfering mechanism for a web application i am writing.<p>The main things i need:<p>1) Replicate login &amp; configuration data to X nodes (email servers)<p>2) Count emails by status (dropped, forwarded, ...) [statistics]<p>3) Save (some) emails back to the Web application [logging]<p>4) 2 &amp; 3 are A LOT of data. It must be able to handle that without losing any<p>The guys from the postgres IRC made me realize that multimaster is not only overkill (and many people are afraid of running them) but also simply not the right solution.<p>1) is easy using Postgres master slave replication<p>2 &amp; 3 however are not so easy in my head. And 4 is kind of out my knowledge scope.<p>I&#x27;ve thought about the following 3 implementations:<p>a) Doing a Master-Master replication. Having the secondary master doing the hard work. Like replicating to others, plus receiving statistics (directly over network). I am not sure how smart it is to use the same database, even more in such a approach.<p>b) Doing statistics (and most likely also the email logs) in a seperate database, that provides a thin API my application can query &amp; cache. Statistics are everywhere in my interface, so i would likely still replicate&#x2F;cache the relevant data back to my main application. But the heavy writing does not directly impact my web application.<p>c) Maybe using something like logstash to handle the information load and drip the relevant info out into the web applications database.<p>I realize this topic is just hard, however i feel like i am missing something obvious.
======
vlahmot
I don't have any specific experience with emails but any time I need to move a
lot of data around I go with Apache Kafka and Apache Flume.

Write all of your emails into a kafka topic from your webapp. Read from the
topic to do processing. Use flume to sync results back to your webapp db.

1) For this I would probably use something like Chef/Ansible but I don't know
the first thing about configuring email servers. You could have something that
wakes up, reads the latest config off a topic, and then applies that config
via a config management tool.

2) You can throw Apache Spark on to the kafka stream to calculate these
aggregations.

3) Flume can read the emails and then save them back to wherever you need
(this is typically s3/postgres for me). Flume can scale out over the kafka
topic naturally using the same consumer group id.

I like this approach because you can scale it cheaply and easily by starting
with kinesis streams instead of kafka if you don't have the ops resources to
run kafka and running spark in stand alone mode until you need a cluster.

With spark you can do your statistics in there (streaming over a time window
or batch) and then sink them over to your stats db.

With the flume/kafka combo you can treat kafka as the "channel" and you get
some nice transaction functionality out of flume that makes handling failures
a breeze.

It does take some tooling/monitoring to run confidently and the whole apache
"big data" ecosystem is daunting at first but its well worth it in my opinion.

~~~
herbst
This sounds super interesting. I never worked with any of these, but this
sounds more or less like what i need. Kudos

------
dozzie
So basically, 1) is like keeping account related data in LDAP with per-MTA
replicas, 2) is like collecting statistics, and 3) is like forwarding e-mail
messages to another system to handle there? Did I get that correctly? And what
is 4) "A LOT of data"?

~~~
herbst
I was hoping it is clear enough :)

1) Is actually a Web application, but yeah #1 is kind of clear

3) Is only logging as well. So not time critical, emails are handled by the
"nodes" themself. Reason i mention them outside of statistics is because
statistics are just a few numbers, this is actual data and may need to be
threated differently.

4) It should handle up to a few thousand emails per second.

