
Monitoring 9600 banks at scale - jeandenis
https://blog.plaid.com/scaling-a-monitoring-platform/
======
steakknife
Interesting writeup. This is also a major issue for us at TradeIt (we do
something similar but for stock brokers and portfolio/trading) as the brokers
we integrate are not always... _ahem_..."robust". We've found that our
upstream users really appreciate that often we can tell them about brokers'
service outages before the brokers even announce it (when the brokers even
bother). Sometimes the brokers don't even realize their system is
malfunctioning until we poke them to ask what's going on.

Our throughput numbers are much lower and and our integrations are much fewer
than Plaid, so we have been able to get away with keeping a close eye on
Graphite/Grafana for spikes in request failures/timeouts. Seems like
eventually we will need to implement some kind of statistical monitoring and
alerting.

~~~
funkymatt
grafana has that ability built in!

------
divxflounder
Great article! I'm definitely taking an action item to look into Prometheus. I
own DevOps/Monitoring and Alerting my org and it's really cool to see how
other companies skin this cat.

I saw Cloudwatch in the pipeline, which is an Amazon product. I know I'm going
to make a very controversial statement here, but - why Amazon? With volumes
like yours, your scale will eventually hit the point where your cost
skyrockets.

Regarding the metrics themselves, you might already do this, but I highly
recommend splitting your metrics into a 50th, 95th, and 99th percentile in
your Grafana graphs. This will give you a solid idea of not only what your
customers experience on average, but edge cases as well.

Do you have a regular forum with how you are reviewing said metrics and pre-
solving problems? We're still trying to solve this in multiple teams where I
work and have noticed that some teams are great at it and other teams are a
little more reactive.

Love to see this stuff :)

~~~
jeeyoungk
One of the authors here. Thanks for enjoying the article!

Re: AWS. We're not at a point where we are overburdened by the AWS spending.
Many things are more efficient with AWS, as we have a fairly small engineering
team. We use various different AWS products (Aurora, Kinesis, to name a few)
that we are utilizing.

Regarding metrics & percentiles - Yes I agree. 99th percentile is what we try
to look at the most, as most other metrics tend to be deceiving.

Regular forums - This is something that we need to improve on as we move
forward. The blog post mostly describes the infrastructure we've built, but it
takes time and effort to become a metric-driven organization.

~~~
Terretta
Pretty unofficial here, but I prefer engineering channel to biz dev channel...
Drop me a note, loop in whoever would be interested? I’ve been meaning to get
our companies better acquainted — your fantastic write up reminded me.

------
syastrov
Nice write up. I love reading these kinds of postmortems.

Unlike a lot of those I read, it sounds like you actually set out with a good
set of requirements and really understood the problem.

I had a good experience using Prometheus as well for a smaller project (server
monitoring). It’s interesting to know that it can handle so many metrics and
scale so well to more complex problem areas.

~~~
joyzheng
One of the blog authors here -- thanks!

> I had a good experience using Prometheus as well for a smaller project
> (server monitoring). It’s interesting to know that it can handle so many
> metrics and scale so well to more complex problem areas.

Yep, we started out with a pretty simple prometheus setup too (two instances
scraping the same metrics, just for redundancy) but have been adding federated
instances and doing some pre-aggregation to scale; the nice part is that we've
been able to do it pretty gradually by updating the config (e.g. splitting out
one bucket of metrics into a separate node for scraping at a time).

~~~
tigre100
We took a similar journey with Prometheus @ Improbable. We found federation to
have its limits & wanted a global query view as well as a few other nice
features: [https://improbable.io/games/blog/thanos-prometheus-at-
scale](https://improbable.io/games/blog/thanos-prometheus-at-scale)

------
lordxenu
How do you get the data from banks? Are you scraping the webpage after the
user logs in? Not many banks I know of have public apis.

~~~
throwawaymath
Yes, for any bank that doesn't provide them with API access they're scraping
the login pages. They even do this for banks which implement anti-scraping
measures.

------
Rainymood
How do you guys handle user log-in credentials? I mean, you're basically
logging into their bank, right?

------
wbh1
Really enjoyed this write-up. I'm currently in the process of scaling out a
Prometheus-based replacement for an old Nagios setup that was scaled to its
limit and posts like this just make me that much more excited for Prometheus
as a technology.

------
beamatronic
With that many integrations, some small set must be broken at any given time.
How do you handle this without scaling a support staff accordingly?

