
Infrastructure Monitoring with Prometheus at Zerodha - mr-karan
https://zerodha.tech/blog/infra-monitoring-at-zerodha/
======
heliodor
Can attest to the quality of VictoriaMetrics. Happily running it at
[https://HostedMetrics.com](https://HostedMetrics.com) for our hosted
Prometheus + Grafana offering.

~~~
mr-karan
Author of the post here. That's pretty cool to hear that you're running VM.
It's been rock solid for us as well.

~~~
trithagoras
If you haven't already, may want to checkout M3 if you're thinking of scaling
to a cluster and doing zero downtime upgrades. You can scale out horizontally
with the Kubernetes operator, it avoids downtime as it replicates data at
write time and does quorum based master-less reads.

~~~
mr-karan
Ah alright. Haven't looked at M3DB yet, will check it out. We've a lot of non
K8s components too, so wanted a solution outside the operator, but yeah I get
the idea behind it.

------
gowthamgts12
was reading your first blog post ([https://zerodha.tech/blog/hello-
world/](https://zerodha.tech/blog/hello-world/)) and amused that your tech
team is only 30.

~~~
stedaniels
I was more amazed than amused :-)

------
totaldude87
more than the details on the blog, love their FOSS

[https://zerodha.tech/projects/](https://zerodha.tech/projects/)

------
RabbitmqGuy
This was a really good post. Their first blogpost is also amazing;
[https://zerodha.tech/blog/hello-world/](https://zerodha.tech/blog/hello-
world/)

They have basically self hosted almost everything with a team of 30

------
bogomipz
>"GitLab CI pipelines lint and validate the configurations and then upload
them to an S3 bucket. There’s a sync server on the Alertmanager cluster to
check for new config and automatically reload Alertmanager in case of any
config updates."

I would be curious is this syncing being done by something custom they wrote
or is this something built into Alertmanager now? I had look at the latest
Alertmanager docs and nothing jumped out at me regarding this.

~~~
mr-karan
Yes, we basically have a short lived periodic job on our Alertmangaer server
which does a `sync` from the S3 bucket which stores the configs. If something
has changed, just trigger a `docker restart alertmanager` basically.

Alertmanager doesn't have any solution for this, although if you decide to use
the Alertmanager using Prometheus Operator, then you get automatic config
reloads. We decided to keep our AM cluster independent and out of K8s, as we
pointed out we have a hybrid environment.

Hope that answers your question.

~~~
bogomipz
Thanks. So is your Alertmanager a standalone node then? How do you handle HA
on that if do?

One other thing I wanted to ask was how many nodes are in your Victoria
cluster?

~~~
mr-karan
Alertmanager can run in a clustered mode with `--cluster*` flags. You can read
more here[1]. So basically mutliple nodes can run and all Prometheus instances
are configured to send all alerts to all AM instances. The alerts are
deduplicated at cluster level.

Victoria Metrics right now runs as a single node setup. That works out for
now, because as mentioned Prometheus maintains a WAL. So even if the Remote
storage is down for sometime, the core functionality of alerts isn't affected.
And we've setup alerts on VM health, so if it goes down, we've to act on it
and get it back up. Once that happens, all the previous data also is ingested
automatically.

This is something that we would surely revisit on at a later point of time,
and setup a proper HA on it if needed. :)

[1]:
[https://github.com/prometheus/alertmanager/blob/master/READM...](https://github.com/prometheus/alertmanager/blob/master/README.md#high-
availability)

------
sarangab
Have tried to use the API in Python for algorithmic trading but have struggled
to come up with a simple solution that can do the backtesting and live
component together. Nithin had previously mentioned that this market is too
small for them to spend time on, but any leads on building this from the tech
team? Blog post would definitely help.

------
bwood
I've been looking to connect with Zerodha's API, but it doesn't seem to be
available to non-Indian people. Any suggestions on how to get started?

~~~
mr-karan
Ah, regulations permit account opening only for Indian residents.

------
jackhalford
I like prometheus but it seems tedious to do a HA setup, is this solved now?

~~~
mr-karan
So, we maintain labels `prometheus_replica` which prints the pod name of the
Prometheus. Since the operator deploys it as a Statefulset, the pods are
ordered, so the naming will be like prometheus-k8s-0,-1 and so on. Having
these labels help to deduplicate the metrics on the ingester/federation level.

Although yes you're right, it was a bit difficult to get it running out of the
box with native Federation capabilities. A secondary long term storage like
Victoria/Cortex/Thanos is something you can check out.

------
KrishnaAnaril
@mr-karan what is the main reason for switching to Vuejs from Angular?

~~~
mr-karan
Angular has been known for breaking their syntax/API across major version
upgrades and more often than not it leads to a complete rewrite of your app.
We took the decision to rewrite and explored React, Vue etc. Even though Vue
v2.x was still a new kid on the block, it turned out to have a relatively
easier learning curve, bundle size was far lesser than Angular or React and
the docs were amazing and was faster in our benchmarks. So we made this
decision around 4 years back and quite happy so far.

