Infrastructure Monitoring with Prometheus at Zerodha

heliodor · on April 27, 2020

Can attest to the quality of VictoriaMetrics. Happily running it at https://HostedMetrics.com for our hosted Prometheus + Grafana offering.

mekster · on April 28, 2020

I was quite convinced VictoriaMetrics is superior than competitors after reading the author's several comprehensive technical details and benchmarks since a while back.

https://medium.com/@valyala/high-cardinality-tsdb-benchmarks...

Good to see it getting more attention.

mr-karan · on April 27, 2020

Author of the post here. That's pretty cool to hear that you're running VM. It's been rock solid for us as well.

trithagoras · on April 27, 2020

If you haven't already, may want to checkout M3 if you're thinking of scaling to a cluster and doing zero downtime upgrades. You can scale out horizontally with the Kubernetes operator, it avoids downtime as it replicates data at write time and does quorum based master-less reads.

mr-karan · on April 28, 2020

Ah alright. Haven't looked at M3DB yet, will check it out. We've a lot of non K8s components too, so wanted a solution outside the operator, but yeah I get the idea behind it.

codeduck · on April 27, 2020

VictoriaMetrics is great. Totally painless drop in solution for long term Prometheus retention.

gowthamgts12 · on April 27, 2020

was reading your first blog post (https://zerodha.tech/blog/hello-world/) and amused that your tech team is only 30.

stedaniels · on April 27, 2020

I was more amazed than amused :-)

totaldude87 · on April 27, 2020

more than the details on the blog, love their FOSS

https://zerodha.tech/projects/

RabbitmqGuy · on April 28, 2020

This was a really good post. Their first blogpost is also amazing; https://zerodha.tech/blog/hello-world/

They have basically self hosted almost everything with a team of 30

bogomipz · on April 27, 2020

>"GitLab CI pipelines lint and validate the configurations and then upload them to an S3 bucket. There’s a sync server on the Alertmanager cluster to check for new config and automatically reload Alertmanager in case of any config updates."

I would be curious is this syncing being done by something custom they wrote or is this something built into Alertmanager now? I had look at the latest Alertmanager docs and nothing jumped out at me regarding this.

mr-karan · on April 27, 2020

Yes, we basically have a short lived periodic job on our Alertmangaer server which does a `sync` from the S3 bucket which stores the configs. If something has changed, just trigger a `docker restart alertmanager` basically.

Alertmanager doesn't have any solution for this, although if you decide to use the Alertmanager using Prometheus Operator, then you get automatic config reloads. We decided to keep our AM cluster independent and out of K8s, as we pointed out we have a hybrid environment.

Hope that answers your question.

bogomipz · on April 27, 2020

Thanks. So is your Alertmanager a standalone node then? How do you handle HA on that if do?

One other thing I wanted to ask was how many nodes are in your Victoria cluster?

mr-karan · on April 28, 2020

Alertmanager can run in a clustered mode with `--cluster*` flags. You can read more here[1]. So basically mutliple nodes can run and all Prometheus instances are configured to send all alerts to all AM instances. The alerts are deduplicated at cluster level.

Victoria Metrics right now runs as a single node setup. That works out for now, because as mentioned Prometheus maintains a WAL. So even if the Remote storage is down for sometime, the core functionality of alerts isn't affected. And we've setup alerts on VM health, so if it goes down, we've to act on it and get it back up. Once that happens, all the previous data also is ingested automatically.

This is something that we would surely revisit on at a later point of time, and setup a proper HA on it if needed. :)

[1]: https://github.com/prometheus/alertmanager/blob/master/READM...

sarangab · on April 29, 2020

Have tried to use the API in Python for algorithmic trading but have struggled to come up with a simple solution that can do the backtesting and live component together. Nithin had previously mentioned that this market is too small for them to spend time on, but any leads on building this from the tech team? Blog post would definitely help.

bwood · on April 27, 2020

I've been looking to connect with Zerodha's API, but it doesn't seem to be available to non-Indian people. Any suggestions on how to get started?

mr-karan · on April 27, 2020

Ah, regulations permit account opening only for Indian residents.

jackhalford · on April 27, 2020

I like prometheus but it seems tedious to do a HA setup, is this solved now?

mr-karan · on April 27, 2020

So, we maintain labels `prometheus_replica` which prints the pod name of the Prometheus. Since the operator deploys it as a Statefulset, the pods are ordered, so the naming will be like prometheus-k8s-0,-1 and so on. Having these labels help to deduplicate the metrics on the ingester/federation level.

Although yes you're right, it was a bit difficult to get it running out of the box with native Federation capabilities. A secondary long term storage like Victoria/Cortex/Thanos is something you can check out.

tdubey · on April 27, 2020

Cortex was just released, and while it doesn’t solve HA right away, it does make it easier: https://cortexmetrics.io

KrishnaAnaril · on April 27, 2020

@mr-karan what is the main reason for switching to Vuejs from Angular?

mr-karan · on April 27, 2020

Angular has been known for breaking their syntax/API across major version upgrades and more often than not it leads to a complete rewrite of your app. We took the decision to rewrite and explored React, Vue etc. Even though Vue v2.x was still a new kid on the block, it turned out to have a relatively easier learning curve, bundle size was far lesser than Angular or React and the docs were amazing and was faster in our benchmarks. So we made this decision around 4 years back and quite happy so far.