My team is Observability at Twitter; we work on monitoring and we’re looking for distributed systems engineers and full-stack engineers. We are one of the largest monitoring stacks in the industry, writing up to 15 million metrics per second for all production services at Twitter. We also have a front-end service that is used every day by most engineers at Twitter. We write our services in Scala, use a state-of-art Cassandra-like database called Manhattan, and if you join, you’ll get to work on challenging problems from day one.
Here are some of the things we’ve done in the past 12 months:
- Made our alerting execution service seamlessly fail over across datacenters
- Implemented a temporal set membership service for our database to keep track of metric groupings
- Added tiering policies for metrics based on their automatically-derived significance
- Added hybrid online/offline processing of data for different use cases
- Optimized the time-series query language to make reads more efficient
- Made an asynchronous query processor to support expensive queries with lower latency requirements
- Wrote a client-side agent that collects and reports metrics to the storage system
That sounds like you have something else wrong. With ECC memory on server hardware, we've seen 0 checksum errors in the last 6 months and I've seen only 2 ever. A typical server has 136TB of raw hdd space and we get about 71TiB usable. It's about 80% full.