Curious to know if any TSDBs (Timeseries databases) evaluated before finalizing ...

jrockway · on Oct 21, 2020

I don't think time is a particularly strong keying methodology for distributed tracing. You want to be able to quickly select all spans that have a certain trace ID. The time that the span was ingested is relatively unimportant to the querying mechanics, and you don't do any operations like "combine samples into a 1h average after 5d".

jeffbee · on Oct 21, 2020

When I was at Dropbox I did do this ... we only retained traces for a very short time, so I did daily aggregations of traces into summary data structures that could attribute service time to various span types at different depths of the request etc. Don't ignore the value of aggregating and summarizing trace data. Of course, a TSDB is also useless for this, I agree there.

jrockway · on Oct 21, 2020

Yeah, using the trace data to get RPC latency and success/error charts is useful, but that's just another branch off the data-stream. Your metrics system needs to inspect every trace, so your database only needs the ability to send you a list.

(Jaeger seems to have Kafka support so you can do this in real time. Haven't tried it.)

user5994461 · on Oct 21, 2020

I don't think there's any free TSDB available. The project is maybe 3 years old.

InfluxDB is more recent and quite limited. Things like sharding is not supported and they stated it would never be supported except in a paid edition when they make one.

Prometheus is more recent. Similar story with scaling. They changed storage formats and rewrote once or twice in the past few years, it's moving really fast. It's more of a standalone product for server metrics (node exporter + prometheus + grafana), wouldn't recommend to use as a general purpose database.

mnsmns · on Oct 21, 2020

As of today sharding is supported in InfluxDB. Prometheus is not suitable for traces, it is more for metrics. The problem I see with all these DBs are they are well suited for specific applications like prometheus for metrics, Influx for IoT. Timescale is still in nascent stages.

ecnahc515 · on Oct 21, 2020

For what it's worth, Jaeger, one of the most popular FOSS tracing servers uses ES/Cassandra for storing it's traces.

Thaxll · on Oct 21, 2020

Also Netflix is a Java shop so anything JVM based makes sense for them.

geodel · on Oct 21, 2020

I feel in between Linkedin, Twitter and Netflix all Java design pattern/framework/architect experts are located. A pleasant surprise to me is Oracle seems to have gotten rid of such products and people.