
Stripe’s Veneur: A distributed, fault-tolerant pipeline for observability data - federicoponzi
https://github.com/stripe/veneur
======
chimeracoder
This was a pleasant surprise to see on Hacker News this morning! I work on the
Observability team at Stripe and have been the PM for Veneur (and the rest of
our metrics & tracing pipeline work) pretty much since we released it ~2 years
ago.

If you're interested in learning more about how Veneur works and why we built
it, I gave a talk at Monitorama last year that explains the philosophy behind
Veneur[0]. In short, a massive company like Google is able to build their on
integrated observability stacks in-house, but almost any other smaller company
is going to be relying on an array of open-source tools or third-party vendors
for different parts of their observability tooling[1]. When using different
tools, there are always going to be gaps between them, which leads to
incomplete instrumentation and awkward (inter-)operability. By taking control
of the pipeline that processes the data, we're able to provide fully
integrated views into different aspects of our observability data.

The Monitorama talk is a year old at this point, so it doesn't cover some of
the newer things Veneur has helped us to accomplish, but the core philosophy
hasn't changed. I've given updated versions of the talk more recently at
CraftConf (in May) and DevOpsDaysMSP (last week), but neither of those videos
are online yet.

[0] [https://vimeo.com/221049715](https://vimeo.com/221049715)

[1] e.g. ELK/Papertrail/Splunk for logs, Graphite/Datadog/SignalFx for
metrics, and maybe a third tool for tracing if you're lucky.

------
tchaffee
Am I the only one who is always slightly disappointed that neither the README
file on Github nor the landing page at the website tells me why I would want
to use the software in question? What problem it solves? Why might "a
distributed, fault-tolerant observability pipeline" be interesting to
programmers or anyone else? It seems like you've already got to be familiar
with the problem space to understand what this is and what need it fulfills.

I'm not picking on this package. I see it all the time.

Can someone here explain to me what the use case is for this software?

~~~
mrkurt
It's not always possible to describe a problem in a way you will understand if
you aren't already familiar with it. And that's ok! Reading something like
"distributed, fault-tolerant pipeline for observability data" and ¯\\_(ツ)_/¯
could be good response both for you _and_ the people who built the thing. It's
definitely ok that you have to dig a little more to wrap your head around it.

In short, it's reasonable to expect that people who see a project already
understand the problem space and write for the ones who can say "yep I need
this".

This particular project is probably only useful to people who know what
observability data means.

~~~
aaronbrethorst
Even though Albert Einstein seemingly didn't actually say this, I still think
this aphorism is appropriate:

 _If you can 't explain it simply, you don't understand it well enough._

~~~
mrkurt
Is it? Because there's an awful lot of base knowledge you need to understand
any kind of OSS project. Can you explain what nginx does to someone who
doesn't have a smart phone in any kind of meaningful way?

~~~
aaronbrethorst
I have fifteen years of professional experience in software development and I
didn't have the slightest clue what Veneur did from its description of "A
distributed, fault-tolerant pipeline for observability data."

\- Distributed - Got this one, but it's an adjective and doesn't mean much on
its own.

\- Fault-tolerant - Same.

\- Pipeline - OK, great, totally meaningless noun. I guess stuff goes in it?

\- Observability data - What the heck is observability data?

Nginx is a web server, plus a heck of a lot more. Their website does a
terrific job of explaining it.
[https://www.nginx.com/resources/glossary/nginx/](https://www.nginx.com/resources/glossary/nginx/)

~~~
mrkurt
The scope of software development is so broad I think you could spend 100
years doing it and still not understand everything.

For example, _I_ know what observability data is, but I'd have a difficult
time explaining the problem Redux tackles. If you've spent most of your time
building user facing apps + web apps, how would you immediately understand a
problem that someone working on large scale payment infrastructure has to
solve?

~~~
detaro
> _but I 'd have a difficult time explaining the problem Redux tackles._

Do you think you couldn't _understand_ a short description of the problem it
tackles? Because if you could, then a short description in a readme would be
valuable to you, and presumably the developers of the thing _are_ able to
explain the problem.

------
roskilli
It’s definitely interesting to see the different systems being built for
monitoring across the different tech co’s.

M3 aggregator, Uber’s metrics aggregation tier is similar, except it has
inbuilt replication and leader election on top of etcd to avoid any SPOF
during deployments, failed instances, etc. Also it uses Cormode-Muthukrishnan
for estimating percentiles by default, it has support for T-Digest too.
Although these days submitting histogram bucket aggregates all the way from
the client to aggregator then to storage is more popular as you can estimate
percentiles across more dimensions and time windows at query time quite
cheaply. You need to choose your buckets carefully though.

It too is open source, but needs some help to make it plug into other stacks
more easily:
[https://github.com/m3db/m3aggregator](https://github.com/m3db/m3aggregator)

------
dswalter
It always makes me happy to see approximate algorithms/data structures like
hyperloglog being used.

~~~
ambicapter
"probabilistic" is probably the word you're looking for, and yes I agree, I'm
fascinated by the idea of trading off a little bit of accuracy for massive
performance gains.

~~~
dswalter
That's a good term. I'm sure you've found this collection of links on
'streaming algorithms'. It's a gold mine of resources in this space:
[https://gist.github.com/debasishg/8172796](https://gist.github.com/debasishg/8172796)

------
ebikelaw
When I'm evaluating a system like this what I want to read about is how is it
hardened against client stupidity. For example, someone deploys an application
in my datacenter and it emits metrics that have gibberish in their names
(consider a common Java bug where a class lacks a toString, so the metric gets
barfed out as foo.bar.0xCAFEBABE.baz). How does the system cope with this
enormous, hyper-dimensional input?

------
noncoml
Why is Go so popular in the industry at the moment? What's the decision
process for choosing Go?

~~~
ebikelaw
I doubt anyone can answer this for you, but why shouldn't it be? It is a very
sensible language and toolchain. When writing source, it is easy to write
tests and testable code and to run the tests as part of your build. At build
time, it's fast and it produces easy-to-deploy statically-linked applications.
At runtime, it's pretty fast and compact ... compared to python, Go is most of
the way to C++-level performance.

~~~
dozzie
> I doubt anyone can answer this for you, but why shouldn't it be [popular in
> the industry at the moment]?

Its build system is poor if you want to rebuild the binary from the same
sources (you can't precompile the libraries used), and the statically linked
binary may be nice for a one-off deployment ("fire and forget" mode), but for
repeated deployments and multiple versions running at the same time it reminds
me the Chinese torture "death by a thousand cuts" in what you can't do with
such a binary (dozens of small things that are hard to remember, each on its
own not being enough to go away from static linking, but boy, they do add up).

~~~
ebikelaw
Can you remember even one? I'm interested. I've used Go at Extremely Large
Scale (tm) and never thought it was terribly troublesome.

~~~
dozzie
Sure I can remember. Checking libraries the binary uses ("why this cURL fails
on HTTPs? oh, it's linked against GnuTLS, that explains everything"),
injecting your own library, intercepting a function (maybe syscall, or rather
its libc wrapper), tracing function's execution (ltrace). All these things
merely annoy if you used to have them but now you don't and it's hard to
remember them all, but there's a lot of them.

And then there's also sharing memory between different processes that use the
same library. You don't have that for a statically compiled binary.

~~~
ebikelaw
All those things sound like antipatterns to me. If your application is
supposed to fetch HTTPS pages then there should be an integration test for
that, so you would never have to debug it in production. Having shared objects
on the machine actually makes this impossible because your tests are running
with different libraries than in production. Shared libraries on a machine are
a non-hermetic input to your build and are to be avoided. In addition, runtime
shared objects (especially of something performance-critical like crypto)
inhibit all of the most important compiler optimizations like inlining. The
savings from sharing text segments is small in my experience. As for ltrace,
there's a million ways to trace function calls these days, like uprobe or
perf.

~~~
dozzie
> All those things sound like antipatterns to me. If your application is
> supposed to fetch HTTPS pages then there should be an integration test for
> that [...]

It's not something to regularly rely on, but something that helps in debugging
and troubleshooting. Not for a programmer, but for a sysadmin.

> Having shared objects on the machine actually makes this impossible because
> your tests are running with different libraries than in production.

In such a case you have your deployment process broken. And if your testing
and production environments differ in this matter, they differ enough bite
your ass even with your statically linked binary.

> Shared libraries on a machine are a non-hermetic input to your build and are
> to be avoided.

This is merely stating a generic opinion. I want to see a concrete, coherent,
technical argument supporting this.

> In addition, runtime shared objects (especially of something performance-
> critical like crypto) inhibit all of the most important compiler
> optimizations like inlining.

Especially crypto should not be called in a tight loop, but passed a large
chunk of data. Otherwise you inhibit all of the most important defence against
side channel attacks, and I guarantee that you are not competent enough to
defent against that on your own.

> As for ltrace, there's a million ways to trace function calls these days,
> like uprobe or perf.

So let's break one of them for no good reason?

And still, lack of any of the mentioned things is merely an annoyance once you
hit it, but as I said, they are numerous and add up, while the other option,
static linking, provides little benefit apart from supporting broken workflows
(like different environment in testing and production).

------
pinko
Know of anyone using this in production outside Stripe?

~~~
galaktor
Intercom use it heavily in production.

Source: am engineer at Intercom.

~~~
chimeracoder
> Intercom use it heavily in production. Source: am engineer at Intercom.

This is awesome to hear! Are you referring to this Intercom?
[https://www.intercom.com/](https://www.intercom.com/)

~~~
galaktor
Yes, that's the one.

------
madspindel
It's from 2016: [https://stripe.com/blog/introducing-veneur-high-
performance-...](https://stripe.com/blog/introducing-veneur-high-performance-
and-global-aggregation-for-datadog)

~~~
chimeracoder
> It's from 2016

2016 was the first public release, but the project has grown a lot in that
time. You can take a look at the changelog to see what's new, ever since we
switched to a six-week release cycle last year:
[https://github.com/stripe/veneur/blob/master/CHANGELOG.md](https://github.com/stripe/veneur/blob/master/CHANGELOG.md)

Source: I work on the Observability team at Stripe and I am the PM for Veneur.

------
amelius
What do they mean by "observability data"?

Is this a fancy way of saying "privacy-sensitive user data"?

~~~
Jarred
More like diagnostic stats about servers - memory usage, CPU usage and many
more things like that.

