We have tons of collectors. And tons of graphers. What we have not is a little bit of smarts in that tools. Ability to predict and ability to react.
Predict. We have Holt-Winters Forecasting Algorithm implemented in RRDTool from 2005 and a couple of papers.
React. I'm not talking about 'fix it automagically'. But everyone wants to know 'wtf was that peak on this graph last night?'. Usually your never know, except the simplest cases. Because you cannot collect everything about everything all the time. But monitoring system could enable 'collect everything we can' for short period of time when it detects something. Something wrong or something strange, something out of the pattern. Does anybody hear about system with something like that?
It'll be released in a week or two. In the meantime, I've been speaking about it: http://devslovebacon.com/conferences/bacon-2013/talks/bring-...
The vast majority of hundreds of thousands metrics are common for any node/server actually, so predefined recipes/settings should be used for them. Only aplication-level metrics for only in-house applications have to be tuned manually.
If you're interested to see our algorithms in action, email me jennyinc at gmail.com
- Plugin system: Only way to scale the development of the solution
- Lua for plugins: Yes! Language is not important, but not having to stop and restart the application for changes in logic, etc. is essential.
- Routing. Sounds great, can't wait to take a deeper look.
Kudos to devs. Nicely done!
Alternatives written is scripting language like ruby, python, etc. require runtimes and libraries, and it can get quite complicated to deploy them (especially if the env. has lots of different OS versions, etc.), keep track of all dependencies, deal with conflicts with other apps that may require a different version of the runtimes & libraries. As such for operational reasons it's very appealing to have a distributable binary that works without having to worry about what prerequisites are and whether it would impact anything else on the server.
shh can be extended with custom pollers written in Go, but focuses on collecting system-level metrics. log-shuttle is a general-purpose tool for shipping logs over HTTP. l2met receives logs over HTTP and can be extended with custom outlets written in Go, but requires log statements in a specific format ("measure.db.latency=20" or "measure=db.latency val=20").
It's great to see so many new tools in this space. Previously I had a bunch of one-off "carbonize" scripts running out of cron, each collecting a specific kind of metric and sending it to Graphite or statsd. This worked OK but required quite a bit of code to get things done. Heka's plugin system looks like a nice way to structure things.
I'm curious if Mozilla is using these two tools in combination internally, and what that architecture looks like.
But as Hekad has plugins for all of the above (except Go - but I'm sure it's possible): http://heka-docs.readthedocs.org/en/latest/architecture/inde...
Well I guess we'll now be evaluating whether Hekad looks like it might be a more promising fit.
I particularly like the bullet points on aggregation counters, filters and transformations. We'll have to see how they work in practise though. The docs are very pretty, but as is usual with early releases it seems a little difficult to picture the whole and how it will actually work in practise from the soup of detail that Sphinx spits out.
If you're not familiar with collectd's capbilities, you can get a quick overview of the official plugins at http://git.verplant.org/?p=collectd.git;a=blob;hb=master;f=R...
And a whole host of other proprietary transports. So its cool and looks awesome, but what does it give me that the entirety of other monitoring protocols doesn't
There's a bunch of things going on on your boxes (logs, jmx, syslog, etc), and you want to get them out in a useful unified format. You have to do some ugly things (e.g. parse rails logs for latencies), and then emit the data, preferably in some structured format that knows that render=17ms is a duration so that you can graph it.
They chose their own transport to speak between heka nodes, because it maps perfectly to their internal representation, but it looks like they are willing to speak any of those protocols you mentioned to the outside world. It's useful to do a limited amount of munging inside the hekasystem before sending the data to logstash, graphite, etc, so it looks like they spent quite a bit of time building a framework for that initial work, so you can move it as close to the edges as you'd like.
To me, the transport and/or protocol isn't interesting, it's that you have a flexible, lightweight agent that's also capable of doing pre-processing and rollups.
This comes from a couple things.
Go compiles to a single static library so you don't have to worry about having dozens of "the right" library installed on your machine. Grab the heka binary and run with it.
This greatly eases our operations work as we have fewer dependency conflicts to deal with when we push things to production.
You just copy them to some directory, run them and they work.
And some of them even support building native binaries (e.g Java through gcc).
Go compiles to native code. Not only do you not need a preinstalled Go runtime on a target system, but there's very little advantage to even having one. The normal way of installing a Golang program is simply to copy the binary and run it. That's powerfully simpler than most other modern programming languages, with the obvious exception(s) of C/C++/ObjC.
† Commenter downthread says the same thing, but let me add that we look at other people's Python/Java/Ruby programs professionally, and I can't recall a single client ever doing anything like this.
The "Virtually nobody" this is because the main use case for Python and Java are as server side languages (both) and scripting languages (Python). In those cases people are expected to have or to setup the appropriate runtime beforehand.
But for people who want to ship apps to end users (customers and consumers) with Java and Python, the bundling thing is very very common.
People using them in the end user space, regularly do it this exact way. For most of them, you don't even get to know what they use underneath.
- Dropbox (uses and bundles Python in the app).
- Vuze torrent client (previously Azureus and very popular in its prime) bundles a JRE (for when you don't have an installed one).
- LightTable is just a JS runtime bundled with Webkit as a standalone app.
On that note, Python packaging/deployment/repeatability is still a disaster. If you have code with dependencies on compiled C extensions, there are few good ways to deal with this in prod.
In summary, I think a lot of us find the idea of monolithic binaries appealing (perhaps even to an irrational degree, speaking for myself) because of issues suffered in the past. :)
On Linux there's a large gray area for things like libexpat.so.1 that may or may not be linked dynamically. But libc is LGPL, so I expect it too would be linked dynamically.
We started by extending logstash, but our needs were more "we need a router" and logstash isn't meant to be a router.
Statically linking the world isn't trivial. For our existing Python code bases - how are you going to deal with third party libraries from PyPI?
Come by on #heka on irc.mozilla.org, we're kicking around in there.
Depends on what you want?
You could freeze the pip-requires to always install the same version and use a virtualenv per application. This is basically the same as bundling everything together, it has all the benefits with the least amount of work.
You could use distribution packages for security, correctness and stability or even roll your own repository inside your infrastructure to absolutely control everything.
Finally you could just bundle everything manually by fooling around with the PYTHONPATH and putting all the dependencies in a single directory. This is kind of like improvising your own virtualenv, it's very hacky, but it can work.
It really bugs me that I have to have an Python interpeter on the frontend web machines (cause I would prefer not to have a C compiler there)..
You don't need to have a C compiler installed for the Python interpreter to work. I hope you're joking...
I had a feeling. We may hope your code is hella tight...
Which is what syslog can't do.
Of course, both syslog and collectd have been around and battle-hardened for many(!) years, whereas this first Heka release is being called "0.2-beta-1" for a reason. I wouldn't go rush into replacing any mission-critical infrastructure just yet. ;)
Nobody ever claimed that Rust is "there yet".
The core Rust developers all say that Rust is still in flux, and that a stable version is still many months in the future, possibly 2014. And they advise not to use it in production.
i believe that's about it, though.