StackStorm – IFTTT for Ops

amoghe · on Oct 6, 2015

Orchestration tools (puppet/chef) are already able to get your infrastructure to a target/desired state, and keep them in that state, and notify when deviations occur or the target state cannot be achieved. What does StackStorm do that these tools cannot?

lotyrin · on Oct 6, 2015

This is for the guy who gets a support call because of a disk getting full, and has to ssh into a box and delete old log files because logrotate is fubared by some other team but he finds out that there's a nagios monitoring everything (with yet another team ignoring the noise) so he wants to just have his bash oneliner for deleting old log files run any time the disk monitor hits critical, and all he has to do is sell someone with a purchase card a SaaS app (easy), and doesn't have to sell his entire organization the concept of not being fuckups (hard).

It's sad how big the market for this is.

bigdubs · on Oct 6, 2015

I don't like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

doriftoshoes · on Oct 6, 2015

Giant disclaimer: I work at StackStorm...but I also have an extensive Ops background.

This is really the next step in runbook automation. It gives users a way to express procedures and operational patterns in code (the workflow definitions are in yaml). With any sort of automated remediation there is always the concern of "painting over the mold" but at the same time you don't want to get stuck doing a large number of manual steps when you could be focusing your energy on tracking down the root cause of the issue and resolving that. The more important aspect to me personally is the easy of version controlling these ops patterns. Store the workflow definitions in source control and it is easy to diff the changes in your procedure.

mst · on Oct 6, 2015

It seems YAML is the new S-expressions for people trying to pretend they didn't actually invent a programming language.

This is, I guess, less annoying than executable XML.

doriftoshoes · on Oct 6, 2015

YAML is definitely the hotness right now but it works. No need to invent a full blown language for something like this. Way easier than trying to write json.

eropple · on Oct 6, 2015

Or you can use Ruby or Python and Perl and have a scripting language that looks and acts as a scripting language is generally expected to look and act. (And with Ruby in particular it's very easy to provide a flexible and terse DSL that provides significant benefits on its own.)

This "invent your own worse programming language" fad is disappointing.

crdoconnor · on Oct 7, 2015

>Or you can use Ruby or Python and Perl

Then you get turing completeness, and turing completeness harms readability and makes your language a magnet for technical debt.

This is why most of us these days use an intentionally dumb language to template HTML (another non-turing complete language). The alternative was a god-awful fucking mess (remember PHP without a framework/templating language?).

This is what Tim Berners Lee alluded to with this: https://en.wikipedia.org/wiki/Rule_of_least_power

Here's my example (yes, I'm one of those people):

http://hitchtest.com/

I don't think you could write a cleaner, more readable parameterized test case in python.

>This "invent your own worse programming language" fad is disappointing.

There's a lot of disasters out there for sure (in the testing world as well =), but I'm pretty happy to see custom YAML-based declarative languages catching on.

I think Ansible states are, likewise, way easier to deal with than the equivalent in python would end up being.

smw · on Oct 7, 2015

You're making the same exact mistake that the guy who decided ant should use xml as a programming language did.

As soon as you need loops, conditionals, subroutines -- which are all very common when writing tests -- you're making up your own language with horrible warts.

crdoconnor · on Oct 8, 2015

>You're making the same exact mistake that the guy who decided ant should use xml as a programming language did.

Ant was a badly done turing complete langauge. I did not create a turing complete language I created an very, very dumb declarative language with no functions or control structures - only data.

>As soon as you need loops, conditionals, subroutines -- which are all very common when writing tests

I already have loops and conditionals via jinja2 (a templating language I did not create) on the high level and in python on the step level.

I am not and will never implementing any control structures in YAML (a la ant). The YAML will always remain dumb to help maintain a strict separation of concerns and test readability.

>you're making up your own language with horrible warts.

Warts such as what?

smw · on Oct 9, 2015

https://github.com/saltstack-formulas/mysql-formula/blob/mas...

Down this path lies madness. Now the user has to deal with the difference between YAML values and Jinja values. Can't really reference YAML data set in other files, or other places in this file.

It's a tangled mess. Please don't do this to your users.

Maybe use tcl? Guile? Some real language that lends itself to making a clean api/dsl for what you're trying to do.

crdoconnor · on Oct 9, 2015

I agree that YAML block style as on lines 22, 23, 24 should be eliminated (I will probably prevent my framework from parsing this). If you use an unescaped { or } it should signify that you are using Jinja2.

Similarly, I'm no fan of the {% sets %} at the top - it's a code smell.

Apart from those things, though, what you linked to seems easy to read and understand to me.

I'm absolutely positive that Tcl or Guile (or python) would create the potential for bigger messes than what you just linked to under similar circumstances. Simply being turing complete is enough for that.

eropple · on Oct 8, 2015

Such as using a text templating language to define control flow. Literally everything you describe is why you shouldn't do that. =(

crdoconnor · on Oct 8, 2015

The templating language isn't and can't be used to define test control flow.

It's only used to generate test cases - e.g. 7 login scenario tests on 7 different browsers. Or, as on my website, two virtually identical tests running on two different versions of python.

It isn't necessary to write test cases using jinja2 either. YAML on its own is enough.

mst · on Oct 6, 2015

For plain config, I mostly use JSONY now.

For executable declarationish things, I like Tcl (usually embedded in perl so I can use Moo(se) for OO rather than the inferior crap in other languages).

Hashicorp's HCL is an interesting middle ground.

eropple · on Oct 7, 2015

I can respect Tcl. I'm not super familiar with it aside from hacking on eggdrop, but it's in the same ballpark. For configuration, I just source a shell script and grab env vars wherever I can. It's the most portable (between applications, not necessarily between platforms) option available to me and plays nicely with stuff like jails. At a glance, I don't quite understand the value prop of JSONY over YAML, which (as you see used in something like Rails) allows for a good bit of flexibility with references to more tersely communicate intent.

I think HCL is one of the worse decisions made by a somewhat influential software company in quite some time. It's harder to use than YAML (which is funny because the stated reason for its existence is "YAML is harder", and it makes me really, really curious what sort of users they're polling) while being less expressive and they're very careful to not really care about anybody's interop "because you can just use JSON instead", except that, as trying to work with Terraform amply proved, no, you can't use JSON, because that path isn't tested because nobody cares and unit tests are hard. =(

HCL is a regular problem in my life. I needed to vent a little.

crdoconnor · on Oct 7, 2015

>for configuration, I just source a shell script and grab env vars wherever I can.

That can cause you headaches when you want to store lists or associative data in your config.

kentonv · on Oct 7, 2015

> This "invent your own worse programming language" fad is disappointing.

Fad? This has been a common anti-pattern basically since the dawn of computing. :)

eropple · on Oct 7, 2015

Fine, fine. Boondoggle?

mst · on Oct 6, 2015

> Way easier than trying to write json.

Which is why ingy and I invented http://p3rl.org/JSONY for config files.

zobzu · on Oct 7, 2015

not sure how yaml is easier than json also not sure how yaml is easier than python or what not

lnkmails · on Oct 7, 2015

StackStorm developer here!

Just to clarify, YAML is one way to write the workflow spec. Just like how YAML is way to write docker compose file. We picked YAML over JSON because it is less verbose and allows one to add comments which is pretty useful if someone is looking at a remediation workflow early in the AM. We do use YAQL expressions for some variable manipulation but that is not a YAML spec. A workflow is a collection of actions and actions can be written in almost any language (as a script) or python (first class actions). So you can think YAML workflow as a declarative spec rather than a programming language. Maybe that helps?

crdoconnor · on Oct 7, 2015

Some nice readable example YAML definitions would be good to show on the README.

ascendantlogic · on Oct 7, 2015

And how do you universally fix the bullet wounds? How do you stop every upstream issue from ever occurring so middle-of-the-night remediation doesn't have to happen? Of course everyone would love the actually solve every problem but real life is messier than that. For every issue that needs to be fixed, there's usually one or more reasons why they can't be fixed the "right way" right this instant: Management bullshit, prioritization, $$$, etc etc.

ewindisch · on Oct 6, 2015

Monitoring exists because software is not perfect and will never be perfect. Software raises errors because it doesn't know what to when those events occur. Operators usually know what needs to be done for their specific environment, even if the software they're running doesn't.

If we could eliminate the bullet wounds we wouldn't EVER need logs. As long as we have logs, we need some way to process them and react to their events. For many organizations, for decades, this has been to have an alerts system and an active operations team fighting fires. Those teams maintain knowledge-bases to track institutional knowledge of how to manually react to these events.

Solutions such as Stackstorm and IBM ZAware seem to be created to allow that institutional knowledge to be automated. I've also seen (and built) proof-of-concepts using Bayesian filters as part of such systems. It's been a long-time coming and I'm happy solutions are evolving to address this need.

Finally, I think that bandaids such as this may be, might be precisely what is needed in some cases. Assuming it's even a legal or technical possibility, the cost of the "right fix" may far exceed the harm of leaving a bug unfixed. Sometimes the right fix has a large time-cost, which automation can help bridge the gap for operations teams while a permanent fix is developed (or more hardware is acquired, etc).

mst · on Oct 6, 2015

> For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

Of course. But in the mean time, it would be nice to keep production up.

spdustin · on Oct 6, 2015

You aren't kidding.

I thought it was a clever framework for other kids of "ChatOps" in addition to all that you said. I can imagine some clever Hubot scripts taking advantage of some of the components.

Man, I sure do love my Hubot.

jamesfryman · on Oct 6, 2015

Disclaimer: I am an employee at StackStorm

Tools like Puppet and Chef are great at managing the state of a single node. However, with any piece of infrastructure that expands beyond a single node, things can get complex pretty quickly. In times where the state of the systems change, there often are a multitude of steps that need to occur to ensure that the intended state is met. However, because tools like Puppet and Chef are node-centric, you often find yourself waiting a long time for eventual convergence as a code is executed for a node, updated state data transferred to some upstream data server (Chef server, PuppetDB), and then other nodes converge with updated data. Depending on the task, this convergence time can be killer.

In contrast, StackStorm is an event-driven automation framework that will help perform the incremental tasks across many systems necessary to properly move from one state to another. Common examples include ensuring that Load Balancers are up before advertising network services, ensuring SQL standby servers are alive before enabling replication, and so forth. Each of these steps is going to require an imperative set of steps to transition between states. With StackStorm, we plug into a multitude of tools (including Puppet and Chef!) that provide event updates as actions take place, intercept these triggers, and execute workflows. In many cases, we have clients that heavily use a configuration management tool like Puppet or Chef, but rely on StackStorm to orchestrate the various runs of these tools on different nodes as checkpoints are reached. In this way, decrease the feedback loop as you have StackStorm listening for "run finished" notifications from these CM tools, and then we can go and figure out what is next by kicking off actions or workflows as necessary.

There is absolutely a ton more about StackStorm beyond this immediate answer. In addition, we include things like Role Based Access Control, ChatOps support, full audit trails, and more. I encourage you to check it out and provide feedback. We'd absolutely love to help!

helloiamaperson · on Oct 7, 2015

You mention Puppet and Chef, but how does it compare to more holistic tools like terraform, juju, and bosh?

dzimine · on Oct 7, 2015

terraform/juju/bosh are purpose-build for app and infra deployments.

StackStorm is a generic automation platform with no . One can use it to run arbitrary chain of actions on events. As such, it is used in auto-remediation & automating runbooks. We internally use it for variety of things like irc-to-slack relay, zombie ec2 vm periodic clean-up, ChatOps-ing JIRA, etc.

Some folks do run complex continuos deployment pipelines and blue/green deployments on StackStorm. Would be good if they can comment here.

helloiamaperson · on Oct 7, 2015

Oh, so kind of like: http://concourse.ci/?

dzimine · on Oct 7, 2015

may be, when it comes to CI/CD. not familiar with concourse.

my comment above got cut out, i meant to say "generic platform with no focus on CI/CD". It's for auto-remediation.

it's not natural to use bosh or concourse to automate cleaning up log files or replacing a failed node in cassandra ring, or respond to identified DDOS attack.

kentonv · on Oct 7, 2015

We should make it possible to run StackStorm on Sandstorm (https://sandstorm.io), mainly for the confusion factor. :)

Apparently our offices are like three blocks apart, too!

lnkmails · on Oct 7, 2015

StackStorm developer here! Don't be surprised if I turn up at your door tomorrow :).

kentonv · on Oct 7, 2015

In all seriousness, email me (kenton at sandstorm.io) if you want to visit us for lunch or come to a LAN party[0] sometime.

[0] http://kentonshouse.com

DevOpsDotCom · on Oct 7, 2015

We actually just posted a piece on monitoring vs. remediating about StackStorm, check it out guys, i think you may find it interesting and relevant. http://devops.com/2015/10/07/enough-monitoring-act/

aaronbrethorst · on Oct 6, 2015

What's the difference between this and all of the other runbook automation suites out there? (besides the open source license)

...or am I missing something?

dzimine · on Oct 6, 2015

Disclaimer: I work with StackStorm. In the past, I built Opalis IIS aka Microsoft SC Orchestrator. Seen both sides.

StackStorm to legacy runbook automations is what chef/puppet to legacy config management. It's open source, infra as code, and respect devops tools and mindset. Some folks on our team are devops with field experience putting their learnings in.

Our key design principles: 1) infrastructure as code, which means: workflows, rules, action metadata, and other artifacts are readable, source-controllable code (yaml)

2) integrations are "easy", which means: use python, ruby or shell, or turn any existing script into action by adding yaml meta-data. If you did an integration with something like HP OO or MS SystemCenter you appreciate the difference.

3) yes, opensource. I think it's a deal breaker, especially when it comes to integrations.

That's our perspective, how do you guys see it?

cowsay · on Oct 7, 2015

Anyone actively using this?

Seems pretty interesting and looking for any suggestions for what may be the best wow factor.

epowell2015 · on Oct 7, 2015

Netflix is one: https://news.ycombinator.com/item?id=10272955

armabiz · on Oct 7, 2015

Also Cisco, Rackspace.

But it should work well even for small startups/companies.

Own infrastructure as code, where you can control everything and tie together Monitoring/Configuration management/Issue creation/ChatOps/Auto-remediation - is really powerful thing.

zobzu · on Oct 7, 2015

its not infra as code though, its bandaiding as yaml. so say, your logs are filling the disk, nagios complains.

what you do, is a yaml file that goes and delete some files around when this happens....

.. instead of... fixing logrotate config

i dont know, it feels wrong: as much work, except it also takes setup, new machines, new stuff that can fail, be misconfigured etc.

lnkmails · on Oct 7, 2015

StackStorm developer here.

I've worked with multiple services in multiple teams where upstream fixes take a while and meanwhile devs and ops people get paged like crazy for a diagonized and remediable problem. Agreed that logrotate config needs to be fixed for this case but it is only a simple demo for auto-remediation. For years, Cassandra dead node replacement is a 6 step manual process. You'd think upstream should be fixed but unfortunately not. So StackStorm fills the gap between what is ideal and what is running in production. Usually, there is a gap. See http://docs.datastax.com/en/cassandra/2.0/cassandra/operatio... vs https://stackstorm.com/2015/09/22/auto-remediating-bad-hosts.... That is just another example.

armabiz · on Oct 7, 2015

It's not only about that, - cleaning logs is just simple example. The main big thing is about IF-Then-Else and it's up to you to choose what you put after that IF.

Things like:

* Building fully automated and really complex CI/CD workflows from several tools

* Do something with your AWS or RackSpace clusters based on monitoring event from NewRelic, Sensu, Nagios

* Automatic node replacement in cluster, migrating MySQL master (sleep well!)

* Security automation, based on detecting erroneous events and automatically freezing account/activity and then notifying human about the incident

* Create JIRA issue as part of Workflow, kind of detailed report after some action being done

* Listen for new events/changes in Trello/Kafka/GitHub/RabbitMQ/anything even Twitter and trigger an action

* Folks even using it for Smart Home Automation

* ChatOps thing: obtain info about your infrastructure from Chat or trigger your favorite CM tool: Puppet, Chef, Ansible, Salt.

Most probably anyone can imagine lots of use cases with their favorite DevOps tools, how to tie them together.

Moar Automation, - less routine!

dzimine · on Oct 7, 2015

let's scrutinize. And please do challenge and point out what still feels wrong. * first, a library of scripts (actions), a shared one. each action is atomic, linux style, doing one thing well. A common pattern in ops. now with CLI, API and UI. Feels right so far? * second, combine these actions, building blocks,into workflows (workflow is action comprising actions). why not script? a) transparency of state (it ran 3 steps and failed on 4th) b) reliability, like 'restart workflow from a point of failure' c) carrying data - scripts pipe strings, workflows pipe JSON. * Add chatops. Any of these actions or workflows exposed in any chat with couple lines of meta. And any events sent to chat with rules

Good things begin to happen here, even before wiring events with actions. Shared context, integrations, quickly building more actions from existing actions, full audit...

* now, add IFTTT - firing these actions on events. Quite a lot of cases fall into this.

It's a challenge to single-out on one use case. A trivialized example, as log-file delete, is dismissed as "baidaid". Complex examples are domain specific and harder to grasp. We think we are on something here. We think it's not a bandaid, it's a glue. Needed in many cases.

Again, I work for StackStorm.