Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Keep – GitHub Actions for your monitoring tools (github.com/keephq)
181 points by talboren on Sept 4, 2023 | hide | past | favorite | 65 comments
Hi Hacker News! Shahar and Tal from Keep here.

A few months ago, we introduced here at HN (https://news.ycombinator.com/item?id=34806482) Keep as an “open source alerting CLI” and got some interesting feedback - mainly around UI, automation, and supporting more tools. We were VERY early back then, and we understood that although the current DX around creating alerts is not great, it's not that critical and developers don’t need another tool just for that.

But we did find something else.

While talking to developers and devops, we found that a lot of companies use many tools that generate alerts - from Cloudwatch, Prometheus, Grafana, and Datadog to tools such as Zabbix or Nagios. We definitely agree consolidation in the observability space is a real thing, but while talking to those companies we feel that there are still real use cases for having more than one tool (and for example, according to Grafana’s 2023 observability survey, 52% of the companies uses more than 6 observability tools https://grafana.com/observability-survey-2023/).

So we that in mind, we rebuilt Keep with a simple mindset: (1) Integrate with every tool that triggers alerts - it can be either pushing alerts to Keep via webhooks or routing policies or Keep to pull alerts via the tools API. (2) Create a simple abstraction layer to run workflows on top of these alerts. (3) Maintain a great developer experience - open source, API-first, workflows as code and generally having a developer mindset while building Keep.

During the time we rebuilt Keep, Datadog released their workflow automation tool (https://docs.datadoghq.com/service_management/workflows/) which led us to the understanding that's exactly what we solve - but for everyone who uses tools other than Datadog.

A short demo of Keep with a simple use case: https://www.youtube.com/watch?v=FPMRCZM8ZYg

You can try it yourself by signing into https://platform.keephq.dev

Like always - we invite you to try Keep and we are eager to hear any feedback.




I'm looking at this and thinking, "you know what, this could be an awesome personal tool as well".

This is definitely outside of the use cases described but I can definitely see myself hooking this up in an IFTTT style to funnel things into my todo systems using the HTTP provider.

Will poke around this soon.


thats definitely something we want people to do, use it for their small annoying manual things. in the end of the day I think this will help Keep become very user friendly.


IMO the readme docs make it seem confusingly like it's built-in/really well integrated with Actions, because the syntax is so similar. It takes some light digging to find it's actually entirely separate (but similar) and run as `keeo run --alerts-file=path`, from GH Actions or anything else at all, because it's a separate file parsed by a third-party program that just so happens to have a similar syntax.

Nice tool though, looks useful, added to the list.


thanks for the feedback here, that’s really important for us and we actually just refactored the readme a couple of days ago. any concrete action item you’d suggest?


Try to read it as someone approaching the whole project for the first time, particularly the 'Workflows' section, it just doesn't really make sense, kind of implies there's something special about running it in GitHub Actions (maybe it means to say that the syntax is similar?) and doesn't tell you how to actually pass the Keep workflow file to keep (which is what will parse it, not GH Actions) at all.

Start with `keep run --actions-file`, then show the file format. Don't mention specific CI/CD, it's not relevant.


would "datadog workflow automation alternative" would sounds more reasonable? we did want to do that but people told us no one knows datadog workflow automation.


Honestly it would be better to just describe what your tool is, rather than trying to describe it in relation to something else. People know what workflow automation is, and people know what monitoring is. I was initially confused at what this product had to do with Github Actions.

Also, it would do well to link to your website ( https://www.keephq.dev ) also, rather than just something that immediately prompts me to log in.

And one final thing, having a demo video with some sort of narration to explain what I am watching people click on would also be helpful.

Obviously these are just my opinions, and they are worth exactly what was paid for them :)


and your feedback is much appreciated! noted for next time. (i actually did the demo in a rush so it was more just showing a simple usage of the platform)

we'll think about some changes to the README asap!

thanks!


thanks for the feedback, appericated. We do have a demo at keephq.dev and we also dumped the youtube link on the post itself. what else would help you understand better?


I watched that video, but my thought was it would be better with narration or subtitles explaining what is going on.


You can keep a mention of another tool if you really want, but just make it clearer that you only mean it has similar syntax.

Concretely:

> Run a Keep workflow with `keep run --actions-file=./path/to/workflow`; or `keep run` assumes the file is called `blah` in the current working directory by default if not specified.

> The actions file has the following format, which may seem familiar if you've used GitHub Actions:

> ```

> ...

> ```

> further actions file examples are available here and a full reference is available in the docs here.


This genuinely seems useful. I’ll give it a shot. Nice!


let me know how I can help


This looks great!

I'm the maker of an alert-generating tool (OnlineOrNot), how would I go about adding an integration for Keep?


well actually documentation around that is still under construction (https://docs.keephq.dev/development/adding-a-new-provider) but adding a new integration (provider in our terms) to Keep is a piece of cake! happy to chat in our Slack (https://slack.keephq.dev) or over a zoom, what ever works for you :)


Since this is 2023 and we are releasing things that solve X and Y problems in YML I do want to take the opportunity to question whether solving problem for X or Y in YML is really the thing we should be building businesses around these days. I’ve spent the greater part of the last year or so undoing the pain of “reasonably complex GHA in YML” in my organization. It’s one of those things that sounds great conceptually, and works really well simplistically, but once your use case evolves beyond even remotely simple (for example, abstracting and maintaining this code in an engineering org in the tens of people, not even hundreds), it is a slow growing cancer that ends up being a huge time suck, unmaintainable, untestable mess, and technical debt for your org.


Did you perhaps have an alternative solution you don't mind sharing? In my opinion yaml is good enough for gitops. Easy to read, understand, modify.


I’ve been using Dagger for this replacement specifically. But that is ci/cd specific to caching and workflow execution. For things like workflow automation and orchestration I would reach for something like Prefect or Dagster. The point is to be able to do something in an actual programming language so that you not only get typing, readability, reusability, unit testability, local execution and language specific tooling, but also that it doesn’t suck for end users to write, debug and maintain. This also gives end users an escape hatch when your abstraction is inevitably not going to be good enough for them

Cuelang etc like siblings mentioned are decent enough but the real scalable solutions here are made available in general purpose programming languages.


the "workflows as code" is something we are thinking about too. I guess there are pros and cons for every approach and eventually we would want to support both (but need to start with something)


Check temporal.io, they are used by Uber, Netflix, Datadog...


temporal is for managing events. How is it related to ci/cd? Generally curious.


It’s more generally a workflow manager and orchestration tool. It is kind of a more general version of the tools I mentioned, Dagster and Prefect. It can be used to spawn and manage CICD tasks asynchronously.

Though I will say that Temporal's use case is probably not really well mapped to CI/CD - though it could be used for it (which is why I didn't mention it). It's primary strength is robust, long lived workflows with intelligent retries and the like - you typically want your CI/CD to be as fast as possible and while you want retries and resilience etc it's not as important as some other things (like being hermetic, reproducible, and cached).


Here's a Temporal v Prefect comparison I wrote: https://community.temporal.io/t/what-are-the-pros-and-cons-o...

tldr is Temporal is more general-purpose: for reliable programming in general, vs data pipelines. It supports many languages, and combining languages, has features like querying & signaling, and can do very high scale.

CI/CD is a common use case for Temporal—used by HashiCorp, Flightcontrol, Netflix: https://www.youtube.com/watch?v=LliBP7YMGyA


And what was the solution? How did you eventually address those issues? While I agree that GitHub Actions has its downsides, it's also widely used and simple to start with, which we thought was a good approach. Would you be more comfortable with 'Zapier for Monitoring' or an alternative to 'Datadog Workflow Automation'?


The complaint doesn’t seem to be about GitHub Actions, but YAML. I agree 100% percent, as soon as I saw that Keep is using YAML, I closed the tab.

Nope. Nope. Nope.

It’s like going back to Mongo without schemas and relational checks. We have perfectly good configuration languages with schemas, checks, imports, logic, etc. YAML is unacceptable in this profession.


Could you point to the mentioned configuration languages with schemas, checks, imports, logic etc?



These are interesting. I've seen both before but never understood - if I have a config that cannot be easily written out in yml, why would I force my team to learn either one of these DSLs instead of using our main dev language (say python) to generate the yml instead? What's the value proposition of jsonnet or cue?


One very important thing to point out here is that you’re not just writing config with this YML. If you look at the example on the GitHub link it’s a workflow orchestration and execution context. There’s a code runtime involved and logic is executed in the YAML. That is where YAML falls apart.

If all you’re doing is defining configurations (example is Kubernetes manifests helm charts etc) then great. But that isn’t what this is.

To your original question, I would actually advocate using the general purpose programming language for most use cases. Learning a new DSL, like you mentioned, is overhead from both a usability and maintainability perspective. I haven’t used jsonnet before but I know that cuelang gives you some power tools around typing, config validation, templating etc. it’s essentially purpose made for configuration management and tooling so it’s probably going to be really good at that. I don’t know if it’s worth using over a suite of language specific tools like Pydantic + Jinja though because when you’re using a general purpose language like python you have a whole, much larger ecosystem of tools and libraries you also have access to and can pull from.


I agree with this for internal tools only intended to be used by a relatively small organization. Using a DSL may well be preferable for open source, and for larger organizations where not every team is proficient in your Turing-complete language of choice.

Some drawbacks of plain YAML, and of tools that use string templating to render YAML:

    - difficult to extend features not exposed by upstream
    - composition is often messy, resulting in duplication
    - validation is often impractical (at least identifying the exact source of the error… I’m looking at you Helm!)
Unrelated to OP, but you can leverage Tanka to extend helm charts with functionality not provided by upstream.

https://tanka.dev/


I agree with you on that as well. The YAML aspect is somewhat of a 'low-level' concern that you shouldn't have to worry about unless you need something highly customized.

Now, let me reverse the question—what would make you keep the tab open?"


“Need[ing] something highly customized” is not some uncommon occurrence for your end users. It’s an inevitability for a large portion of them.

Give me some well supported libraries in common general purpose languages to do this, codegen is pretty good these days and supporting 3 or 4 languages shouldn’t be an insurmountable achievement.


got you, so you would imagine some typescript/python/golang sdk that let you define workflows?


Precisely. Let me interweave that into my implementation as I see fit. Maybe for some people that’s just literally pasting an example into GitHub actions and saying “python keep.py myargs” but it doesn’t have to be, it’s just another tool chain in the general purpose environment


will definitely have it in mind, thanks for the input. btw any other tools do you know that doing it?


AWS CDK has bindings for a few languages


I have been working on a data validation tool for a while. I even tried creating an extended YAML parser for data validation. You made me realize I wasted my time with that approach. Better now than later. I would love to talk to you before I throw away more code. Can we connect?


Hey! I've missed your reply here. Sent an email.


I'm the system architect and code quality gate in my company, and I feel you... my job is to keep things sane, consistent and extendable. GHA as well as Azure Logic Apps are booth helpful in the small scale but, omg, so far away from reusable or even able to deploy the same damn thing on different stages from code. To GHA: I find the GHA just look the same as Azure DevOps Pipelines yet they GHA don't hold your hand when designing and evaluating the steps.


Under the hood GHA is using the same backend as Azure Devops Pipelines so it would make sense that they look the same


Yeah but people freak out when they see Gradle, Bash, Bazel, or even wacky raw Python.

The real competition is, what will LLMs write better? Because I have zero interest in learning new DSLs, I just want whatever will be most text based to use through an LLM.


Then, honestly, you want it to write something that is statically verifyable


You probably want python then. I think it's been well demonstrated that is probably the language with the largest amount of effort has gone into training LLM's to work with, in multiple facets.


Haha my team maintains something just like this this internally - ours is called “info-radiator”. Great idea for a product. You should add an Amazon referral link to a small Lenovo tablet and some Velcro for developers to have a dedicated Keep monitor!


thanks! happy to help your team migrate to Keep and send out a few complimentary monitors ;)


We have a similar internal tool, and it's also called keep!

Besides alerts it also tracks, and displays things such as which MongoDB server is the primary, or which ElasticSearch node is the controller.


whats the odds. would be super interesting to talk, are you up for it?


Looks really interesting! Does the self-hosted version support OAuth or other authentication methods to manage users through an external identity provider?


Self hosted version supports both single tenant mode (no users/login at all) or multi tenant mode where you can configure your own auth0 account and work with all of their supported IDPs. It’s not so well documented but I can have that documented ASAP


This is similar to something I saw before: https://n8n.io


yea n8n is really neat. in our vision Keep gonna be "n8n for monitoring".


does it do WebSocket or check at an interval? or does it have another approach?


How does that compare to similar existing tools like StackStorm?


i think a few major points would be: 1. we're much more integrated and focused on the tools (and even more specifically on the monitoring tools), we actually integrate with their APIs and not only receive events from. 2. trying to be a lot simpler, modern UI, zero-clicks concepts, etc. 3. they're super cool and we learn a lot from them


Does it do Rollup’s and how is it different from Stackstorm st2


wdym by "Rollup's"? I answered re:stackstorm st2 in another comment:

> i think a few major points would be: 1. we're much more integrated and focused on the tools (and even more specifically on the monitoring tools), we actually integrate with their APIs and not only receive events from. 2. trying to be a lot simpler, modern UI, zero-clicks concepts, etc. 3. they're super cool and we learn a lot from them


In your demo, the message for disk space keeps repeating but we just need one alert instead of 100s , also good would be flap detection when something flaps up/down, and auto remediation that it was resolved.


the demo is super simple and we just send a slack message for every Prometheus alert, so if you would fine tune Prometheus it would resolve it. In addition, we have few throttling strategies that we didn't show in the demo. Last, we will add flap detection!


Maybe don’t call a reliability tool “GitHub Actions for X”, which doesn’t exactly inspire confidence? Having used relatively high frequency GitHub Actions for alerts, I get more random errors from Actions (Actions down, failed git pull, failed cache pull, etc.) than actual alerts.


I was thinking just the same!

When I think of Github Actions I think of horrific inline YAML scripting, missing core functionality, clunky UI and frequent outages.


lol

yea, you’re definitely right. we should’ve mentioned something like “GitHub actions but reliable”.

actually our readme mentions datadog workflow automation as a reference though people are less familiar with it


yea the original thought was "Datadog Workflow Automation" but too many people told us they don't know it.


I've used Datadog as an application engineer for years (not an expert / ops person) and I just learned about Workflow Automation from this post... so good choice.


mic drop


Datadog or us? ^^'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: