Hacker News new | past | comments | ask | show | jobs | submit login
Gremlin Free – Run chaos experiments to prevent outages (gremlin.com)
164 points by dpritchett 22 days ago | hide | past | web | favorite | 52 comments

Ha, after working on building Uber’s chaos monkey (which was hard and took a while to build) and working with Netflix’s chaos monkey — it’s super nice to see Gremlin release this service so anyone can see the benefits of chaos engineering. I hope they add a “random chaos” feature to keep engineers on their feet. ;-)

Hey folks, I work at Gremlin and we're super excited to announce this launch. Drop any questions, comments, or concerns, we're happy to help!

I like the look of this and love that you have released a free version. I am a little dismayed, though, that the two options are $0 and $1000/m (paid annually) with nothing in between. The free version seems great to get started, but I'd really like a lot more of the attacks that the paid version has, but $12,000 is much, much too high a price for a startup or personal project. That's quite a jump in cost.

I can’t speak for this vendor in particular, but one common reason for pricing like this is the vendor doesn’t want to deal with smaller customers as they often have the highest support requirements.

Pretty much. An enterprise customer won't bat an eye at $12K/year, and I imagine it'd pay for itself pretty quickly.

I can definitely relate with GP, though. It feels frustrating to learn about an interesting product only to find that it's priced way outside of your budget.

At least Gremlin has a public sticker price. Sometimes enterprise services just skip that completely and require you to setup a call with someone in their sales department, which usually means the service is outside of your budget.

> which usually means the service is outside of your budget.

I wonder if this is true though.

Maybe it puts off people who otherwise might be able to have a product tailored to their budget???

Its possible. I know a guy who isn't put off buy "call us" pricing notices and he seems to get good deals, so its certainly possible. Many of us will never find out though, because "call us" means "close tab". Even if I really want their product and am willing to pay a lot for it, my time is too precious to me to waste on calls.

Sure, but then why tease us with a cut-down free tier?

Thanks for putting this out! I caught a Gremlin talk at a recent conference and was very impressed with how knowledgeable the developers were.

How does the Gremlin platform interact with one of my hosts? Do I need to install an agent or something? Does it need root access to my host, hypervisor, cloud console?

Simply install an agent, authenticate with our control plane, and create attacks through our webapp. No root access required.

Check out more info at https://www.gremlin.com/docs/infrastructure-layer/installati... .

If root is not required, how does the agent issue a shutdown or restart?

Hello! I am a Solutions Architect for Gremlin. Great question! It uses four Linux Capabilities to accomplish that listed here: https://www.gremlin.com/docs/security/overview/#linux-capabi...

I see that you are on the Rust production user page [0]. Can you talk a little bit about what Rust is used for and how the experience has been?

[0]: https://www.rust-lang.org/production/users

Hey, I'm an engineer at Gremlin! When you install Gremlin onto your linux hosts for infrastructure experiments, you're using binaries that were completely written in Rust. I would be lying if I said there wasn't a bit of a learning curve (coming from mostly working with Java). Most of that can be attributed to the memory management concepts built into Rust. At first you fight the compiler a bit (asking things like, why am I not allowed to reference this variable?!), but you soon learn to love and rely on the compiler as it builds more confidence in the runtime behavior of the product.

One game changer for Rust is the treatment of Errors as first class citizens. It's literally built into the native types that Rust wants you to work with. That's huge for our product, given it runs in an inherently error-prone environment.

Thanks for the reply. I always anticipated Rust being a good fit for a daemon like tool. Not having to install a separate runtime and have things statically linked is a nice benefit. I know it's not the only language that is capable of this but being able to leverage the other bits of Rust helps with productivity as well.

I laughed out loud at "Failure as a service". Thanks for that.

Hi, we are startup using a lot of lambda, fargate, rds and dynamodb. Will gremlin work for this? I didn't see any mention of support of fargate or lambda on your website.

We've got you covered! Gremlin supports severless products with application layer fault injection.

Take a look at our docs for more: https://www.gremlin.com/docs/application-layer/overview/

Thanks that will work nicely with our lambda functions which are in Java. How about python? We are running python django in fargate. so it is possible to bring up a new container or add gremlin in the existing container. Is this possible?

Glad to hear it! Additional language support is top of mind. Node is up next and python is high on the list.

Regarding your app running in Fargate, you can do either. Hop over to our #support slack channel and we can help out more!


Failure as a service doesn't make all that much sense considering that a many failure scenarios would make the target host inaccessible to Gremlin.

How does Gremlin handle this?

Good question! All of the network attacks have a whitelisting capability, to keep the host accessible. This isn't an issue with state attacks, as the client will come back online once the host reboots. And with resource attacks the client typically remains active, if your application is handling starved resources well.

Is anyone aware of a chaos tool that isn't a SaaS (free or not) and doesn't require using Spinnaker like the current Netflix chaos tool does?

Yes, we compiled a list of all the OSS alternatives to Chaos Monkey here!


The old chaos monkey didn’t require spinnaker. You can find it here: https://github.com/Netflix/SimianArmy

So here is what I don't get about this stuff.

What happens to the in-flight requests? Don't a few users run into random errors whenever a host is killed unexpectedly?

You could have your loadbalancer retry everything that fails, but then wouldn't every single request in your app have to be idempotent?

Server crashes happen. This forces you to deal with them instead of pretending they won’t.

Well yes, but I would suggest that they are uncommon enough that a few requests failing isn't a problem when those happen.

It's an entirely different story when you are killing processes constantly.

If "a few requests failing isn't a problem" is a reasonable statement in your company then this type of service isn't really aimed at you. The point is to help shake out fault tolerance issues. If you can already tolerate faults there isn't much of an issue.

It doesn’t _need_ to be constantly. The important part is that it is done _deliberately_ to understand what happens when failures occur.

idempotent requests, stateless services, etc are all parts of a fault tolerant system.

your service has a few ways to deal with a dependency going down -- maybe it's a retry, maybe it's opening a circuit breaker and returning a default payload instead of calling that service.

It really depends on what specifically the service is and what it's calling (so it's a very case by case issue).

One of the very neat features of istio is that you can do this tuning in real time -- spin up your services, simulate faults, and then test your service while tuning your retry logic to see what the best user experience is.

well for example in our systems all api calls only moves from a know state to another known state and any call failure redirects the client/user to the dashboard trough an error handler so they have to reload the last good state saved on the database.

not perfect, but having a server crash is not much different than having a connection reset by a wifi status change or an upload timing out due the mobile network going away or the user navigating away or closing the browser.

It sounds like you are saying "The in-flight requests fail" to me.

I really don't like the idea of saying that it's simply okay to give random users a bad user experience like that when you are actually killing servers yourself all the time.

It's a different approach to managing risk -- minimizing impact of failure rather than minimizing the likelihood of failure.

It's nice to know that you can kill a process and the only impact is that in-flight requests fail, rather than having a more significant outage if a process crashes and the failover doesn't work, or the process doesn't automatically restart, etc.

If you accept that requests will fail you can build retries into the system. It's a lot harder to make a system more resilient if you avoid testing the failure scenarios.

Exactly! Chaos engineering is all about thoughtfully planned out experiments, to observe what the user experience will be when something fails. Doing this on your own terms allows you to improve the experience so that your customers aren't affected.

You can decide what happens when an in-flight request is dropped, whether you hold onto the state somehow and retry or the client could fail gracefully with a relevant error message.

Another thing that's not often caught by "normal" testing but that chaos engineering can capture is when multiple things fail together in random ways. It can be surprising how otherwise robust services can fail badly when multiple things go wrong at once.

When you have nested service calls, a single downstream failure shouldn’t fail all the way back to the root request, in most cases.

How do you prevent abuse of this tool?

Security is extremely important to us. Clients authenticate to our control plane either with a secret string or a certificate. Clients can be revoked at any point from our webapp and as well if the client loses communication to our control plane, any ongoing attack is halted.

Check out our security page for more: https://gremlin.com/security

What infrastructure size does one need to have where this technique is beneficial? Genuinely curious where the threshold is.

Multiple criteria:

1. When you go from one machine running the code to more than one 2. Any system that may experience failures and detection of such failures and recovery is desirable 3. Most distributed systems due to the failure scenarios inherent in such systems.

NB: This company "Gremlin, Inc", its product "Gremlin Free", and its use of the Gremlin name is in no way affiliated with or related to Apache TinkerPop™ Gremlin, its ASF marks, name, the open-source Gremlin graph programming language, ASF TinkerPop Gremlin Graph Traversal Machine (GSM), associated libraries, or the Gremlin Graph developers group formed in 2009.


It's also probably not related to the 1984 movie "Gremlins" or the 1970's car of the same name (listed as one of the ugliest cars of all time[0])

[0] https://www.cbsnews.com/pictures/worlds-15-ugliest-cars/7/

ASF marks were filed at formation. Identifying and distinguishing the use of similar and potentially confusing marks (esp in software products) is one of the required duties. The use of ASF marks in software products is prohibited to prevent confusion.

TinkerPop is a graph database. This is an infrastructure testing system. The likelihood of confusion between those two use cases is comparable to the likelihood of confusion with a movie or a car. You have a trademark, that's awesome. Even McDonalds trademark, which is one of the broadest, is scoped.

I was confused when I first saw the project, cartoon logo and "Gremlin Free" name, and I'm intimately familiar with both the Apache TinkerPop Gremlin open-source project, its third-party libraries, and the extensive ASF legal process we went through registering and identifying the use of marks.

Read up on your trademark, IP and copyright law. The use of similar marks is not permitted when there is potential for confusion, such as two different software projects, esp with overlapping audiences.

NB: TinkerPop is NOT a "graph database", it is a collection of software libraries for connecting to, using, and managing graph databases and distributed processing platforms. Gremlin is a primary part of that stack -- Gremlin proper is the programming language, and the Gremlin GTM is the runtime.

It doesn't sound like you were confused about whether they were the same project, it sounds like you were immediately concerned whether the other endeavor was in conflict with your trademark. That may constitute confusion but it's not the sort of confusion trademarks were designed to mitigate.

We did register the Gremlin trademark early on to be sure we had all of our bases covered:


So you begin by imitating Google's old logofont, and then you switch to something resembling Google's new logofont [1].

And for a company name, you decide to use the name of an established programming language and registered mark of a top-level Apache project [2], which has been in use since its inception over a decade ago, and incidentally it's a project and programming language that both your previous employers know well [3].

That's some inspired work, overflowing with creative originality. And to top if off, you have animated graphs floating in the background. Yeah, no possibility for confusion there. There's a name for that you know? Google it and see if you can find the word -- its definition has to do with siphoning goodwill. Who advised you on this and agreed these were wise choices and that this would be a good way to begin?

[1] https://www.underconsideration.com/brandnew/archives/new_log...

[2] https://tinkerpop.apache.org/

[3] https://www.gremlin.com/team/

No, there are ~70 or more third-party projects building on the Apache TinkerPop Gremlin platform, and new ones are popping up all the time. My first thought was this is likely a new project site for a Gremlin client library or service provider that was still using placeholder images for branding [1].

When a new third-party project site goes online, SOP is to review the new site/code, understand what it is and how it fits into the overall ecosystem, and then put their project lead in touch with TinkerPop's resident artist and illustrator, Ketrina Yim [2], who is available to create custom artwork/logos and Gremlin character illustrations tailored around the new project's theme.

However, when I dug into this site's docs to see how it was related, I couldn't find a clear connection. And when I checked with our team, they said Apache Legal has been aware of the issue since last year.

[1] Branding https://tinkerpop.apache.org/policy.html

[2] Ketrina Yim, Illustrator https://tinkerpop.apache.org/index.html#committers

> the names of all Apache® projects, software products, and their logos are trademarks owned by the Apache Software Foundation on behalf of our project communities. Note that while some Apache project names and logos are registered in the US and various countries, even unregistered names and logos are still trademarks of the ASF and should be treated with respect.

Only trademarks registered with various legal jurisdictions are enforceable, such as the "Apache TinkerPop Gremlin" name you registered. Unregistered marks used by an Apache project are simply claims on the mark, which in many cases are not enforceable. One example is the Apache Groovy project, where its PMC members erroneously claim the exclusive use of "Groovy" for their programming language, when in fact their only valid claim is to "Apache Groovy" (and its previous monikers "org.codehaus.groovy" and "Codehaus Groovy").

Gremlin is the original/previous moniker and has been since 2009. Apache TinkerPop was added upon joining Apache in 2016.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact