Hacker News new | comments | show | ask | jobs | submit login
How We Build Code at Netflix (netflix.com)
624 points by hepha1979 on Mar 9, 2016 | hide | past | web | favorite | 135 comments

I'm interested in knowing more about the "25 Jenkins masters" that they have, and how much they have modified/built for Jenkins to make it work for them.

We are currently in a state of "big ball of plugins and configuration". A bunch of plugins have been installed, and lots of manual configuration has been put into jobs so that everybody has what they need to build their software. It has led to Jenkins being a "do everything" workflow system. The easy path that Jenkins provides, to me, seems like the wrong one - it makes it easy to just stuff everything in there because it "can" do it. This seems to leads to tons of copy/paste, drift, all types of different work being represented, and it is starting to become unmanageable.

Have others seen this happen when using Jenkins? How have you dealt with it?

Netflix have been quite involved in the Jenkins project, including the Job DSL Plugin, which enables the automated creation of new Jenkins jobs, e.g. when a new Git branch is created, by defining the job structure with a simple Groovy-based DSL.

Taking this further, the upcoming release of Jenkins 2.0 is going to put a lot more emphasis on pipelines-as-code, where entire workflows can be defined in code, and version-controlled, as opposed to clicking everything together via the web UI.

See https://jenkins-ci.org/2.0/

Some things are already available, see this: https://github.com/jenkinsci/workflow-plugin/blob/master/REA...

For the last 4 months, Groovy has been known as "Apache Groovy".

I've spoken to multiple companies about how they do builds, and it's incredible (to me) how many of them use Jenkins. In the end, almost everyone ends up with the kinds of problems you can see on this thread:

  * Plugins are useless
  * Configuration drifts all over the place
  * Jenkins ends up just being used as a job runner to run shell scripts
  * Every repository or project implements similar but subtly different build scripts
When we faced this exact problem while writing Deployboard (https://www.youtube.com/watch?v=tgmJa7FciDg, especially starting at around 18:45), specifically the build system there. I explicitly wanted to avoid having engineers write shell scripts. People don't really know how to write them properly, nobody ever tests them, and they're just not really treated the same way you treat "real code".

Instead, we ended up using resque. We're a rails shop and already use resque extensively in production. Resque scales really really well, so we were pretty confident that we would be able to run builds on this sytem indefinitely without needing to split anything into separate clusters. And resque jobs are ruby, so they can be written and tested just like ruby.

As a result, we were able to standardize on just a few jobs (e.g., BuildRailsJob, GradleBuildJob, NodeBuildJob) with a few arguments for each job. In the process, we wrote a lot of really nice primitives, so if we do need a different kind of job (or a modification to an existing job) then those can be made pretty easily. On the whole, I've been extremely pleased with the resulting system.

However your not tipping your hat to the huge values Jenkins provides:

    * Reporting such as unit test reporting
    * Emailing and notification
    * A well known plugin system
    * User security
    * Master / Slave management
Jenkins is not just a job runner. Its a job runner that collects reports, maintains history of the job being run, manages user security, has a public api (REST) and plugin system. That is a lot of stuff to implement on your own.

Really Jenkins just needs to fix the configuration drift issue.

lets you have Jenkinsfile in your repo https://wiki.jenkins-ci.org/display/JENKINS/Pipeline+Plugin

.ci.yml like .travis.yml https://github.com/groupon/DotCi

solves every single problem you have listed.

Looks like jenkins has discoverability issues. I am curious what kind of things you tried with jenkins before writing your own build system.

I think most people do. We did a couple of things to get out of that state at Conjur.

First, we made all of our builds into docker containers. Under this system, slaves (we call 'em executors now) only contain very basic software - docker, git, and make to be specific. This means that our builds are entirely self contained, and we don't have to worry about, for example, messing around with RVM on executors. If the containerized build works locally, it pretty much always works on Jenkins as well.

Second, we started using the job dsl plugin to manage configuration and brought in some autoscaling and machine identity. Beyond what I do as a platform engineer to turn my projects into jenkins builds, I don't completely understand it (nor do I have to, which is a good thing!), so I'll let our DevOps guy take it from here:


You might like to look at Concourse[1], which makes explicit, checked-in pipelines of containerized builds its central model.

I am very bullish on Concourse.

[1] https://concourse.ci

We fought the copy+paste drift for a while. Most jobs were very similar, but just different enough that debugging things when something went wrong was often both time consuming and frustrating.

Ultimately, we took an approach similar to Travis CI, or Gitlab CI [1], only using shell scripts since that plugs into Jenkins easily enough. Every project has a CI script and a release script in a common location relative to the project root that takes care of everything needed to take a fresh clone of a repository, run the tests, and deploy the project (depending on a few environment variables) if the tests pass.

We have 1-click operation to set up a new job in Jenkins, and it handles all the configuration based on an XML template. Everyone understands that they're not supposed to make manual tweaks to the jobs once they're set up, and a year later, things are working pretty smooth.

[1]: http://doc.gitlab.com/ce/ci/quick_start/README.html

This way is the only way we've found to do it, even on a smaller scale. Once you get above 20-25 jobs, or create a self service way for teams to create projects+github+jira+cloudformation+jenkins etc, you have to aggressively standardize. Often this means standardizing on the lowest common denominator (shell script).

thanks for sharing.

Yes - we've definitely seen that. The most common thing we've see if that you start with a jenkins job that just runs a build process, and then you end up with a jenkins job that has a huge shell script which calls the build process somewhere in the middle, that's unversioned.

To resolve that we've tried to push the shell script into a file in the repository, which Jenkins then checks out and runs, which makes it easier to maintain and faster to set up new build machines.

We haven't done much in the way of modification of Jenkins, and the big ball of plugins and configuration (with the associated deadlocks from bad plugins and some bad core Jenkins foo) is now 25 smaller balls of plugins and configuration. We're looking at ways to manage the new pain and are considering a number of routes from writing some tooling to help manage the world, introducing a smaller CI system that handles the very basic cases with Jenkins for more complex solutions. Our biggest problems are probably plugin creep (and the poor state of plugin maintenance in some cases), the cluttered UI with chunks of configuration hidden behind "Advanced" buttons, the inability to store job configuration with code a la Travis (which may be better with workflow), and handling routine maintenance across shards (without something in place like Operations Center), API inconsistencies (and the inability to fully manage via the API, forcing things like Groovy scripts over Jenkins remoting...). Once we have some direction on where to go, I'm sure we'll follow up with some more blog posts and share things at our meetups.

I recommend, strongly, looking into Concourse. It's well suited for this sort of case, because that's why it was built.

I've seen pipelines with hundreds of inputs (an embedded software company), others with over a dozen stages (other teams at Pivotal), both kinds with fan-in/fan-out as necessary. Today I even saw a generic pipeline that could test and build identically-structured product files in a uniform way across quite different products (Redis, Apache Geode etc).

People have written resources (the main means of extension) in Go, Ruby and Python so far. If you can put it in a Docker image and execute it, then it can be taught to behave like a Concourse resource.

So far it's working well for us and others trialling it. I am very bullish about the future of Concourse.

I am going to read documentation of Concourse. Btw have you written like a blog or article about your experience with Concourse?

No, I haven't.

My first approach was as an individual, which was quite difficult, because it took me a while to work out the differences between jobs and tasks and how to lay it all out. Lots of copying and pasting from other pipelines I studied.

At this point pretty much every team in Pivotal's Cloud Foundry division is running a Concourse pipeline, including the one I belong to. What's been interesting is how each team is experimenting with patterns that Concourse makes possible.

For example, I and my peers are now turning various things into "executable documents" -- the terminology sucks at the moment. Think of all the stuff buried in READMEs and wikis and long-forgotten cron jobs. How to build that special docker image that you only update every few months. The database backup. Keeping the blue-green deploy codepath warm.

When we find another one of these, we now encode it into our pipeline. That way, if I need to find it, it's there. And if it goes bad, the pipeline definition points me at where to go looking for everything of interest.

Another pattern that is emerging for us is "enforce project invariants". For example, we have multiple repos, so multiple Gemfiles and Dockerfiles, all of them with ruby versions set. At the front of our pipeline there is a little gateway to check that these are all identical. If not, it prints a table of versions found, so again, I know instantly where to go look. We have various other little invariants that turn days of insane debugging into a few seconds of sanity-checking.

None of this is novel. What Concourse lets me do is hoist good design practices out of code and apply them to CI. The final feedback loop I rely on can itself now be checked in, broken apart according to SOLID principles and even (my colleagues in Buildpacks are pioneering this) unit tested. Some colleagues in London have designed a fully generic pipeline that can be configured at runtime to build any of the five products they manage.

tl;dr fuck yeah Concourse.

It's really the same Jenkins Ami but sharded for different teams.

One of the things Spinnaker does is plugin Jenkins jobs as a reusable stage that is parametrized and scoped to an application deploy. This has allowed one team to move from 40 Jenkins jobs created via job dsl plugin to just 6

> A bunch of plugins have been installed, and lots of manual configuration has been put into jobs so that everybody has what they need to build their software.

Could use something like this if you are using github https://github.com/groupon/DotCi

Job configurations can be version controlled and reviewed like everything else.

Also, the plugin does other optimizations like storing job/build data in a db so jenkins doesn't slow down as you create more builds/jobs. So you don't need "25 Jenkins masters".

For multijob pipelines checkout https://wiki.jenkins-ci.org/display/JENKINS/Pipeline+Plugin

I think the copy/paste is not inherently bad as long as it is version controlled and visible to everyone, that is why we have all CI configuration in the repo with gitlab-ci.yml

I was surprised about needing 25 different masters. On GitLab.com we have a single clustered application that handles over 1600 GitLab Runners (called build slaves in Jenkins).

Agreed, we have a basically arbitrary number of runners (near releases it gets into the hundreds), and one master handles it just fine.

Glad to hear that. We'll soon announce an autoscaling runner that allows you to boot up new instances automatically. I love that you sometimes run hundreds of runners, would you like to do a guest post? Email me at website@sytse.com

I've seen jenkins devolve into this at one of the previous companies I worked at. There were (at one point) two or three people simply tasked with writing, improving and debugging jenkins plugins as well as random configuration issues and failures. Not a pleasant experience.

Our team ended up forgoing the plugins and simply writing shell scripts (checked into our repository) that would handle various pieces of the build workflow. Our jenkins job then became:

  Call script 1
  Call script 2
I'm sure there are better alternatives, but at that time it allowed us to version changes to our build, and turned jenkins into nothing more than a glorified task runner - which we were fine with.

I've seen something similar happen, switched to https://www.go.cd/

GoCD is very difficult to version-control and the interface is, to put it politely, in need of some love.

At Pivotal my colleagues working on Cloud Foundry poured a lot of engineering effort into making GoCD scale across multiple teams, repos, sites and so on, and it just never worked out.

Alex Suraci wrote Concourse, dogfooded it on a project team, and now pretty much the whole of Cloud Foundry is being built with Concourse pipelines.

Missed this reply - agree that the interface is confusing. Concourse looks promising!

>now pretty much the whole of Cloud Foundry is being built with Concourse pipelines.

What are some of your thoughts on,

* How it compares to Jenkins pipeline plugin https://wiki.jenkins-ci.org/display/JENKINS/Pipeline+Plugin

* Could it have been written a jenkins plugin instead of a whole new CI software. I like some of the features of concourse pipeline but it doesn't have support for wide range of remoting/plugins that jenkins supports.

I'll defer to the authors for their experiences with Jenkins: http://concourse.ci/concourse-vs.html

As for writing a plugin, no, it would not have been possible. Concourse has an entirely different model of operation and needs easy access to containerisation facilities to achieve it.

Concourse doesn't really think in terms of "plugins". What you become accustomed to is wondering "is there a resource type for this?".

Right now I work on a software repo with a moderately complex API for uploading final binaries and releasing them to clients to download. Instead of telling people to write scripts or install a plugin, I can point them to the resource that another team has written.

"Just add the resource". Released software is now just a stream of events, no different from git commits, S3 files, points in time, Tracker stories etc etc. Every resource has the same interface so it makes it possible to click together stuff into clever combinations, rather than lashing together awkwardly and hoping it'll work.

Is it more reasonable to try to avoid this problem in the first place, or to accept that it's likely to happen and deal with it then? Both options seem reasonable to me (and I'm trying to get my company to adopt Jenkins).

Jenkins DSL plugin and/or configuration management (chef/puppet) is necessary.

For those wondering about how this applies to Node.js use a Netflix like myself, it's in there towards the bottom of the article:

> "As Netflix grows and evolves, there is an increasing demand for our build and deploy toolset to provide first-class support for non-JVM languages, like JavaScript/Node.js, Python, Ruby and Go. Our current recommendation for non-JVM applications is to use the Nebula ospackage plugin to produce a Debian package for baking, leaving the build and test pieces to the engineers and the platform’s preferred tooling. While this solves the needs of teams today, we are expanding our tools to be language agnostic."

Twitter has an agnostic tool called Pants [1]. I run a JVM shop and we are just now using some NodeJS so we are have been trying to figure out that as well.

[1]: https://pantsbuild.github.io/dev.html

> The Netflix culture of freedom and responsibility empowers engineers to craft solutions using whatever tools they feel are best suited to the task.

I absolutely love that. I'm a huge fan of what Hastings and company have done over there in terms of culture and making Netflix a unique and desirable place to work.

I think it's time for another round of "find a way to make Netflix hire me."

The real interesting thing to me is the language around "paved road"s, and management support for tools teams to maintain those paved roads.

It's easy to say, "Engineers should use the tools they deem appropriate" and make the engineers happy. What's harder is to support those individual efforts with attention and money to build a common consensus and a shared toolchain.

"Paved roads" is nice language to describe the utility of consensus without condemning those who depart from it. I dig it.

There's an open house coming up. Let me know if you'd like to come (if you're in the bay area) zgtjyizv@abyssmail.com

Awesome. I just sent you a message. Thank you!

Spinnaker is an amazing tool. Really makes it easy to confidently deploy applications via immutable infrastructure.

Immutable infrastructure is the future and it seems that even Netflix is planning to use containers for that: "Containers provide an interesting potential solution to the last two challenges and we are exploring how containers can help improve our current build, bake, and deploy experience."

I think that the future of deployments means that it is closely integrated with your source code management. Every new push builds a container that can go through the following steps:

1. Deployment (server with only test traffic)

2. Post-deploy test (smoke test)

3. Canary (part of the traffic)

4. Live / multiregion deploy

5. Manual overrides http://techblog.netflix.com/2015/11/global-continuous-delive... "Spinnaker also provides cluster management capabilities and provides deep visibility into an application’s cloud footprint. Via Spinnaker’s application view, you can resize, delete, disable, and even manually deploy new server groups using strategies like Blue-Green (or Red-Black as we call it at Netflix). You can create, edit, and destroy load balancers as well as security groups."

Also see https://gitlab.com/gitlab-org/gitlab-ce/issues/3286

> Immutable infrastructure is the future

Are there any downsides to immutable infrastructure?

I'm on a team developing an immutable infrastructure solution with (Docker) containers at a large financial institution. The biggest downsides so far:

* It can get complicated very quickly since it involves a lot of stages, systems, and loads of brand new tools where you're unlikely to be able to hire existing talent (you need people who can just read the docs on a complicated tool/architecture like Kubernetes and then start using it).

* Speaking of talent, you need developers who are reasonably well-versed in Linux systems administration. Could your developers write a shell script that can configure every little aspect of a Linux host from packaging to authentication? Your developers also need to understand some hard core topics of TCP/IP (e.g. anycast), firewalls (e.g. port mapping), and DNS (TTLs, views, SRV records, and more). They also need to be well-versed in security (e.g. the impact of running as root inside a container) and especially authentication (e.g. HMAC, securely storing shared secrets) and encryption (SSL, managing certificates, knowing which hashing algorithm to use)!

* Dealing with existing bureaucracy. Especially rules and policies in regards to security. For example, some people will think that containers and immutable VMs should be treated like regular hosts. That can mean waiting hours for an inventory system to acknowledge their existence before you can put them into production. Or having to wait 30 minutes after every container/VM is brought up so that a security tool can perform a scan even though each container/VM would be identical in every way and have already been scanned before production deployment. Sigh.

Of course, a lot of that depends on where you work. At my work right now the biggest challenge is the bureaucracy for certain. If I can get containers classified differently than VMs then all my policy/BAU problems will (likely) go away. If I can't get that then, really, there's no point to our efforts since there would be no advantage to end users (the consumers of what we're building).

I may be able to live with a pointless, half-hour-long security scan every time we bring up a new container but I absolutely cannot live with an 8-hour-long wait for a container to show up in our system-of-record (inventory system).

We are doing something very similar (mesos rather than k8s), also in a financial organisation.

Our main issue is also security - as docker containers aren't very good at actually containing, from a security PoV, we're using SELinux MLS, assigning each container a unique SELinux level/category (fedora-atomic sets this up nicely for you if anyone wants to try).

This is great for security, but breaks so many things - lots of the cutting edge work with docker and related tech isn't done by people that care deeply about security, or really understand this sort of usecase, so we run into a huge amount of edgecases and bugs. We spend a lot of time fighting this.

It also makes having a really good build/deploy pipeline essential, as well as a good environment set up to reproduce and diagnose issues.

Along a similar grain, I'm curious if anyone's used Joyent's infrastructure stack internally... I know they offer it, but not familiar enough with it. It seems Solaris containers as a base for docker containers is a better security model, but not sure what parts, if enough is open to implement without paying consultation from Joyent to get started even. It's definitely a compelling model.

I am only slightly surprised that there's still a lot of bugginess in this aspect, though I haven't had to deal with that level of security need on the infrastructure side in a number of years. I'd rather be doing app-dev over ops-dev.

HashiCorp also has a pretty complete solution called Atlas : https://www.hashicorp.com/ ( different from the Netflix Atlas http://techblog.netflix.com/2014/12/introducing-atlas-netfli... )

Joyent's Triton looks cool: https://www.joyent.com/ as does CoreOS's Tectonic (built on top of k8s): https://tectonic.com/

I've been following CoreOS with great interest, I was just thinking from a security standpoint Joyent's Triton looks amazing... I don't think their pricing is competitive with Google/AWS/Azure though.

Hadn't looked at Atlas... I changed jobs into a larger company, and don't really have to deal with the ops side ever. Though eagerly awaiting a blessed docker based solution to become available in the org... it'll mean that projects can be developed more readily without being limited to the Java based infrastructure in place now.

Do you have links pointing at what docker containers leak from a security PoV ?

This largely depends on how you're implementing them. If you're using Kubernetes, for example, all Docker containers being managed by Kubernetes can talk to each other without restrictions. From each container's perspective they're part of a big, flat network with no firewalls between them.

So if you have a container running the database for client A that means that when client B spins up their little malicious container they can directly connect to client A's DB.

That's one way that Docker containers can "leak" (sort of). Another way is that the filesystem for running containers is directly accessible from outside the container. That means that anyone with the power to start/stop containers has the power to read the filesystems for all other containers.

There used to be a problem with, "root inside the container is the same as root outside the container" but that was fixed with Docker 1.9 via the new UID mapping feature. So the filesystem and processes running as UID 0 inside the container will actually be some high UID in the parent OS. So unless you're running Red Hat's version of Docker (which is currently hard set to be Docker 1.8!) you can upgrade and solve that problem pretty quickly with a few tweaks.

I'm a network engineer with a strong interest in container networking -- where does anycast come into play in your immutable infrastructure? I am highly curious!

Right now for us it mostly just DNS using anycast but we may use anycast as part of a (forthcoming) microservices architecture.

The current popular method is to simply query a service discovery resolver of some sort (e.g. https://docs.docker.com/swarm/discovery/) to figure out which server to query, say, the weather (very important in finance! haha =). That's great and necessary if your architecture spans across your internal cloud, Amazon, and cloud-provider-of-the-day but if you have a fixed set of data centers where you know your applications will always be running (say, because of legal reasons =) then you can just use anycast as long as you keep ports consistent.

For most microservices deployments I'd imagine they would want their stuff to be accessible via a global DNS name and if they're going to do that then they might as well take the next step and just standardize on a port as well. If you have a single (global) name and a single port for an API why do you need service discovery (for that)? You don't. You just need to make sure that when clients query your API that they don't wind up using that one server you've got on the other side of the world on a 56k ISDN line (sadly yes, that does happen).

Having said that we may end up using popular service discovery tools anyway because they're convenient and we can use them in horrible, do-as-I-say-not-as-I-do ways to hack legacy applications into containers =D

Thanks for sharing your insight. Comments like this is why HN is my favorite place on the net.

Sounds like Devops.

I think immutable infrastructure still requires a lot of tools and moving parts today. We're spending a lot of time thinking on how to improve this. Two idea's we have:

1. Deploy directly from GitLab to Kubernetes https://gitlab.com/gitlab-org/gitlab-ce/issues/14040 2. Add a container registry to GitLab itself https://gitlab.com/gitlab-org/gitlab-ce/issues/3299

Any comments or suggestions are welcome.

You need to commit to it. You just toe the line. If you currently depend on being able to SSH into boxes and tail logs you will feel a lot of friction.

However, practices like those tend to be a crutch anyway.

Build wait times invoke 90's nostalgia.

It takes a while to bake the image, and storing thousands of images can add up. However containers can address both points.

Google is actively adding Kubernetes support as well: https://github.com/spinnaker/spinnaker/issues/707

Can some one explain to me what immutable means in terms of "infrastructure" ?

The other replies here are great but let me give you the layman's version of what "immutable infrastructure" means:

If it works for me it works for everyone.

You never patch or upgrade immutable infrastructure. You just replace what you've got with a new VM or container. Containers being preferred because they can be started & stopped near instantaneously and there's nothing like a virtual BIOS that could have different configurations like with VMs.

You don't stand up a VM or container then "log in to configure it". Once the VM is "up" that's it. You're done. At that point you just need to point your load balancers/DNS at the new stuff then take down the old stuff.

One interesting aspect of immutable infrastructure such as this is that it is completely incompatible with loads of existing security policies and what would have been considered "best practices" just a few years ago. For example, you might have a security policy that states that everything must be scanned within 30 days for malware/out-of-date packages/whatever. Yet with immutable infrastructure your hosts or containers may only be up for a few days before being replaced!

So when your security team freaks out because none of your hosts/containers are showing up in their systems you'll have a lot of explaining to do =D

"We need to scan your hosts so we can ensure that you're installing security patches."

"We don't do that."

"You don't install security patches?!?"

"Yeah, well, you see..."

Trust me when I say that trying to explain how it all works and why it's more secure than old school deployments is not easy!

this is why your security team should be integrated into how you (securely!) build and deploy software. in this specific case, that would mean maintaining the OS layer Dockerfile and scanning the dependency tree for build artifacts.

Yeah, it's actually a lot more complicated than that. Let's assume the security team creates the Dockerfile. It'll look like this:

    FROM ubuntu
    WORKDIR ${foo}   # WORKDIR /bar
    RUN developer_script.sh
So now with each new container update you'd need someone from the security team to audit/review `developer_script.sh`. It's kind of pointless if your goal is fast deployments.

If you just need to make sure your developers don't make a mistake in terms of securely configuring their containers (and making sure to always use the latest software) then you simply scan them before the canary stage. The problem there is, "what are you looking for?"

Also, there exists only one tool to scan Docker containers (OpenSCAP) and if your security team doesn't like it, well, you're screwed: https://github.com/OpenSCAP/container-compliance

The other problem is that OpenSCAP only checks the container's packages for compliance. It doesn't actually scan the container's filesystem for things like JREs and bundled libs. So if you're using Docker best practices by keeping your images as minimal as possible you may not even have a package tool inside your containers. In that case how do you check for things like out-of-date versions of Java?

Another problem is you assume there's some modicum of control over what's inside the containers before they're deployed. We're in charge of creating the infrastructure for running containers with the promise that end users (who would be various application teams) can make their own containers (or at least their own Dockerfiles) and deploy them on our infrastructure.

We can work with them to help develop, say, a Kubernetes pod config for their app but as far as what their app is or what gets bundled with it we'll have no knowledge after the first successful deployment.

I never said you need or should have the package tools inside your container, but you do need to be able to track the lineage of what was put into those containers so you can easily search for ex. deployed container revisions with an out of date Java version. This is where you need a team devoted to build that creates this kind of infrastructure. Otherwise, you end up with difficult to answer questions about your infrastructure.

tl;dr I assume there's some control because there should be.

> So when your security team freaks out because none of your hosts/containers are showing up in their systems you'll have a lot of explaining to do =D

> "We need to scan your hosts so we can ensure that you're installing security patches."

> "We don't do that."

> "You don't install security patches?!?"

> "Yeah, well, you see..."

If you use a tool like zypper-docker, you can create a new image quickly that applies just the security patches. Currently it only works for SUSE-derived containers but we're planning on making it distribution agnostic.

There's also some work we're doing on connecting Docker containers running on SUSE enterprises to connect with SUSE Manager (aka spacewalk), so they would show up in their systems.

You can find most of this stuff in github.com/SUSE.

zypper-docker is kinda pointless. Just re-build from your Dockerfile using the latest upstream image. It takes like 10 seconds (depending on how many external dependencies need to be fetched).

Our Docker registry pulls down the latest images (that we use with our Dockerfiles) multiple times daily. So if the author of the Dockerfile just puts "FROM whatever:latest" they are guaranteed to have all the latest/patched software whenever they "docker build".

Sometimes you don't want to update all of your dependencies (isn't that the whole point of Docker), you might want to just apply security updates. Just putting "FROM whatever:latest" is a bad idea, because you don't know whether the "latest" will suddenly break your image.

The point of zypper-docker is to quickly roll out security fixes with essentially no downtime. zypper-docker also allows you to quickly check the health of all of the images and containers on your servers, so you can figure out which ones need updating. Also, zypper-docker allows you to apply RPM patches (something that's quite crucial for enterprises).

So no, it's not pointless. In fact it solves a problem that not many people have been working on solving: updating containers.

Actually, the correct and only way to find out if ":latest" breaks your image you create your image using ":latest"!

How are you going to know if the latest version of the upstream image breaks your package if you don't try it? E whole point of continuous integration is that there are no surprises when it comes time to push to production.

If ":latest" breaks your image you had better know:

A) Immediately.

B) What went wrong ASAP.

The integrated testing (that you're supposed to use with Docker best practices) should reveal if there's a problem and if it doesn't you'll still catch it during QA testing of your image.

The fact that ":latest" breaks your image should be nothing more than a few minutes to days of troubleshooting. While you're doing that your existing production images will keep on chugging away.

The only rule that you must not break is that you have the next release of your image out the door, ready for production within 30 days. Why? Because that's the maximum (hard-coded) lifetime we (and you should too) allow images to stay running.

But then you risk having untold other packages move under your feet. This is what concepts like "unattended-upgrades" address.

The risk associated with breakage from an automated upgrade is an order of magnitude less than the risk associated with being hacked because you weren't keeping up to date.

It means updating infrastructure by making changes to a versioned service definition instead of on running instances.

Would you fix a software bug by editing the code on a running server and tell yourself that you will add it to the repository later? Of course not. You would end up with a running instance of the code that is impossible to replicate.

Immutable infrastructure applies that same idea to running services.

Would you fix a software bug by editing the code on a running server and tell yourself that you will add it to the repository later?

Absolutely yes in the right circumstances (have you seen how this website works?).

Not every service needs to be written as "scale out to a billion nodes" architecture with eight layers of checks and balances between idea to production.

We can't globally say "everybody must use a 16 step verifiable app development pipeline" when some people run 3 servers and others run 3 million.

(Plus, not all services are stateless! Redeploying stateful servers is painful—you'll be killing user experience. Updating code live on a running server while maintaining internal continuity of state can maintain sessions/caches/game-state without annoying users by kicking out of their current flows.)

I agree with parts of this, sort of.

I maintain a personal server that hosts a variety of VMs for things like mail, static site hosting, nameservice, etc. Last time I rebuilt it, I thought, hey, I'll try doing it with these interesting tools. The immutable-infrastructure approach turns out to be pretty heavy in that case. A lot of times it feels like I have to perform a triple bank shot just to make a minor configuration change.

That said, I think the one-off server, carefully maintained by one person, is becoming very rare. When I build things that other people will work on, I think immutable infrastructure and the cattle-not-pets approach is the only responsible way to go. Even if traffic volume won't be huge, I think the clarity and ease of debugging you get is vital when it doesn't all live in one wizard's head.

I agree that stateful servers are a challenge in this context. But they're a challenge regardless. Having that one thing we're afraid to upgrade or restart can cause all sorts of development and business process issues. That might have been a worthwhile tradeoff 10 years ago when you had to buy all your hardware and you needed ops people to do a lot of stuff manually. Servers go down, and we might as well accept that from the beginning.

That's a false dilemma. There are more options than "edit code on running server" and "use a 16-step verifiable app deployment pipeline".

For example: update the code in your source control repository, and then build the new version of the software.

> That's a false dilemma

It's not, and that's coming from someone who does devops for thousands of virtual machines in production. I don't have an hour to burn sometimes to update a repo, build new AMIs, roll them into production, roll the old ones out, watch logs to ensure canaries pass acceptance tests and that production traffic to new instances aren't erring out, all for minor changes (example: nginx mime type change).

Try something like Heroku. Commit the code to Git, push the repo to the Heroku repo and it deploys it to the servers for you. Easy.

You mean the same Heroku that has the uptime of a server under the dev's desk? No thank you.

Or, since we're talking about adding tooling for better efficiency - just update the source code, and let the tooling build and deploy! :)

Automatic (and safe, obviously) production cutover is still kind of a holy grail, but it's definitely doable, and being done as we speak by a number of pretty neat companies.

Roughly that services are not upgraded in place, but completely replaced, here by creating a new image and launching it in fresh instances, then shutting down the instances running old code. Makes sure that all instances are exactly the same, vs otherwise there might be differences between freshly created and upgraded ones etc.

An example would be whenever new version of an app is deployed an entirely new EC2 instance is started from a fixed image. The app is copied onto it and then load balancers are changed etc...

The image the EC2 instance is started from is immutable while it is running. Of course the image gets updated regularly by devops with security fixes, but the running production instance(s) is/are never changed on the fly. Instead it is completely redeployed.

You build the VM one time, deploy it to dev. If all tests pass, you deploy the exact same VM to test/stage/prod/whatever.

It's the diff between a new build (of the "same code") for each environment versus building up the VM one and deploy that.

Simplified: Instead of modifying a deployment by uninstalling the old software, updating the config files and installing a new version, you create a new virtual machine image with all you need and deploy instances of that. Even if you just want to change a setting in a file you deploy new virtual machines. There are variations of this such as using Docker.

And the next step would be moving your config to something distributed like etcd or archaius

The article actually discusses this directly.

A short summary would be that your infrastructure (instances) are replaced for every deployment instead of updated in place. This is all done via continuous delivery tools (spinnaker, jenkins etc) and is very important to be able to easily scale microservices horizontally.

As far as I understood it, you don't make changes to running servers. Say you need to update a package, you spin up a new server with the updated package and once the traffic is migrated to new server, you spin down the old server.

It means that your releases take forever because you have to bake new images each time you want to push out a new release.

Very cool article. Amazing how much tooling Netflix has built themselves.

What's amazing is that Netflix let their teams develop solutions from scratch. Usually at big companies when their developers say something like, "can't we just build a solution ourselves?" they're laughed out of the room or, more likely, marked down as candidates in the next round of layoffs.

It helps that they have repeatedly proven this strategy works. When the CTO goes to the CEO and says "we're going to build this, not buy it," the CEO obviously trusts his decision making because he's seen him successfully execute his strategy so many times.

It sounds like Netflix engineering culture is built a lot around trust. Management trusts that only top engineering talent works there, and the sink-or-swim culture, coupled with performance bonuses and a rising stock price, ensures that the engineers are making the best decisions for the company. It's a lot easier to approve a "build" vs "buy" decision when you know that the interests of the engineers pushing for it are actually aligned with the company business interests.

Contrast this to a company with a large separation of incentives between the stock price and the engineers. In that scenario you end up with a bunch of engineers who are bored and want to prematurely optimize systems because there is no clear personal cost to doing so.

> It helps that they have repeatedly proven this strategy works.

I laughed at this because you can't "repeatedly prove" you're capable of developing your own solutions if you're never allowed to do it in the first place!

The enterprise mantra is: COTS or it doesn't happen.

Fairly new Netflix employee here (started last November). From what I can see so far the "highly aligned, loosely coupled" culture here works very well. The company has a clear focus, we all get it, and we all work to make it happen. Teams have freedom to accomplish their goals they best way they see fit. There is quite simply not a lot of focus on particular means, the focus is on outcomes.

I've been in so many places where that response would be the correct one. I've spent a lot of time decommissioning some rather horrible outcomes of "can't we just build it ourselves".

When companies treat engineers like disposable commodities they should expect solutions that will be disposed.

NIH is the opposite of "disposable engineers".

It's saying your engineers are so much more precious and able than the barbarians outside the gate, that they should write their own test harness/DOM selection library/insert commodity software genre here.

To be fair most COTS solutions made for enterprises are absolute crap. It doesn't take much to improve upon them.

Enterprise tools are well-known to have absolutely horrible user experiences, extremely complicated configuration and architecture, and even worse documentation. I don't blame any manager/PM for thinking, "we can do better" with an internal team.

"let's wait for the community to fix the opensource code we rely on".

I've heard this a lot in my career (even academia)

Do they? The only big company I've been at (Amazon) has an absolute monster of an inhouse build and deployment/pipelining system. Although it could just be that they've been around longer than open source alternatives.

Netflix cares about "it works" and they have to have tools.

They also seem to have technical management as well. That makes a huge difference in building out and maintaining your own tools and also in hiring and retaining effective engineering talent.

I think it's more interesting to note that they couldn't exactly buy that from anyone else.

I believe at least some parts of the tooling were developed by an embedded team from Pivotal Labs.

Huh? The article describes a massive system and there is a lot of abstraction here, are you suggesting Pivotal was a part of the large abstraction or just grabbing credit for some minor part?

Not that I am aware of. It's possible they contributed to some of our OSS efforts, but I don't believe anything used internally was developed in that manner.

Pivotal ( the company that engulfed Pivotal Labs and Spring ) did contribute the Cloud Foundry integration in Spinnaker. This is not used at Netflix, though.

Major outage being reported worldwide.


Anything interesting deployed in the last hour?

Something in the CI/CD tool chain, Spinnaker, failed for it to move all the way to Live without being caught.

There are many reasons that a site can go down, aside from a code deployment.

Not deployment or spinnaker related

can't connect from the UK...

What are the reasons for Netflix choosing nodejs for their front-end server and not java like in their back-end?

Since all Frontend code is written in JavaScript, it makes sense to have the server code written in JavaScript as well as we do quite a bit of work on the server. Also, we do serve isomorphic views so Node is a big help in that regards.

Java even with webjars, hot reloading (jrebel), and what not just can't keep up with the superior Node tooling not to mention more and more frontend developers know node and not Java.

I still prefer Java/JVM languages though for backend work.

Because using Java for a front end server is a horribly inefficient and painful experience.

Freedom and Responsibility! Teams are free to choose the tools that work for them. Netflix technical direction operates more like an ecosystem than a hierarchy.

I'm guessing it's their use of React both on the front-end and back-end.

I'm a Netflix fan, as a consumer and an engineer, and this blog post just reinforces my fanboi status. Amidst the descriptions of deployment tools and pipelines one thing stood out for me: the fact that AMI bake times are now a large factor, and that "installing packages" and the "snapshotting process" were a big piece of this. Containers are definitely the answer to this problem. You can deploy base images with the OS and common dependencies, and have the code changes be a thin final layer. Of course with such a sophisticated pipeline based on AMI deployment this change would not be trivial for Netflix, but the bottom line is they have described the primary container use case perfectly, imo.

I am fairly certain it is on the Spinnaker roadmap. There was a recent talk by Sangeeta Narayanan on "Containers so Far" [1]. At 37:10~ you can see how they will essentially be allowing for containers to replace AMIs in the "bake" process. I'm looking forward to seeing how Netflix start to roll out more container support, and seeing how Spinnaker progresses.

[1] http://www.infoq.com/presentations/netflix-containers

At Button we switched from baking AMIs to deploying containers on ECS and it's had a pretty amazing effect on deploys, both in terms of the speed and (due to them being quick and painless) frequency.

How do they 'externalize config' with respect to http://12factor.net/config?

There are many ways to accomplish an external configuration. I do not know how they do it at Netflix, but one could have the prebaked AMI reach out to a service discovery tool to pull its configuration at run time. For example, Consul [1] or etcd [2].

One advantage of this approach is that you can use the exact same AMI in testing as in production. All you would have to do is change the configuration which is pulled. You could determine which configuration to pull based on AWS tags on the EC2 instance, for example.

[1] https://www.consul.io/

[2] https://github.com/coreos/etcd

A combination of things such as Spinnaker providing granular environmental context to an instance by way of user data (things that the ec2 metadata service doesn't readily provide to an instance without API calls), dynamic properties which can be controlled down to the instance-level (http://techblog.netflix.com/2012/06/annoucing-archaius-dynam... ), and other solutions depending on a team's needs.

Good point - curious what do you exactly mean by this though, and how did you compare Spinnaker with it. Can you elaborate a bit more?

Really just a naming collision. IBM Spinnaker is a data store, Netflix Spinnaker is a software builder. The paper I linked is interesting if you're researching distributed data stores.

Isn't Netflix using Mesos (see http://techblog.netflix.com/2015/08/fenzo-oss-scheduler-for-... )? I don't understand what role it plays here.

The Fenzo scheduler is, if I recall correctly, used to do scheduling for Mantis and Titus (formerly Titan). Think of these as a streaming job infrastructure and a container service respectively. Spinnaker can assist you in deployments to both (I use Spinnaker to deploy Mantis jobs, for example) but the two platforms are not completely prolific -- we still deploy much of the ecosystem as baked AMIs.

Seems like a nice system, but would be improved by building with Bazel or Buck instead of Gradle.

What do you see as advantages/disadvantages of each tool?



Please stop posting these.

why? I'm trying to keep track of things that catch my eye

You can upvote a story an it will appear under "saved stories" when you click on your profile. You commenting to save stories for you is spamming everyone else. So yes, please stop.

It's an abuse of HN threads. Comments here are supposed to be substantive.

if you upvote something it'll show up in the list at https://news.ycombinator.com/saved?id=pitt1980. So just upvote things that catch your eye.

Sure, it's useful to you, but it adds to the clutter that hundreds and thousands of others must wade through.

You could just bookmark the thread...

thanks, I hadn't noticed that feature before

innovative solution. I'm fine with it.

This is how my company used to build software 10 years ago.

Now we have Docker containers, cloud VMs, GitHub, 1 click deployment, advanced metrics, Grafana, Go microservices, Slack bots, etc.

Sounds like Netflix is stucked in the past.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact