
How We Build Code at Netflix - hepha1979
http://techblog.netflix.com/2016/03/how-we-build-code-at-netflix.html
======
mkobit
I'm interested in knowing more about the "25 Jenkins masters" that they have,
and how much they have modified/built for Jenkins to make it work for them.

We are currently in a state of "big ball of plugins and configuration". A
bunch of plugins have been installed, and lots of manual configuration has
been put into jobs so that everybody has what they need to build their
software. It has led to Jenkins being a "do everything" workflow system. The
easy path that Jenkins provides, to me, seems like the wrong one - it makes it
easy to just stuff everything in there because it "can" do it. This seems to
leads to tons of copy/paste, drift, all types of different work being
represented, and it is starting to become unmanageable.

Have others seen this happen when using Jenkins? How have you dealt with it?

~~~
christop
Netflix have been quite involved in the Jenkins project, including the Job DSL
Plugin, which enables the automated creation of new Jenkins jobs, e.g. when a
new Git branch is created, by defining the job structure with a simple Groovy-
based DSL.

Taking this further, the upcoming release of Jenkins 2.0 is going to put a lot
more emphasis on pipelines-as-code, where entire workflows can be defined in
code, and version-controlled, as opposed to clicking everything together via
the web UI.

See [https://jenkins-ci.org/2.0/](https://jenkins-ci.org/2.0/)

~~~
vorg
For the last 4 months, Groovy has been known as "Apache Groovy".

------
vlucas
For those wondering about how this applies to Node.js use a Netflix like
myself, it's in there towards the bottom of the article:

> "As Netflix grows and evolves, there is an increasing demand for our build
> and deploy toolset to provide first-class support for non-JVM languages,
> like JavaScript/Node.js, Python, Ruby and Go. Our current recommendation for
> non-JVM applications is to use the Nebula ospackage plugin to produce a
> Debian package for baking, leaving the build and test pieces to the
> engineers and the platform’s preferred tooling. While this solves the needs
> of teams today, we are expanding our tools to be language agnostic."

~~~
agentgt
Twitter has an agnostic tool called Pants [1]. I run a JVM shop and we are
just now using some NodeJS so we are have been trying to figure out that as
well.

[1]:
[https://pantsbuild.github.io/dev.html](https://pantsbuild.github.io/dev.html)

------
Gratsby
> The Netflix culture of freedom and responsibility empowers engineers to
> craft solutions using whatever tools they feel are best suited to the task.

I absolutely love that. I'm a huge fan of what Hastings and company have done
over there in terms of culture and making Netflix a unique and desirable place
to work.

I think it's time for another round of "find a way to make Netflix hire me."

~~~
seanp2k2
There's an open house coming up. Let me know if you'd like to come (if you're
in the bay area) zgtjyizv@abyssmail.com

~~~
Gratsby
Awesome. I just sent you a message. Thank you!

------
moondev
Spinnaker is an amazing tool. Really makes it easy to confidently deploy
applications via immutable infrastructure.

~~~
sanjeetsuhag
Can some one explain to me what immutable means in terms of "infrastructure" ?

~~~
riskable
The other replies here are great but let me give you the layman's version of
what "immutable infrastructure" means:

If it works for me it works for everyone.

You never patch or upgrade immutable infrastructure. You just replace what
you've got with a new VM or container. Containers being preferred because they
can be started & stopped near instantaneously and there's nothing like a
virtual BIOS that could have different configurations like with VMs.

You don't stand up a VM or container then "log in to configure it". Once the
VM is "up" that's it. You're done. At that point you just need to point your
load balancers/DNS at the new stuff then take down the old stuff.

One interesting aspect of immutable infrastructure such as this is that it is
completely incompatible with _loads_ of existing security policies and what
would have been considered "best practices" just a few years ago. For example,
you might have a security policy that states that everything must be scanned
within 30 days for malware/out-of-date packages/whatever. Yet with immutable
infrastructure your hosts or containers may only be up for a few days before
being replaced!

So when your security team freaks out because none of your hosts/containers
are showing up in their systems you'll have a _lot_ of explaining to do =D

"We need to scan your hosts so we can ensure that you're installing security
patches."

"We don't do that."

"You don't install security patches?!?"

"Yeah, well, you see..."

Trust me when I say that trying to explain how it all works and why it's more
secure than old school deployments is _not_ easy!

~~~
dastbe
this is why your security team should be integrated into how you (securely!)
build and deploy software. in this specific case, that would mean maintaining
the OS layer Dockerfile and scanning the dependency tree for build artifacts.

~~~
riskable
Yeah, it's actually a lot more complicated than that. Let's assume the
security team creates the Dockerfile. It'll look like this:

    
    
        FROM ubuntu
        WORKDIR ${foo}   # WORKDIR /bar
        RUN developer_script.sh
    

So now with each new container update you'd need someone from the security
team to audit/review `developer_script.sh`. It's kind of pointless if your
goal is fast deployments.

If you just need to make sure your developers don't make a mistake in terms of
securely configuring their containers (and making sure to always use the
latest software) then you simply scan them before the canary stage. The
problem there is, "what are you looking for?"

Also, there exists only _one_ tool to scan Docker containers (OpenSCAP) and if
your security team doesn't like it, well, you're screwed:
[https://github.com/OpenSCAP/container-
compliance](https://github.com/OpenSCAP/container-compliance)

The other problem is that OpenSCAP only checks the container's packages for
compliance. It doesn't actually scan the container's filesystem for things
like JREs and bundled libs. So if you're using Docker best practices by
keeping your images as minimal as possible you may not even _have_ a package
tool inside your containers. In that case how do you check for things like
out-of-date versions of Java?

Another problem is you assume there's some modicum of control over what's
inside the containers before they're deployed. We're in charge of creating the
_infrastructure_ for running containers with the promise that end users (who
would be various application teams) can make their own containers (or at least
their own Dockerfiles) and deploy them on our infrastructure.

We can work with them to help develop, say, a Kubernetes pod config for their
app but as far as _what their app is_ or what gets bundled with it we'll have
no knowledge after the first successful deployment.

~~~
dastbe
I never said you need or should have the package tools inside your container,
but you do need to be able to track the lineage of what was put into those
containers so you can easily search for ex. deployed container revisions with
an out of date Java version. This is where you need a team devoted to build
that creates this kind of infrastructure. Otherwise, you end up with difficult
to answer questions about your infrastructure.

tl;dr I assume there's some control because there should be.

------
mattiemass
Very cool article. Amazing how much tooling Netflix has built themselves.

~~~
riskable
What's amazing is that Netflix _let_ their teams develop solutions from
scratch. Usually at big companies when their developers say something like,
"can't we just build a solution ourselves?" they're laughed out of the room
or, more likely, marked down as candidates in the next round of layoffs.

~~~
chatmasta
It helps that they have repeatedly proven this strategy works. When the CTO
goes to the CEO and says "we're going to build this, not buy it," the CEO
obviously trusts his decision making because he's seen him successfully
execute his strategy so many times.

It sounds like Netflix engineering culture is built a lot around trust.
Management trusts that only top engineering talent works there, and the sink-
or-swim culture, coupled with performance bonuses and a rising stock price,
ensures that the engineers are making the best decisions _for the company._
It's a lot easier to approve a "build" vs "buy" decision when you know that
the interests of the engineers pushing for it are actually aligned with the
company business interests.

Contrast this to a company with a large separation of incentives between the
stock price and the engineers. In that scenario you end up with a bunch of
engineers who are bored and want to prematurely optimize systems because there
is no clear personal cost to doing so.

~~~
riskable
> It helps that they have repeatedly proven this strategy works.

I laughed at this because you can't "repeatedly prove" you're capable of
developing your own solutions if you're never allowed to do it in the first
place!

The enterprise mantra is: COTS or it doesn't happen.

------
gjkood
Major outage being reported worldwide.

[http://downdetector.com/status/netflix](http://downdetector.com/status/netflix)

Anything interesting deployed in the last hour?

Something in the CI/CD tool chain, Spinnaker, failed for it to move all the
way to Live without being caught.

~~~
codingdave
There are many reasons that a site can go down, aside from a code deployment.

------
Scarbutt
What are the reasons for Netflix choosing nodejs for their front-end server
and not java like in their back-end?

~~~
salehenrahman
I'm guessing it's their use of React both on the front-end and back-end.

~~~
dominotw
there is also this
[https://github.com/Netflix/falcor](https://github.com/Netflix/falcor)

------
markbnj
I'm a Netflix fan, as a consumer and an engineer, and this blog post just
reinforces my fanboi status. Amidst the descriptions of deployment tools and
pipelines one thing stood out for me: the fact that AMI bake times are now a
large factor, and that "installing packages" and the "snapshotting process"
were a big piece of this. Containers are definitely the answer to this
problem. You can deploy base images with the OS and common dependencies, and
have the code changes be a thin final layer. Of course with such a
sophisticated pipeline based on AMI deployment this change would not be
trivial for Netflix, but the bottom line is they have described the primary
container use case perfectly, imo.

~~~
mkobit
I am fairly certain it is on the Spinnaker roadmap. There was a recent talk by
Sangeeta Narayanan on "Containers so Far" [1]. At 37:10~ you can see how they
will essentially be allowing for containers to replace AMIs in the "bake"
process. I'm looking forward to seeing how Netflix start to roll out more
container support, and seeing how Spinnaker progresses.

[1] [http://www.infoq.com/presentations/netflix-
containers](http://www.infoq.com/presentations/netflix-containers)

------
neduma
How do they 'externalize config' with respect to
[http://12factor.net/config](http://12factor.net/config)?

~~~
conorgil145
There are many ways to accomplish an external configuration. I do not know how
they do it at Netflix, but one could have the prebaked AMI reach out to a
service discovery tool to pull its configuration at run time. For example,
Consul [1] or etcd [2].

One advantage of this approach is that you can use the exact same AMI in
testing as in production. All you would have to do is change the configuration
which is pulled. You could determine which configuration to pull based on AWS
tags on the EC2 instance, for example.

[1] [https://www.consul.io/](https://www.consul.io/)

[2] [https://github.com/coreos/etcd](https://github.com/coreos/etcd)

------
oconnore
Another Spinnaker:
[http://arxiv.org/pdf/1103.2408.pdf](http://arxiv.org/pdf/1103.2408.pdf)

~~~
simonebrunozzi
Good point - curious what do you exactly mean by this though, and how did you
compare Spinnaker with it. Can you elaborate a bit more?

~~~
oconnore
Really just a naming collision. IBM Spinnaker is a data store, Netflix
Spinnaker is a software builder. The paper I linked is interesting if you're
researching distributed data stores.

------
x0rg
Isn't Netflix using Mesos (see [http://techblog.netflix.com/2015/08/fenzo-oss-
scheduler-for-...](http://techblog.netflix.com/2015/08/fenzo-oss-scheduler-
for-apache-mesos.html) )? I don't understand what role it plays here.

~~~
diab0lic
The Fenzo scheduler is, if I recall correctly, used to do scheduling for
Mantis and Titus (formerly Titan). Think of these as a streaming job
infrastructure and a container service respectively. Spinnaker can assist you
in deployments to both (I use Spinnaker to deploy Mantis jobs, for example)
but the two platforms are not completely prolific -- we still deploy much of
the ecosystem as baked AMIs.

------
sayrer
Seems like a nice system, but would be improved by building with Bazel or Buck
instead of Gradle.

~~~
metanoia
What do you see as advantages/disadvantages of each tool?

------
pyman
This is how my company used to build software 10 years ago.

Now we have Docker containers, cloud VMs, GitHub, 1 click deployment, advanced
metrics, Grafana, Go microservices, Slack bots, etc.

Sounds like Netflix is stucked in the past.

