
Automating Datacenter Operations at Dropbox - dmicher
https://blogs.dropbox.com/tech/2019/01/automating-datacenter-operations-at-dropbox/
======
nemothekid
A theory based on the article is it seems Pirlo may be written in Python
(going off the fact that is leverages SQLAlchemy) - which is interesting given
most providers are writing new infrastructure code largely in Go (Spinnaker is
another Python exception).

I'm guessing that Python is still heavily used inside Dropbox, but does anyone
know if they have published any style guides or tooling to managing Python
codebases at their scale?

~~~
lwf
Dropbox sponsors most of the developers of [http://mypy-
lang.org/](http://mypy-lang.org/) , an optional static typing system for
Python.

I've found it to be hugely useful in safely working in a large Python codebase
:)

(NB: I work at Dropbox)

~~~
mroche
Curious if you’ve ever tried Cython out. Not quite the same, but a semi-
similar end goal to a degree. I started looking into the other day and it
might provide some nice improvements to heavier applications.

------
peterwwillis
_" While there are some excellent job queue systems such as Celery, we didn’t
need the whole feature set, nor the complexity of a third-party tool.
Leveraging in-house primitives gave us more flexibility in the design and
allows us to both develop and operate the Pirlo service with a very small
group of SREs."_

I don't know any more than this paragraph or two explains, but it sounds like
NIH syndrome. If you chose to write your own solution just because a 3rd party
one was complicated or expensive, you've underestimated the complexity and
expense of developing and supporting new software. Not only do you have
software developers developing your business products, but now you have
software developers developing the IT tools that support the software
developers writing the business products.

 _" Using the network database and configuration tool developed by our Network
Reliability Engineering (NRE) team,"_

Another custom tool? Network inventory and config management tools do exist
already...

 _" Rather than having engineers manually running tests using playbooks, Pirlo
performed an automated sequential battery of tests that reduced the need for
hands-on attention and concurrently increased diagnostic accuracy."_

Or you could, like, install Jenkins, write your tests, and do all this without
writing your own distributed job queue system.

~~~
mrbanks
Developers at these big companies get bored & assume they know better than a
mature open source solution so like to reinvent a perfectly good wheel
basically.

~~~
inferiorhuman
Or they've actually used Jenkins (or any of these other suggested
alternatives). I've used Jenkins personally and professionally.

Most recently I've been rebuilding my own CI stack and never really gave much
thought to going back to Jenkins. So I've been asking around and one of the
only common complaints I've heard so far is that getting the initial
configuration done is painful and generally orthogonal to automation. Plus the
documentation is atrocious.

No off-the-shelf product will be a perfect fit, but with CI software I was
truly surprised at just how large the gaps were.

So, sure, if you're Dropbox and you want to automate everything Jenkins is
almost certainly not the right tool for the job. If Dropbox already had a
supported, mature in-house job queue system, why not use it?

Conversely at megacorp, they spent 4+ years claiming to work on deploying a
Jenkins (CloudBees) cluster and still came up with bupkis. Our own internal
job and message queuing systems were astoundingly bad (and support for
internal tools was almost entirely forbidden).

At megacorp I absolutely decried any sort of home grown solution. But if
Dropbox were to actually tackle the problems of internal testing and support
and come up with mature solutions, why shouldn't they use them?

~~~
kohsuke
Hi, I'm Kohsuke, the creator of Jenkins. I'm sorry to hear that you had a bad
experience.

Would you be willing to letting me interview you so that I can learn where it
failed your expectation? I'm honestly trying to learn where we can do things
better, and often what's obvious to one person is completely incomprehensible
to another. So I think this is a great opportunity for me to learn a fresh
perspective.

My contact information is in my personal profile.

~~~
inferiorhuman
Jenkins isn't a terrible experience, and I've used it personally and
professionally (and would do so again where it's a good fit), but for my
current project it missed a few of the requirements. In trying to rationalize
the whole NIH thing, I talked to some friends and peers about their CI
experiences. I got pretty consistent responses on Jenkins.

My relevant requirements:

1.) The software needs to be self-hostable and run on the BSDs. For the most
part this narrows down the options to buildbot and the Java based CI options
(Jenkins and GoCd). Travis could probably be run on FreeBSD, but the open
source bits are essentially abandoned (e.g. some repos are missing) with no
documentation. Nearly everything else these days is strongly tied to Linux via
docker. Some free hosted services offer a FreeBSD target, but I'm looking to
test on DF/Free/Net/OpenBSD.

2.) The software needs to scale down. The GoCd folks suggested that the agent
would need around 500 MB of RAM. I haven't profiled Jenkins, but I can't
imagine the agent being that much lighter weight. Certainly the Jenkins server
process is glacially slow. By contrast my prototype in Rust is showing memory
usage of under 5 MB for each process (agent + server). I expect that to grow a
little but, but not by an order of magnitude.

3.) The software needs to handle multi-arch builds. Travis does this extremely
well. Buildbot and GoCd, kinda. Jenkins does not handle this use case (e.g.
pipelines + matrix builds are not supported). I really like the way Travis
basically handles these as sub jobs.

My experience:

A.) The Jenkins documentation is terrible, if it exists at all. I've heard
that this has been improved in the year or so since I've looked at Jenkins
(but that hasn't been my experience). I mentioned this to one of the CloudBees
guys at the DevOps Days conf I went to last year and got an ack that this is a
known issue (although CloudBees has driven a ton of Jenkins documentation and
improvement). At MegaCorp we paid a fortune to CloudBees, which helped a ton
but didn't really help end users. I cannot understate just how much of a
detriment the documentation is.

At the opposite end of the spectrum rust (except for the async stuff) and
postgres are just a dream come true. If it's any consolation the GoCd
documentation is pretty atrocious as well. Almost none of it is up to date
with the current UI.

B.) The Jenkins community tends to cargo cult Jenkins-Groovy snippets like
crazy, potentially as a result of #A. Having a good community helps
documentation and helps when there are gaps in the documentation.

C.) Bootstrapping Jenkins is not something easily done in an automated way.
The CLI is not stable and I had tons of trouble trying to get plugins and
dependencies sorted without having to drop into the GUI. For homelab stuff
I've automated bootstrapping of nearly everything except for Jenkins with
Ansible.

I don't think these are new or unknown issues as in talking to friends and
peers I've found that the typical responses regarding Jenkins are along the
lines of: Jenkins works well enough so that we're not motivated to switch, but
A & C are our main pain points.

~~~
kohsuke
Thanks for taking time to put this together. Yes, much of it isn't new, but
it's always good to hear how these dots are connected in other people's view
to form a theme.

On C, I think we've made a good progress
<[https://github.com/jenkinsci/configuration-as-code-
plugin>](https://github.com/jenkinsci/configuration-as-code-plugin>) that I
think you'd like.

