
Stack Overflow: How We Do Deployment - Nick-Craver
http://nickcraver.com/blog/2016/05/03/stack-overflow-how-we-do-deployment-2016-edition/
======
taurath
How people manage git and source control tells you a lot of things about a
company's culture. They said that most commits go directly onto master and
this works for them, which indicates:

\- Good rapid communication about who is working where. People are generally
not touching the same code or else you'd run into frequent collisions
(solvable via rebasing of course but they would be doing more branching if it
were a thing to happen very frequently I'd suspect)

\- The developers are given autonomy and have assumed some level of mastery
over whatever their domain is. Trust in each developer's ability to commit
well-formed and considered code.

\- They have a comprehensive test stack, which helps verify the above point
and keep it sane

~~~
manacit
I found this very curious - by their own admission, this also means that most
code _does not get reviewed_ before it lands in production. To me, this is
quite scary, and I would be very hesitant to adopt this for any large-scale
project or company.

IMO, code review is a cornerstone of code quality and production stability -
the number of dumb (and smart!) mistakes in my code that have been caught in
CR are numerous, and it's a big portion of my workflow. There are times when I
feel it's redundant (one-line changes, spelling mistakes, etc), but I wouldn't
trade those slowdowns for a system where I only got review when I explicitly
wanted it.

Of course, for pre-production project and/or times when speed is of the utmost
concern, dropping back to committing to master might make sense, but for an
established and (I'm assuming) fairly large/complex codebase, I would think
that it would be best for maintainability and stability to review code before
it's deployed.

~~~
Nick-Craver
If we want a code review on anything risky, we may push a branch or we may
just post the commit in chat for review before we build out. Which is chosen
depends on how big or blocking the change may be.

We ask for code reviews all the time, we simply don't mandate them - I think
that's the main difference.

~~~
otis_inf
> or we may just post the commit in chat for review before we build out.

Isn't that 'after the fact', considering your teamcity polls the gitlab repo a
lot, so a commit will trigger a build right after it, and if everything goes
well, deploy it too?

So you have to know up front whether a thing is 'risky', but that's a
subjective term.

~~~
Nick-Craver
It only deploys to our development/CI environment automatically. Deploying out
to the production tier is a button press still.

So yes, it will build to dev, but we're using this in situations where we're
_very_ confident the changes are correct already. I'd argue blind pushes are
the problem otherwise. If the developer is not very certain: they can open a
merge/pull request or just hop on a hangout to do a review.

~~~
otis_inf
> It only deploys to our development/CI environment automatically. Deploying
> out to the production tier is a button press still.

Ah missed / overlooked that!

------
adontz
I was always wondering if there is some way of database deployment which does
not suck. And all I see - every professional team ends with bunch of 074-add-
column-to-table.sql files. I mean, code deployment can be organized much
better. You can make graceful restarts, transactional updates, etc. Actually,
almost nobody backups old code version because deployment of new one may fail,
but database upgrades are so fragile and making backups is a must not only
because you afraid of upgrade process may be interrupted and leave database in
inconsistent state, but because properly done upgrade may result in corrupted
state.

~~~
mwhite
It's called NoSQL, which removes the need for schema migrations for things
like adding or deleting columns.

This could be solved for relational databases if you implemented application-
level abstractions that allowed you to store all your data using JSON storage,
but create non-JSON views in order to query it in your application using
traditional ORMs, etc.

So, store all data using these tables, which never have to be changed:

\- data_type

\- data (int type_id, int id, json data)

\- foreign_key_type (...)

\- foreign_keys (int type_id, int subject_id, int object_id)

(we'll ignore many-to-many for the moment)

And then at deploy time, gather the list of developer-facing tables and their
columns from the developer-defined ORM subclasses, make a request to the
application-level schema/view management abstraction to update the views to
the latest version of the "schema", along the lines of
[https://github.com/mwhite/JSONAlchemy](https://github.com/mwhite/JSONAlchemy).

With the foreign key table, performance would suffer, but probably not enough
to matter for most use cases.

For non-trivial migrations where you have to actually move data around, I
can't see why these should ever be done at deploy time. You should write your
application to be able to work with the both the old and new version of the
schema, and have the application do the migration on demand as each piece of
data is accessed. If you need to run the migration sooner, then run it all at
once using a management application that's not connected to deploy -- with the
migration for each row in a single transaction, eliminating downtime for
migrating large tables.

I don't have that much experience with serious production database usage, so
tell me if this there's something I'm missing, but I honestly think this could
be really useful.

~~~
Nick-Craver
> With the foreign key table, performance would suffer, but probably not
> enough to matter for most use cases.

Citation needed :) That's going to _really_ depend.

I'm not for or against NoSQL (or any platform). Use what's best for you and
your app!

 _In our case_ , NoSQL makes for a bad database approach. We do _many_ cross-
sectional queries that cover many tables (or documents in that world). For
example, a Post document doesn't make a ton of sense, we're looking at
questions, answers, comments, users, and other bits across _many_ questions
all the time. The same is true of users, showing their activity for things
would be very, very complicated. In our case, we're simply very relational, so
an RDBMS fits the bill best.

~~~
mwhite
Sorry for being unclear. I'm not proposing NoSQL. I'm saying that many NoSQL
users really mainly want NoDDL, which can be implemented on top of Postgres
JSON storage while retaining SQL.

\- data (string type, int id, json fields)

\- fk (string type, int subj_id, int obj_id)

    
    
      select
        data.id,
        data.fields,
        fk_1.obj_id as 'foo_id'
        fk_2.obj_id as 'bar_id'
      from data
      join fk as fk_1 on data.id = fk_1.subj_id
      join fk as fk_2 on data.id = fk_2.subj_id
      where 
        data.type = 'my_table'
        and fk_1.type = 'foo'
        and fk_2.type = 'bar'
    

What would the performance characteristics of that be versus if "foreign keys"
are stored in the same table as the data, if fk has the optimal indexes?

------
clio
Additional anecdata: At my place of employment, after the required code
review, we must write an essay about the change and have at least two
coworkers approve it. Then we must email the essay to a mailing list of
several hundred people. One of those people is the designated external
reviewer, who must sign off. However, lately, this final reviewer has been
requesting 10+ modifications to the deployment essay due to a managerial
decision to "improve deployments". Moreover, deployments are not allowed 1/4th
of the year unless a vice president has signed off.

Any code change requires this process.

~~~
TheRealDunkirk
What industry sector? Medical? Defense?

~~~
clio
Logistics

------
ohitsdom
This feels really clunky to me, but maybe I'm just not getting it. I'm trying
to implement a more automated build/deploy process at my current place of
employment and am basically modeling it off of Github's [0], which seems to
have a better feel.

Obviously the quality of the process needs to be high, but when it's
effortless and "fun" then everybody wins.

[0] [http://githubengineering.com/deploying-branches-to-github-
co...](http://githubengineering.com/deploying-branches-to-github-com/)

------
richardwhiuk

       Fun fact: since Linux has no built-in DNS caching, most of the DNS queries are looking for…itself. Oh wait, that’s not a fun fact — it’s actually a pain in the ass.
    

Surely that should just be a very fast lookup in /etc/hosts?

~~~
Nick-Craver
The problem here is that these services move - so if it's in /etc/hosts, our
failover mechanisms (to a DR data center which has a replica server) are
severely hindered. We're adding some local cache, but there are some nasty
gotchas with subnet-local ordering on resolution. By this I mean, New York
resolves the local /16 first, and Denver resolve's its local /16...instead
BIND doesn't care (by default) and likes to auth against let's say: the London
office. Good times!

~~~
KaiserPro
but thats what DNS scope is for surely?

we had _n_ datacenters each named after their city: ldn.$company.com,
ny.$company.com etc etc. in the DHCP we pushed out the search order so that it
would try and resolve locally, if that failed try a level up until something
worked.

This meant that you'd bind to _service_ it would first look up
service.$location.$company.com, if thats not there it'd try and find
service.$company.com

This cuts down the need for nasty split horizon DNS, moving
VMs/services/machines between datacenters was simple and zero config.

If you were taking a service out of commission in one datacenter, you'd CNAME
service.$location.$company.com to a different datacenter, do a staged kick of
the machines, and BOOM failed over with only one config change.

On a side note, you can use SSSD or _shudder_ NSLCD to cache DNS.

~~~
Nick-Craver
We do, but in the specific case of Active Directory, we _want_ to fail over
and auth against another data center if the primary is offline. This means for
our domain, the local (to the /16) domain controllers are returned first and
then the others. The problem is BIND locally doesn't preserve this order and
applications are suddenly authenticating across the planet.

DNS devolution isn't a good idea here, since the external domain is a
wildcard. We'll be paying for that mistake from long ago until (if ever) we
change the internal domain name.

This is a pretty recent problem we're just now getting to because the DNS
volume has been a back-burner issue - we'll look into permanent solutions for
all Linux services after the CDN testing completes. Recommendations on the
Linux DNS caching are much appreciated - we'll review each. It's something
that just hasn't been an issue in the past so not experts on that particular
area. I am surprised caching hasn't landed natively in most of the major
distros yet though.

~~~
KaiserPro
Aha gotcha. I was under the impression that SSSD chose the fastest AD server
it could find(either via the SRV records, or via a pre-determined list)? I've
not had too much trouble with it stubbornly binding to the furthest away
server. (thats with AD doing the DNS and delegation to BIND )

NSCD (name service caching daemon) is in RHEL and debian, so I assume it'll be
in ubuntu as well. The problem is that it fights with SSSD if you're not
careful. [https://access.redhat.com/documentation/en-
US/Red_Hat_Enterp...](https://access.redhat.com/documentation/en-
US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/usingnscd-sssd.html)

out of interest, what are you using to bind to AD?

------
sakopov
I think if you're deploying .net code you're almost certainly going to follow
similar build architecture with TeamCity doing most of the grunt work. We have
a very similar build structure but a bit more polished I think. Our TeamCity
build agents build nuget packages, deploy them to Octopus, run unit and
integration tests. Octopus handles promotions of code from dev to qa, to
staging and all production environments. We also write migrations for database
updates using FluentMigrator which works with just about any database driver.
It's a joy deploying code on an environment like this.

~~~
rjbwork
Agreed on the Octopus bit. TeamCity + Octopus is practically magical. Until
literally yesterday, I'd yet to find something that didn't work with minimal
effort between the two.

------
JdeBP
> _Fun fact: since Linux has no built-in DNS caching, most of the DNS queries
> are looking for … itself._

This is wrong in two ways, and isn't factual at all.

First, the cause of the queries is nothing to do with whether DNS query
answers are cached locally or not. There is no causal link here. What causes
such queries is applications that repeatedly look up the same things, over and
over again; not the DNS server arrangements. One could argue that this is poor
design in the applications, and that they should remember the results of
lookups. But there's a good counterargument to make that this is _good_
design. Applications _shouldn 't_ all maintain their own private idiosyncratic
DNS lookup result caches. History teaches us that applications attempting to
cache their own DNS lookup results invariably do it poorly, for various
reasons. (See Mozilla bug #162871, for one of several examples.) Good design
is to hand that over to a common external subsystem shared by all
applications.

Which brings us to the second way in which this is wrong. A common Unix
tradition is for all machines to have local proxy DNS servers. Linux operating
systems have _plenty_ of such server softwares that can be used: dnsmasq,
pdnsd, PowerDNS, unbound ...

One of the simplest, which does _only_ caching and doesn't attempt to wear
other hats simultaneously, is even named "dnscache". Set this up listening on
the loopback interface, point its back-end at the relevant external servers,
point the DNS client library at it, and -- voilà! -- a local caching proxy DNS
server.

* [http://cr.yp.to/djbdns.html](http://cr.yp.to/djbdns.html)

* [http://cr.yp.to/djbdns/run-cache.html](http://cr.yp.to/djbdns/run-cache.html)

* [http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/djb...](http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/djbdns-problems.html#wrong-icann-root)

* [http://cr.yp.to/djbdns/install.html](http://cr.yp.to/djbdns/install.html)

I run server machines that have applications that repeatedly look up the same
domain names again and again. Each runs a local dnscache instance, which
ameliorates this very well.

~~~
Nick-Craver
I don't believe your assessment is correct. I _very specifically_ said built-
in. This remains true. If curious, we're on CentOS 7 specifically. I didn't
say there aren't _any_ options, only that there aren't any built-in. What you
described as alternatives are totally true, but they still aren't built-in.
It's a manual/puppet/chef/etc. config everyone has to do.

As for the applications - we have little _direct_ input to TeamCity of Gitlab
(the problem children here). And even if we did, I think we agree: the
application level shouldn't cache anyway.

That being said, we're looking at `dnscache` as one of a few solutions here.
But the point remains: _we have to do it_.

~~~
JdeBP
You are employing a faulty concept of what constitutes built-in. This is not
Windows, or one of the BSDs. You're using one of the Linux distributions where
_everything_ is made up of installing packages. There is no meaningful "built-
in"/"not-built-in" difference between installing one of these DNS server
packages from a CentOS 7 repository and installing any other CentOS packages
from a CentOS 7 repository.

> All software _on a Red Hat Enterprise Linux system is divided into RPM
> packages which can be installed, upgraded, or removed._

\-- [https://www.centos.org/docs/5/html/Deployment_Guide-en-
US/pt...](https://www.centos.org/docs/5/html/Deployment_Guide-en-US/pt-pkg-
management.html)

A (very) quick check indicates that the CentOS 7 "main" and "updates"
repositories have at least three of the DNS softwares that I mentioned. Ubuntu
16 is better endowed, and has all of them that I mentioned, and an additional
"Debian fork of djbdns" that I did not, in Ubuntu's "main" and "universe"
repositories.

------
radicalbyte
It's funny to see that stackoverflow came to exactly the same solution for
database migrations on the Microsoft stack as my team did, even down to the
test procedure.

Simple, safe and very effective :)

~~~
jdc0589
its also pretty much the same solution colleagues and I came up with a few
years ago for a migration tool we were working on. It's kind of abandon-ware
at this point, but this version is pretty far along:
[https://github.com/jdc0589/mite-node](https://github.com/jdc0589/mite-node)

------
daddykotex
> If Gitlab pricing keeps getting crazier (note: I said “crazier”, not “more
> expensive”), we’ll certainly re-evaluate GitHub Enterprise again.

Shots fired :P

------
goldbrick
When did everybody decide that chatbots were the new hotness for deployments?

~~~
Nick-Craver
If you mean pinbot - that's literally _all_ it does. It takes a message and
pins it, knocking the old one off the pins.

The build messages build...that's also literally all it does. It simply puts
handy notices in the chatroom. Why _wouldn 't_ you want that integration?
Everyone going to look at the build screen and polling it to see what's up is
a far less efficient system. A push style notification, no matter the medium,
causes far less overhead.

I doubt we'll ever build from chat directly for anything production at least,
simply because those are 2 different user and authentication systems in play.
It's too risky, IMO.

------
richardwhiuk

       A developer is new, and early on we want code reviews
       A developer is working on a big (or risky) feature, and wants a one-off code review
    

This implies you don't normally do code review??

~~~
manigandham
Code review is overrated. Over my career I've built 8/9 figure revenue
platforms without code review or serious test setups.

A lot of this cruft is unnecessary when compared to good domain knowledge and
solid coding focus.

~~~
ohitsdom
> Over my career I've built 8/9 figure revenue platforms without code review
> or serious test setups.

This does not diminish code reviews. Would you be a better engineer today if
you had regularly participated in code reviews? Would your coworkers?

~~~
meddlepal
One of the problems with code reviews is they're often done really poorly, for
a number of reasons:

1\. Reviewers are often poorly trained to provide good design reviews and
default to nit-picky stuff a code linter should pickup. Human linting is just
a poor use of time and money.

2\. Nobody seems to ever have time for them to deep dive into the code.

3\. Few engineers seem to ever actually want to do them.

4\. Reviews can become hostile.

Code reviews are probably really important in some fields, for example,
medical equipment, aviation, etc, but for the vast number of projects where
we're shoveling A bits to B bucket or transforming C bits into D bits it's
overkill and companies would be better off investing the massive amount of
wasted time in better CI/CD infrastructure.

------
nxzero
SE/SO is for it's size an amazingly high proformance team. It's rare to hear
them say it, in fact, I don't recall ever hearing them say this.

The importantance of the cohesion and trust amoung their team is critical to
their deployments. In fact, I would say it's vital to how they're able to get
away with minimal amounts of code reviews for example.

It's dangerous to believe this is easy or reproducible. New teams needs
extensive controls in place to make sure the quality of the deployments will
not negatively impact the group.

------
NKCSS
A few things strike me as odd/sub-optimal:

    
    
        - migration id's freeform in chat -> why not central db with an auto-increment column?
        - Use chat to prevent 'migration collisions' -> Same central db/msmq/whatever to report start/stop of migrations and lock based on that...

~~~
GordonS
I guess this is because using a team chat is 'good enough' for them without
adding another layer of tooling

------
ngrilly
The article says:

> there is a slight chance that someone will get new static content with an
> old hash (if they hit a CDN miss for a piece of content that actually
> changed this build)

Anyone has a solution to this problem?

~~~
mfontani
Push new static content first to all servers/cdn, _then_ bust the CDN cache /
push a cache bust

~~~
dpark
This doesn't resolve the issue. The fundamental issue here is that two
versions are running in parallel.

* If you push static content and web pages together, you get V1 and V2 of _both_ static and web, and you end up with incorrect static resources served in both directions. This approach is only reasonable if your deployment strategy is to take a service outage to upgrade all machines together.

* If you push web first, you get the ugly scenario described in the article where V1 resources get served with V2 hashes and cached for 7 days.

* If you push static content first, you still have V2 static content being served for V1 web pages. The "cache bust" doesn't matter. Somewhere a cache will expire and someone will get V2 static resources for a V1 page.

You have to deal with the two versions somehow if you want to resolve the
issue fully.

------
infocollector
Stack Overflow uses windows? Any particular reason to do this?

~~~
sklivvz1971
The founders and first devs were proficient in it.

It works well enough for our needs (e.g. C# has one of the best GCs on the
market), and no one is a platform/language zealot, so we keep on using it.

Stuff we added later runs on other platforms, as needed (e.g. we run redis and
elastic search on CentOS, our server monitoring tool Bosun is written in
go...)

------
streeter
I was surprised they weren't using Kiln
[[https://www.fogcreek.com/kiln/](https://www.fogcreek.com/kiln/)], a Fog
Creek product. I know SO is independent from Fog Creek now, but still a bit
surprised at it. I wonder if there was a migration off at some point.

~~~
Nick-Craver
Yep...for the on-premise reasons listed in the article. Once upon a time a lot
of projects were on Mercurial, hosted by Kiln. The Stack Overflow repo
specifically has always been on an internal Mercurial and then Git server.
Originally this was for speed, now it's for speed and reliability/dependency
reduction.

------
gnahckire
Did you choose polling over webhooks for a reason? Or was webhooks recently
added as a feature to Gitlab?

~~~
Nick-Craver
Webhooks didn't _used to_ work well for many builds off a single repo, but I
think this changed very recently in TeamCity. Thanks for the reminder - I'll
take another look this week at _adding_ web hooks. We'd still want the poll in
case of any hook failures.

At the moment, Gitlab knows nothing about our builds - and we'd want to keep
it simple in that regard. If we can generically configure a hook to hit
TeamCity to alert of _any_ repo updates though, that's tractable...I need to
see if that's possible now.

------
msane
TeamCity is a great CI system.

------
totally
What's with all the backslashes?

~~~
Nick-Craver
Yeah! What's wi...wait, what?

~~~
JdeBP
Enjoy
[https://news.ycombinator.com/item?id=11433139](https://news.ycombinator.com/item?id=11433139)
.

