Hacker News new | comments | ask | show | jobs | submit login
Stack Overflow: How We Do Deployment (nickcraver.com)
428 points by Nick-Craver on May 3, 2016 | hide | past | web | favorite | 140 comments

How people manage git and source control tells you a lot of things about a company's culture. They said that most commits go directly onto master and this works for them, which indicates:

- Good rapid communication about who is working where. People are generally not touching the same code or else you'd run into frequent collisions (solvable via rebasing of course but they would be doing more branching if it were a thing to happen very frequently I'd suspect)

- The developers are given autonomy and have assumed some level of mastery over whatever their domain is. Trust in each developer's ability to commit well-formed and considered code.

- They have a comprehensive test stack, which helps verify the above point and keep it sane

I found this very curious - by their own admission, this also means that most code _does not get reviewed_ before it lands in production. To me, this is quite scary, and I would be very hesitant to adopt this for any large-scale project or company.

IMO, code review is a cornerstone of code quality and production stability - the number of dumb (and smart!) mistakes in my code that have been caught in CR are numerous, and it's a big portion of my workflow. There are times when I feel it's redundant (one-line changes, spelling mistakes, etc), but I wouldn't trade those slowdowns for a system where I only got review when I explicitly wanted it.

Of course, for pre-production project and/or times when speed is of the utmost concern, dropping back to committing to master might make sense, but for an established and (I'm assuming) fairly large/complex codebase, I would think that it would be best for maintainability and stability to review code before it's deployed.

You can commit code code directly to master and still code review all code before production if you want to.

We do.

Our tooling will tell us what cases have introduced between our current deployment version and the to be released version. It will tell us that all cases have been reviewed and tested and ready to go between those two versions. We usually constantly deploy, so all cases ready are deployed asap.

Only issue is that you can get 'blocking' cases but that's fairly rare. Big cases get a feature switch

What are cases? Is it short for use case?

Pretty sure he means like an issue. A github issue, jira ticket, etc.

Yes, we use the word case for issue here

i'm guessing a scrum use case.

If we want a code review on anything risky, we may push a branch or we may just post the commit in chat for review before we build out. Which is chosen depends on how big or blocking the change may be.

We ask for code reviews all the time, we simply don't mandate them - I think that's the main difference.

> or we may just post the commit in chat for review before we build out.

Isn't that 'after the fact', considering your teamcity polls the gitlab repo a lot, so a commit will trigger a build right after it, and if everything goes well, deploy it too?

So you have to know up front whether a thing is 'risky', but that's a subjective term.

It only deploys to our development/CI environment automatically. Deploying out to the production tier is a button press still.

So yes, it will build to dev, but we're using this in situations where we're very confident the changes are correct already. I'd argue blind pushes are the problem otherwise. If the developer is not very certain: they can open a merge/pull request or just hop on a hangout to do a review.

> It only deploys to our development/CI environment automatically. Deploying out to the production tier is a button press still.

Ah missed / overlooked that!

Cool! That clarifies things a lot - the way I read the article sounded like you rarely asked for reviews.

It's a false dichotomy to think you can either move fast or have good process. Etsy commits directly to master on their large, monolithic php webapp and they have a pretty strong code review process where code isn't deployed until it's reviewed. They still manage to move fast with autonomy and trust, at least they did when I worked there for the past couple of years.

I just happened to listen to an old Stack Overflow podcast where they talked about code reviews. They said they do in-person code reviews before committing code.

(I find this is the most valuable way of doing code reviews vs pull requests/sending comments back and forth. In-person conversation about the code is so much higher bandwidth.)

We did in person code reviews at my company, and for me they are mostly useless. I like to spend some time to look at the code, and doing so on someone's screen or when the are looking at yours makes me (and fellow devs) want to finish it quickly, so we frequently missed design problems.

Code reviews aren't just about catching problems either: the reviewer can learn new coding techniques from this, the reviewer will become more familiar with following the team guidelines by applying them this way and the reviewer is also made more familiar about parts of the codebase they might have not known about.

Shouldn't the types of mistakes you're worried about be caught by the automated tests? IMO code review is more about bigger picture design, e.g. if a piece of code would yield the correct result but with sub-optimal performance (which can also be caught by automated tests but not always).

Prior work in CompSci on various engineering techniques showed code review to be among most effective at any part of lifecycle. Testing can miss lots of things. So, best to do both.

There's a lot that fits between automated tests and code review. We have an extensive system of static analysis and code quality bots that run, but there's still a lot of design patterning and higher level functionality that machines (or, at least, our machines) don't always catch.

Obviously, it depends widely on the codebase and the number/quality of engineers working on code, but it's been my experience that a team reviewing each other's code is still something that can't be 100% replaced with automated tests.

Most companies I work for, many in Fortune 500 roster, don't ever do code reviews.

Every time such process was put into place, it quickly faded out after a few months.

Long time dev at the company here.

Yes, Yes, and No.

#1 kind of happens naturally as we're all working on different things. There are "a lot" (< 100) of people slinging code/design/SRE/IT, but only a few work in the same areas at the same time and rapid good communication generally happens among those subsets of the overall team. We also seem to have perpetuated a tendency toward good citizenship, so we generally talk to people before traipsing through areas of the code we aren't familiar with.

#2 Absolutely. This is some basic principle stuff. We don't hire people to not trust them.

#3 Related to #2: Automated acceptance testing is done as is deemed appropriate by the person developing the system. I've been on teams that valued and developed automated testing more, and less. My personal experience at the company has been it is neither necessary nor sufficient for success.

Much more important than any pre-deploy automated testing is our effort to monitor what's deployed (both in terms of software/hardware metrics and business goals). Bosun (http://bosun.org), developed by our SRE team, gives us some pretty great introspection/alerting abilities. I'd be incredibly sad to not have it. Bosun monitoring combined with the ability to have a build out in <5 minutes keeps me pretty happy.

Just out of curiosity... you guys seems very pragmatical in most of things. Are you using any development methodology/principles (Agile, waterfall, etc.)?

Again, up to the individual teams to decide what's right for them. The most globally applicable principle we have is "Make sure to use your brain."

Currently most teams don't subscribe to a particular methodology, though there is a Scrum effort happening on one team.

The comprehensive test stack are mostly the users on meta, which gets the deployed code before the other sites.

From what I heard so far, Stack Overflow and the Stack Exchange sites don't have a significant amount of automated tests.

A team of developers all working off of master doesn't necessarily require much communication about who's working where.

If your code is well organized into modules broken down by functional area, it should reduce the number of potential conflicts.

Also, fear of merge conflicts is somewhat unjustified; most conflicts can be resolved and rebased against using git rebase without that much work; the git rerere option [1] and git imerge [2] can also help with this.

If developers would actually learn how to resolve merge conflicts, and not be afraid of the occasional conflict resolution which requires understanding the other change and how to write new code that incorporates both changes, it's less overhead than communicating about pending changes.

[1] https://git-scm.com/docs/git-rerere [2] https://github.com/mhagger/git-imerge

The number of devs who screws up rebase, or worse merge, is astonishing. "I just used theirs" :|

You're right, there are a lot of devs like that. In this day and age of DVCS any developer worth their salt should be able to manage merges properly. It sounds like SO might not hire people who are incapable of understanding merging.

I agree with 2 but not 1 & 3. In my experience

- Committing to master directly is the simplest thing to do. That is why most people chose to do that. It works well if everyone is working on their own pieces not touched by others. Out of all the teams I have worked with in past almost 4 in 5 did this.

> They have a comprehensive test stack, which helps verify the above point and keep it sane

This may or may not be true.

If you are frequently pushing to master which is deploying to prod, not having a good test stack is asking for trouble - their uptime indicates this isn't happening.

Not that I think it's an invalid argument, but almost all of our outages have been either a database issue or (far more often) CloudFlare's inability to reach us (the origin) across the internet. Deploying code rapidly very, very rarely causes an issue.

To us, the speed of deployment and overhead savings we get 24/7 is also absolutely worth those very rare issues.

Thats pretty amazing actually! While I certainly have seen a fair amount of code that passes (even good) automated testing but fails on validation, it feels somewhat strange but starts to make sense after a bit to value heavy monitoring more.

Good modularization of code I think makes this more possible - the point of a big test suite is to catch unintended consequences of a change, and the less coupling the less likely this is to happen. Stuff like routers, session object and other stateful/lower level logic I'd imagine is more tricky to change without a test suite

Thanks again for the post and the candid answers!

In my experience, committing directly to master is a helpful way to decrease merge conflicts. They may happen more frequently, but you get little conflicts more often, rather than large conflicts or conflicts that need to be resolved multiple times.

I was always wondering if there is some way of database deployment which does not suck. And all I see - every professional team ends with bunch of 074-add-column-to-table.sql files. I mean, code deployment can be organized much better. You can make graceful restarts, transactional updates, etc. Actually, almost nobody backups old code version because deployment of new one may fail, but database upgrades are so fragile and making backups is a must not only because you afraid of upgrade process may be interrupted and leave database in inconsistent state, but because properly done upgrade may result in corrupted state.

There is a fundamental difference between code deployment and DB upgrades: DB upgrades must preserve (and upgrade) the existing state. Code deployment only needs to ensure that the new code is in place. It can do so by deploying to a new location and then redirecting the entry points, or it can do so by laying over the new code. Either way, the ways ways in which it can go wrong are few (at least compared to DB upgrades). DB upgrades, on the other hand, must take existing state (which can measure GBs and TBs) and transform it into a new state. If you're really unlucky, this may involve a size-of-data transformation (eg. update every row in a big table). I've seen 'upgrades' that lasted weeks. Having witnessed DB upgrades at MS Azure scale (see [0]), and having to code myself several SQL DB upgrade steps (internal system upgrades, the kind done when you install new DB engine, not user upgrades) I can say with confidence that DB upgrades are hard.

What we all end up with are either DB migration steps (a-la Rails migrations), of which I approve see [1], or schema comparison tools (a-la vsdbcmd.exe) of which I'm very skeptical after being burned several times on large DBs (they are as good as the tool, and some tools are better, still I find explicit migrations much better).

As a side note, My startup DBHistory.com is recording all DB schema changes, like a DB flight-recorder of sorts, and one of my future goals is to add capability to generate compensating actions for any DB schema change and thus be able to revert the DB schema to any previous state (by simply rolling back every change, in reverse order). But I must reckon I'm quite far from having such a thing working, and I'm not even considering migrations that modify data, not schema.

    [0] https://medium.com/@rusanu/why-i-have-high-hopes-for-the-quality-of-sql-server-2016-release-6173bc1fbc82#.wp1zt5pn0
    [1] http://rusanu.com/2009/05/15/version-control-and-your-database/

Perhaps one day I will learn how to write HN clickable URLs...

>Perhaps one day I will learn how to write HN clickable URLs...

Leading space is for preformated text and is mostly useful for code. For everything else, don't add leading space. Do add extra line breaks, though.

Here are your links with extra line breaks and without leading space, thus nicely presented and clickable.

[0] https://medium.com/@rusanu/why-i-have-high-hopes-for-the-qua...

[1] http://rusanu.com/2009/05/15/version-control-and-your-databa...

It would be great if formatting instructions were shown in the /post and /submit pages.

occasionally i wonder if we are making a fundamental mistake when we replace the old code with the new. perhaps data versions should be first class concepts in our domains.

Perhaps I can add some context on top of the post. All updates are in transactions unless explicitly opted out. Also remember our case is a little special: we have hundreds of the databases with the same schema, what if #328 fails? The failure/rollback scenario is a bit more complicated when you go past a single database involved.

As for backups: absolutely. We handle this independently though. We do full backups every night as well as T-logs every 15 minutes. If I had to restore every database we had to a very specific point in time or just before a migration command was run: we have T-logs to do that going back 4 days at all times.

I'm sure there are good solutions for single database applications way more fully featured than our approach, they just do little to solve any problems we actually run into.

Why hundreds of databases? One for each stackexchange site?

Two per site - one for main Q&A, one for its meta.

You couldn't do one with some different tables? What was justification for both separate DB's and the Q&D/meta split? Just curious.

If you did different tables, that's even more complicated by making every query dynamic. It also makes backups, etc. far more complicated as well. Multiple databases is simply the simplest solution for multiple things that need a database with the same schema :)

Appreciate the tip.

There definitely is some tooling available to help - Redgate, many ORMs - but I agree that it's somewhat lacking.

I think the bigger problem is cultural - many programmers either don't really understand databases/data modelling or they don't care about it. After all, you don't really have to worry about it when you're just starting out - almost any schema will work. That is, right up until you have to modify it. By the time it becomes an issue, the culture has crystallised and changing the database is too risky.

For some reason, a lot of companies are largely unwilling to spend money on good database management/migration tools - even if they're paying a stack of cash for SQL Server.

> For some reason, a lot of companies are largely unwilling to spend money on good database management/migration tools - even if they're paying a stack of cash for SQL Server.

If you use SQL Server, you can do database migrations using DACPAC files. DACPAC migrations are idempotent, so you can deploy to an existing database; it will add missing columns without deleting data etc.

Personally I like manual migration scripts better, for my last projects I have been successful embedding these migration scripts in the source code of the application itself, so it integrates with source control , makes it very nice to use in test and development, and avoids many of the possible mistakes with separate code and database deployments.

There are many open source tools for this in .NET: I've used FluentMigrator, SqlFu and Insight Database Schema, and they all worked well.

Oh to be using SQL Server. Big old Oracle databases with lots of big packages and inter-related tables equals an impact analysis fun zone.

If your release script doesn't measure in the tens of thousands of lines then you're living far too comfortably.

I use sqitch, which is kind of like a specialized git for database schemas, and it has been pretty awesome so far. It includes dependency management, selective deployment of updates to the schema, testing before deployment with rollbacks on failure, etc. Work your way through the tutorial and I think you'll be pleasantly surprised at how many problems it solves.

Have you checked out something like LiquiBase or Flyway? They offer a good improvement over the ad hoc solution.



A lot of our apps use Flyway and it works pretty well, though it tries to do everything in a transaction. Normally that's desirable -- we don't want a migration to partially succeed -- but we would often use CREATE INDEX CONCURRENTLY in Postgres databases, which can't run in a transaction. In those cases we would need to manually run the migrations and update Flyway's schema_version table, which is annoyingly complex (11 columns, some with no obvious purpose).

Microsoft has SSDT for SQL Server.

The principal is brilliant. The SSDT project is a file tree of the complete schema of your system, including security objects. There is also a simple macro-language built-in for supporting conditional-compilation in your SQL. And because it's part of Visual Studio, you get source-control and msbuild integration.

This thing does static analysis of your SQL, and does deltas on SQL schema against running databases and creates the delta scripts to publish. Works very nice for continuous deployment to testing servers.

Any change that risks data-loss will be blocked, so for those you have to execute a change manually and then publish your SSDT package. You can generate the script directly against the target database, or generate a "DACPAC" package and use a command-line tool to publish the compiled DACPAC against a target database.

The problem is that the SQL server side of Microsoft is the polar opposite of the Satya OSS side of Microsoft developers. Lots of tedious designers and slow GUIs.

Also, it has no story for deploying configuration data or initialization data. At all. We've rolled our own tooling for that.

Also, performance-wise it's a goddamned dumpster fire, and it's buggy as hell.

So yes, MS got the concept right... but the implementation leaves a lot to be desired.

I concur - when I saw SSDT I thought this looks great, I can't recall the specifics now but it seems not ready for primetime.

> the SQL server side of Microsoft is the polar opposite of the Satya OSS side of Microsoft developers. Lots of tedious designers and slow GUIs.

Someone needs to come in and send some people packing and tell the rest to get with the program. For how central MSSQL seems to be to MS strategy going forward, there sure are a lot of things that still just suck.

I know. The command-line workflow for publishing SSRS reports is awful, the SSIS packages use that crazy graphical programming language and are hyper-brittle, and Management Studio still calls all its tabs sqlquery## until you save them and uses a moronic non-standard regexp for find. So many obvious bone-headed things an open development process would have taken around back and shot. It marrs an otherwise-great and featureful platform.

I wonder about this too. It seems like some ideas from category theory and pure functional programming could really help here to provide an abstraction over the top of the (pseudo)relational model.

While doing some research on an unrelated topic, I stumbled on some potentially related work [1, 2, 3] by some researchers at MIT that could be relevant to database deployment/migration (I haven't checked in depth yet). I have not had a chance to sink into these references to see if there is any relevance or promise there, though it looks like there is some kind of commercialization effort [4].

[1] Patrick Schultz et al. Algebraic Databases. http://arxiv.org/abs/1602.03501

[2] David Spivak. Functorial Data Migration. http://arxiv.org/abs/1009.1166

[3] http://categoricaldata.net/

[4] http://www.catinf.com/

if you aren't stopping-the-world during upgrade, you need to ensure the old code and the new code both execute happily against the new schema.

its even worse if you want to be able to rollback.

if the old version can't handle the new schema, is there anything that can be done when upgrading multiple servers?

You have to do a multi-step transition

1. Add new database structure (new columns, new tables, whatever) but leave all the old structure in place

2. Update all servers with code that writes in the new format but understands how to read both the new and old structures

3. Migrate the data that only exists in the old structure

4. Get rid of the old stuff from the database

5. Get rid of the code that is responsible for reading the old format

Conceptually it's straightforward but it can take a long time in calendar days depending on your deployment schedule, it can be tough to keep track of what's been migrated, and the data migration will cause performance issues if you don't plan it properly (e.g. trying to do a migration that locks an important table). You just have do it in a way where each individual change is backward compatible and you don't move on to the next change until the previous one is rolled out everywhere.

I've used Flyway for that. Conceptually similar, but it's a real tool. Very solid.

I put a bunch of effort into database change management over the years and work on a tool called AliaSQL that is a simple command line database migrator for SQL Server.

Using this tool has led to a dramatic increase in productivity on our team since we really don't have to worry about database changes anymore. I won't waste space here with the details but these links will fill you in if you have any interest.

https://github.com/ClearMeasure/AliaSQL http://sharpcoders.org/post/Introducing-AliaSQL

After using SQL for most of my career, I'm now working on a product using MongoDB. Removing the need for schema migrations has been such a boon for productivity. You essentially push the work of migrating to the app code, where it can be done incrementally, and with a far better programming language than SQL. It's been well worth the trade offs, on my opinion.

It's called NoSQL, which removes the need for schema migrations for things like adding or deleting columns.

This could be solved for relational databases if you implemented application-level abstractions that allowed you to store all your data using JSON storage, but create non-JSON views in order to query it in your application using traditional ORMs, etc.

So, store all data using these tables, which never have to be changed:

- data_type

- data (int type_id, int id, json data)

- foreign_key_type (...)

- foreign_keys (int type_id, int subject_id, int object_id)

(we'll ignore many-to-many for the moment)

And then at deploy time, gather the list of developer-facing tables and their columns from the developer-defined ORM subclasses, make a request to the application-level schema/view management abstraction to update the views to the latest version of the "schema", along the lines of https://github.com/mwhite/JSONAlchemy.

With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.

For non-trivial migrations where you have to actually move data around, I can't see why these should ever be done at deploy time. You should write your application to be able to work with the both the old and new version of the schema, and have the application do the migration on demand as each piece of data is accessed. If you need to run the migration sooner, then run it all at once using a management application that's not connected to deploy -- with the migration for each row in a single transaction, eliminating downtime for migrating large tables.

I don't have that much experience with serious production database usage, so tell me if this there's something I'm missing, but I honestly think this could be really useful.

> With the foreign key table, performance would suffer, but probably not enough to matter for most use cases.

Citation needed :) That's going to really depend.

I'm not for or against NoSQL (or any platform). Use what's best for you and your app!

In our case, NoSQL makes for a bad database approach. We do many cross-sectional queries that cover many tables (or documents in that world). For example, a Post document doesn't make a ton of sense, we're looking at questions, answers, comments, users, and other bits across many questions all the time. The same is true of users, showing their activity for things would be very, very complicated. In our case, we're simply very relational, so an RDBMS fits the bill best.

Sorry for being unclear. I'm not proposing NoSQL. I'm saying that many NoSQL users really mainly want NoDDL, which can be implemented on top of Postgres JSON storage while retaining SQL.

- data (string type, int id, json fields)

- fk (string type, int subj_id, int obj_id)

    fk_1.obj_id as 'foo_id'
    fk_2.obj_id as 'bar_id'
  from data
  join fk as fk_1 on data.id = fk_1.subj_id
  join fk as fk_2 on data.id = fk_2.subj_id
    data.type = 'my_table'
    and fk_1.type = 'foo'
    and fk_2.type = 'bar'
What would the performance characteristics of that be versus if "foreign keys" are stored in the same table as the data, if fk has the optimal indexes?

In no specific order:

If your database doesn't enforce the schema you still have a schema, it's just ad-hoc and spread across all your different processes, and no one quite agrees what it is. In the real world as requirements change and your app/service increases in complexity this becomes a constant source of real bugs while simultaneously leading to garbage data. This is not theoretical, we have a lot of direct painful experience with this. Best case scenario your tests and tooling basically replicate a SQL database trying to enforce the schema you used NoSQL to avoid in the first place.

Indexes are fast but they aren't magic. A lot of what a traditional SQL database does is providing a query optimizer and indexes so you can find the data you need really fast. Cramming everything into a few tables means everything has to live in the same index namespace. Yes you can use views and sometimes even indexed views, but then you have a schema so why jump through hoops to use non-optimized storage when the database has actual optimized storage?

Separate database tables can be put on separate storage stacks. A single table can even be partitioned onto separate storage stacks by certain column values. Cramming everything into four tables makes that a lot more complicated. It can also introduce contention (depending on locking strategies) where there wouldn't normally be any.

IMHO most systems would be better served by sharding databases than by using NoSQL and pretending they don't have a schema. If application design prevents sharding then scaling single-master, multiple-read covers a huge number of cases as well. The multiple-master scenario NoSQL systems are supposed to enable is a rare situation and by the time you need that level of scale you'll have thrown out your entire codebase and rewritten it twice anyway.

The key to schema migrations is just to add columns and tables if needed, don't bother actually migrating. Almost all database engines can add columns for "free" because they don't go mutate existing rows. Some can drop columns for "free" too by marking the field as obsolete and only bothering to remove it if the rows are touched.

Postgres (and at least one other RDBMS) has partial indexes, which pretty much solves the index namespace problem you mention: http://www.postgresql.org/docs/8.0/static/indexes-partial.ht... Partial indexes are integrated into the proof-of-concept repo I linked.

Storing a data type field in the generic storage table enables the same partitioning ability as a standard schema.

99% of NoSQL database users just don't want to deal with migrations, even if they're "free" (another big issue is synchronizing application code state and DB migration state of production, testing, and developer machines), so what they really need is NoDDL, YesSQL.

> Almost all database engines can add columns for "free" because they don't go mutate existing rows. Some can drop columns for "free" too by marking the field as obsolete and only bothering to remove it if the rows are touched.

Didn't know that, thanks.

> It can also introduce contention (depending on locking strategies) where there wouldn't normally be any.

Didn't think of that. I'm aiming this at 99% of NoSQL users in which doing things you could do with SQL requires much more effort, so allowing them to do it with SQL can accept a modest performance degradation, but if you have any good links relevant to how this storage design would affect lock contention, please share.

A .NET abstraction for PostgreSQL: https://github.com/JasperFx/marten

Additional anecdata: At my place of employment, after the required code review, we must write an essay about the change and have at least two coworkers approve it. Then we must email the essay to a mailing list of several hundred people. One of those people is the designated external reviewer, who must sign off. However, lately, this final reviewer has been requesting 10+ modifications to the deployment essay due to a managerial decision to "improve deployments". Moreover, deployments are not allowed 1/4th of the year unless a vice president has signed off.

Any code change requires this process.

What industry sector? Medical? Defense?


This feels really clunky to me, but maybe I'm just not getting it. I'm trying to implement a more automated build/deploy process at my current place of employment and am basically modeling it off of Github's [0], which seems to have a better feel.

Obviously the quality of the process needs to be high, but when it's effortless and "fun" then everybody wins.

[0] http://githubengineering.com/deploying-branches-to-github-co...

   Fun fact: since Linux has no built-in DNS caching, most of the DNS queries are looking for…itself. Oh wait, that’s not a fun fact — it’s actually a pain in the ass.
Surely that should just be a very fast lookup in /etc/hosts?

The problem here is that these services move - so if it's in /etc/hosts, our failover mechanisms (to a DR data center which has a replica server) are severely hindered. We're adding some local cache, but there are some nasty gotchas with subnet-local ordering on resolution. By this I mean, New York resolves the local /16 first, and Denver resolve's its local /16...instead BIND doesn't care (by default) and likes to auth against let's say: the London office. Good times!

but thats what DNS scope is for surely?

we had n datacenters each named after their city: ldn.$company.com, ny.$company.com etc etc. in the DHCP we pushed out the search order so that it would try and resolve locally, if that failed try a level up until something worked.

This meant that you'd bind to service it would first look up service.$location.$company.com, if thats not there it'd try and find service.$company.com

This cuts down the need for nasty split horizon DNS, moving VMs/services/machines between datacenters was simple and zero config.

If you were taking a service out of commission in one datacenter, you'd CNAME service.$location.$company.com to a different datacenter, do a staged kick of the machines, and BOOM failed over with only one config change.

On a side note, you can use SSSD or shudder NSLCD to cache DNS.

We do, but in the specific case of Active Directory, we want to fail over and auth against another data center if the primary is offline. This means for our domain, the local (to the /16) domain controllers are returned first and then the others. The problem is BIND locally doesn't preserve this order and applications are suddenly authenticating across the planet.

DNS devolution isn't a good idea here, since the external domain is a wildcard. We'll be paying for that mistake from long ago until (if ever) we change the internal domain name.

This is a pretty recent problem we're just now getting to because the DNS volume has been a back-burner issue - we'll look into permanent solutions for all Linux services after the CDN testing completes. Recommendations on the Linux DNS caching are much appreciated - we'll review each. It's something that just hasn't been an issue in the past so not experts on that particular area. I am surprised caching hasn't landed natively in most of the major distros yet though.

Aha gotcha. I was under the impression that SSSD chose the fastest AD server it could find(either via the SRV records, or via a pre-determined list)? I've not had too much trouble with it stubbornly binding to the furthest away server. (thats with AD doing the DNS and delegation to BIND )

NSCD (name service caching daemon) is in RHEL and debian, so I assume it'll be in ubuntu as well. The problem is that it fights with SSSD if you're not careful. https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

out of interest, what are you using to bind to AD?

> The problem is BIND locally doesn't preserve this order

Nor need any other DNS server software do so. The actual DNS protocol has no notion of an ordering within a resource record set in an answer.

I suspect, from your brief description here, that what you'll end up with is using the "sortlist" option in the BIND DNS client library's configuration file /etc/resolv.conf . Although SRV RRSets will introduce some interesting complexities.

* http://homepage.ntlworld.com./jonathan.deboynepollard/FGA/dn...

It will. AFAIK systemd-resolved does caching by default.

I'm confused. The lookup is for the localhost, so how would this alter failover mechanisms? You don't want a lookup for the localhost being responded to with an address of a different data centre surely?

It's not for localhost, it's for the server name. While Gitlab and Teamcity normally are on the same box, they can operate on different boxes or in different data centers. It's looking up a DNS name which happens to point at the same box...does that explain it more clearly?

Can't you just have all traffic in a DC to go only through your local DNS resolvers?

The first lookup might take longer, but subsequent ones should be fast.

caching DNS resolvers can fit in a 256MB RAM VM and use virtually 0 CPU.

Also, Linux has "built in" (whatever that means) DNS caching. It's called nscd. It's just usually not enabled by default (which is sensible, since it's better off shared).

nscd also has a known TTL bug that hasn't been fixed in 9 years.


Wow. A testament to focus on quality that is.

I think if you're deploying .net code you're almost certainly going to follow similar build architecture with TeamCity doing most of the grunt work. We have a very similar build structure but a bit more polished I think. Our TeamCity build agents build nuget packages, deploy them to Octopus, run unit and integration tests. Octopus handles promotions of code from dev to qa, to staging and all production environments. We also write migrations for database updates using FluentMigrator which works with just about any database driver. It's a joy deploying code on an environment like this.

Agreed on the Octopus bit. TeamCity + Octopus is practically magical. Until literally yesterday, I'd yet to find something that didn't work with minimal effort between the two.

Happy to see someone else using FluentMigrator. It's a fantastic library that doesn't get mentioned enough imo.

> Fun fact: since Linux has no built-in DNS caching, most of the DNS queries are looking for … itself.

This is wrong in two ways, and isn't factual at all.

First, the cause of the queries is nothing to do with whether DNS query answers are cached locally or not. There is no causal link here. What causes such queries is applications that repeatedly look up the same things, over and over again; not the DNS server arrangements. One could argue that this is poor design in the applications, and that they should remember the results of lookups. But there's a good counterargument to make that this is good design. Applications shouldn't all maintain their own private idiosyncratic DNS lookup result caches. History teaches us that applications attempting to cache their own DNS lookup results invariably do it poorly, for various reasons. (See Mozilla bug #162871, for one of several examples.) Good design is to hand that over to a common external subsystem shared by all applications.

Which brings us to the second way in which this is wrong. A common Unix tradition is for all machines to have local proxy DNS servers. Linux operating systems have plenty of such server softwares that can be used: dnsmasq, pdnsd, PowerDNS, unbound ...

One of the simplest, which does only caching and doesn't attempt to wear other hats simultaneously, is even named "dnscache". Set this up listening on the loopback interface, point its back-end at the relevant external servers, point the DNS client library at it, and -- voilà! -- a local caching proxy DNS server.

* http://cr.yp.to/djbdns.html

* http://cr.yp.to/djbdns/run-cache.html

* http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/djb...

* http://cr.yp.to/djbdns/install.html

I run server machines that have applications that repeatedly look up the same domain names again and again. Each runs a local dnscache instance, which ameliorates this very well.

I guess this goes to show that you can be a full-fledged expert in one development "world," and miss easy things in another. I see this a lot in the apparent expectations of people because of their ingrained presumptions of the OS platform. This article is of interest to HN because Mr. Craver is on a leading edge of .NET development and deployment, and this is an insightful oversight.

I spent 20 years getting deep with PHP, and then Rails, on Linux, where, IMO, doing CI builds with something like Jenkins or Heroku is pretty straightforward. For the past couple years, I've been doing .NET, and, while I never really stopped doing ".NET" since the VB 3.0 days, I've had a lot to learn to get into doing serious, enterprise-level stuff. I just took my first steps in setting up auto-building with Visual Studio Team Services, and my experience has left me really disappointed. Dealing with changing variable values per environment is really hacky, no matter which of a handful of tricks you want to try. So much so, I left the whole thing hanging, and have gone back to doing releases by hand.

Skimming the article, and seeing discussion about this topic in the comments, leads me to conclude that my experience wasn't just limited by my ignorance, and that this is still an area that is underserved in the .NET world. You'd think someone would have neatly sorted this by now. I keep looking for the oversight in MY case, but I'm not finding it.

> gone back to doing releases by hand.

Did you try Octopus Deploy (1) or Deployment Cockpit (2)

[1] https://octopus.com

[2] http://deploymentcockpit.com

No. Thanks for the tip. I'm going to try to crack this nut again. I'll check them out.

I don't believe your assessment is correct. I very specifically said built-in. This remains true. If curious, we're on CentOS 7 specifically. I didn't say there aren't any options, only that there aren't any built-in. What you described as alternatives are totally true, but they still aren't built-in. It's a manual/puppet/chef/etc. config everyone has to do.

As for the applications - we have little direct input to TeamCity of Gitlab (the problem children here). And even if we did, I think we agree: the application level shouldn't cache anyway.

That being said, we're looking at `dnscache` as one of a few solutions here. But the point remains: we have to do it.

You are employing a faulty concept of what constitutes built-in. This is not Windows, or one of the BSDs. You're using one of the Linux distributions where everything is made up of installing packages. There is no meaningful "built-in"/"not-built-in" difference between installing one of these DNS server packages from a CentOS 7 repository and installing any other CentOS packages from a CentOS 7 repository.

> All software on a Red Hat Enterprise Linux system is divided into RPM packages which can be installed, upgraded, or removed.

-- https://www.centos.org/docs/5/html/Deployment_Guide-en-US/pt...

A (very) quick check indicates that the CentOS 7 "main" and "updates" repositories have at least three of the DNS softwares that I mentioned. Ubuntu 16 is better endowed, and has all of them that I mentioned, and an additional "Debian fork of djbdns" that I did not, in Ubuntu's "main" and "universe" repositories.

I too read that part of your post and paused. Your argument about it not being built into Linux is interesting because it gets down to what is Linux versus what is the distribution, etc. Which I don't think is worth getting into. The truth is I'm almost positive you rely on plenty of other core components not built into Linux proper. DNS caching is certainly available to you and if you don't have config mgmt or an immutable infrastructure model in place, I'd start considering it soon.

It's funny to see that stackoverflow came to exactly the same solution for database migrations on the Microsoft stack as my team did, even down to the test procedure.

Simple, safe and very effective :)

its also pretty much the same solution colleagues and I came up with a few years ago for a migration tool we were working on. It's kind of abandon-ware at this point, but this version is pretty far along: https://github.com/jdc0589/mite-node

> If Gitlab pricing keeps getting crazier (note: I said “crazier”, not “more expensive”), we’ll certainly re-evaluate GitHub Enterprise again.

Shots fired :P

When did everybody decide that chatbots were the new hotness for deployments?

If you mean pinbot - that's literally all it does. It takes a message and pins it, knocking the old one off the pins.

The build messages build...that's also literally all it does. It simply puts handy notices in the chatroom. Why wouldn't you want that integration? Everyone going to look at the build screen and polling it to see what's up is a far less efficient system. A push style notification, no matter the medium, causes far less overhead.

I doubt we'll ever build from chat directly for anything production at least, simply because those are 2 different user and authentication systems in play. It's too risky, IMO.

   A developer is new, and early on we want code reviews
   A developer is working on a big (or risky) feature, and wants a one-off code review
This implies you don't normally do code review??

Code review is overrated. Over my career I've built 8/9 figure revenue platforms without code review or serious test setups.

A lot of this cruft is unnecessary when compared to good domain knowledge and solid coding focus.

> Over my career I've built 8/9 figure revenue platforms without code review or serious test setups.

This does not diminish code reviews. Would you be a better engineer today if you had regularly participated in code reviews? Would your coworkers?

One of the problems with code reviews is they're often done really poorly, for a number of reasons:

1. Reviewers are often poorly trained to provide good design reviews and default to nit-picky stuff a code linter should pickup. Human linting is just a poor use of time and money.

2. Nobody seems to ever have time for them to deep dive into the code.

3. Few engineers seem to ever actually want to do them.

4. Reviews can become hostile.

Code reviews are probably really important in some fields, for example, medical equipment, aviation, etc, but for the vast number of projects where we're shoveling A bits to B bucket or transforming C bits into D bits it's overkill and companies would be better off investing the massive amount of wasted time in better CI/CD infrastructure.

> Would you be a better engineer today if you had regularly participated in code reviews? Would your coworkers?

Maybe, but probably not. It's not like I never see someone's code, it's right there when I'm working in the same code base and I can go through commits to see the high-level changes.

There are lots of ways to become a better engineer and code reviews are pretty far down on the list in my view. They usually just turn into tedious ordeals that burn up actual productive time.

I believe s/he would, and likely knows it. They just said it was overrated, not useless. They just draw the tradeoff line differently.

Code reviews have a double edge purpose:

1. Make sure you don't do something dumb + mentor/educate to better standards.

2. Share the knowledge of how a codebase works so that someone else will know how to fix something at 3am when you can't be reached.

I wonder if there is a better way of doing code reviews.

For example, a way to for reviewers to just mark a review as 'acknowledged' and submit a list of potential concerns (which may freely be ignored by the author). This makes them much more low friction as the reviewer is scanning the code to understand the purpose of it and help think of potential pitfalls at a high-level, rather than nit-picking apart little details.

This is how we do it (using Stash). Most reviewers are across the code base and so essentially rubber-stamp approve. External stakeholders (mostly operations) go over the code with more rigor since this is their one shot at getting errors corrected. "NBC" (non blocking comment) is used to describe a nit-pick (formatting, suboptimal but not awful variable names) that isn't amenable to automatic linting. Additional review comments are assumed to be blocking and require at least an acknowledgement. Big changes are usually acknowledged with a "my sprint is in danger" and put into the backlog.

I've mentioned it in previous threads, but we try to prevent hostile reviews by separating the code from the coder. Comments should not reference the author, only the code.

> A lot of this cruft is unnecessary when compared to good domain knowledge and solid coding focus.

The counterpoint is that this good domain knowledge is bettered by considering other's changes.

> Code review is overrated.

Scientific research suggests otherwise though.

Yes, we don't normally code review stuff we don't need to. We trust each other.

I don't think doing code reviews implies a lack of trust. I trust everyone on my team but we code review everything. For that matter, the fact that we trust/have good relationships with everyone makes code reviews more effective because we can be more candid.

We often find issues in code reviews like edge cases that weren't thought of, code that could be refactored to use an existing utility or patter the author wasn't aware of, etc.

I work for a company that averages around 50 production deployments per day for our customer-facing ERP, and we only do code reviews for new devs and changes to underlying framework changes for mostly the same reason sklivvz1971 mentions. We rollback very infrequently and a majority of our devs can deploy to prod with the push of a button as needed (this includes both application and database code). Not arguing that code review is unnecessary, just feel that with proper training and having devs with good judgement can help reduce the likely hood of breaking things when deploying small changes frequently.

We rarely catch things breaking in code reviews, I agree they are really bad at finding bugs. Automated tests and linters are better at finding stuff like that. The things we usually address in code reviews are architecture and code design issues, and occasionally edge and interaction issues that are outside the scope of what might have been considered when implementing.

We also have frequent production deployments that everyone on the team can do, I view that as something that is independent of code review.

Not a bad practice, but costly in dev time and a trade off many aren't willing to make especially in smaller companies where mistakes aren't as costly as all that extra developer time.

Personally I do code reviews mostly to share knowledge and culture rather than looking for bugs. Occasionally a bug is found, but I don't generally have the time to review the logic, just the style.

Don't get me wrong, we review code. We simply don't review commits. Reviewing absolutely every commit would feel like a waste of our time and an efficiency issue. I can see where it could be useful, but in our case it's simply a solution to a problem we don't have. We can definitely live with a bug in production for 5 minutes.

As a commenter below notes, there are always two pilots in an airplane -- and that is pretty much also a trust issue -- but we don't pilot planes, we don't have actual lives depending on us.

The captain trusts the co-pilot on airliner.

There are still two people flying a plane.

Don't forget the role who does most of the work, the autopilot :)

Well, it is not an enterprice applucation. It is free web site. Yeah, not a very good practice but not very critical too.

I dunno that I'd say an enterprise app is inherently more valuable than Stack Overflow. ;)

But to be clear - it's not that we never do reviews. It's more that we have an "ask for it when you need it" type of policy. New hires get regular reviews, so initial architecture/style concerns are addressed then... along with teaching the logistics of our code reviews (push to a branch & submit a merge request).

SE/SO is for it's size an amazingly high proformance team. It's rare to hear them say it, in fact, I don't recall ever hearing them say this.

The importantance of the cohesion and trust amoung their team is critical to their deployments. In fact, I would say it's vital to how they're able to get away with minimal amounts of code reviews for example.

It's dangerous to believe this is easy or reproducible. New teams needs extensive controls in place to make sure the quality of the deployments will not negatively impact the group.

A few things strike me as odd/sub-optimal:

    - migration id's freeform in chat -> why not central db with an auto-increment column?
    - Use chat to prevent 'migration collisions' -> Same central db/msmq/whatever to report start/stop of migrations and lock based on that...

I guess this is because using a team chat is 'good enough' for them without adding another layer of tooling

The article says:

> there is a slight chance that someone will get new static content with an old hash (if they hit a CDN miss for a piece of content that actually changed this build)

Anyone has a solution to this problem?

Without poking at their site to see how they behave, I assume they just append a cache breaking param onto the URL for a static entity. Assuming that assumption is correct, the fix is to make the hash meaningful to the process serving the static entities.

This adds complexity as now your static site needs to serve requests based on the hash. This isn't conceptually complex but it means you must deploy at least two versions of your static resource (one for each hash in the wild). And you still have to do two phase deployment in this model (static resources first, then the "real" site). Or you can build redirect or proxying support from v2 to v1 and vice versa, which is a much uglier problem to solve, but eliminates the need for two phase deployment.

Since they have short deployments, their solution is pretty elegant. If you have long deployments, the hash aware server becomes sensible. If you're a masochist, the two way proxying becomes attractive.

Nope, we solve it the same way, it's a pain. I did a presentation once for devs where I unraveled the up-to-9 different layers of caching between an end user and our website (when you take into account the browser and any tiered CDN caching)

It's a pest of a problem but pre-deploying static assets is the best answer.

Push new static content first to all servers/cdn, _then_ bust the CDN cache / push a cache bust

This doesn't resolve the issue. The fundamental issue here is that two versions are running in parallel.

* If you push static content and web pages together, you get V1 and V2 of both static and web, and you end up with incorrect static resources served in both directions. This approach is only reasonable if your deployment strategy is to take a service outage to upgrade all machines together.

* If you push web first, you get the ugly scenario described in the article where V1 resources get served with V2 hashes and cached for 7 days.

* If you push static content first, you still have V2 static content being served for V1 web pages. The "cache bust" doesn't matter. Somewhere a cache will expire and someone will get V2 static resources for a V1 page.

You have to deal with the two versions somehow if you want to resolve the issue fully.

Stack Overflow uses windows? Any particular reason to do this?

The founders and first devs were proficient in it.

It works well enough for our needs (e.g. C# has one of the best GCs on the market), and no one is a platform/language zealot, so we keep on using it.

Stuff we added later runs on other platforms, as needed (e.g. we run redis and elastic search on CentOS, our server monitoring tool Bosun is written in go...)

Written in C#

Joel was a former product manager at Microsoft.

I was surprised they weren't using Kiln [https://www.fogcreek.com/kiln/], a Fog Creek product. I know SO is independent from Fog Creek now, but still a bit surprised at it. I wonder if there was a migration off at some point.

Yep...for the on-premise reasons listed in the article. Once upon a time a lot of projects were on Mercurial, hosted by Kiln. The Stack Overflow repo specifically has always been on an internal Mercurial and then Git server. Originally this was for speed, now it's for speed and reliability/dependency reduction.

Did you choose polling over webhooks for a reason? Or was webhooks recently added as a feature to Gitlab?

Webhooks didn't used to work well for many builds off a single repo, but I think this changed very recently in TeamCity. Thanks for the reminder - I'll take another look this week at adding web hooks. We'd still want the poll in case of any hook failures.

At the moment, Gitlab knows nothing about our builds - and we'd want to keep it simple in that regard. If we can generically configure a hook to hit TeamCity to alert of any repo updates though, that's tractable...I need to see if that's possible now.

TeamCity is a great CI system.

What's with all the backslashes?

Windows. The terminal window is Cmder [1]

[1] http://cmder.net/

Yeah! What's wi...wait, what?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact