Hacker News new | past | comments | ask | show | jobs | submit login
How We Release So Frequently (skybettingandgaming.com)
201 points by TomNomNom on Feb 7, 2016 | hide | past | web | favorite | 50 comments

This is a good solid write-up of techniques that I think are emerging as best practices generally for high scale, complex sites. The approach described matches what we do at Eventbrite pretty closely, and I know companies like Etsy and Slack use the same kind of process.

Feature flags in particular are a very powerful tool for managing feature releases independent of the deploy cycle. Here's a good recent article that dives into those in more detail: http://martinfowler.com/articles/feature-toggles.html

I just want to throw out that there's nothing about this that won't work for !("high scale, complex") sites. I worked at a ~20 person startup and we arrived at the same decision regarding database migrations and, you know what, it wasn't that hard and it didn't take that much time and it made deploys a lot smoother.

I see a lot of other people mentioning the annoyance factor. Like anything else, you get used to it, and appreciate its advantages.

All the principles in here sound exactly like what is SOP at Google (and probably Facebook and others). It's a pain in the butt sometimes, especially when making schema changes, but ensuring everything works properly even when you have multiple versions running in production can really help with confidence in making changes.

You do have to think several steps ahead when making big changes; the (internal-only) project I work on is in the process of completely changing our DB schema and we're redoing our API completely as well. We're attempting to keep our old API running in parallel while migrating literally everything underneath it, which is a fun challenge. It results in a lot of what can feel like busywork, but when the alternative is bringing down your service temporarily, it's worth it. An hour of planned downtime to do an offline migration can easily turn into several days when Murphy strikes. That's OK when you're first building a system, but once anyone is relying on you to get their work done, it is just pure pain.

The biggest difficulty is managing database rollbacks. When you mess up your database, rolling back can be tough.

These guys avoid the problem by never rolling back the database, and never making changes that might require that.

Flyway author here. Rollback is an illusion once your transaction has been committed. At best you can attempt to issue a compensating transaction to undo some of the effects. More details: https://flywaydb.org/documentation/faq.html#downgrade

Currently using flyway on an enterprise Java project. Brilliant library. Excellent documentation, developers sing its praises all day everyday.

Database schema changes can almost always be moved forward in a always safe manner using expand-contract pattern http://martinfowler.com/bliki/ParallelChange.html

As for reducing the risk of damage to the data itself from badly behaved application code, I think the best approach is to design your architecture such that you can't lose important data.

There are various other techniques than can help. I wrote about some of them here http://benjiweber.co.uk/blog/2015/03/21/minimising-the-risk-...

Releasing less frequently actually only makes the problem worse. Infrequent changes are often too big to have a chance of understanding their potential affect on the production system. You're also less likely to immediately know how to respond to a problem.

Never rolling back makes a lot of sense, and ia often asserted by the fact that many migration frameworks dont even provide rollback facility e.g. flyway. A database migration is a major high risk change IMO and should be well thought through.

If you add a column and real customer data is written to that column, then "rolling back" means deleting it. That isn't a problem that can be "solved".

I am never thinking about doing rollback on production, downgrade is nice on dev machine when you work on multiple branches. I'd rather do database restore and code restore to pev version.

I would not roll out new code with migrations without ad-hoc db backup. Any new code as well.

It's always possible to make changes to your schema without needing to potentially roll back the database. It may just be more annoying than you are willing to attempt. You might need to roll back your code to a previous version, but if you structure your changes right, you'll do it in multiple steps. (These are basically what's laid out in the article, but broken down more.)

  1. Change your code to write to old and new schema; keep reading from old schema. 
  2. Migrate data to new schema in background. 
  3. Add a flag to control where you're reading from. Default it to read from old location. Keep writing to both. 
  4. Flip the flag on some subset of your jobs. Ensure everything is still running smoothly for as long as you like. 
  5. Change the default to read from new schema. Wait as long as you like to be comfortable that the change is working properly. 
  6. Delete the code that reads from the old schema. 
  7. Delete the code that writes to the old schema. 
  8. Drop the data in the old schema.
At any point prior to deleting the old data, if you encounter problems you can roll back to an old version of the code. If your schema changes are incompatible, you can make an entire new database with the new schema. This may temporarily waste some storage space, but it's very safe.

Get off NFS while you still can. Thank me later.

I think NFS often gets an undue bad reputation. I work at a company which uses NFS at scale and it does the job without too much trouble. NFS is used as the storage for vmware datastores, xen primary storage and also for shared storage mounted between servers.

For the latter case, the mounting of these partitions can be automated with config management on the linux servers. You have to be careful with UID's and GID's but config management helps with this.

The filers supplying the NFS storage can be exploited to provide replication to other datacenters,snapshots and also provide redundancy with multiple heads serving the volumes.

In the past I've used Fibre channel ( found it overly complex) and iSCSI. iSCSI was fairly straight forward to use, but I've never tried to automate it. I guess there isn't a reason you couldn't however. For complexity I guess its Fibre>iSCSI>NFS.

Performance wise we don't have any issues with NFS itself, the bottleneck is sometimes the filer trying to keep up :-)

Anyhow, in complex environments, sometimes its good to keep things simple where you can. NFS helps with that, its stable, scalable and the performance is comparable to iSCSI.

Removing the need for shared storage on the OS where possible is the ultimate aim though.

I wonder of how much the experience differs based on the NFS version being used.

Agreed. I've run Oracle with thousands of commits/sec on NFS with no problems. Or no more problems than we'd have had on any storage.

Yep. I work for a big company with a bureaucratic system engineering department that puts all program code on NFS (SANs).

It's been the cause of virtually all of our service outages and many of our performance problems---and it's completely obsolete in the era of Jenkins and Ansible.

It sounds like they are using NFS for distributing their code, which should be almost entirely read-only - the only place that writes new files to the file system is the single Jenkins machine that manages the deploys. It seems likely to me that this would avoid most of the risks inherent in running NFS at scale.

Let's say you distribute a binary over NFS -- compile some C or C++ or whatever you like. Then various hosts run /path/to/binary. At some point, the NFS server changes that file out from underneath those hosts, because, well, it can. The usual "text file busy" you'd get when trying to do that on a local filesystem never happens.

At some point after that, the hosts running the binary will try to page something involving that binary and will SIGBUS and die.

That's just one of many failure modes.

They say they put out new package; I take that to imply they don't rewrite files once deployed.

But what you're describing has interesing failure modes indeed - from just successfully patching the running process to the SIVSEGV to having the target process ending up in an infinite loop (BTST).

That's not the only issue with running NFS with a horizontally scaled web cluster, stuff like this starts to hurt: http://www.serverphorums.com/read.php?7,655118

That is something we're working on at the moment. It is read-only from the web servers' perspective, but it's definitely not without its problems.

Looks like a large distributed company...I have some experience with casinos...similar...

I'd be interested to know more about your experiences with an NSF "share"...which he mentions...?


In the old days of relational databases we had an abstraction layer between actual storage and applications called a query language, decoupling them with functions and views, which were helpful to change the schema independently of the code...

We don't use database views, but kinda-sorta mimic them by separating the persistence layer from the API layer. It can be annoying to maintain (at the big G there are lots of SWEs who joke that their job is writing code to copy fields between protobufs), but the alternative is to couple your API directly to what you're storing in the database. That is a road that leads to pain.

The benefit over using views is that at your code is written in the same language, instead of having a while bunch of logic running semi-hidden in the database. If you have a bug in your view, you have to update your DB schema (or at least roll out new PL/SQL DB code or whatever). And if you're working with a planet-scale distributed app, it just plain won't work.

I literally checked the date twice to make sure this wasn't 10+ years old.

I guess your point being that these aren't novel techniques?

The author sure didn't invent them, but they aren't widespread enough yet. In fact e.g. Rails puts you in the opposite mindset (which is OK at early stages).

Betting is a highly regulated environment. The kind of mistake that startups make all the time could result in a heavy fine of the loss of a license. In such a climate, this workflow is exceptionally modern.

For a company like Sky Betting this is actually fairly monumental. Not every company looks and behaves like every trending SV startup.


PS - They actually make pretty good money too: http://www.cvc.com/Our-Portfolio.htmx?ordertype=1&itemid=112...

Neat. Interestingly, for all its faults PHP in my experience has made this a little easier to achieve than other languages we also use: shared-nothing and no internal process state between requests makes cutting over a bit easier than our equivalent node servers. Some great practical advice in this article.

That's a flaw in how you use node, not node itself. We run dozens of instances of node without internal process state, the need for sticky sessions, etc. You're losing a lot of load-balancing ability by not keeping this discipline too.

I can see the infrastructural scenarios where this would be beneficial, but maybe even then, I think this is a convoluted way to not use what I consider critical workflow tools, specifically environmental files, config files, git merge, git rebase, branches. I think you would be better off looking to see if you can organize your files better, restrict merging major branches to employees who are properly trained/competent in merging.

Are frequent releases some sort of advantage?

Whatever you are developing has no value until someone actually uses it. Therefore the quicker you can confidently release something(feature, bugfix, etc), the more value you gave deliver.

This idea comes from the concept of inventory waste from lean manufacturing

Another side benefit is that it generally leads to higher quality releases and more stability, for a few reasons:

1) If you can do quick, easy releases, people release small, narrowly-scoped changes instead of huge batches of changes that may interact with each other in all sorts of unknown ways.

2) If you need to fix something, having a process that lets you release quickly and easily means you can get a fix deployed that much sooner. I hate being in a situation where an emergency patch that is really small still takes hours to get out.

Also, if something goes wrong in production, it's easier to find the cause. Just look at the most recently deployed code change.

From that standpoint, frequent releases can also make things harder, depending on the tightness your feedback loop. If it takes you two days for user reports to percolate to you, it doesn't help if there've been 16 releases since then to examine. For the work I do, error reports are usually "A patient called us to report that their daily email had some wrong info", so for that, multiple-times-per-day releases would not make things any easier to debug.

Presuming your "productivity" stays the same, you would either have 16 releases of size "1" after two days, or 1 release of size "16". Either way, the size of the code to examine would be the same, right? You would still have the benefit of getting the new features faster to the user, so I would say quick releases are still beneficial.

In your situation, I would look for ways to detect the issues earlier. Ideally, you know about it before the user does (which is easier said than done, I know). Added benefit of this: quicker feedback will help you pinpoint the problematic release ;-)

In addition, faster releases can also reduce sunk costs. For example, if you release new features in stages, you can correct course quickly from the beginning based on actual user use/feedback instead of spending months building up something that may not meet actual needs.

This is hugely important. My current employer wants our grand release to have a powerful search feature. We have zero customers to use it and Zero data to search through.

Agreed...the primary advantage is being able to backtrack quickly...diagnostics are focused...

Provided all else is equal (bugs, feature requests, customer feedback,) yes. If it gives any business advantage, the sooner you release, the sooner it starts giving returns, and those returns eventually compound over time. So, yes.

End result to a user...potentially yes...though they may never be aware of your efforts...

What language, framework, tools do you use to manage these phased migrations? It sounds like alot of extra code to write.

Also 1 site update, or version number, could really be N releases until fruition -- which don't sound like traditional releases to me.

>> It sounds like alot of extra code to write.

What he's saying is more about convention than writing code. For instance, instead of adding a column "abc" and doing:

foo.abc = 123;

they would do something like:

if (foo.abc) { foo.abc = 123; }

make sure all tests pass, and then migrate the db.

If you're asking about tools to migrate code, all popular languages have one. (I.e. django comes with one that's already really good).

seems like there is a race condition on the docroot switchover, but maybe with their forward only migrations it's a non-issue.

Interesting 8/10 years ago!

Hm, is it any better working there than for the TV part ?

There are many tech teams at Sky. Each one is very different in terms of culture, tech, product etc.

As I understand it, Sky bet is quite separate from the rest of the company. This is probably a good thing...

Separate company these days.

Yes! Although I am obviously biased.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact