The right way to turn off your old APIs

eyelidlessness · on Jan 23, 2021

This is fantastic. Maintaining API compatibility is a fine art.

A fun fact: for my previous job I developed a set of libraries to make Node API development type safe and generally easy to use safely with a lot of help from the TS compiler. When I got the chance to prove it out for production use, I was tasked (well I volunteered) with porting a service that had one of the worst API designs I’ve ever encountered. It had no design, totally off the cuff for every route handler. But it was in production and critical, and API stability was a hard requirement.

A wonder of wonders: these libraries worked with almost no rework, mapped to an internal implementation that was much better to work with, and auto generated documentation that was 99% correct (I fixed the two bugs filed over 10 months).

Before I left we were planning a v2 API, and as far as I could tell literally no business logic/service code would need to change. Only the service boundaries.

And on my way out, doing knowledge transfer sessions, all the things that seemed inscrutable about the libraries doing all that work really clicked with the rest of my team. It was already doing the API translation that was the ugly part. Once they saw how it worked with APIs that were designed for use and development, nothing felt painful for the team I was leaving it with.

Planning ahead for this kind of thing is important. You don’t have to have the perfect API. You just have to have good abstractions in place for the inevitability you don’t have the perfect API.

eyelidlessness · on Jan 23, 2021

Following up since this is seeing a little bit of attention: I unfortunately couldn’t open source those libraries but I’m already working on their successor. And I have really good ideas how to improve on the original, including:

- snapshot testing to identify unexpected API changes

- auto generated integration/E2E fuzzing tests to validate API boundaries so you can trust you don’t need to test them yourself and can just trust the type system and your business logic tests

- a fully transport layer agnostic interface so you’re not married to HTTP (or REST, or whatever)

- the API boundaries are defined with standards-based primitives (JSON Schema) but simple declarative interfaces like io-ts for end users, but this internal structure makes documentation generation a first principles assumption

- 400/validation errors will be automatically documented along with success responses

- a “choose your own idiom” interface where you can go full on Option type/exhaustive pattern matching for responses, error/result tuple responses, or more TS/Node idiomatic try/catch semantics. But each interface will be type safe from request to response and you won’t have to sacrifice any of the other guarantees listed above. The only tradeoff here is since TS doesn’t have type checked exceptions the other interfaces will probably have richer documentation

eyelidlessness · on Jan 23, 2021

I wish I could reply to the dead reply. But since I can’t...

> And now the business is left with another abstraction no one maintains on top of a legacy system no one understands. Good job. Well deserved promotion.

The business made a business decision to evaluate the tech in order to save a critical service that had serious maintenance problems, and made another business decision to use the stack. I barely even advocated for it, I mentioned it one time, we decided to do a spike into feasibility, everyone was pleased with the outcome.

The libraries themselves were extensively documented from the start. I made a presentation introducing them before the service port was launched.

A greenfield service was under active development for several months before I resigned, and a grand total of two bugs were filed related to that (by me). The team using it had maybe five questions for me over that period. Other services were planned to be ported when feasible.

I did two weeks of knowledge transfer during my transition out of the business. I took great care to spend extra time on the most complex parts and had special sessions for what I was unashamed to call the “bad parts”.

Overall the reception was very good, and by the end of that there were very few questions. I’m reasonably sure I left my old team and adjacent teams on good footing to stay productive. And I offered to make myself available for free after the fact if anything needed further explanation (which hasn’t been needed).

Honestly I know “NIH syndrome” is a smell, but goodness I wish people had a little bit more inclination to take a charitable interpretation and ask questions before attacking.

g_delgado14 · on Jan 23, 2021

any way to "follow" you? Either via a git{hub,lab} repo or twitter or something? Really curious to hear when this is released.

Also, I've implemented my own typesafe route abstraction for express before so this is def a projecct I would enjoy contributing to.

eyelidlessness · on Jan 23, 2021

Same handle on GitHub! And Twitter. Currently building my personal tech site/blog (same name on GH Pages, probably everywhere since my last LiveJournal), then getting more focus on this project which I’ll also be blogging about on my site. I hope to have the site up and running next week.

Herdinger · on Jan 23, 2021

This is all great advise.

I feel like there is one important detail missing. There is no way to turn off an API like this without guaranteeing not to introduce breakage. Even for a well behaved client that has enough foresight to anticipates API deprecation.

You only get the specified headers AFTER calling the API endpoint.

So for example if a device sits in a drawer for a year, then calls the API your response is undefined if the API has been turned off. No chance to get the date beforehand.

The way I handled situations like this is to explicitly make 'turned off' part of every single API response.

Something like APIResponse = SpecificResponseForEndpoint | TurnedOff

That way API interactions are always well defined, clients can implement a global handler that does the appropriate thing, locking the client in a 'please update' state if it is an in house developed app for example.

For HTTP I usually reserve the 410 status code for that since it usually does not collide with the more common 404.

laurent92 · on Jan 23, 2021

After lots of notifications, warnings and headers:

> first they disabled it for one hour, then reenabled it, then they disabled it permanently two weeks later.

> There's other tricks too: Android added increasing delays to deprecated native APIs in 2015, eventually going up to a full 16 second wait, before finally turning off the API entirely.

I feel that this is the crux. You never know who actually uses the API until it is really unavailable.

toast0 · on Jan 23, 2021

> You never know who actually uses the API until it is really unavailable.

You should have a pretty good idea who's using an HTTP API from access logs. In an ideal world, User-Agent headers help you trace usage down; although that's hard to enforce on a public API.

Depending on how much introspection is going on, Google should have a pretty good idea who's using deprecated APIs among those who submitted their apks to Google Play.

Yeah, there's probably going to be things that slip through the cracks, but you should be able to track down a lot of things first. And if you can't shim your old API onto your new API, maybe people aren't transitioning because your new API sucks :P

rollcat · on Jan 23, 2021

Another trick is failure injection, e.g. go for a 404/410 for a small % of all requests or clients to figure out potential impact and give an early warning, then slowly ramp up the %.

peteri · on Jan 23, 2021

One thing that isn't called out is cost to your clients.

Last week I was fixing a bug caused by vendor who had updated an API in an incompatible way (probably via a shim of some sort) this was causing our shared ultimate end customer to be at risk of regulatory fines. Thankfully as it turns out the last 1.xx version of the API was mostly adding functionality we don't use and we can swap to using a new field without lots of pain.

However this does mean we need to spend a couple of weeks of developer and tester time retesting the interface code. Cost I would guess $10,000 by the time we add in all the PM/PO/DevOps time multiply that by a few customers and it adds up fairly quickly.

Moving to version 2.xx doesn't look too bad but the monolithic protobuf file has been split into parts and the upgrade is probably going to take a 6-8 weeks as they've also changed some of the enum values and properly deprecated stuff. My gut feel from having spent a couple of hours looking at V2.xx is it mostly compatible but this vendor has form for hiding a surprise change (they changed from representing money as an int with an implied decimal place to a long+int)

Again probably not too much work overall but it will cost $20,000 to swap versions and the opportunity cost of not doing something else instead.

We have the same problem with the API for external applications we provide. We know folks are running very elderly versions (over 3 years old) and our next version is huge change for client applications. We've taken advantage of this change to rename a load of stuff and explicitly add types in various places. I doubt we'll be able to sunset the old API for many years.

akavel · on Jan 23, 2021

If you're designing a new one, I heartily recommend trying to go with https://aip.dev guidelines; they are built on a lot of practical techniques focused on making APIs future-proof (and also internally consistent).

mgliwka · on Jan 23, 2021

A good additional measure is a scheduled brownout. Turn off the API for a couple of hours or a day to make the consumers notice, then turn it back on for some weeks to give them time to migrate.

Google did this with their old Helm chart repository.

ozim · on Jan 23, 2021

Then you might find some cases, where an API endpoint is called once a month o once a year but it is critical, that will bite you back really hard.

vegetablepotpie · on Jan 23, 2021

One solution I’ve seen posted here (can’t remember the link) is to put a sleep in some call and step it up every day/week/month until retirement.

That way when the application slows down, people complain, a story is created to figure out why, and the answer will be the library is deprecated and needs to be migrated.

The calls can always be made, they just get more expensive.

michaelt · on Jan 23, 2021

Naturally, you would combine any planned API delay or outage with conventional deprecation steps like updating documentation well in advance, posting to your blog, twitter and mailing list, e-mailing every identifiable user of the deprecated API, and having your account managers reach out to paying customers who use the API.

londons_explore · on Jan 23, 2021

The best way to turn off old API's is to not need to.

If your API is simple enough that it can be reimplemented as a shim to the new API, and that shim layer needs no state or special permissions, then it becomes effectively maintanance free. Your shim cannot have security issues (since it has no special permissions). It likely won't have scaling issues (no state makes scaling easy). You can leave it sitting there forever if necessary, and the total cost will be well below the risks of turning off an external API.

zinekeller · on Jan 23, 2021

Actually pointed out in the document:

The next question, if you're not napping, is to ask yourself is whether there's an alternative to shutting down this API. Everything you turn off will break somebody's code and take their time to fix it. It's good for the health of your client ecosystem and the web as a whole if APIs keep working.

There are valid reasons to shut down an API (outside of "the company will shut down"). Some API shutdowns are security-wise, for example the API relies on an outdated authentication scheme that can easily be cloned or bruteforced. Assuming that you are in 2012 when credential sharing were acceptable, it's impossible to develop a backwards-compatible API because simply the method to authenticate is now vastly different (tokens versus classic username-password credentials). Some will argue "oh, just generate app passwords!" but unless your audience is that savvy enough, having the API left on will shift the burden from a one-time change to a continual cycle of explaining users to use an app password (which will hit you very badly).

At other times, you need to shut down an API due to an external change. For example, separating APIs by region (note that "oh, just geodetect it" haven't really been sent up letters from a government agency in charge of data and privacy).

nexuist · on Jan 23, 2021

I don't understand, what use is a REST API with no state? Who is making remote HTTP calls for calculating e.g. math equations or unit conversions? A stateless REST API is completely useless; just do everything locally.

For everything else, there will be state to manage, and it's not easy to shove a new version on top of an old version (hence why every API provider agrees to just offer two or more separate versions of the API server rather than having one unified API).

saberdancer · on Jan 23, 2021

Address lookups, weather lookup, company lookup, telephone lookup, autocomplete lookups, ...

Why wouldn't there be a stateless API?

corty · on Jan 23, 2021

Depends on your definition of "stateless".

A database is commonly considered state, so database queries are never stateless in some definitions. Answers to stateless queries can be cached indefinitely. Your examples can't.

Your definition probably just says that one query shouldn't influence the next one in a session/context/user. The other definition says that any query with the same parameters should yield the same result.

xg15 · on Jan 23, 2021

By this logic, there is no such thing as a stateless API because even a trivial "add those numbers" API would rely on the "state" of the server binary or code - which could also change at any time.

I think a more useful definition is your second paragraph - i.e. does the server have to keep state specifically for the client that did the last request? This is the case for sessions and for many non-HTTP protocols that need to track state per connection (e.g. TCP tracks connection flags, statistics and sequence numbers, TLS tracks your keys, FTP tracks your current directory, SMTP tracks which data fields were sent, etc etc).

It does not apply for simple database searches or even CRUD endpoints because while there is global state (there always is), the server does not care which client in particular made the requests.

APIs that have no per-client state can generally be implemented by a shim that internally calls a newer API pretty easily. It's not that easy with APIs that do have per-client state.

MaxBarraclough · on Jan 23, 2021

I think they meant that the shim can be stateless, as it just defers requests to the new API.

londons_explore · on Jan 23, 2021

Exactly.

xg15 · on Jan 23, 2021

In addition to the DB lookup point, even if the task is only doing calculations that a user could in theory also do locally, there are valid reasons why you'd want to put it behind a network boundary: The calculations may be so complex they require specialized or powerful hardware (e.g. graphics or neural network computations), they may rely on data that you don't want to show to your users - or you may want to keep the details of the calculation itself secret for IP reasons.

vasergen · on Jan 23, 2021

Think here about stateless in this way. Imagine you scaled your API to 2 instances. Stateless here means you don't care which server you hit.

bspammer · on Jan 23, 2021

A trained model is just a bunch of stateless math but they are often deployed behind an API

BerislavLopac · on Jan 23, 2021

One surprisingly underutilised technique is the specification-driven API design and development. You start with the API spec -- using e.g. OpenAPI, or basically any other specification standard -- and have all implementations, both server and client side, built based on that.

anticensor · on Jan 23, 2021

The original article should have been titled "Software Transition Engineering" instead.

toolslive · on Jan 23, 2021

This one is about HTTP APIs. Shouldn't the title reflect this?

jaachan · on Jan 23, 2021

Some of these suggestions (like the sleep(16)) also apply to non-HTTP APIs.

In fact, I find the headers part of the suggestion the least useful. No client will check these headers.

toolslive · on Jan 23, 2021

Well, the context matters a lot:

  - Is it a distributed system ?
  - Are the clients yours ?
  - Is it one server, or our there thousands of components with the same API ?

These things influence what's possible and what's a good/bad strategy.