Hacker News new | past | comments | ask | show | jobs | submit login
Advice for Operating a Public-Facing API (jcs.org)
178 points by CharlesW on July 23, 2023 | hide | past | favorite | 63 comments



That last suggestion (of returning 200 but with a message explaining that usage limit has been reached) seems odd to me - I'm not saying I haven't seen code that simply retries on any error but that could just as well include an error trying to parse the response and extract meaningful values from it. What I have observed is that client code often logs or relays HTTP status response codes (and 429 explicitly exists for this case) but rarely bothers doing the same for unexpected message content, which at best might be logged or conveyed to an end user as some sort of "failed to parse" exception. Surely a better option to avoid getting hammered is to return 429 after a 15-20 second delay?


I fully agree. Any error should not return a 2xx code. All clients should be safe to assume that 2xx is cool and they can just move forward and use the output. If suddenly 2% of your 2xx responses are errors it becomes fun time to sort out where data got corrupted a few months after you started to use an API. Handling errors as a client also becomes nasty.


I think the premise here is that a 4xx is an “expected” server issue, which the code may be able to handle (i.e. retry) whereas a 200 followed by a parse error is an “application problem” that they’re not prepared to handle.

A 200 may mean the payload has changed, and how would an app handle an arbitrary format change? This could well fall into the “don’t test for something you can’t handle”.

At a minimum it’ll ideally get the app to quiet down.


> I think the premise here is that a 4xx is an “expected” server issue

That premise is incorrect, as 4xx is client errors, as in the client formatted the request wrong or anything else, and therefore it couldn't be responded to. Rate-limiting is the client hitting the server too much, and it's up the client to handle this.

5xx is for server issues.

2xx should be success in any shape or form, so clearly 2xx shouldn't be used in this case either.

I agree that 429 should just be used for it's intended purpose here, handling rate-limiting requests/responses. You can still add a body if you want, with the current quotas.


From my reading, this "hand out a 2xx and an error message" advice is for badly behaved clients who are retrying when they get 4xx.

It's not what "should" be used, it's what the author found to be effective.


> From my reading, this "hand out a 2xx and an error message" advice is for badly behaved clients who are retrying when they get 4xx.

But trying to handle clients who mishandle things like that is a fools errand. What client, in their right mind, would try to retry a request that is failing because of what the client is sending? In no case does that make sense, ever.

Similarly, should everything just be 200 then just in case clients mishandle redirect requests?


A lot of developers are idiots. There is tons of code out there doing exactly this kind of thing.

People will copy random snippets from SO and smack them with a hammer until they seem to work then move on to the next thing. I've seen some incredibly stupid code out there, code I can only assume the author either didn't understand or truly didn't give a fuck about. Probably both.


> A lot of developers are idiots. There is tons of code out there doing exactly this kind of thing.

Sure, I agree a lot with this, but that doesn't mean you and me should also do idiotic things. Lets just return correct status codes and the ones who misuse it, will misuse it :)


I like the detail from the top comment of replying w a 429 after a many second delay. That would mitigate any retry storm.


Which is slightly strange because it contradicts the previous suggestion: don't be too liberal of what you accept.


To be fair in some cases rate-limiting could be triggered by a server-side issue, technically there's a 529 error for that, but I can't recall seeing it used. But 4xx errors should exist as an indication to clients that there's no point immediately retrying the same request. If they're not honouring that, delaying before responding seems the most reasonable course of action.


I'm even guilty of writing client code that just retries on failing to parse the response, precisely because the server I was working with randomly would return empty/corrupt responses for no discernible reason (it was actually Amex's API sandbox). Come to think of it, perhaps it was trying to rate limit us, as it happened a lot running our integration test suites! But the fact it returned 200 and even text/html as the content type was a bit ridiculous.


> Pushover's API has a message size limitation of 1,024 characters. If the message parameter is larger than that, I could reject the request because it's not correct, but then the user's message is lost and they may not have any error handling. In this case, I truncate the message to 1,024 characters and process it anyway

This guy should not be designing APIs.


Given what Pushover does, I would say it is reasonable. It is a text notification service, not a bank instructions processor. "I'm not receiving any notifications" is likely to be a much more common customer complaint than "my message got truncated". Given that many of the users would not be coding the API integration themselves but using a tool like Zapier, it is reasonable to expect that they don't do text processing on the client side, and it would be the server's responsibility to deal with overflow text.


Silently doing the wrong thing / silently discarding user data is bad engineering because it hides an error condition. You want failures like this to be loud not silent, especially if they are infrequent.

If a client desires this behaviour, then add a `truncate=true` flag that has to be explicitly passed.


What would you do in the case, where a developer never anticipates that a message goes above the limit, and hence does not catch and process that category of errors. Later the software is being repurposed, and some low amount, eg. 1%, goes about. But this error is silent.


> What would you do in the case, where a developer never anticipates that a message goes above the limit, and hence does not catch and process that category of errors. Later the software is being repurposed, and some low amount, eg. 1%, goes about.

This is exactly the type of scenario I was thinking about.

> But this error is silent.

You’re conflating two different systems; the API and the client that accesses it. These are different systems worked on by different people in different organisations.

This error is not silent at all. It’s correctly flagging the error as soon as it happens. The client is told in no uncertain terms that the call failed. This is good engineering. Hiding a problem only makes it harder to detect, which prolongs time to fix. Errors should be flagged immediately and loudly. For an HTTP-based API that means responding with a 4xx class error and a response body containing an informative message. So – not silent at all.

Separately, in a different organisation, some client developer has screwed up by ignoring the response from an API call. In the general case, you can’t fix this externally. Bad developers who assume all their calls succeed and don’t check error conditions are going to make that mistake all over their code and their shortcomings are going to manifest as numerous bugs. Furthermore, whomever reviewed their code missed something obvious – these kinds of bugs are easy to spot in code review. So generally speaking, if you’re asking what I would do then the answer is nothing. I’m not responsible for fixing somebody else’s dysfunctional team in somebody else’s organisation and I’d just be scratching the surface by trying to work around just one of their bugs at the expense of my own service’s quality.

In this particular case, however, you can surface 4xx class errors in a number of ways if you really wanted to, e.g. email the client a daily report of all client errors. But you should be designing your API so that correct usage provides a robust solution and that means rejecting invalid data immediately.


If possible, redesigns preceded by targeted warning emails sent to business customers based on message length stats, with a timetable for when the fix will rollout.

If that data doesn’t exist, public announcement that the API is breaking long messages without the new parameter.

If the API can’t be broken, version the API and break it in the new one and for new customers.

In all cases, disclosure in the documentation - If you know there’s a limit and you still exceed it, at that point, as the kids say, FAFO.


In semver, that's a major version upgrade since it changes the API without being backwards compatible. API consumers have responsibilities too.


Imagine having the financial promotion disclosures truncated off, unbeknownst to you, and getting your operating license taken away by the FCA.


Imagine that you think the financial promotion disclosures are sent, but they are not.


Sorry, that solution doesn’t work. The API in question is designed to send push notifications. Push notifications are delivered by each platform on a best-effort basis. You simply cannot rely upon any particular push notification being delivered.

If somebody is sending a disclosure in a separate notification to the message that needs the disclosure, they are already running afoul of the law because they have no way of ensuring the disclosure is delivered.

The only way for an API like this to operate correctly is to reject invalid messages and not truncate. Silent data mangling is not acceptable.


You would know, because in the alternative setup, it wouldn't fail silently


How would you design it? Accept messages of any length, reject messages over maximum length?


HTTP 413 Payload Too Large is the appropriate response, not silently truncating content because you wanted to be ‘optimistic’ in your replies to clients.


> Pushover's API might be unusual in that it is used by a wide range of devices (embedded IoT things ...

If you did that, and no error handling was implemented in the IoT device, it would merely result in a non-event that no one knows about.

In this specific instance, the author is doing the right thing. The developer will see truncated messages running through.

In the end it is a question on desired semantics. The semantics of the operation in question is obviously to truncate and process rather than fail.


The message in question is being delivered to an end-user device, and is not expected to serve any purpose other than being displayed on a screen. In this context, it seems totally reasonable to truncate overly-long messages.


Certain types of messages have legal and regulatory requirements about their contents. Assuming no such messages traverse your API and truncating them silently is a recipe for disaster.


Yes, that's true.

What makes you think that the API described in this post would be subject to those kinds of legal and/or regulatory requirements?

Could it be the case that the author knows prima facie that no such requirements exist for their service?


The API described is not subject to those kind of legal/regulatory requirements. The users of the API may be.


Users don't dictate terms, though. How my API works is up to me.

If I target use cases that prefer truncation of messages to errors, then surely it's acceptable for me to build it that way, right? I'm not saying it's a good idea in general. I'm saying it can be acceptable in the appropriate context.


Postel's Law?

(I actually agree with you, I think it's a bad rule of thumb. I prefer Fail Fast (and loudly)).


Postel’s Law is bad and has been the cause of countless bugs, including many security vulnerabilities. Some discussion here:

https://news.ycombinator.com/item?id=9824638


I'd rather my request fail with an error than have it be silently truncated.


While a developer might like that, but an end user expecting a notification that gets nothing now will be asking, “Why do I not get notifications?”.


Agreed, any kind of error message is better than nothing. Makes my code calling the API really hard to read and debug


It's amazing how far adrift the industry has gone with authentication.

This post is saying to avoid OAuth and use bearer tokens because OAuth is too complicated. I agree with OAuth being too complicated, but I don't really think bearer tokens are the solution either.

Now there are jwts, and passkeys, and all these other solutions, which from where I'm standing, just look like someone else's resume-builder that I'm going to have to understand in a few years in order to do something simple and unrelated.

We are already connecting over TLS, just have the client authenticate with a long lived asymmetric key. Let me see that key in whatever web framework; I should have access to it, same as an HTTP header. Then I'll stick it in the database, maybe hash it first, if it's big. It doesn't have to be harder than that, your identity is (the hash of) your public key.


The main problem with OAuth is it's a three party system.

You have an API provider service, an API client service, an a human telling the provider that the client should be allowed to access a limited subset of data on the provider service.

That's fine in cases where you actually need three parties involved, e.g. an API that allows access to a user's photo library stored in the cloud... but that's not every situation and if you can avoid it, you should.

A simple two party system, where the API provider authorises the API client, is vastly simpler. And just as secure (more secure even, because complexity is the enemy of security).


Users want to be able to use their existing identity providers. At a minimum it's a red flag for consumers if you don't give them the option to sign in with Google, etc. - and it's simply a deal breaker for many enterprise customers if you can't integrate with their IDP.

I wish implementing auth wasn't so complicated now, but the idea that we can just decide to make it so by doing it another way is a fantasy. That ship has sailed.


My hunch is that this would ring true among some user segments, but would love to see the research on this (both in day-to-day and corporate contexts).

From personal experience and those of family members, I'd say we maintain many accounts leftover from bespoke sign-in processes and one-off services. That hasn't turned us from signing-up for these services, in spite of cumbersome onboarding (it's just one of those unfortunate 'hidden costs' of the web).

I say this, because we recently went through the trouble of deactivating many of those accounts that were no longer in use. Although the ship may have sailed for alternatives, it will be some time before legacy sign-ons will have been phased out, if at all.


I'm in the process of finding the things that I used my Google login for and moving them back to "legacy" logins with my email. Part of the process of de-Googling my life.


How's your experience been so far? Have you substituted some services completely?


Gmail is the bugger to get rid of. I need a backup email for my FastMail account, but I don't want to pay for two accounts when I'm only ever going to use the second one as the backup (and the obvious hack of creating a second alias for the FastMail account is equally obviously bound to fail).

Drive is still useful sometimes, mostly because my wife is addicted to it, but for a lot of the stuff I was using this for, I now use my Remarkable (and their SaaS).

Photos is one of those where I debate the usefulness of any of it. 99% of my photos are shite and not worth saving. The 1% that are, I need to curate a lot more than I do, and the curation is the problem not the storage. Ideally I'd like 2 buttons on the camera, one for "take this photo and delete it in 24 hours" and the other "take this photo and add it to my long-term storage with $this tag"

Getting third parties off Google login and on to email/password (using a paid password manager) was pretty simple, though I still find occasional sites that are just "welcome back <gmail address>!" and I have to work out how to get that changed. Haven't had any huge problems with that so far, apart from the usual "change email address, try to login again, get wrong password, reset password, 745395739 captcha attempts, generate password in password manager, save new password, login successful finally" dance.


JWTS (and related token type approaches) are a solution to the whole: how do I avoid spending time doing authn on every request?


If you use mTLS client identity is proven cryptographically (normally via /CN x509 field) - no need to exchange the key during authentication. You still need to generate it and distribute it to the user and manage its lifecycle which gets you to the same place as OAuth except with worse library support


> no need to exchange the key during authentication

The public key does need to be exchanged, along with a signature relating it to the current session. This is all handled by TLS, there is no need for the client to send the key in the application data.

> You still need to generate it and distribute it to the user

This approach avoids distributing secret key material at all. Private keys should ideally never move. They are generated randomly, used to derive the corresponding public key, and then persisted as appropriate. The public key is sent around to other parties.


How do you ensure someone else didn’t just create a new cert with the same user id? At the minimum there needs to be a step to sign the public key (with another flow to prove csr requester identity) Do you see how this a lot more moving pieces than oauth the user needs to figure out?

If you’re suggesting to just store a cert thumbprint that means a db call on every request - no different than just a secret token.


I like mTLS, I've worked in scenarios where both mTLS and OAuth are used separately and together, but if the comment here is suggesting certificates will be less complicated than OAuth then I would say I spent an equal amount of time banging my head against the wall with regards to learning and wrangling both, but maybe that's just me, would appreciate anyone else with experience in both to add their take.


Adding more structure to prefixed tokens could help avoid false positives when someone signs up for GitHub secret scanning. I notice that Pushover is apparently not signed up: https://docs.github.com/en/code-security/secret-scanning/sec...


> Be descriptive in your error responses

> Assume a human will read them

Yes, but... on the flip-side, make sure there's also some sort of code that the machines can read. Depending on the framework you started with, one or the other may be the more-obvious one to default to. But if you don't have both, you either have humans looking at cryptic codes (the point the article was making) or you have machines parsing through human-readable error messages with the risk of breaking backward-compatibility if you try to fix a typo in an error message or later decide to reword it or include more pointers.


> Serve your API at api.example.com, never at example.com/api. As your API's usage grows, it will expand beyond your website/dashboard server and need to move to a separate server or many separate servers. You'll want to be able to move things around by just pointing its IP somewhere else rather than trying to proxy things from your dashboard server.

This makes a rather specific assumption that there is no L7 load balancer or the like in front.


What about CORS and CSP headers? Should public/open APIs use them and prevent building client side only apps?


You don’t use CORS to prevent client-side only apps, you use CORS to allow them. CORS reduces security restrictions, it doesn’t add them.


Why would you want to block that?


I have stumbled upon several "open" APIs that had such "security measures" implemented, thus making them rather unusable. It would be a useful advice of what not to do, for anyone building public facing APIs.


The token prefix is a nice hack, but users will find it out and then it becomes part of the informal API.


And that's fine. In fact, I would document the prefixes in the docs and make ot official.


I don't think the AKIA/ASIA prefix for AWS access keys is formalized, which may be a good example of this: https://github.com/localstack/localstack/blob/6f846a8278e439...



Its the kind of thing I would rather see a base64 coded bit of json for. Easy to understand for support and easy for clients debugging.


Curious to straw/iron man this from the perspective of Reddit and their API "conversations" lately.


Did this page just turn into flying toasters when I wasn't looking?





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: