Hacker News new | past | comments | ask | show | jobs | submit login

It is actionable, sort of, but it takes a lot of careful thinking about what you can safely do with data you do not trust.

E.g. you need to assume that every function you pass that data to will be subjected to malicious or accidentally broken input. I originally wrote "unless/until you have sanitised the content", but really, when building applications taking user data, just assume that you're dealing with malicious or accidentally broken input everywhere unless you have proven otherwise.

This goes from the trivial: Range check numbers; check length of strings, and verify any other constraints (encoding, character set limitations) that you may later depend on.

To the very complex: Going to re-use end-user HTML? May sound simple, but you basically need a HTML parser with explicit white-listing of tags and attributes, and if you allow CSS you need to parse and white-list CSS attributes too (biggest risks: chance of executing malicious JS in context of another logged in user, including in your admin interface; chance of causing unintended side-effects if you allow triggering HTTP requests - as a minimum, even assuming nobody any longer are stupid enough to trigger side-effects on GET requests and assuming that's all they are able to trigger, it has privacy impacts including the chance of leaking details about your admin systems or any third party systems you pass the HTML on to).

In general it means you have to understand all the ways the type of data you allow can go from being innocuous inert sequences of bytes to triggering effects that may be under the control of a potentially malicious user, and you have to assume that if you don't know, then format needs to be assumed to not be inert when passed to any given piece of code.

E.g. to take a much simpler example than HTML. Consider passing arbitrary XML to an XML parser in order to validate it against a schema to sanitise. Could be a smart thing to do. Except, even assuming the schema is strict enough, consider that a malicious XML document passed to a parser that's not explicitly configured not to, may be able to make HTTP requests with a source IP on your internal network (by specifying a suitable URL for the doctype).

Doesn't need to be malicious either - I've seen plenty of systems have throughput fall through the floor because someone didn't handle this case and suddenly got a bunch of XML documents with a doctype URL that took ages waiting for requests to a downed nameserver for some third-party domain to time out.

In this case you also better be sure you don't have any services that are "protected" only by being behind a firewall that allows side effects via GET requests (a there's a good reason to never allow side effects via GET requests and not allow unauthenticated services even behind your firewall, on the assumption that somewhere, sometime, you will slip in this area and allow a user-supplied URL to get retrieved from an internal IP due to the multitude of formats that can include URLs)

And yes, if there's a risk of strpos having a buffer overflow, you are now SOL if you haven't validated your input in a way that prevents it, and while that's an unlikely case, it is an important illustration of the overall point:

All third-party data is unsafe until proven safe in the context of the code it will be passed to.

As a wider point, you should consider not only your own immediate usage, but whether or not a given piece of data may ever be passed on to a third party API etc., as whether or not you consider their own security lapses to be their problem, it can also harm you.

As a corollary, you should assume any data coming coming from a trusted partner is as unsafe as data passed to you from a known hacker.

It's with data as with unprotected sex: when you take data from someone, you're exchanging data not just with them, but with everyone with access to their systems and anyone they exchange data with.

Don't assume they're being safe - it takes just a single slip-up in their data handling before what you might think are "safe" data fields provided by your partner are actually unvalidated content provided by a malicious user. You may think you know the source of the data when taking a feed from a trusted partner, but you don't - not really.

To the extent that you should not just treat individual fields as supplied by potential malicious users. You should treat their entire supplied data feed as supplied by a potentially malicious user. As for why, consider the equivalent of SQL injection applied to whatever format your partner is passing you. Or they may have been hacked.

The TL;DR boils down to pretty much the comment you replied to. Anything longer, including the above needs to come with a big, huge caveat: It's NOT complete.

You can write books about the ways data-validation can go wrong and things to look for, and what I've written above just scrapes the surface in a few very unsatisfactory ways (except, hopefully, by terrifying you). You need to always approach it assuming the worst.




    It's with data as with unprotected sex: when you take data from someone, you're exchanging data not just with them, but with everyone with access to their systems and anyone they exchange data with.
I'll start calling airgapped systems abstinence-only networking.


As we all know abstinence-only doesn't work, so maybe there are stronger parallels here than at first glance. ;)


Well, it works if you actually practice it...


In both cases, it's much easier said than done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: