
If your code accepts URIs as input, filter out “file://” - stevekemp
https://blog.steve.fi/If_your_code_accepts_URIs_as_input__.html
======
hughperkins
Wrong way around: only allow [http://](http://) and [https://](https://) (and
generally filtering out anything thats not letters, numbers, slash or dot is
probably a good idea. Remove any sequences of more than one slash or dot.

~~~
pdkl95
Can we please stop trying to _enumerate badness_ [1]? When parsing input it is
possible to define the set of _valid_ input, not all possible invalid inputs.

Also, anybody accepting input from an untrusted source (such as anything from
a network or the user) that isn't verifying the data with a formal recognizer
is doing it wrong[2]. Instead of writing another weird machine, guarantee that
the input is valid with a parser generator (or whatever) recognize the input
and drop anything even slightly invalid.

[1]
[http://www.ranum.com/security/computer_security/editorials/d...](http://www.ranum.com/security/computer_security/editorials/dumb/)

[2] [https://media.ccc.de/v/28c3-4763-en-
the_science_of_insecurit...](https://media.ccc.de/v/28c3-4763-en-
the_science_of_insecurity)

~~~
ubernostrum
_guarantee that the input is valid with a parser generator_

OK, that works really well... until you learn how much non-RFC-specified
behavior is built in to web browsers. Simply building a parser to the RFC will
leave you wide open to all sorts of nastiness!

The is_safe_url() internal function in Django is a bit of a historical dive
into things we've learned about how browsers interpret (or, arguably,
misinterpret) various types of oddball URLs:

[https://github.com/django/django/blob/master/django/utils/ht...](https://github.com/django/django/blob/master/django/utils/http.py#L287)

~~~
pdkl95
> non-RFC-specified behavior

I never said anything about limiting the parser to what's defined in an RFC.
The acceptable input to "quirks mode" is just another (non-RFC) grammar, which
still needs to be defined and validated.

~~~
ubernostrum
Then I do wish you luck, but I don't think you'll ever be able to produce a
suitably complete grammar since parts of it will require knowledge of
undocumented proprietary internals of Internet Explorer.

Hence we scrape along doing our best with what we can figure out from
observing behavior and collecting bug reports. But even with that,
is_safe_url() is one of the most prone-to-security-issues functions in
Django's codebase.

------
wkd
"Despite reporting the problem to the author on Friday, and following up the
report via Twitter this has not yet been fixed, but after four days I assume
I'm not alone in spotting this."

Giving someone a weekend to fix something doesn't exactly sound like
responsible disclosure. I understand if you get excited because you found a
flaw but if you find something like this please be more responsible with
publishing your findings.

~~~
kuschku
Well, I found the same by pure chance before reading this article, and I
suspect many more in the HN crowd did.

If already half a dozen people on HN report they’ve found it and emailed the
person about it, it’s likely it’s too late for responsible disclosure.

~~~
4ndr3vv
Agreed! Better get that karma before someone else does /snark

~~~
kuschku
The issue is more that if so many people have already found it, who else has?

~~~
23andwalnut
Disclosing it publicly before it's been fixed only increases the number.

------
rybosome
The vulnerability is happening because the web server is running as a user
with elevated permissions. If it were running as a user who only had
permissions to read files from the webserver's serving directory this would
obviate the problem.

Not that you shouldn't also validate input (in a whitelist rather than
blacklist as others have said): just goes to show that security is a
multifaceted problem with lots of ways to be paranoid. :)

~~~
nebulous1
You don't need elevated privileges to read /etc/passwd

~~~
daveloyall
That is correct.

    
    
        sebboh@namagiri:~$ ls -al /etc/passwd
        -rw-r--r-- 1 root root 1450 Jun  7 09:08 /etc/passwd
    

So, there's this idea of running publicly facing services as a user which has
less privileges than a normal interactive user. This user might be called
'nobody' or 'apache', etc.

However, on your average distribution, /etc/passwd is accessible to the
'nobody' user or the 'apache' user.

There are various work arounds:

\+ run your service in a chroot or jail.

\+ employ some kind of SE Linux thing.

\+ ...?

I've only ever used the first one.

~~~
delinka
Since ages now:
[https://en.wikipedia.org/wiki/Shadow_password](https://en.wikipedia.org/wiki/Shadow_password)

~~~
daveloyall
That doesn't mean you should allow your app to access `file://` ... (Nor allow
it out of a chroot.)

Simply knowing some usernames on your system could provide an attacker with
clues...

------
r1ch
Also be sure not to allow loopback connections (ie your own site or localhost)
or you can cause a deadlock if the user requests the same URL to download from
your site recursively. Choose an appropriate timeout too to prevent users
tying up backend processes with a HTTP server that is slow to respond, and
don't follow Location headers to avoid bypassing of your initial filters. A
Range header should also be used to prevent users from telling your server to
download multi-GB files and causing bandwidth waste / denial of service.

~~~
JoshTriplett
> Also be sure not to allow loopback connections (ie your own site or
> localhost) or you can cause a deadlock if the user requests the same URL to
> download from your site recursively.

Keep in mind that any arbitrary domain can point DNS to a loopback or LAN
address. So the code that fetches a URL needs to include this filtering.

> don't follow Location headers to avoid bypassing of your initial filters

Often, you'll want to follow redirects, but re-apply your filters when doing
so.

> A Range header should also be used to prevent users from telling your server
> to download multi-GB files and causing bandwidth waste / denial of service.

You can't count on support for Range, and the server might also just behave
unexpectedly. Your fetching code needs to limit how much data it accepts from
a server, and drop the connection after some upper bound.

~~~
detaro
> _Your fetching code needs to limit how much data it accepts from a server,
> and drop the connection after some upper bound._

And if you accept compressed responses, remember that they can be _a lot_
bigger when decompressed: e.g. gzip allows for factor 1000.

------
byuu
Also be very wary of ../ or possibly ..\ in URIs.

Say you have
[http://example.org/download?file=release/software-1.0.zip](http://example.org/download?file=release/software-1.0.zip)
so that you can log download statistics.

If you just fetch GET["download"] and return it, you're gonna have a bad time
if they try
[http://example.org/download?file=../../../../../etc/hosts](http://example.org/download?file=../../../../../etc/hosts)
(browsers strip it out automatically, but it's easy enough to type such a
request into a telnet session.)

And don't just assume your safeguard works. Act like a hacker and try it out
on yourself to make sure it works. Add an assert(validate("../") == false); at
startup on your server, so it won't even run otherwise. Forbid the use of
fopen() and instead go through your own file::open() function that calls
validate() internally, then #define fopen ERROR_DONT_USE after (or whatever
equivalent for the language you use.)

It is a hostile world, you can never be too safe by adding multiple (even
seemingly redundant) layers of protection.

~~~
idbehold
I've done something similar in node to ensure URIs like those cannot walk up
past a base directory:

    
    
      function saferesolve(base, target) {
        var targetPath = '.' + path.posix.normalize('/' + target)
        return path.posix.resolve(base, targetPath)
      }
    
      saferesolve("./datasource", "a/b") === "./datasource/a/b"
      saferesolve("./datasource", "a/b/../c") === "./datasource/a/c"
      saferesolve("./datasource", "../..") === "./datasource"
      saferesolve("./datasource", "../../a/b") === "./datasource/a/b"
      saferesolve("./datasource", "../../a/b/..") === "./datasource/a"

~~~
ajanuary
I've done something similar in Java, though more focused on file paths
[https://github.com/aJanuary/basepath](https://github.com/aJanuary/basepath)

------
throwawayReply
Also make sure you don't follow 301/302, or someone can set up a http link
which redirects to file:// .

~~~
kuschku
Or just, when following 301/302, call the same function again – which then
validates the link completely again.

~~~
throwawayReply
I'm not sure why you're getting downvoted, it's a little unfair for people to
downvote a sensible suggestion without explaining why.

One possible downside is that someone could Redirect A -> B and redirect B ->
A, which risks tying up your resources following links, but browsers limit how
many redirects will be followed, so it ought to be possible to limit
redirects.

------
ppierald
The u= parameter in the OPs article also is vulnerable (even if http/https are
whitelisted, file:// blacklisted, etc) to the #10 vulnerability on the OWASP
Top Ten 2013 list, namely Unvalidated Rediredcts and Forwards.
[https://www.owasp.org/index.php/Top_10_2013-A10-Unvalidated_...](https://www.owasp.org/index.php/Top_10_2013-A10-Unvalidated_Redirects_and_Forwards)

I can easily get you to click the link to drive-by malware, adult sites,
pharma, phishing, etc. because the site doesn't ensure where the link is
actually going to.

------
greenleafjacob
This is the confused deputy problem. The most general solution to this class
of vulnerabilities, SELinux, has been largely ignored. Does SELinux need more
work to "bring it to market", or is it just too complicated and needs to be
simplified?

~~~
pawadu
configuring SELinux is way too complicated for the average user.

~~~
nerdponx
And poorly documented. I learned what little I know about it from online
tutorials, not the docs

------
konceptz
Quick FYI. It's difficult to filter URIs with regex because they are context
free grammars and belong to a larger language than regular expressions.

Recursive descent parsers are the way to go like uri objects in most
languages.

Ex. URL url = URL.fromString(inputString); url.getScheme;

------
fpgeek
The original link seems down, but it has been saved at archive.org:
[http://web.archive.org/web/20160912105232/https://blog.steve...](http://web.archive.org/web/20160912105232/https://blog.steve.fi/If_your_code_accepts_URIs_as_input__.html)

~~~
stevekemp
Looks fine for me (the author), but glad to see a cached copy the content is
pretty minimal.

Rate-limiting hasn't kicked in, and I'm seeing a steady stream of visitors.

~~~
tomhoward
It seems to be inaccessible for me here in Australia, but works when I access
it via a VPN to Europe. Possibly traffic is blocked from certain
networks/regions.

------
homakov
Also, if you're using ruby's open-uri:
[http://sakurity.com/blog/2015/02/28/openuri.html](http://sakurity.com/blog/2015/02/28/openuri.html)

------
Eridrus
This is a special case of a Server-Side Request Forget vulnerability.
Validating schemes is part of the answer but not the whole answer because
attackers can still forge requests to internal resources you have firewalled
off from the internet.

These were recently released to help people deal with these issues since the
details can be finicky: [http://blog.includesecurity.com/2016/08/safeurl-
server-side-...](http://blog.includesecurity.com/2016/08/safeurl-server-side-
request-forgery-protection-library.html)

~~~
RegW
Really. Surely if I tried to enter
[http://localhost:8983/solr/admin/cores?action=UNLOAD&core=co...](http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core0&deleteIndex=true)
the back service would be password protected.

Oh, wait a minute.

------
majke
file:// urls are awesome! People forget about them and nobody actually
understands what to expect from these urls. Consider cross-origin policy
implications. Is file:///home/john/bar.html on the same origin as
file:///home/john/foo.html ?

Also, XHR is totally allowed to go to file://

------
jwilk
> Actually the actual output all newlines had been stripped.

Not stripped, but replaced by spaces. Also, the linked image looks like
/etc/passwd, not /etc/hosts.

> Weird.

Not weird. That's how whitespace in HTML works.

~~~
kijin
OP tested his hack on an online Markdown converter, so it probably has more to
do with Markdown's treatment of whitespace than HTML's.

As it happens, Markdown ignores most non-consecutive newlines.

~~~
detaro
It's a converter taking HTML as input, converting it to Markdown.

------
phibit
Another fun one that's becoming more common with cloud hosted services: access
to internal metadata services that issue credentials.

Blocking file:// doesn't prevent access from >
[http://169.254.169.254/latest/meta-data/iam/security-
credent...](http://169.254.169.254/latest/meta-data/iam/security-
credentials/..).

------
milankragujevic
I usually only do this, if I'm putting a user-submitted URL into
file_get_contents: if(substr($_GET['url'], 0, 4) != 'http') { exit; }

~~~
philo23
If thats the only validation on calls to file_get_contents, that could very
easily be bypassed. Entering something like just "/etc/passwd" for example.

~~~
milankragujevic
/etc != http :)

~~~
philo23
Ah yes, you're right. Totally mis-read the code.

It'd still leave access to any files in the same (or sub) directory starting
with http, which realistically would probably be none but still something to
bear in mind.

~~~
nitrogen
Right, best to check the whole URL. Something like "http/../../../etc/passwd"
might get through otherwise.

~~~
milankragujevic
Like....

    
    
       $url = rawurldecode($_GET['url']);
       $url_without_protocol = str_replace(array('https://', 'http://'), '', $url);
       $protocol = (stristr($url, 'https://') ? 'https' : 'http');
       $page = file_get_contents($protocol . '://' . $url_without_protocol);

~~~
kh_hk
What about

    
    
        u = urlparse(url)
        if u.scheme not in ['http', 'https']:
            return 400

------
Bino
There are others things to consider when fetching data from untrusted HTTP
resources.

A. Can it be used to DDoS

B. How much data do you read (download, inbound may cost you real money)

------
int_19h
I think handling (rejecting) this in the libraries that handle URIs would be
the wrong fix. While this is a vulnerability in a web app, it is a handy
feature in an app that runs locally, and the library doesn't really know the
context. It's up to the app to filter the URLs according to its isolation
requirements.

------
ara24
This is no different than failing to validate input parameters, and letting an
attacker take a dump of your database. A service connecting to a remote
location should have checked the URI scheme.

------
Grue3
I tried Python requests and Common Lisp drakma, and neither of them can handle
"file://" URL schema. Which HTTP client libraries are actually vulnerable to
this?

~~~
simscitizen
The most popular HTTP library is libcurl, and it supports file URLs by
default.

------
kazinator
Don't write server code that opens URI's that come in as input, period. If you
take URI's as input, do it only to turn them around and spit them out into
some Javascript sent back to the same session. Whatever can or cannot be
accessed this way is the browser's problem.

~~~
elmigranto
This works as long as your business model doesn't have opening user's URIs at
its core, e.g. monitoring service.

------
_RPM
a blacklist is generally not a good idea in any case. You should use a
whitelist instead.

------
beilharz
fuckyeahmarkdown fixed it just now.

