
Nobody expects ENTITY sections in XML, either - mikeknoop
http://mikeknoop.com/lxml-xxe-exploit
======
kijin
Most XML libraries have an option to disable external entity loading. But this
is not enough. Libraries should be secure by default, not by toggling an
obscure option. (On a related note, why do we still have so many templating
engines that don't block XSS by default?)

Recent versions of libxml (2.9.0+) disable external entity loading by default,
so any implementation based on libxml (such as PHP's SimpleXML) should be
secure as long as the defaults are left untouched. But if you use an XML
parser implemented natively in another language, or one that links to an older
version of libxml, you should look really carefully at the default settings.

~~~
bazzargh
minor correction - 2.9.2+. (2.9.0, 2.9.1 had buggy attempts to disable entity
expansion, and were still exploitable) -
[http://www.cvedetails.com/cve/CVE-2014-3660/](http://www.cvedetails.com/cve/CVE-2014-3660/)

------
acdha
In case anyone missed dalke's comment below:

If you use Python, use defusedxml:

[https://pypi.python.org/pypi/defusedxml](https://pypi.python.org/pypi/defusedxml)

It's a drop-in replacement for the stdlib XML parsers and lxml which makes it
trivial to import an instance with secure defaults – to quote the docs:

    
    
        Instead of:
    
        >>> from xml.etree.ElementTree import parse
        >>> et = parse(xmlfile)
    
        alter code to:
    
        >>> from defusedxml.ElementTree import parse
        >>> et = parse(xmlfile)

------
mikeknoop
Note: I submitted this last night to HN but it didn't get picked up. I thought
it worthwhile to re-post given that the visibility of a similar HN thread was
what clued me into this security problem in the first place.

------
0942v8653
I realize this is nothing new, but these vulnerabilities are so incredibly
_simple_ (eg. Shellshock, this) and are because of obscure features no one
touches (eg. Shellshock, this). Maybe that's too much to take from the 2
things I'm thinking of at the moment—any other examples?

~~~
_delirium
It does make me feel like my decision to "parse" RSS feeds using ~5 lines of
Perl regex, which seemed dumb at the time, is maybe still sorta-dumb, but at
least not worse than using a default XML parsing library. All I really need
out of an RSS feed is to find the author, URL, date, and body, which I was
just too lazy to do "properly", so used some regexes as a quick hack. But
seeing what stuff "proper" XML parsers have buried in them, I think I might
stick with the Perl script...

------
anonfunction
The article states GitHub gists add the correct content-type header based on
file extension which is untrue.

    
    
      curl -i https://gist.githubusercontent.com/mikeknoop/e7b3c526738b66950eb4/raw/1d46d432ed380abc986cf15028221318b836395b/text.xml
      // other headers...
      Content-Type: text/plain
      // headers and body...
    

There is a great service call rawgit[1] that does actually add the correct
headers.

[1] [https://rawgit.com/](https://rawgit.com/)

~~~
mikeknoop
Good catch, looks like maybe Github removed this feature (I know for sure it
used to work with `.json` extensions.

Updating the post, thank you!

------
wglb
Good description, but a pretty well-known problem.

~~~
mikeknoop
From what I can tell, it is very well known in security circles.

Even so, we as a team of 8 didn't catch the problem for some time because we
weren't aware of it. That's part of my rational for giving an easy-to-test-
yourself POC.

~~~
dalke
While I have little experience with it, I know the goal of the defusedxml
package (see
[https://pypi.python.org/pypi/defusedxml](https://pypi.python.org/pypi/defusedxml)
) is to make it so that teams like yours can worry less about these details.

It has a module which "acts as an example how you could protect code that uses
lxml.etree. It implements a custom Element class that filters out Entity
instances, a custom parser factory and a thread local storage for parser
instances. It also has a check_docinfo() function which inspects a tree for
internal or external DTDs and entity declarations."

It sounds like it would have defused your example of lxml, and perhaps a few
others you haven't considered.

~~~
fryguy
But how do you know to use defusedxml instead of the regular one? I've never
even heard of it.

~~~
dalke
The base problem is that XML is not secure by default.

Every solution requires either figuring out the problems yourself (which is
impossible, given the number of problems that exist), or learning about it
from elsewhere. No matter what, there will be people asking the same question
you did.

I found out about defusedxml because I read planet.python.org where
[http://blog.python.org/2013/02/announcing-defusedxml-
fixes-f...](http://blog.python.org/2013/02/announcing-defusedxml-fixes-for-
xml.html) came up, and because I had enough general understanding of the
security problems with XML to recognize why it was created.

------
abalone
So uh, has this been patched in LXML?

Seems like parsers should at least disable file-based external entity URIs by
default.

