
Billion laughs - khet
http://en.wikipedia.org/wiki/Billion_laughs
======
astrojams
It isn't obvious at first glance that this small xml file actually expands to
billion "lols". You really have to give the bad guys credit for ingenuity.

~~~
tisme
:(){ :|: & };:

Even simple bash scripts can do weird things like this. And that's a lot
smaller.

~~~
chubot
Yeah but there's a HUGE difference: bash is intended to be a turing complete
language, while XML is a data format. It's trivial to use infinite resources
if you can execute arbitary code.

It shouldn't be possible to use resources exponential to the input size for
just PARSING (not doing anything else with) XML. This is a great example of
why you want to use simple serialization formats like JSON.

It's the same reason I never liked YAML. The spec and implementations are just
way too big. There's got to be something hiding in there like this.

~~~
burke
YAML is acutally trivially compatible with this exploit, in a sense, though
the result is not particularly disastrous with Psych.

    
    
        lol1: &lol1
          "lol"
        lol2: &lol2
          [*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1]
        lol3: &lol3
          [*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2]
        lol4: &lol4
          [*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3]
        lol5: &lol5
          [*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4]
        lol6: &lol6
          [*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5]
        lol7: &lol7
          [*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6]
        lol8: &lol8
          [*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7]
        lol9: &lol9
          [*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8]

~~~
clarkevans
In YAML, this is 9 objects in memory, one of them is a scalar, the other 8 are
arrays of size 10 that happen to be pointers to the other objects. Not very
big at all... much less than 1k. It's not even interesting since YAML
competently serializes graphs that are far more complex. Lots of reasons to
trash YAML, this isn't one of them.

As aardvark179 noted, a usage concern with YAML is more with how an unguarded
application might fail to check for serialized graph cycles.

~~~
burke
That makes sense. I tested this in a REPL that tried to pretty-print the
result.

------
dguido
Probably should rename this to "billion reposts."

Can we move beyond this simple issue and discuss more complicated aspects of
security on HN?

~~~
dguido
Fine, you guys asked for it.

<https://news.ycombinator.com/item?id=259458>

<https://news.ycombinator.com/item?id=3859853>

<https://news.ycombinator.com/item?id=1674911>

<https://news.ycombinator.com/item?id=301296>

<https://news.ycombinator.com/item?id=4619344>

People that exploit these kinds of things continue to innovate, but HN seems
to be stuck with XSS, SQLi, and malformed XML.

~~~
daeken
While I'm all with you on talking about advanced security, the reality is that
most people here don't understand basic security. I think talking about the
low-hanging fruit is important -- everyone has to start somewhere. And as
always, if you want to see more advanced security stuff, post it! I'll upvote
it for sure.

Edit: This does make me think that I've been meaning to write a blog post
about a security issue I discovered for about 6 months now. Time to do that.

~~~
wglb
Which he did: <http://news.ycombinator.com/item?id=4678309>. This is a tale in
which cody pretty much ends up owning ccbill.

------
wtallis
So, how much memory would a real-world parser actually consume given this
file? I'd try it, but I had to RMA my workstation's motherboard yesterday,
leaving me with a machine that only has 3GB, which is the obvious minimum for
a full expansion. But I could imagine an XML parser might use UCS-2
internally, inflating this to 6GB. Or, some parsers might be clever and not
attempt a full expansion.

~~~
zurn
So you're asking how much memory this resoource exhaustion attack consumes
when you run it.

Each line in the WP examaple amplifies by a factor of 10. It has 9 lines. It's
10e9. That's a billion times 3, which is just enough to 32-bit virtual memory
space in common operating systems.

Of course the XML implementation could be smart and short-circuit this while
preserving the semantics.

~~~
dbaupp
I think you mean 10^9 = 1e9 rather than 10e9.

(And the parent actually mentioned the 3 GB minimum, I think it was
essentially asking how much extra memory a real world XML parser uses.)

~~~
irishcoffee
Nope, 10e8 is what you're both looking for, as 1^9 is.. 1.

~~~
BCM43
Actually, dbaupp is correct. The e1 parts means *10^9.

See: <https://www.wolframalpha.com/input/?i=1e9>

------
caseydurfee
Is there a legitimate use case for being able to recursively define entities
like that?

~~~
halter73
Look again. There's no actual recursion going on. It's fairly trivial to
identify recursion in DTD entities.

~~~
lmm
Rather than nitpicking the terminology, can we answer the actual question: is
there any legitimate use case for defining an entity in terms of another
entity?

~~~
byuu
Right, that's the root of the problem: feature creep. Common with design-by-
committee.

As with any feature, there are possible use cases. Perhaps you want to create
a form document and use custom entities that you modify later to fill the
document out.

But the amount of times that's useful is not likely to be worth the potential
harm of things like billion laughs. Easy to separate that functionality, and
have your XML parser do an element data replace with your own custom tokens.
Eg node.replace("{name}", customerName);

------
alexrbarlow
I have to say, i love this, crazy, for a language that is really for
transferring data.

I guess you could do this with YAML too?

~~~
aardvark179
To pull this particular style of trick you require a schema definition that
allows for one object to be expanded into a whole set of objects, and for the
resulting data structure to be a tree rather than a simply a rooted directed
graph.

I don't know YAML well but I believe if you tried this trick with something
like alias nodes then you would end up with a lol9 node with ten separate
connections to a single lol8 node with ten separate connections to a single
lol7 node and so on. This would not produce the same problem in the parser,
though might trigger problems in whatever processed the resulting graph.

~~~
clarkevans
That's definitely correct: Since you can produce cycles in YAML (it's
deliberate choice), programs which don't check for graph cycles and blindly go
about traversing a serialized graph are subject to DDOS.

------
055static
This doesn't work with my sed-based XML parsers. :(

------
ilcavero
so, how do I protect myself against this?

~~~
Evbn
Use a parser or data structure that doesn't duplicate identical objects.

Functional programming wins here.

~~~
Nitramp
You might, but the expansion happens on a level that might be really hard to
implement if you want to keep XML semantics. E.g. in the DOM API, each element
has its own identity - you can't collapse identical objects (or you could, but
the objects generated this way aren't identical, they have different parent
nodes for example).

