Hacker News new | past | comments | ask | show | jobs | submit login
Bad JSON Parsers (github.com)
211 points by lovasoa 11 days ago | hide | past | web | favorite | 191 comments





The JSON spec [1] clearly states, in section 9:

  An implementation may set limits on the maximum depth of
  nesting.
So, this is kind of a pointless exercise. The author would do well to inform themselves better before slinging epithets such as "bad" for a library.

[1]: https://tools.ietf.org/html/rfc7159


I would be perfectly OK with that answer if they documented that they use the C stack and therefore have a small finite depth limit, and/or threw some sort of TooMuchNesting exception before they hit that limit.

AFAICT, many/most of these libraries do neither of the above. There's nothing in the documentation to suggest they'll have trouble with deeply nested inputs, and their response to such a situation is to simply crash the process. I'd call that a bug.

I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons". If you've got limits, you've got to document them, or at least make them recoverable.


It's hard for a library to portably know when the C stack will exhaust or give a concrete estimate that makes sense for documentation purposes.

Firstly, that concept is not exposed at all to standard complaint C code. The standard is vague enough to allow for many possible variations of the program stack abstraction. So you would not be portable, standard C.

Second, the library author doesn't know ahead of time how much stack space you, the caller, have consumed before calling the library. Do you call it deep inside some ui framework with a big stack? With a huge i/o buffer on the stack? They don't know.

Translating all of this into "this is how deeply nested your json can be" is not a sane exercise, at best you can give a machine-specific guess that will vary a bit with your own application's stack needs.

But perhaps one can side-step all of that and get most of the way with an artificial, arbitrary-constant runtime depth check that is safely below likely actual stack limits.


Sure, that’s all true. So don’t use the C stack for parsing. It’s not difficult.

> I don't consider it acceptable for a library to simply say "the spec says we're technically allowed to 'set limits' on what we accept, so here's some nasal demons".

Welcome to C, enjoy your stay.


Most of these libraries are for HLLs.

I just wrote standard conforming JSON parser:

    cat standard-still-bad.sh

    #!/bin/sh
    echo "ERROR: JSON parser implementation limit." 1>&2
    exit 1 # terminate and indicate error

--- https://tools.ietf.org/html/rfc7159#section-9

> An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.


"Conforming" is not identical to "useful." A car that doesn't start is still a car.

It's funny, the spec they references[0] makes no mention of parsers or nesting. But RFC 8259[1] clearly does. Moreover, I notice they were both published in December 2017. Why two specs for the same standard?

[0] http://www.ecma-international.org/publications/files/ECMA-ST...

[1] https://tools.ietf.org/html/rfc8259


As is almost always the case when there are multiple specs purporting to describe the same thing: politics.

The Ecma spec is meant to be a spec defining a format and its semantics; the RFC places requirements on implementations of that format as well as describing it's semantics.


There was a rumor that was IETF was going to specify a bunch of new breaking changes in RFC 7195 (March 2014), and because of that (irrational) fear ECMA published the 1st edition of 404 very quickly in October of 2013.

RFC 8259 and the 2nd edition of ECMA-404 contain the same change, a single sentence, specifying that JSON must use UTF-8.


Maybe the headline is bad, but making a table of implemention limits of various JSON parsers seems useful?

Maybe we focus on headlines too much?


I dunno, the author claims they're sorting the table from worst to best, instead of by minimum maximum depth to maximum maximum depth

Maybe I shouldn't have called this repository "bad json parsers". I just added the following to the README:

> The name of this repository is intentionally provocative. It's goal is to document limitations of existing json parsers, not to denigrate the work of the many contributors who worked on the various libraries presented here.


The word you're looking for is clickbait ;)

"We Ranked The Top JSON Parsers By Call Stack Depth... Number 6 Will Shock You!"


First appreciate the change to the readme. I think that goes a long way.

Second, "bad" here is highly subjective. I've never ever hit the recursive depth limit in python with JSON, and if I do, I'll probably think about doing it a different way, in a way that won't require python to do deep recursion.


Its*. "It is goal" makes no sense.

fixed

Yeah, this is dumb.*

I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec) than not allowing thousands of nested arrays.

---

* Except that this is potential DoS vector. From that angle, it has some interest.


> I'd sooner criticize a parser for mangling numbers above 2^53 (also allowed by the spec)

The spec has something specifically to say about this[0]:

> Note that when such software is used, numbers that are integers and are in the range [-(253)+1, (253)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

Basically, allow up to 2^53, and you ought to be fine.

[0] https://tools.ietf.org/html/rfc8259#section-6


Agreed, such a parse would be "to spec". It says "An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers."

On a practical level, parsers not handling 64-bit ints has bit me more than parsers not handling 1000+ levels of nesting.


As a side note, the Haskell implementation also supports numbers of arbitrary size, again limited by memory, for both parsing and encoding.

From TFA

>The two most recent JSON standards RFC 8259 and RFC 7159 both say "An implementation may set limits on the maximum depth of nesting". However, the ECMA-404 specification doesn't contain any limit on how deeply nested JSON structures can be.


To be fair, that text was added after the above comment was posted.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...


Not stating a limit doesn't mean there's NO limit. It means the limit is implementation defined. If the way you interpret a spec implies a requirement of infinite memory or infinite stack, then the way you interpret it is wrong.

Reading "MUST" when specification says "may" is also wrong.

Nevertheless, the parser failing at only 100 levels of nesting is shockingly bad.

I work with complex JSON every day. Even painfully complex structures like Facebook's Ad Insights API or AWS Lambda Events only get around 5 levels deep, so 10 deep is already exotic, and I can't even conceive of a use case for 100. The method of failure is important too--would it really be better to just keep going forever, or die when maximum stack depth was reached, or the machine runs out of RAM (or swap space...)? Treating absurdly nested JSON the same way you would any invalid JSON input seems like a much friendlier and less dangerous approach overall.

Think of error cases and malicious input instead of well-written APIs.

When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)

What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.

A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.

[1] https://github.com/google/gumbo-parser


Was Gumbo used for parsing crawled pages for indexing reasons ? How was it used internally ? I presume they use some headless chrome like technology for most of that nowadays.

No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.

Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.

Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.


Stick any structure in DynamoDB and you’ve already halved your nesting depth due to its encoding a superset off JSON in JSON. Then start encoding more complex structures like the ones the others have mentioned and these limitations start to bite pretty quickly.

Python handles function stack depth by limiting it to 1024, I’ve run into that limitation with recursive functions a few times but I can override it by setting a new maximum in code. Couldn’t similar be done with JSON?

> I can't even conceive of a use case for 100

A naive serialization of cons cells?


> I can't even conceive of a use case for 100

A serialized trie ?


Lots of recursive data-structures (trees/graphs/*) lend themselves nicely to lots of nesting. Such documents aren't really intended to be read/written by humans.

If they aren't meant to be read/written by humans, JSON is probably a bad choice.

I couldn't disagree more.

If you need to serialize and deserialize data between tools written in different languages -- e.g. write data to a file and read it in another tool -- then what would be better than JSON?

Nothing about JSON is inherently good or bad for deeply hierarchical/recursive data.


Deeply hierarchical/recursive data is equally well served by XML and Protobuf.

However, JSON itself can be... Painful for some subsets of data that you may wish to move between applications. Binary data, well-typed data. JSON tends [0] uses floats, so you may need to adjust your expectations about how to store numbers if precision matters. You can workaround a lot of these, such as through various encodings, but they are workarounds.

A lot of those workarounds depend on re-encoding, which unfortunately means using strings, which can run afoul of "An implementation may set limits on the length and character contents of strings." and silently drop important bytes, or simply cut off the end of the payload.

[0] The grammar is well-defined, the behaviour is not. "This specification allows implementations to set limits on the range and precision of numbers accepted." https://tools.ietf.org/html/rfc8259#page-7


Wouldn’t XML increase filesize even with compression applied? Obviously there are better machine readable libraries out there but XML is the middleground between efficiency and support

I would lean toward Protobuf wherever you can use it. It is designed to deal with issues like this. However, you can't always fulfill some of the requirements for it.

In those areas, XML strikes a practical balance. Especially if you can use a significant subset of XML. (If you don't need DOM, then use a parser that doesn't produce it, etc.)


I would say protobuffs

No, and if they can not be even read by a machine, it's bad.

This works out well in practice -- the ruby parser that ships with the language (the 100 levels one) works for most purposes and will even go back to an all-ruby implementation if it can't load its C extension. It's a reasonable default parser for almost all cases.

However, if you need the all-singing, all-dancing, handles the most complex use cases super fast parser, which is Oj at the bottom (which handles infinite levels), then you have the option to install it and use it.


Bad is a value statement, they didn't say not to spec, invalid, or similar.

I think you’re missing a link.

I don't like it when programs use physical limits to trigger physical errors based on the input. "JSON nesting depth of 23456 levels exceeds maximum memory of 23456MB" is a very different thing from being kill -9'd by the OOM killer (after of course it gets your ssh session).

If you decide your limit in advance and bail out of parsing, that's one thing. If you just use up all the stack and cross your fingers, that's a pain.


I know this is too late to be useful, but it only takes 1b per nesting level to track the nesting depth in O(1) time, or just a 64b integer to track the nesting depth in O(n) time. I wonder if there a sublinear algorithm? At best to me, right now, it could made to look like bitstream compression.

The author has now added this as an aside: https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...

It is pointless, also nobody ever cares much how well a text editor can handle obscure text files (file too large, broken encoding, ....) What matters is how it handles well-formatted files, the others aren't worth opening in an editor anyways :)

The commenter would do well to RTFA before slinging complaints that have already been addressed in the article.

The article is cool-headed and factual, so don't get hung up on the repo name.


To be fair, the article was updated to address those complaints after the comment you're responding to was made.

https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...



It is better to not set limits.

There is always a limit. The question is whether you want to use up all available memory reading bad input, or reject it sooner. If you're worried about denial-of-service attacks, it makes sense to be explicit about how big an input you'll accept.

It would've been interesting to know which platform was the JavaScript test performed on (judging by source code, likely Node.js). However, on my current machine Chrome chokes to stack overflow above nesting level 125,778.

More amazingly, Firefox is currently happily parsing a JSON array with 50,000,000 levels of nesting at full CPU single core utilization and consuming about 12 GB of memory. This was after an array of depth 10,000,000 was successfully parsed after peaking around ~8 GB of RAM. So Firefox's parser seems to be only constrained by available memory!

Also - while this is running in the console the Firefox UI is completely usable and I can browse the web in another tab. Mozilla has come a long way!

E: The 50,000,000 nesting level array parsing eventually failed with "Error: out of memory".


Great. Is anything else usable too?

Pretty much, but RAM utilization is of course closing at 100% so starting any new programs would fail on OS level.

My takeaway from this is that you can submit a long string of brackets to REST apps built with Haskell and Ruby/Oj and crash the process and/or machine by consuming all available RAM. Whereas the smarter JSON processors near the top of the list will abort processing pathological inputs before serious harm occurs.

Wait, which JSON parsers are supposed to be "bad"?


Just because a structure is deeply nested doesn’t mean that it consumes a lot of memory.

And just because your payload is massive and consumes a lot of memory doesn’t mean it’s deeply nested.

I doubt any of the parsers are resilient against storing large documents in memory. What makes them bad is that structures that would easily fit in memory are unparsable.

In most cases your web server imposed a limit on the size of the HTTP payload so you wouldn’t normally be able to deliver a 128GiB JSON file anyway. It’s not like nested structures are zip-bombs.


My comment was mostly tongue-in-cheek but there's some truth to it. Since the receiver creates a navigable data structure with 64-bit pointers, every two bytes sent ("[]") becomes 20+ bytes in RAM (low estimate!). An order of magnitude magnification helps.

Well, most JSON parsers (haskell being a notable exception) use an amount of memory that is proportional to the size of the input, independently of whether they can parse deeply nested structures or not.

Saying that a parser that does not limit the nesting depth will crash your server does not make any sense. You could as well say that str_replace will crash your server.


You can do exactly the same by submitting endless string.

The problem is not DOS by memory exhaustion, the problem is stack-overflow. You can trivially control the return address of the decoder. On mostly public interfaces!

And I think I was the very first one whose JSON library knew about that, probed for the max depth (which is system dependent) and added hard and soft limits to it. It's not even mentioned in this post.


> You can trivially control the return address of the decoder.

With a stack overflow? Stack overflows usually write to unmapped memory and cause a segmentation fault; how would you use that to control the return address? Or perhaps you were thinking about a stack buffer overflow instead?

https://en.wikipedia.org/wiki/Stack_overflow vs https://en.wikipedia.org/wiki/Stack_buffer_overflow


In this case it's the same. The max stack depth is known, and the overflow easily triggerable and controllable from outside. It cannot get much easier.

Wikipedia is a bit confused.


It’s trivial to cause a stack overflow (as in allocating too much memory and running out of stack space), but how would you overwrite the return address? From what I can tell, the attacker can only cause the application to allocate lots of memory, not to overwrite memory.

Memory DOS will not happen with normal recursion, only a tradional stackoverflow. And by adding some content after the fixed number of [ or {, this will by on the stack as argument to the recursive function. There are so many possible exploits. See eg. https://jon.oberheide.org/files/infiltrate12-thestackisback.... or search for "stack overflow recursion exploit"

"The tar archive spec does not specify an upper limit on archive size, but my kernel keeps saying 'no space left on device'. It's obviously a bad kernel."

Well, in your example, there is a physical limit on how much space is available. In the case of deeply nested json, we are talking about structures that perfectly fit into memory, and could be decoded in a fraction of a second if only a different algorithm were used.

On one hand, yes. On the other hand, the standard Ruby "json" gem can be made to overflow on just 202 bytes of valid JSON input.

I mean, it'd be a pathological 202 byte JSON, but yeah (I guess it could be a DOS attack of some kind, actually, hmm...)

Ruby is a good example, because the `oj` gem, on presumably the same system, is listed as "infinite." Obviously not truly "infinite", eventually it'll run out of RAM -- but this shows it is _not_ an issue of machine resources really.

As the OP notes, it's because if you implement using recursion, you'll run out of stack in most languages (now let's start talking about tail call optimization...), but this isn't the only implementation choice, you _can_ implement to not have this limitation, and probably should.


Anywhere you accept JSON from untrusted users, you should be catching parse errors anyway.

Tail call optimization wouldn't help with parsing JSON.

Usually the solution is to create an array on the heap and manage the stack yourself.

Nope, it stops itself at 100 levels with a JSON::NestingError.

If you run it as JSON.parse(ARGF.read, max_nesting: false) you get 65492 instead of 101.


> the standard Ruby "json" gem can be made to overflow on just 202 bytes of valid JSON input.

It is not overflowing, but about aborting with with a proper error.

You can argue whether 100 is a reasonable default, but I think it is not too stupid to have a maximum depth here and bail out as soon as possible. Because what will happen if you accept arbitrary nesting? Then next guy in the chain who actually works with the parsed json tree will also have to handle arbitrary nesting. And if you are now not careful, you have some error deep in some quick and dirty user written code (which might actually be exploitable instead of not accepting the json in the first place).

I would say you think you can of the 100 as some kind of input sanitation.


    max_nesting: The maximum depth of nesting allowed in the parsed data structures.
    Disable depth checking with :max_nesting => false. It defaults to 100.

Referring to it in bytes purely serves to spread FUD. It's >100 levels of nesting which is a bit silly.

I did that intentionally, and not because FUD. I hear ">100 levels of nesting", and think "huge document", which as demonstrated here isn't necessarily true. So I said "202 bytes" to emphasize "not huge document"; to emphasize that deep nesting doesn't really imply much about document size.

100+ levels of nesting implies nothing about the size of a JSON file, just that it's an extremely niche use case. OTOH almost all JSON files are 202+ bytes. Couple that with the title and it just leads to confusion.

>niche use case

Like an attack.


Why 202 bytes? Wouldn’t 101 [ characters suffice?

But then it wouldn't be valid JSON.

But it could still cause a naive parser to fall over rather than report the json as invalid.

The JSON:XS documentation has this to say about the $max_depth field setting for the parser:

    Second, you need to avoid resource-starving attacks. That means you should
    limit the size of JSON texts you accept, or make sure then when your
    resources run out, that's just fine (e.g. by using a separate process that
    can crash safely). The size of a JSON text in octets or characters is
    usually a good indication of the size of the resources required to decode
    it into a Perl structure. While JSON::XS can check the size of the JSON
    text, it might be too late when you already have it in memory, so you
    might want to check the size before you accept the string.

    Third, JSON::XS recurses using the C stack when decoding objects and
    arrays. The C stack is a limited resource: for instance, on my amd64
    machine with 8MB of stack size I can decode around 180k nested arrays but
    only 14k nested JSON objects (due to perl itself recursing deeply on croak
    to free the temporary). If that is exceeded, the program crashes. To be
    conservative, the default nesting limit is set to 512. If your process has
    a smaller stack, you should adjust this setting accordingly with the
    "max_depth" method.
So accepting an unlimited depth structure by default is probably not a good idea. There's a flag you can set if you need to parse something like that, but the defaults are set to parse anything sensible and reject documents that may be malicious.

My takeaway would be "using the C stack to parse JSON is probably not a good idea", not "accepting an unlimited depth structure by default is probably not a good idea".

This limit is entirely artificial, and due only to the design of the parser.

As the documentation you quote says, you should limit the size of JSON texts you accept. If you don't, then you are at risk of a memory exhaustion attack. But this is completely unrelated to the maximum depth you accept. If your parser doesn't use the C stack to parse recursive structures, then deeply nested payloads will parse just fine, without using an abnormal amount of memory.


I disagree with your conclusion. I would instead say that the combination of using the stack to parse JSON and allowing unlimited depth is the problem. In practice, almost no one ever deals with JSON structures even a hundred levels deep (I’d quite seriously guess it to be less than one in a million applications that desire it), so if you artificially cap it to protect against stack overflow you can then safely use the stack, thereby getting better performance by reducing the required allocations, while also protecting against one DoS attack vector.

You never "artificially cap it to protect against stack overflow": Most libraries that cap it artificially cap it to a fixed value, that may or may not protect against stack overflow depending on how large the stack already is when you call the parser.

These caps we’re talking about are normally a small fraction of the total stack size, and few applications run anywhere near the total stack size. I acknowledge your point (and had it in mind when I wrote my first comment), but maintain my position that this capping is protecting against the stack overflow DoS attack: if you’re able to get a stack overflow with your capped JSON parser, you were probably already able to get a stack overflow, because it’s not adding overly much to the stack.

The problem that’s being protected against is user-controlled stack overflow—that’s what makes it a DoS attack.


> As the documentation you quote says, you should limit the size of JSON texts you accept. [...] But this is completely unrelated to the maximum depth you accept.

I'm not sure that's right.

The documentation points out that a deep structure returned from the parser can crash the interpreter, because freeing data also uses recursion. ("due to perl itself recursing deeply on croak to free the temporary".)

Anyway, I found the 512 depth limit too much... When using coroutines in Perl, the stack is smaller and it crashes trying just to print a JSON object with depth 512. I hit the limit at about depth 150, with nothing else on the C stack, so in robust diagnostic code that uses the JSON formatter to log data structures, I limit the max depth to 20.


I'd argue this is a pitfall that languages can and should solve. There's no reason why writing a naive recursive descent parser should cause this problem. The fix, using a separate stack data structure, is telling, since it's the same algorithm, with the same space complexity, you're just saying "my language's call stack isn't a usable implementation, so I need a different stack." Why not just fix the call stack?

In languages like rust/c/c++, this might be hard, since you can't grow the stack by moving them (since you'd need to update pointers, which are too visible in those languages, and even identifying a pointer at runtime is undecidable), and segmented stacks cause the "hot-split" problem.

But most languages could just make this problem go away by growing the call stack when needed. A few of them (incl. Haskell) do. I'm following suit in a language I'm working on.


Most purely functional languages (like Haskell) don't really have a "call stack", but the computation is carried out by graph reduction, so it can obviously detect cycles and do some optimizations there, depending on the type of recursion.

If you want the gory details https://www.microsoft.com/en-us/research/wp-content/uploads/...


Yeah, I've read the STG paper, though its been a while. IIRC there isn't a call stack as we'd normally think of it, but there's still a stack of pending pattern matches, so you actually do hit the same issues. I vaguely recall making deep recursion "just work" in GHC was something that happened just in the past couple years.

> In languages like rust/c/c++

In another HN thread there is much discussion about Rust using, in effect, segmented stacks to implement async-await.

Perhaps you could code a recursive descent parser in Rust using async functions, so that the Rust compiler converts it to a heap-allocated state machine for you.



The repo is calling it bad, but in production I'd actually prefer a parser that throws an approptiate error based on a setting that can be configured with some reasonable defaults over something that just silently chugging along until it runs out of memory.

I can imagine it'd be a lot easier to overwhelm and DDoS a server that attempts to parse incoming JSON requests without any depth bounds too.


Just because your data structure is deeply nested doesn’t mean it takes a lot of memory.

[[[[]]]] is still only 4 objects worth of memory on a stack.


Most JSON parsers use an amount of memory that is proportional to the size of the input, independently of whether they can parse deeply nested structures or not.

A parser that limits the maximum depth of the input can still be made to consume gigabytes of memory on an input of several gigabytes, and there is nothing wrong about that.

Size limits on a production server should be enforced before even starting to parse the JSON, but this has nothing to do with the issue highlighted here.


I think the reason that the nesting level limit is 128 and the file size is 256 bytes for serde_json in this test is that the recursion_limit attribute is not configured.

https://doc.rust-lang.org/reference/attributes/limits.html

The default value of this attribute is 128 which would explain both results, I'm guessing with configuration you could significantly increase both values at the expense of compilation time.


That's for the recursion limit on compile time stuff, like how many layers of macros it will expand.

serde does have this though: https://docs.rs/serde_json/1.0.41/serde_json/struct.Deserial...


Interestingly, I recently encountered an issue in the wild where these two recursion limits are both prominently involved.

Yew, the Wasm web framework, provides an html! macro that allows you to write templates that look like HTML within your Rust code. The macro is notorious for requiring the macro expansion `recursion_limit` to be raised even for fairly small templates. Now, if you happen to have a syntax error in a template and ask cargo to return the build errors as JSON, it will diligently report every macro expansion that led to the problem as a nested structure. Which is what the commonly used cargo-web tool does, so when it sees JSON with 128+ levels of macro expansion details describing a build error (again, not uncommon with Yew's html! macro), serde_json fails to parse it with `ErrorCode::RecursionLimitExceeded` and cargo-web panics.

https://github.com/yewstack/yew/issues/661#issuecomment-5467...


The recursion limit in the deserializer is a deliberate one https://github.com/serde-rs/json/pull/163

For some reason I thought serde was using macros to perform deserialisation. Now I know that isn't the case and it's a coincidence the default recursion limit and the serde default nesting limit is 128.

Even if serde does indeed use macros, they run at compile time. The limit on macro recursion depth has an impact on what programs do compile, but it has no impact on the runtime behavior of these programs.

Is this relevant/are there situations where people nest JSON even 100x/1000x? I have not come across something like that.

The question isn't necessarily what might be found in nature, but what might be intentionally constructed in order to attack something. Without knowing more, the segfault in the C++ library is somewhat alarming.

On the topic of malicious input, the serde_json example seems to be deliberately limiting recursion depth in order to head off DoS attacks:

"serde_json has a recursion limit to protect from malicious clients sending a deeply recursive structure and DOS-ing a server. However, it does have an optional feature unbounded_depth that disables this protection. It still uses the stack though, so it will still eventually blow the stack instead of using all available memory"

https://github.com/lovasoa/bad_json_parsers/issues/7

This begins to suggest that it may be downright desirable for servers to toss out deeply-nested JSON, regardless of the JSON specification.


Google made a similar design decision regarding protobuffers (https://developers.google.com/protocol-buffers/docs/referenc...).

Why not implement parsing without recursion? It doesn't look like a particularly hard task for me. Why restrict users with artificial limitations?

Why make something more complicated to add capability users aren't asking for.

Why do you decide for all users? I'm user and I'd like to have a reliable parser able to parse anything given sufficient memory. That's the point of library functions: provide flexibility for any use case.

If you have a service that takes in JSON from unknown sources then it's a pretty easy DDoS vector for abusing memory limits. Send a few hundred k of nested {"a":[{"a":[...]}]} that's easy to generate from a couple different CnC clients and you can kill a server pretty readily.

Similar for password login systems that don't restrict bad attempts by IP/Username. If you know a user exists you can send many requests for username:passphrase, and the hashing time has a cost... good passphrase hashing has a pretty high cost usually 1/10 of a second on a single core, or more. If you aren't tracking attempts, then you can easily topple a server with a few hundred simultaneous requests usually.

Anything that is an unguarded memory or compute hog can be used as a pretty easy DDoS vector unless otherwise mitigated somehow. I usually restrict 5-10k regardless, and chunk file uploads into separate requests. Restricting recursion for JSON predictable is a pretty great option as well.


Whether the stack is the implicit C stack or an explicit heap stack does not change anything about the argument you've made. I.e., your comment is not at all responsive to Grandparent's:

> Why not implement parsing without recursion?


Because even if the stack doesn't run out, it still processes for a long time. Non-recursion doesn't help you there, you need to look at every byte, after all.

And it's worth noting that many JSON parsers assume the problem domain is "shipping data over a network," where the input may be malicious. If the non-recursive parser has an N^2 piece of algorithm embedded in it, its ability to consume large data structures becomes a DOS liability.

You'll need a stack anyway, because the structure is logically a tree. Better to use the call stack, that is already there and heavily optimized.

Naive person here. Isn’t the call stack a lot heavier than pushing data structures on a std::stack?

A recursive parsing call stack has the overhead of pushing/popping return addresses and maybe a couple other registers. Maybe a bit more if there are security protections like stack cookies that didn't get optimized away, or if shadow stacks are enabled. It's not free, but is pretty cheap. A non-recursive use of std::stack dodges some of that, but instead has the overhead of dynamic allocation, which I'm generally going to consider more expensive - but does have the advantage of being more of a once-off cost - so perhaps it'd be cheaper for long but shallow JSON objects? The cost of the call stack vs std::stack is probably dwarfed by the cost of allocating arrays/objects/strings and decoding in either case.

If you want to over-optimize for that specific case anyways, you can take a hybrid approach with smaller stacks kept on the stack in a single linear root blob, with larger stacks falling back onto the heap, using a non-recursive style.

However, if you're capping JSON depths as an anti-DoS measure anyways (your parser isn't the only thing possibly vulnerable to blowing the stack via recursion - anything dealing with the parsed result is also in danger if it can be recursive), the heap fallback may be entirely dead code anyways... so why bother coding it?


It depends on how big the stack frame of the recursive function is vs how big the amount of data needs to be saved fore each recursion level.

But… most likely yes.


You'll need a data structure to store the results to return to the caller. This data structure by definition cannot be the call stack. So using a call stack is technically redundant.

Exactly. Another example is python's standard library json parser that is documented to throw a JSONDecodeError on invalid input. But if the invalid json contains deeply nested structures, it raises a RecursionError, and this is not documented anywhere. It's easy to imagine how a code that only catches JSONDecodeError while decoding untrusted inputs could be exploited by an attacker.

Ugh Python exceptions are always a source of anxiety for me. Without combing through the source and locking yourself to specific versions it’s nigh impossible to be sure you’ve caught everything without a “catch everything” block.

Documented or not, you shouldn't rely on the lib only throwing JSONDecodeError in security-critical code.

But what can you do instead? It’s a really hard problem to validate generic JSON without accidentally parsing it.

Validation better be done by the parser. Otherwise, subtle differences between the validator and the parser can lead to security flaws. Your solution is to make sure you use a good parser.

When the OS-provided stack is overflown, the program accesses a "guard page" (or several) which is explicitly non-readable by the process, which in turn triggers a segfault. It's not an exploitable issue, but in the interest of robustness, the library probably should've failed more gracefully instead of blowing through the stack.

Is it any more alarming than the other results?

You probably wouldn't see it in "natural" use but it's a potential attack vector if you ever deserialize arbitrary JSON (e.g. JSON bodies in API requests). Of course you should have general limits that would catch these things anyway...

General limits for what? If you can segfault a program with 28kb of JSON, that's far below the maximum many APIs will return, so you'd need to do some sort of pre-parsing of the JSON to determine the nesting level before parsing it...

For example Java won't segfault. It'll throw stackoverflow exception which will be caught and served with some standard error like HTTP 500. It shouldn't affect other queries.

Well, recursion tests would be more for input... I usually limit all POST data to 5-10k and don't like to go over that. I also tend to use chunked uploads, that are a little slower, but serve to limit the size and time of any inbound work.

Procedurally-generated JSON can nest data as deep as it needs to to describe arbitrarily-complex data, but I'd usually consider it an indicator that a problem thought to be simple has overflowed the bounds of its initial assumptions when it happens.

If you're representing tree data structure in a natural way, you can easily run into this limitation on a real world data.

I have seen JSON output with thousands of levels of nesting. It was generated by a sort of compiler tool that compiled to JSON.

I know, Halloween is over, but currently I'm working with an API where the returned JSON is created by manual string concatinations. Sometimes commas and quotation marks go missing.

Does the API specify under which conditions you can expect missing characters? There should be an undefined behaviour angle here somewhere

I wish.

The dev simply writes new string concatination code every time a new endpoint is needed, so I can't even hope for some bugs being gone for good after he tested his implementation with a few endpoints.


They’ve changed the title[1] to remove the phrasing “bad,” I think the title should be updated here as well, even if the repo name still reflects the original title. It’s not reasonable to use this single criteria as a measure of “badness,” as there are many different variables. I would place security/memory safety 1000x higher than recursion limits. I have never hit JSON recursion limits in production.

[1]: https://github.com/lovasoa/bad_json_parsers/commit/16f6ed75b...


Coincidentally, I just wrote a simple JSON parser the other day as a toy exercise. A simplified snippet of parsing code (handling only arrays) relevant to the discussion here would be something like:

  def parse(text):
      stack = []
      index = 0
      result = None
      while index < len(text):
          char = text[index]
          val = None
          if char == '[':
              stack.append([])
          elif char == ']':
              val = stack.pop()
          index += 1
          if val is not None:
              if stack:
                  stack[-1].append(val)
              else:
                  result = val
      return result
Using the test utilities in the repo indicate that the parsing logic can handle arbitrarily nested arrays (e.g. up to the 5,000,000 max in the test script), bound by the limits of the heap.

It seems like the main criticism here is against recursive implementations. Or am I missing something?


A better question to ask is: Is it a success of CS education since developers know when to make tradeoffs in software engineering and go for the most easily understandable code (recursive implementation used in most pseudo code) since deeply recursive cases are rare, or is it a failure that software engineers cannot figure out how to convert a recursive algorithm into an iterative implementation for greater memory efficiency?

It doesn't necessarily mean anything that deeply nested cases are rare. If you're making a general library, someone is going to throw user input at it, and some users will try to break your system. If it can avoid a crash in that case, it usually should.

One of JSON's selling-points is to be human-readable. I would struggle reading any JSON document with a nesting level greater than 6 or so. Why would I ever want a nesting level > 100?

Human-readable as in "not binary". Other than that, it doesn't offer a lot that helps in that regard.

This doesn't really make sense to me. Just because JSON is human-readable doesn't mean that humans are meant to read any possible serialized data nor that it should always be an ergonomic read. Not everything you serialize is just a config file or hand-edited data. It may be arbitrarily dense with arbitrary data.

To me, this is kinda like seeing a JSON array of user IDs coming over the network and going "Huh? Why not usernames? I thought JSON was supposed to be human-readable!"


I see your point; although if you look at the specs for a lot of easy-to-parse formats for computers, a stated design goal is also easy-for-humans (e.g. Markdown, YAML).

Large, complex object hierarchies with lots of nesting might make more sense represented in binary (e.g. Avro).

I realize I'm making a little bit of a McLuhan-esque argument in a largely computer science-oriented context, but I hope you can see what I'm getting at.


Another example: source code is human readable (by definition), but that doesn't mean I can read any possible program. I struggle to read many programs with deep nesting, but that doesn't mean I'd find it acceptable for my compiler to crash in such situations.

Maybe because you use a JSON library as all the time anyway so using it even when it's not ideal but good enough is the path of least resistance. Or you're a cargo cultist.

When dealing with interconnected systems it's a pretty decent format. That it is human readable also allows for relatively easy interaction in terms of the value without building one-off tooling.

When I'm dealing with non-sql data, one thing I've also done is fire the record into a service that compresses to .gz and a worker sends it to S3_PATH/data-type/recordid.json.gz as a recovery option. Not that it's the first or best recovery option, but when your core data is only a few hundred thousand records, potentially faster than a full DB restore. Not to mention, services being written in one language, and a recovery script in a scripting language.

It just depends on how you are doing things. Now, for configurations, I tend to prefer TOML or YAML.


Recursive parsers need not use the stack for recursion, they can use the heap instead through trampolining, and not have an issue until memory is fully consumed. That would impact performance somewhat - so users should also be allowed to choose when they need the trampolined version. They should still limit recursion depth and allow the user to set a value higher than the default.

I can't help but think that the lazily evaluated Haskell version score of infinity is probably somewhat misleading, as once you access something within a large enough tree you'll probably run out of memory pretty quickly.


It’d be nice if it was possible to annotate recursive functions as using the heap instead of the stack for recursion. I’ve seen too many implementations of people writing a poor man’s stack frame to store on the heap and it just totally destroys any sort of elegance the original code had.

Hence "available RAM is the only limit" for Haskell.

Shameless self-promotion: I wrote a JSON parser for node.js that specifically solves this problem of parsing deeply-nested and/or huge JSON payloads.

https://www.npmjs.com/package/bfj

It's entirely asynchronous, yielding frequently to avoid monopolising the event loop. And it uses a fixed-length buffer so it never exhausts available memory. The tradeoff is that it's slow. But afaik it parses JSON payloads of any size and any depth.


I thought i'd try some other Java parsers, but the Maven build doesn't work with current Maven, and it's not clear what you'd feed to test_parser.sh anyway, as there's no driver script for the Java case.

One of the things i'm passionate about is always setting up a project in such a way that someone new to it can download, build, and use it with an absolute minimum of fuss. That means making the most of the build tools, and scripting everything you can't do through the build tool. I don't always meet this standard myself, because i am a weak and wretched human being, but it's great when it's done.

Gradle is good here, because the wrapper means you always get a specific Gradle version, although it can't control the JDK version, which is increasingly painful in the post-8 age. Cargo is pretty good, because the user just needs to install rustup, then they get a precisely controlled Rust and Cargo version via the rust-toolchain file. I'd love it if there was a rustup-wrapper which i could check in, which would obviate the need for a global rustup installation. Ruby, Python, and Node do alright, but you have to ask the user to install the version manager of your choice. C is absolutely miserable.

Since this project has all of the above and more, making it as self-sufficient as i would like would be quite an effort!


I agree the project could use some better documentation. You can run the test with any program by running

    ./test_parser.sh my_program [arguments]
Where my_program is a command that reads json on it's standard input, and returns a non-zero return code if the parsing failed.

How are C build systems miserable? If you're building for a general Linux distro, you'll not have much control over the dependencies, but that's not a problem with C, but with how programs are distributed in the GNU/Linux distribution model.

AFAIK, there is no tool that lets you specify a compiler version in your project, then automatically build with that compiler. The equivalent of rustup, rvm, nvm, etc.

There is also no standard package manager.

It is very hard to persuade GCC to use a specific set of libraries that you have obtained using a package manager, or vendored or whatever, rather than getting distracted by random libraries it happens to find in in /lib etc. The root of the pain is that you really want to use the host system's platform libraries, the ones which implement the POSIX API, but once the compiler can see those, it tends to go off looking for other things in the same place. In my experience, at least!

These deficiencies are tied up with the historical interrelationship of C and the operating system, but that doesn't mean that they aren't real.


Related question: why is the stack limit so tiny in modern operating systems? Why do we allow programs free reign of the heap but not the stack?

Edit: I hadn’t considered that all of a processes’s threads share the same address space. In a 32-bit address space, a 4MB stack limits you to 1024 threads (with nothing left over for heap).

However, perhaps the question still stands: why use a small stack size on a 64-bit system.


One of the reasons is that if you momentarily use a very large amount of stack to calculate something (say gigabytes), and then return, the large part of the stack which is no longer used stays allocated, full of unused data taking up RAM for the remaining life of the process, which might be a long time.

If a heap-allocating algorithm does the same thing, it's likely to free the memory when it's finished with it.


the sort order is incorrect. he says they sort from worst to best, but the worst one is the one that can segfault and the perfectly good ones are all the other ones that, if i read this correctly, just throw an exception.

In Clojure, we've taken to using a macro to generate custom parser drivers (using Jackson underneath) from schemas. If we encounter an unexpected token or invalid (according to the schema) value, we drop the whole input.

I expecting a report on parsers that crash for challenging syntax. eg: { hello: 'world\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\' }

Independently of the backslashes, this is not valid JSON: keys have to be quoted, and JSON uses double quotes, not single quotes...

Hey look, a human-based JSON parser. I wonder what their maximum nesting level is...

Over-under on when the first "ML-based approach to parsing JSON" story hits HN: 1.5 years from now.

Just couple of weeks ago I had to debug a production problem when Jackson parser failed on parsing prod-specific configuration (in JSON). Turns out the problem was a trailing comma in the array of values. IntelliJ IDEA JSON syntax highlighter did not flag it, due to some obscure setting.

I also had problems with some JSON parsers allowing NaN and Infinity as double values and others - failing on them.


For future reference, if your JSON parser accepts NaN, Infinity, or trailing commas, it is broken.

I suspect IntelliJ is using their JavaScript grammar for JSON.


For a moment, I was worried that someone had found the JSON parser I wrote in Python years ago. My career would be over.

Regardless of the quality of this post, the author does highlight a security issue here. The RFC states that it's up to the parser to define the limits and different parsers may behave differently in the face of malicious inputs. That's why you should probably test them...

Nice effort, but seems to miss things. RapidJSON for C++ is a very popular library. And Go has both the standard library JSON parser and a couple of other popular implementations. I would gladly add implementations for Go, but it is unclear how do do that.

Since the document is on GitHub, it's perhaps a safe assumption the author may be amenable to pull requests (and if they're not, you can always fork their repo and add additional information; they MIT licensed the whole thing).

Indeed, I'd love to get pull requests. It's mentioned in the "Remarks" at the end.

Right at the bottom:

> If you want to add a new language or a new library, feel free to open a pull request.


Plans are to remove RapidJSON from Envoy proxy - https://github.com/envoyproxy/envoy/issues/4705

I added RapidJSON to the tests. It segfaults at 87,289 nested arrays.

Guile scheme expands the call stack until until memory is exhausted and then raises an exception. That is pretty damn comfy, because it means that building lists with recursion is just as fast as using tail recursion and a reverse! at the end

Interesting results e.g. on Ruby. I did give the javascript test file a whirl and was able to run it just fine with a nesting level of 1,000,000, so I wonder if something else is happening there.

JSON nested more than 100??

Why calling a JSON parsers are "bad" when the JSON object is already gone far beyond "bad"?

PLEASE don't call anything bad just because it doesn't fit in your own standard.


Much more detailed comparison of JSON parsers: http://seriot.ch/parsing_json.php

For the Java test, I’d love to see how Jackson compares to Gson.

Someone just contributed a pull request with results for Jackson in java

And it’s already been merged (nice turnaround time!). It narrowly beats Gson.

PicoLisp successfully parsed 20M nested brackets, its maximum i can do on 8GB laptop.

"Ubuntu Linux 4.10.0-35-generic SMP x86_64 with 8Gb RAM" Gb != GB.

if you allow a JSON parser use all of memory to go deep, you know what attackers will do... As other commenters mentioned JSON spec allows setting a limit on the depth.

It isn't a matter of memory usage; the memory usage is still linear in the size of the input regardless of the depth. If an attacker can get you to oom by feeding you a deeply nested data structure, they can do the same with a shallow one just as easily:

"[0, 0, 0, 0, 0, <...gigabytes of zeros>, 0 0, 0, 0]"

So you need to limit the overall input size regardless of whether you already have a depth limit in place.


The memory used by parsers that do not limit the maximum depth is still bounded by the size of the input.

An application that does not limit the maximum size of its input in bytes is vulnerable, one that does is not. This is completely independent of the JSON parser.


Has anyone run this for Newtonsoft.Json?

there's a .net parser in the source code but not in the results

Would you be interested in making a PR adding the .NET project to travis and the results to the README ?

Java Jackson is bad too.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: