
Parsing JSON Is a Minefield (2018) - panic
http://seriot.ch/parsing_json.php
======
userbinator
I suppose you could saw that parsing _any_ text-based protocol in general "Is
a Minefield". They look so simple and "readable", which is why they're
appealing initially, but parsing text always involves lots of corner-cases and
I've always thought it a huge waste of resources to use text-based protocols
for data that's not actually meant for human consumption the vast majority of
the time.

Consider something as simple as parsing an integer in a text-based format;
there may be whitespace to skip, an optional sign character, and then a loop
to accumulate digits and convert them (itself a subtraction, multiply, and
add), and there's still the questions of all the invalid cases and what they
should do. In contrast, in a binary format, all that's required is to read the
data, and the most complex thing which might be required is endianness
conversion. Length-prefixed binary formats are almost trivial to parse, on par
with reading a field from a struture.

~~~
robocat
Binary formats have their own serious problems.

> Length-prefixed binary formats are almost trivial to parse

They definitely are not, as displayed by the fact that binary lengths are the
root cause of a huge number of security flaws. JSON mostly avoids that.

> the most complex thing which might be required is endianness conversion

That is a gross simplification. When you look at the details of binary
representations, things get complex, and you end up with corner cases.

Let's look at floating point numbers: with a binary format you can transmit
NaN, Infinity, -infinity, and -0. You can also create two NaN numbers that do
not have the same binary representation. You have to choose single or double
precision (maybe a benefit, not always). Etc.

Similarly in JSON integers or arrays of integers are nothing special. It is
mostly a benefit not to have to specify UInt8Array.

JSON is one of many competitors within an ecology of programming, including
binary formats, and yet JSON currently dominates large parts of that ecology.
So far a binary format mutation hasn't beaten JSON, which is telling since
binary had the early advantage (well: binary definitely wins in parts of the
ecology, just as JSON wins in other parts).

~~~
kentonv
> with a binary format you can transmit NaN, Infinity, -infinity, and -0. You
> can also create two NaN numbers that do not have the same binary
> representation. You have to choose single or double precision (maybe a
> benefit, not always). Etc.

Boy you are really stretching to make this sound complicated. It's not. You
transmit 4 bytes or 8 bytes. Serialization is a memcpy().

You don't have to think about NaNs and Infinities because they Just Work --
unlike with textual formats where you need to have special representations for
them and you have to worry about whether you are possibly losing data by
dropping those NaN bits. If you _want_ to drop the NaN bits in a binary
format, it's another one-liner to do so.

It's funny that you choose to pick on floating-point numbers here, because
converting floating-points to decimal text and back is _insanely_ complicated.
One of the best-known implementations of converting FP to text is dtoa(),
based on the paper (yes, a whole paper) called "How to Print Floating-Point
Numbers Accurately". Here's the code:

[http://www.netlib.org/fp/dtoa.c](http://www.netlib.org/fp/dtoa.c)

Go take a look. I'll wait.

dtoa() is not even the state of the art anymore. Just in the last few years
there have been significant advances, e.g. Grisu2, Grisu3, and Dragon4...

Again, in binary formats, all that is replaced by a memcpy() of 4 or 8 bytes.

(A previous rant of mine on this subject:
[https://news.ycombinator.com/item?id=17277560](https://news.ycombinator.com/item?id=17277560)
)

> > Length-prefixed binary formats are almost trivial to parse

> They definitely are not, as displayed by the fact that binary lengths are
> the root cause of a huge number of security flaws. JSON mostly avoids that.

Injection (forgetting to escape embedded text) is the root cause of a huge
number of security flaws _for text formats_. Length-prefixed formats do not
suffer from this.

What "huge number of security flaws" are you referring to that affect length-
delimited values? Buffer overflows? Those aren't caused by values being
length-delimited, they are caused by people deserializing variable-length
values into fixed-length buffers without a check. That mistake can be made
just as easily with a text format as with a binary format. In fact I've seen
it much more often with text.

> JSON currently dominates large parts of that ecology.

JSON wins for one simple reason: it's easy for human developers to think
about, because they can see what it looks like. This is very comforting. It's
wasteful and full of pitfalls ("oops, my int64 silently lost a bunch of bits
because JSON numbers are all floating point"), but comforting. Even I find it
comforting.

Ironically, writing a full JSON parser from scratch is much more complicated
that writing a full Protobuf parser. But developers are more comfortable with
the parser being a black box than with the data format itself being a black
box. ¯\\_(ツ)_/¯

(Disclosure: I am the author of Protobuf v2 and Cap'n Proto. In addition to
binary native formats, both have text-based alternate formats for which I
wrote parsers and serializers, and I've also written a few JSON parsers in my
time...)

~~~
Avamander
I think people undervalue clean-looking (alphabet-only, few special character)
things, things that don't require people to use the symbol-parsing part of
their brain. Basically easily human-parseable things. I suspect this
phenomenon can be observed in the case of relative popularity of JSON, TOML,
YAML and Python plus the relative unpopularity of Lisp, Haskell, Rust and XML.
And if we look at protobuf in this context it is not easy to parse for humans,
which causes people not to want to use it, developers are not

> more comfortable with the parser being a black box

they're more comfortable with the parser being a black box but the format
being relatively easy to parse compared to the parser being easy to understand
but the format basically unreadable for a human.

~~~
dragonwriter
> I think people undervalue clean-looking (alphabet-only, few special
> character) things, things that don't require people to use the symbol-
> parsing part of their brain. Basically easily human-parseable things.

The symbol parsing part of the human brain is what parses letters and numbers,
as well as other abstract symbols. The division of symbols into letters,
numbers, and others is fairly arbitrary. Most people would say “&”, but the
modern name of that symbol is a smoothing over of the way it was recited when
it was considered part of the alphabet and recited with it.

> I suspect this phenomenon can be observed in the case of relative popularity
> of JSON, TOML, YAML and Python plus the relative unpopularity of Lisp,
> Haskell, Rust, XML.

I suspect not: Lisp and Haskell have _less_ use of non-alphanumeric characters
than most more-popular general purpose languages, and not significantly more
than Python; also, if this was the phenomenon in play, popularity would be
TOML > YAML > JSON but in reality it's closer to the reverse.

~~~
Avamander
> The symbol parsing part of the human brain is what parses letters and
> numbers, as well as other abstract symbols.

I really don't think that's true when you talk about about someone using the
latin alphabet, words in that alphabet compared to some other alphabet (e.g.
{}():!) and "words" (or meanings) in those. Just as a crude example parsing "c
= a - b", where equals and minus are one symbol each and have been taught for
a while, is different from parsing "c := a << b" where ":=" and "<<" basically
act as a separate meaning someone has to learn to understand. Similar to the
difference of latin alphabet and say simplified Chinese.

> also, if this was the phenomenon in play, popularity would be TOML > YAML >
> JSON but in reality it's closer to the reverse.

There could be somewhat of an sigmoid response to the effect, decreased
reaction if you go into either extreme compared to deviating from the average.

I'm not a linguist so it is my speculation, so don't take it too seriously :D

------
juliusmusseau
Because of this article (which I encountered a year ago) I would say Parsing
JSON is _no longer_ a minefield.

I had to write my own JSON parser/formatter a year ago (to support Java 1.2 -
don't ask) and this article and its supporting github repo
([https://github.com/nst/JSONTestSuite](https://github.com/nst/JSONTestSuite))
was an unexpected gift from the heavens.

~~~
AgentOrange1234
Wait. How is this no longer a minefield just because there is a test suite
that identifies some tricky cases?

Doesn’t the test suite’s matrix demonstrate that there are tons of cases that
aren’t handled consistently across these parsers?

~~~
juliusmusseau
Good point. I am presuming the test suite is comprehensive. Does it cover 100%
of all JSON mines? Probably not. But it surfaced about 30 bugs in my own
implementation - things I would have never dreamed of.

So it certainly helped me. And just based on how thorough and insane the test
suite is, I think I'm in good hands. Not perfect hands - but definitely a
million times better than anything I would have come up with on my own.

The test suite made my parser blow up many times, and for each blow up I got
to make a conscious decision in my bugfix: how do I want to handle this?

(I decided to let the 10,000 depth nested
{{{{{{{{{{{{{{{{{{"key","value"}}}}}}}}}}}}}}}}}}} guy blow up even though it
is legal. Yes, I'm too lazy to implement my own stack.) :-)

------
Multicomp
This might be throwing a lit match into a gasoline refinery, but why not opt
for XML in some circumstances?

Between its strong schema and wsdl support for internet standards like soap
web services, XML covers a lot of ground that Json encoding doesn't
necessarily have without add-ons.

I say this knowing this is an unfashionable opinion and XML has its own
weaknesses, but in the spirit of using web standards and LoC approved
"archivable formats", IMO there is still a place for XML in many serialization
strategies around the computing landscape.

Json is perfect for serializing between client and server operations or in
progressive web apps running in JavaScript. It is quite serviceable in other
places as well such as microservice REST APIs, but in other areas of the
landscape like middleware, database record excerpts, desktop settings, data
transfer files, Json is not much better or sometimes even slightly worse than
XML.

~~~
AtlasBarfed
XML cannot be parsed into nested maps/dictionaries/lists/arrays without
guidance from a type or a restricted xml structure.

JSON can do that. It also maps pretty seamlessly to types/classes in most
languages without annotations, attributes, or other serialization guides.

It also has explicit indicators for lists vs subdocuments vs values for keys,
which xml does not. XML tags can repeat, can have subtags, and then there are
tag attributes. A JSON document can also be a list, while XML documents must
be a tree with a root document.

XML may be acceptable for documents. But seeing as how XHTML was a complete
dud, I doubt it is useful even for that.

And we didn't even need to get into the needless complexity of validation,
namespaces, and other junk.

~~~
crispyambulance
XML is a perfectly serviceable data exchange format. The parsers and
serializers work great when used properly. It's nice to have schema.

But I think people just got sick of XML because it was abused so badly with
"web services", SOAP, wsdl and all those horrible technologies from the early
naughts. Over-complicated balls of mud that made people miserable.

~~~
eknkc
Apple's plist format might be the weirdest abuse of XML as far as I can tell.
The SOAP envelopes and shit like that were horrible but plist is plain weird.

Everyone abused XML some way or another. JSON is not that "abusable" I'd say.

~~~
amaccuish
PLIST is pretty flexible though, the underlying storage can be XML, binary or
even JSON now.

~~~
cellularmitosis
Implementing a binary plist encoding on you REST endpoints is actually pretty
great for iOS devices.

------
inopinatus
Once you're parsed the first minefield, another crop emerges: interpreting the
result. Even the range of values seen in the wild for a supposedly simple
boolean attribute is just mind-boggling. Setting aside all the noise from
jokers trying it on with fuzzing engines, we'll see all of these presented to
various APIs:

    
    
        true
        false
        null
        0 | 1
        "true" | "false"    (with assorted variation by
        "yes"  | "no"        case and initial character)
        "" | "0" | "1"
        "\u2713"            (hi DHH)
        -1                  (with complements)
        "[object Object]"
        { "value": true }   (and friends)
                            (attribute not present)
        "敵牴"
    

That last looks like a doozy, but old lags will guess what's going on right
away. It's the octets of the 8-bit string "true", misinterpreted as UCS-2
(16-bit wide character) code points and then spat out as UTF8. Google
translates it, quite appropriately, as "Enemy".

Oddly though, according to my records, never seen a "NULL".

------
umvi
I'm fine with a parser that doesn't get all of the corner cases as long as it
fails gracefully.

Really, the only time it would matter is if you are parsing user-provided JSON
and said user was trying to exploit your parser somehow.

But 99% of the time, I'm not parsing user-provided JSON, so I don't ever
encounter these corner cases and parsing/serialization works great.

~~~
juliusmusseau
What about the 2^63 corner-case?

Consider this JSON: {"key": 9223372036854775807}. With most parsers it never
fails.

But... some JSON parsers (include JS.eval) parse it to 9223372036854776000 and
continue on their merry way.

The problem isn't user-provided JSON here. The problem is user-provided data
(or computer-provided data) that's inside the JSON.

rachelbythebay's take
([http://rachelbythebay.com/w/2019/07/21/reliability/](http://rachelbythebay.com/w/2019/07/21/reliability/)):

 _On the other hand, if you only need 53 bits of your 64 bit numbers, and
enjoy blowing CPU on ridiculously inefficient marshaling and unmarshaling
steps, hey, it 's your funeral._

~~~
jacobolus
> _some JSON parsers (include JS.eval) parse it to 9223372036854776000 and
> continue on their merry way_

This is correct behavior though...? Every number in JSON is implicitly a
double-precision float. JSON doesn’t distinguish other number types.

If you want that big a string of digits in JSON, put it in a string.

Edit: let me make a more precise statement since several people seem to have a
problem with the one above:

Every number that you send to a typical JavaScript JSON parser is implicitly a
double-precision float, and _it is correct behavior_ for a JavaScript JSON
parser to treat a long string of digits as a double-precision float, even if
that results in lost precision.

The JSON specification itself punts on the precise semantic meaning of
numbers, leaving it up to producers and consumers of the JSON to coordinate
their number interpretation.

~~~
justincormack
Every number in JavaScript is, JSON does not specify.

~~~
jacobolus
“JavaScript Object” is right there in the name.

~~~
overgard
Java is also in the name JavaScript, but we know how much that has to do with
it.

~~~
TheRealPomax
indeed, we've all read that history. And we all know how much JavasSript has
to do with "JavaScript Object Notation", too right? Basically everything?

~~~
overgard
Other than shared syntax, JSON is its own thing.

------
truth_seeker
Just recently V8 the JS engine rewrote their JSON parsing code to achieve upto
2.7x faster parsing and also making it memory efficient.

Ref link:-
[https://v8.dev/blog/v8-release-76](https://v8.dev/blog/v8-release-76)

~~~
majewsky
Ah, so that's the source of that Chrome bug that we saw last week. Customers
on Chrome for Windows (only that, not Chrome for Linux or macOS) were
complaining that the search on our statically-generated documentation site was
not working. The search is implemented by a JavaScript file that downloads a
JSON containing a search index, and it turns out that this search index had
too much nesting for Chrome on Windows's JSON parser. This would reliably
produce a stack overflow:

    
    
      JSON.parse(Array(3000).join('[')+Array(3000).join(']'))
    

We were about to report a bug when we noticed that the problem was fixed in
Chrome 76, and the users in question were still on Chrome 75.

~~~
zazagura
Pretty weird for a JSON parser to be platform dependent.

~~~
CamouflagedKiwi
Maybe the stack is shallower on Windows, or the calling convention takes more
space per function and is enough to push it over the edge.

~~~
nh2
Recursion based implementation of parsers in languages with limited stack size
is a programmer mistake.

~~~
justincormack
All languages have limited stack size.

~~~
dwohnitmok
Eh... not really if you're referring to call stacks. Rust at one point had
growable stacks. That was removed for performance reasons. Haskell with GHC
kind of has growable stacks (basically IIRC most function calls occur on the
heap) and its stack overflows take a different form. SML I think at one point
also had an implementation with a growable call stack.

~~~
stkdump
I guess this is a nitpick of limited vs. fixed. Even if you grow the stack, at
some point you can't grow anymore.

~~~
dwohnitmok
If by limited you mean limited by the amount of memory your machine has then
yes it's limited, but I don't think that's what parent was getting at, since
in that sense everything about a computer is limited.

------
iamleppert
Check out the simd JSON project if you’re interested in a super fast JSON
parser:

[https://github.com/lemire/simdjson](https://github.com/lemire/simdjson)

I’ve been using to process and maintain giant JSON structures and it’s faster
than any other parser I’ve tried. I was able to replace my previous batch job
with this as it gives real-time performance.

~~~
kthejoker2
How does it do on the article's test suite?

~~~
glangdale
[ Original designer of much of simdjson here ]

We haven't used that particular suite, but almost everything in that suite is
something we've thought about. In many cases we do the right thing by not
innovating and randomly allowing stuff that isn't in the spec.

I see exactly one thing we didn't think about, as our construction of a parse
tree is pretty basic and we don't build an associative structure even when
building up an object - thus we would not register an error when confronted
with the malformed input listed under "2.4 Objects Duplicated Keys", but
happily build a parse tree with duplicated keys (which will be built up
strictly as a linear structure, not an associative one).

There seems to be leeway on this point as to what an implementation should do.
It certainly doesn't fit our usage model very well to build a associative
structure right there on the spot - some of our users wouldn't want that much
complexity/overhead.

------
carapace
ASN.1

> Abstract Syntax Notation One (ASN.1) is a standard interface description
> language for defining data structures that can be serialized and
> deserialized in a cross-platform way. It is broadly used in
> telecommunications and computer networking, and especially in cryptography.

[https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One](https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One)

Or keep re-inventing the wheel. It's not like the people paying you will
notice or care, eh?

~~~
nimish
Asn.1 is incredibly hard to actually implement . There are dozens of cases of
security bugs based on bad parsers. Also there are a dozen different encodings
of asn.1 data including json (JER). Its age also means that it has a bunch of
obsolete datatypes.

Protobuf and friends have most of the power without a lot of the drawbacks.

~~~
carapace
> Asn.1 is incredibly hard to actually implement.

For whom?

> There are dozens of cases of security bugs based on bad parsers.

You're saying this on a thread called "Parsing JSON Is a Minefield", eh?

In any event, this is not unique to ASN.1. I haven't checked but I don't doubt
there are similar cases for Protobuf, etc.

> Also there are a dozen different encodings of asn.1 data including json
> (JER).

So what? That's the opposite of a problem.

> Its age also means that it has a bunch of obsolete datatypes.

So don't use them.

\- - - -

My point is that if the time and effort that was spent on Protobuf and
CapnProto and all the others had somehow been spent instead on perfecting
ASN.1 then, uh, that would have been good...

~~~
kentonv
> My point is that if the time and effort that was spent on Protobuf and
> CapnProto and all the others had somehow been spent instead on perfecting
> ASN.1 then, uh, that would have been good...

I wrote proto2 in 20% time at Google and I developed Cap'n Proto entirely on
my own time, unpaid. If you think ASN.1 could be perfected with a similar
amount of work then why don't you do it?

~~~
carapace
It seems like I may have offended you, I didn't mean to, and I apologize.

I'd love to discuss this but don't want to get in a flame war.

In re: ASN.1, if I ever have to de/serialize some messages again (I'm quasi-
retired ATM) I would use ASN1SCC "an ASN.1 compiler that was developed for ESA
to cover all data modelling needs of space applications."

> The compiler is targetting safe systems and generate either Spark/Ada or C
> code. Runtime library is minimalistic and open-source. The tool handles
> custom binary encoding layouts, is fully customizable through a code
> templating engine, generates ICDs and automatic test cases."

[https://essr.esa.int/project/asn1scc-asn-1-space-
certifiable...](https://essr.esa.int/project/asn1scc-asn-1-space-certifiable-
compiler)

------
exabrial
I miss the days of strongly typed schemas. It's much easier to fail
gracefully.

~~~
Spivak
Throw your support behind [https://json-schema.org](https://json-schema.org)
it's a great effort.

~~~
rapsey
Or use something sane like protocol buffers

~~~
nh2
Protobuf crashes with data larger than 2GB (json does not).

This severely limits its usefulness, unless you want to put workarounds of
this shortcoming into your application.

Other protobuf-like things like capnproto don't have that restriction, because
they don't use 32 bit integers for sizes.

~~~
kentonv
You really don't want to put >2GB in a single protobuf (or JSON object). That
would imply that in order to extract any one bit of data in that 2GB, you have
to parse the entire 2GB. If you have that much data, you want to break it up
into smaller chunks and put them in a database or at least a RecordIO.

Cap'n Proto is different, since it's zero-copy and random-access. You can in
fact read one bit of data out of a large file in O(1) time by mmap()ing it and
using the data structure in-place.

Hence, it makes sense for Cap'n Proto to support much larger messages, but it
never made sense for Protobuf to try.

Incidentally the 32-bit limitation on Protobuf is an implementation issue, not
fundamental to the format. It's likely some Protobuf implementations do not
have this limitation.

(Disclosure: I'm the author of Protobuf v2 and Cap'n Proto.)

------
failrate
Parsing is a minefield. General purpose computing systems are minefields. Of
all human readable formats I've ever worked with, only S-expressions have
proven easier and safer to parse. Json.org even has unambiguous railway
diagrams!

~~~
filoeleven
I wish EDN would catch on. The simplicity of JSON with better number handling,
a few more very useful data types like namespaced keywords and sets,
arbitrarily complex keys, and an even terser, more readable syntax. [k v k v]
beats [k: v, k: v] hands down, and you can use whitespace commas if you want
them.

[https://github.com/edn-format/edn](https://github.com/edn-format/edn)

------
nullwasamistake
JSON sucks. Maybe half our REST bugs are directly related to JSON parsing.

Is that a long or an int? Boolean or the string "true"? Does my library
include undefined properties in the JSON? How should I encode and decode this
binary blob?

We tried using OpenApi specs on the server and generators to build the
clients. In general, the generators are buggy as hell. We eventually gave up
as about 1/4 of the endpoints generated directly from our server code didn't
work. One look at a spec doc will tell you the complexity is just too high.

We are moving to gRPC. It just works, and takes all the fiddling out of HTTP.
It saves us from dev slap fights over stupid cruft like whether an endpoint
should be PUT or POST. And saves us a massive amount of time making all those
decisions.

~~~
hu3
Off-topic but I'd want to work on a place where half the REST bugs are from
JSON parsing.

~~~
craigds
Yeah I don't believe I've _ever_ seen a json parsing problem in 11 years of
software development.

------
rendaw
Plugging my amazing JSON-like format!
[https://gitlab.com/rendaw/luxem](https://gitlab.com/rendaw/luxem)

In case anyone doesn't click the link, its much simpler than JSON, leaves
interpretation up to the reader, supports polymorphic data, and has some
tweaks to make it nicer to edit by hand. I've used this in a bunch of personal
projects and maps all the models I've come across perfectly. If you need more
power in your format you're better off using Lua than YAML.

I have a Rust Serde implementation 90% complete I could finish up if anyone
wants it.

~~~
mkl
Trailing commas are of course very sensible!

This doesn't seem simpler than JSON otherwise, though, e.g. type declarations
and optional quotes.

Why is leaving interpretation up to the reader desirable? Shouldn't things
always come out the same?

Asterisks are an unusual choice of comment syntax. What if your comment needs
to contain an asterisk? Why not "//..." and/or "/* ... */", or "#..."?

------
jasonhansel
Compared to XML, Markdown, and other human readable formats, this
is...actually not too bad. I was expecting worse.

------
Arrezz
The bigger question is, what is there to be done? What is the road to a more
uniform handling of JSON? I've handled some JSON before and it's usually
fairly easy untill you catch one of these strange implementation quirks. But
I'm not sure that those quirks can be ironed out at this point.

~~~
erikpukinskis
Can you help me understand the problem? These things seem like corner cases
that you could just Not Do(TM) and then you don’t have to worry about it.

What am I missing, when do these gotchas become an actual problem for you as a
developer?

I’m not sure I’ve used any technology that was free of footguns, and JSON
appears to have fewer of them than the average programming language or
library.

~~~
falcolas
I call this kind of answer “The C Answer”. “Who cares if this particular
combination of code results in undefined behavior? Just don’t do it!”

> when do these gotchas become an actual problem for you as a developer?

Whenever you have to deal with JSON produced by “not you”, or when you have to
deal with JSON that may have been corrupted in some fashion along the way.

~~~
erikpukinskis
I’m probably just not used to pure “all behavior is defined” systems, so I
appreciate your perspective.

What industry do you work in? I don’t see many systems like that in my
industry. I work like hell to push things in that direction, but it’s a best
case of “we went from 5% well defined behavior to 50%” after many years of
effort.

------
peterwwillis
The product owner perspective on this should be "nothing supports anything
unless it is tested".

If you pick up a standard and just assume other products will be able to work
with it, you're in for a surprise. I don't care if it's TCP sockets or .ini
files; if you didn't test compatibility with the product you expect will
interact with yours through the standard, consider it unstable, and don't
advertise support for it.

Sometimes you _have_ to support a standard itself, like WPA2, so you implement
the standard according to internet engineering best practice: be liberal with
what you accept, and conservative with what you transmit (or something to that
effect). Then test compatibility with the major products you know will want to
use it, and fix the bugs you find.

------
SigmundA
Been of the opinion a while a lot of issues could be resolved if we agreed on
a streamable binary format that had good definitions for data types (including
integers and dates).

String formats are great an all for viewing in whatever text viewer but so
inefficient and then you have the whole escaping string inside of strings and
string encoding binary data.

If we all agreed on a binary format then there would be a viewer for it in
every debugging tool.

ASN.1, Protobuf, BSON, ION. MSGPACK whatever. I would prefer a binary format
that doesn't repeat keys for efficiency where the schema can be sent
separately or inlined. But even one that's basically binary JSON with more
types would be step up.

------
mirimir
I'm not a professional coder. And I mostly work with tabular data, in
spreadsheets and SQL. I like to get my data as delimited text files. Ideally,
delimited with some character that's 100% guaranteed to never occur in the
data. In my experience, "|" is often a good option, but you never know. And
CSV, even with quotes, can be a nightmare, especially if the data contains
addresses. Or names with quoted nicknames.

Anyway, given the choice, I always pick JSON over XML. Because with JSON, I
can always identify the data blocks that I need, and parse them out with bash
and spreadsheets. Not with XML, however. Just as not with HTML.

------
saagarjha
> For example, Xcode itself will crash when opening a .json file made the
> character [ repeated 10000 times, most probably because the JSON syntax
> highlighter does not implement a depth limit.

FWIW, this appears to have been fixed recently.

~~~
tonyedgecombe
Slightly off topic but Xcode crashing has been very common for me, as much as
I like macos the tooling leaves me missing Windows development.

------
warmfuzzykitten
I certainly would not criticize such a thorough examination for being facile,
but I do want to point out that the conclusion "But sometimes, simple
specifications just mean hidden complexity" is not supported by the article.
Almost all of the end cases are caused by implementors ignoring or extending
the simple specification.

------
ufo
The crashing test cases look scary from a security perspective, specially in
the C-based parsers. Does anyone know if these results are still up to date or
if the bugs have already been fixed?

------
ape4
Relaxed JSON is pretty good.
[http://www.relaxedjson.org/](http://www.relaxedjson.org/)

~~~
krispbyte
Relax it a bit more and you get Neon: [https://ne-on.org/](https://ne-on.org/)

------
jstewartmobile
This is all well and good, but which decoders are already in the browser?

XML and JSON last I checked.

------
rurban
Again. Parsing is a minefield in general, but parsing JSON is one of the
easiest tasks of all serialization formats. It's also the only secure format.

Its various spec bugs (by omission) are not that dramatic, and the various
"enhancements" only made it worse, ie more insecure. Still, bad but not a
minefield. What worries me most that my JSON module is the defacto perl
standard, passes all these tests, was the very first to add all these tests,
is the fastest, and still is not included in that list, just some outdated
modules which should not be used at all. Checking best practices besides
maintaining a spec obviously also is a minefield.

~~~
rurban
I forgot another major JSON minefield problem which is not mentioned nor
tested here: stackoverflow.

This is in fact the most important problem to test against, because it might
lead to exploitable stack ROP.

JSON is usually parsed recursively, and deeply nested structures are mostly
not depth counted. One can trivially construct a nested array or map of 500 to
30000 elements, and at one point the parser either fails or crashes with an
overflow. This number is fixed, thus trivially exploitable. The test spec
should contain the max. depth for arrays and maps, and if there's a fixed
builtin limit, a compile-time limit, implicit limit by crash, or none. non
recursive parsers are fine.

