
Parsing JSON is a Minefield - moks
http://seriot.ch/parsing_json.php
======
skywhopper
While this is true of JSON, it's also true of any other non-trivial
serialization and/or encoding format. The main lessons to learn here are that:

1) implementation matters

2) "simple" specs never really are

It's definitely important to have documents like this one that explore the
edge cases and the differences between implementations, but you can replace
"JSON" in the introductory paragraph with any other serialization format,
encoding standard, or IPC protocol and it would remain true:

"<format> is not the easy, idealised format as many do believe. [There are
not] two libraries that exhibit the very same behaviour. Moreover, [...] edge
cases and maliciously crafted payloads can cause bugs, crashes and denial of
services, mainly because <format> libraries rely on specifications that have
evolved over time and that left many details loosely specified or not
specified at all."

~~~
SamReidHughes
No, this is not true of many reasonable formats. You don't have to make an
obtusely nontrivial format to encode the data JSON does.

~~~
arghwhat
JSON is fairly trivial. The post is a nonsensical rant about parsers accepting
non-JSON compliant documents (as the JSON spec specifically states that
parsers may), such as trailing commas.

In the large colored matrix, the following colors mean everything is fine:
Green, yellow, light blue and deep blue.

Red are crashes (things like 10000 nested arrays causing a stack overflow—this
is a non-JSON-specific parser bug), and dark brown are constructs that should
have been supported but weren't (things like UTF-8 handling, which is again
non-JSON specific parser bugs).

Writing parsers can be tricky, but JSON is certainly not a hard format to
parse.

~~~
SamReidHughes
As a question of fact, programs put out JSON that gets misparsed by other
programs. Some simply parse floating point values differently, or they treat
Unicode strings incorrectly, or output them incorrectly. Different parsers
have different opinions about what a document represents. This has a real
world impact.

Accepting invalid or ambiguous or undefined JSON is not an acceptable
behavior. It means bugs get swallowed up and you can't reliably round trip
data.

~~~
repsilat
> Accepting invalid or ambiguous or undefined JSON is not an acceptable
> behavior

Just to make it explicit (and without inserting any personal judgement into
the conversation myself): JSON parsers should reject things like trailing
commas after final array elements because it will encourage people to emit
trailing commas?

Having asked the question (and now explicitly freeing myself to talk values)
it's new to me -- a solid and rare objection to the Robustness Principle.
Maybe common enough in these sorts of discussions, though? Anyway, partial as
I might be to trailing commas, I do quite like the "JSON's universality shall
not be compromised" argument.

~~~
SamReidHughes
Postel's Law or the "Robustness Principle" is an anti-pattern in general.

Accepting trailing commas in JSON isn't as big a deal as having two different
opinions about what a valid document is. But you might think a trailing comma
could indicate a hand-edited document that's missing an important field or
array element.

------
JasonFruit
This is interesting and important in one way: anything poorly specified will
eventually cause a problem for someone, somewhere. That being said, my first
response was to complete the title, ". . . yet it remains useful and nearly
trouble-free in practice." There's a lot of, "You know what I mean!" in the
JSON definition, but in most cases, we really _do_ know what Crockford means.

~~~
Someone
If your API takes json input, some of those issues are potential security or
DoS issues.

For example, if you validate your json in your web front-end (EDIT: I used the
wrong term. What I meant here is the server-side process that’s in front of
your database) and then pass the string received to your json-aware database,
you’re likely using two json implementations that may have different ideas
about what constitutes valid json.

For example, a caller might pass in a dictionary with duplicate key names, and
the two parsers might each drop a different one, or one might see json where
the other sees a comment.

~~~
helaan
Reminds me of last years CouchDB bug (CVE-2017-12635) which was caused by two
JSON parsers disagreeing on duplicate keys: here it was possible to add a
second key with user roles, allowing a user to give admin rights to itself.
JSON parser issues are real.

~~~
xenadu02
One of the benefits of serialization technology (like Codable+JSONEncoder in
Swift or DataContract in C#) is that you get a canonical representation of the
bits in memory before you pass the document on to anyone else.

By representing fields with enums or proper types you get some constraints on
values as well, eg: If a value is really an integer field then your type can
declare it as Int and deserialization will smash it into that shape or throw
an error, but you don't end up with indeterminate or nonsense values.

This can be even more important for UUIDs, Dates, and other extremely common
types that have no native JSON representation, nor even any agreed-upon
consensus around them.

You get less help from the language with dynamic languages like Python but you
can certainly accomplish the same thing with some minimal extra work. Or
perhaps it would be more accurate to say languages like Python offer easy
shortcuts that you shouldn't take.

In any case I highly recommend this technique for enforcing basic sanitization
of data. The other is to use fuzzing (AFL or libFuzzer).

------
freshhawk
It would be great if programmers learned from markdown and json.

Here is the lesson:

1\. We need something simpler, so I will make a simple solution to this
problem 2\. Simple should also mean no strict spec, support for versioning or
any of those engineer things. All that engineer shit is boring and I can tell
myself this laziness is "staying simple" 3\. OH SHIT, I was totally right
about #1 so this got popular and having been designed as just a toy is causing
a lot of problems for a massive number of people ... now my incompetence
regarding #2 is on display and there is nothing I can do about it

I'm not saying "Thou shalt always add bureaucracy to your toy projects", but
look at what happened and think about how Gruber and Crawford will be
remembered, partly if not mostly, for being "the asshole who screwed up X". If
you go the other way programmers will think "damn I hate these RFCs, these
suits are messing up the beautiful vision of Saint [your name here]".

~~~
mchanson
Oh yes RFC process always keeps things from having compatibility issues.

I definitely never saw any issues with all those XML based standards like SOAP
or XSLT.

~~~
eadmund
A lot of that is because XML is objectively insane: it's a monumentally over-
specified version of something that a sane community would have sketched out
on the back of a cocktail napkin. XML is S-expressions done wrong. It's a
massive amount of ceremony & boilerplate, IMHO due to the pain of dealing with
dynamic data in static languages. It's basically the Java of data-transfer
languages.

And it shouldn't even be used for data transfer: it's a _markup_ language, for
Pete's sake!

~~~
Mikhail_Edoshin
Please. XML specification is much shorter than that of YAML, for example, even
though XML 1.0 includes a simple grammar-based validation spec (DTD). "A
markup language"? What does it mean? Are there any special "data-transfer
languages" we neglect? :) Data gets serialized; we need to mark different
parts of it; XML can totally do it. For some cases it's not the best fit, but
nothing is.

------
borplk
I think the idea of humans sharing a language with computers is problematic at
a fundamental level.

(the whole $dataformat "easy to read for humans")

It becomes a source of never ending lose-lose compromises where the more
points you give to human convenience the more points you take away from
machine convenience and vice versa.

Then you end up having to "settle" for something in between that is just
ambiguous and problematic for machines to be able to deal with it and just
noisy enough for humans to be able to cope looking at it and edit it. That's
basically what JSON is.

If we accept to use a transformation step and better tooling we can free the
representation from this tension of "friendly for computers vs friendly for
humans".

It's also a bit odd that we apply this readability obsession only to these
data formats.

I don't hear people wanting a human readable text representation of their
audio, video or images.

~~~
rainbowmverse
>> _I don 't hear people wanting a human readable text representation of their
audio, video or images._

This is, in fact, a huge concern for people who think about accessibility.

~~~
borplk
Ok but what I'm talking about is a little more specific.

Talking about "human readability" of JSON and XML is a little bit like talking
about human readability of JPEG or MP3.

Chasing after it creates a lot of problems.

Formats like JSON and XML often carry lots of textual information so it's
tempting to want them to be like "just like text but with some extra stuff"
but that creates its own problems.

So it would be interesting to have something like JSON but philosophically
treat it like MP3. Meaning, don't assume that humans must fiddle with the
bytes in a text editor with great ease so that the representation can be
designed without the influence of "would the raw bytes look pretty to
people?".

~~~
zenhack
What you're proposing sounds like cbor and/or messagepack (which are virtually
identical in their design), or argdata[1].

I agree it's a pretty solid spot in the design space.

[1]: [https://github.com/NuxiNL/argdata](https://github.com/NuxiNL/argdata)

------
eponeponepon
This is the one thing that the JSON-against-XML holy warriors need to
understand properly. Yes, JSON's less verbose; yes, it's just "plain text" (in
as much as there is such a thing); yes, XML makes you put closing tags in -
but if you need reliable parsing and rock-solid specifications (and it's
reasonably likely that you do, even if you think you don't...), then XML, for
all its faults, is very likely the better way.

~~~
3pt14159
Disagree.

I can always make my JSON act like XML if I want to. When I'm following
something like JSON API v1.1 I get a lot of the advantages that I'd get from
XML with 99% less bloat. You want types? Go for it! There are even official
typed JSON options out there. The security / parsing issues with XML alone are
enough for me to rule it out.

How many critical security issues are the result of libxml? Nokogiri / libxml
accounts for 50% of my emergency patches to my servers. ONE RUBY GEM is the
result of _half_ of my security headaches. That's insane. I only put up with
it because _other people_ choose to use XML and I want their data.

How many issues are the result of browsers having to deal with broken HTML (a
form of XML)?

JSON isn't perfect, and I wouldn't use it absolutely everywhere, but it's dead
simple to parse[0], readable without the whitespace issues of YAML, and I
can't think of one place I'd use XML over it.

[0] [http://json.org/](http://json.org/)

~~~
looperhacks
HTML isn't XML. It's close, but it isn't. There's XHTML for that.

~~~
eponeponepon
Just for the record - XML and HTML are both subsets of SGML, somewhat
overlapping, but by no means coterminous with each other (at least until HTML
5 - I'm honestly not sure what it's relationship to SGML is).

And, speaking from experience, the XML nay-sayers should largely be glad if
they never had to deal with SGML :)

~~~
teddyh
HTML pretended to be a subset of SGML, but never really was, and the illusion
quickly dispersed as time went on, since HTML was strictly pragmatic and ran
in resource-constrained environments (the desktop), while SGML was academic,
largely theoretical, and ran on servers, analyzing text.

XML, on the other hand, was more of a back-formation – a generalization of
HTML; it was not, as I understand it, directly related to SGML in any way. The
existence of XML was a reaction to SGML being impractical, so it would be
strange if XML directly derived from SGML.

~~~
tannhaeuser
> _XML [...] was not [...] directly related to SGML in any way_

That's incorrect. XML is by definition a proper subset of _WebSGML_ , the SGML
revision specified in ISO 8879:1986 Annex K. These two specifications were
published around the same time and authored by the same people.

In a nutshell, XML added _DTD-less_ SGML (eg. such that every document can be
parsed without markup declarations, unlike eg. HTML which has `img` and other
empty elements the parser needs to know about) and XML-style empty elements.
The features removed from SGML to become XML were tag inference/omission (as
used in HTML), short references (for things such as Wiki syntax, CSV, and even
JSON parsing), uses of marked sections other than `CDATA`, more complex use
cases for notations, and link process declarations ("stylesheets") plus a
couple others.

------
inglor
> For instance, RFC 8259 mentions that a design goal of JSON was to be "a
> subset of JavaScript", but it's actually not.

Actually, it's really close [https://github.com/tc39/proposal-json-
superset](https://github.com/tc39/proposal-json-superset)

This is a stage 3 proposal likely to make it to the next version of the spec.
At which point JSON would truly be a subset of JavaScript.

~~~
tenken
Better late than never ...

------
kcolford
This brings to mind the old internet motto (someone correct me on the actual
source): "be liberal in what you accept, and be conservative in what you
send".

JSON is pretty clear on what certain things should mean, strings are Unicode
plus escape sequences, objects map keys to values, arrays are ordered
collections of values, the whole serialized payload should be Unicode, etc.
Even those things can be relaxed further. This IMHO is what makes JSON so
robust on the internet and the perfect choice for a non-binary communication
protocol.

~~~
0xcde4c3db
> "be liberal in what you accept, and be conservative in what you send"

This is commonly known as Postel's Law, and comes from one of the TCP RFCs
[1].

[1]
[https://en.wikipedia.org/wiki/Robustness_principle](https://en.wikipedia.org/wiki/Robustness_principle)

~~~
bhldr
This is also widely considered a bad idea now. Making liberal consumers allows
for sloppy producers. Over time this requires new consumers to conform to
these sloppy producers to maintain compatibility.

Just look at the clusterfuck that HTML5 has become. You need to have extremely
deep pockets to enter that market.

~~~
taeric
Do you have a survey or other citation for it being a bad idea? I get that it
enables bad behavior, per see. However, the idea of rejecting a
customer/client because they did not form their request perfectly seems rather
anti customer.

Ideally, you'd both accept and correct. But that is the idea, just reworded.

~~~
shagie
The Harmful Consequences of Postel's Maxim -
[https://tools.ietf.org/html/draft-thomson-postel-was-
wrong-0...](https://tools.ietf.org/html/draft-thomson-postel-was-wrong-00) (HN
from 2015
[https://news.ycombinator.com/item?id=9824638](https://news.ycombinator.com/item?id=9824638)
)

Wrestling with Postel’s Law [https://techblog.workiva.com/tech-blog/wrestling-
postel’s-la...](https://techblog.workiva.com/tech-blog/wrestling-postel’s-law)

~~~
taeric
The discussion was fun. And seems evenly split, at a quick reading.

More, I think it split on how you read it. If you view it as an absolute maxim
to excuse poor implementations, it is panned. If you view it as a good faith
behavior not to choke on the first mistake, you probably like it.

This is akin to grammar police. In life encounters, there is no real place for
grammar policing. However, you should try to be grammatically correct.

~~~
still_grokking
> This is akin to grammar police. In life encounters, there is no real place
> for grammar policing. However, you should try to be grammatically correct.

That's because most humans have feelings. But most machines don't. So that's
not comparable.

~~~
taeric
I meant that grammar policing does little to help the exchange of information.
Feelings aside.

~~~
caf
A lack of enforced language standards does end up producing a language full of
difficult-to-learn inconsistencies, though.

~~~
taeric
It makes difficult to codify inconsistencies. Most aren't that difficult to
learn, oddly. Especially if you are just trying to be conversational.

Edit: I'm specifically going off evidence of teaching my kids. They have
basically picked up language completely by talking to us. Even pronouns,
adjectives, adverbs, etc. What they have not learned, is the reasons some
words are used when another could have worked.

------
Groxx
This seems like more of a problem with parsers _not following the spec_. It's
a simple spec, but so strict and restrictive that it's a bit of a pain to give
to humans, and small extensions (like comments) are immensely useful for its
(ab)uses as configuration "DSL"s. And some edges like that the string format
isn't specified - that's (basically) fine, it's not a connection-negotiation
protocol.

So JSON parsers tend to implement some weird, unspecified, inconsistent
superset of JSON. I haven't encountered one yet that fails to parse _valid_
JSON though. That doesn't seem to imply that parsing JSON is a minefield. Only
that parsing _human input_ , _nicely_ , is a minefield. No spec avoids the
human ergonomic problem simply.

------
protomikron
In my opinion most parsers are just too lax. They support some fancy extension
at the beginning (like comments, which are not a good idea), get more and more
"feature-rich" and then support syntax that is not specified in the standard.

Other parsers now have to lower their "standard" (no pun intended) to compete
which leads to more complex edge-cases that we also find in undefined
behaviour and compilers.

E.g. if your HTML is broken, it mostly renders somehow in your browser, which
is in my opinion bad design - the same is probably true with JSON.

~~~
snowpanda
I totally agree. JSON to me is actually pretty straight forward, it's the
parsers that interpret it differently.

------
teliskr
There are many things in tech which have caused great grief in my life as a
programmer. JSON is not one of them.

------
hsivonen
This document is missing the XMLHttpRequest/Fetch JSON profile. ECMA operates
on a sequence of Unicode code points (really code points, not scalar values!).
WHATWG defines how you go from bytes over HTTP to something you can pass to
the ECMA-specified JSON parser.

------
jwilk
Previously:

[https://news.ycombinator.com/item?id=12796556](https://news.ycombinator.com/item?id=12796556)

------
gumby
We had a nasty liberal-in-what-you-accept JSON problem: using JSON to
communicate between services written in various languages (Python, Java,
Javascript, C++): the python client was simply writing maps out which were
automatically serialized into something _almost_ JSON: {'label': 123} (using '
to delimit the label strings, not "). The Javascript JSON parser would
silently accept this, as would some of the Java libraries, while both the C++
parsers we used would reject it. This was a pain to debug since some of the
modules communicated seemingly perfectly, and of course those developers
didn't see why they should change.

~~~
bastawhiz
JSON is mostly a strict subset of Python. That's not unexpected, but it makes
me question how something like this actually happened. Your bug likely
resulted from someone doing a `str(obj)` instead of `json.dumps(obj)`. Hardly
the fault of JSON for being very similar to Python's default string
serialization.

~~~
gumby
That is precisely what they did, yes, and some of the client code (e.g.
JavaScript) DWIMed it (accepted it)

------
smel
It's inevitable, nothing is perfect ... there is only popular things that
everyone complains about and things that nobody cares about :D

We're not machines we're more comfortable with messy and forgiving systems.

Do you want to build successful products? be liberal on input and conservative
on output. You need to reduce entropy and give people a feeling of magic. It's
when you're old and have enough scars on your skin when you learn to hate
magic and become control freak :D

------
keymone
i wish there was a chance for EDN[1] to replace JSON. it's a shame the
industry defaulted to a subset of javascript as a data notation format
considering all it's shortcomings =/

yeah, i get it, "but it has native support in all browsers" is a valid
argument, i just wish it wasn't.

[1] [https://github.com/edn-format/edn](https://github.com/edn-format/edn)

~~~
xfer
In what regard EDN is "better" than JSON? the point of the post was the RFC
specification is not tight and there are corner cases. I don't see any rigor
in the given link either..

~~~
keymone
it is better in regard that:

\- there exists single reference implementation, which rules out points like
scalars not being valid JSON in some parsers despite being part of the spec

\- it is extensible in a manner that never invalidates the syntax for parsers
that do not use corresponding extensions (this is huge actually)

\- comments are part of the spec, so it essentially replaces both json and
yaml

\- richer set of primitive types

\- commas are whitespace (best feature ever)

the rest should definitely be handled in BNF spec, but the above makes EDN
immediately _much_ better than JSON

------
Walkman
If parsing JSON is a minefield, what about YAML? :D

~~~
Groxx
A minefield where the mines hunt you down, rather than waiting for you to step
near.

------
rurban
Parsing JSON is not a minefield. It is technically trivial and pretty secure.
Compared to other specs it's not that bad, but of course there are still some
security concerns, esp. in the last two JSON RFC updates, which made it worse
and not better.

But most other commonly uses transport formats are much worse, and much harder
to parse. Start reading at [http://search.cpan.org/~rurban/Cpanel-JSON-
XS-4.02/XS.pm#RFC...](http://search.cpan.org/~rurban/Cpanel-JSON-
XS-4.02/XS.pm#RFC7159)

~~~
rgovostes
I don't think it's a rule that JSON parsers are, in general, "pretty secure."
Even if the parser itself is not vulnerable (to say, hitting recursion
limits), how duplicate keys are handled between parsers has led to security
vulnerabilities in the past for other things such as GET parameters. Or
suppose an attacker gets a message through a few layers and that then causes a
backend server to fail, like with the Swift errors he talks about, causing
data loss.

------
ourcat
Try RSS.

Having built a system years ago, to try an parse tens of thousands of feeds,
there's a huge amount of 'fuzzy logic' required to put it all in order.

Despite the spec.

~~~
bmn__
There were nine different and incompatible versions.

[http://web.archive.org/web/2004/http://diveintomark.org/arch...](http://web.archive.org/web/2004/http://diveintomark.org/archives/2004/02/04/incompatible-
rss)

------
nnq
In a different realm, there are people that find even ol' JSON annoyingly
strict, and we prefer to just grab JSON5 instead
([https://github.com/json5/json5](https://github.com/json5/json5)) at least
for system-local configs.

..there was a wise saying about how you gotta "stop worrying and love the
bomb" ;)

------
rbalsdon
Im happy to say that my own JSON parser
([https://github.com/ryanbalsdon/cerial](https://github.com/ryanbalsdon/cerial))
passed a lot more of these tests than I expected! It isn’t a general parser
though (requires a pre-defined schema) so is probably cheating.

------
falcor84
I just noticed that the recursion depth test mentions 10000 opening brackets,
while the test code uses `'['*100000` (one order of magnitude more). I am
curious about the actual recursion depth they can handle but don't have access
to xcode myself.

------
k__
Reminds me of some JSON I got from an API.

It was always malformed and I always wrote the dev that he should fix it.

He always did, but every new endpoint was malformed again.

One day I looked at the code and it was full of string concatinations of DB
results...

~~~
caf

      select '{ "user": { "name": "' || u.name || '", "email": ' || u.email || '" } }' as json from users u;
    

oh my.

~~~
Izkata
...I don't know if it was intentional or not, but you're missing a comma.

------
edejong
If you think parsing JSON is hard, try parsing/generating streaming JSON while
limiting memory bandwidth requirements. Fun exercise, with a push-down
automaton.

------
Animats
Parsing UTF-8 in the presence of errors is a huge headache in itself.

UTF-8 with "byte order marks"? That makes no sense.

------
nickthemagicman
Why can't there be a subset of JSON or XML called like strict mode and it's a
much more sane version of the ddl?

~~~
habitue
Why can't there be?

------
drawkbox
Before JSON, XML and standard binary formats, there were just CSV/TSV and
random binary formats which was a bigger minefield. Simply exchanging data was
a project in itself.

At least JSON and XML are text based when it comes to data exchange. Back in
the day before APIs that needed to exchange data cleanly, without JSON/XML,
exchanging data was not only a minefield but one with constant carpet bombing.
The fact that edge cases that are rarely run into is all that is left of data
exchange issues is a huge advancement.

What is great is in most cases JSON works fantastic and simplifies data
exchange and APIs all the way to front ends and backends. XML is available if
needed. So are binary standard formats now for really compact areas like
messaging for performance that humans may never see or you may never have a
third party that needs to parse it. The task of parsing xml in client side
javascript is not fun especially, neither is binary parsing where adding a
value can break the whole object, JSON keys can come and go.

The engineer can choose the tool for the job but there better be a good reason
to use anything over simple JSON, almost any problem can be solved with it.
Engineers should aim to take data complexity and make it as simple as
possible, not take simple and make it complex for job security, real engineers
always move more simple when possible and away from vogon ways.

For data that is exchanged between services and front-end/backend, JSON is the
simplified format that makes things move faster. XML got tarred and devolved
into vogon sludge with SOAP services and nested namespacing/schemas but still
is needed in some areas. Binary standard formats when you control both sides
or noone else needs to connect to it or you don't need it on the front end
maybe or possibly you need performant real-time messaging. There is also YAML
if you need more typing or BSON where binary needed but still simple. All
formats have good uses and bad but using binary when JSON will suffice is not
being as simple as possible.

JSON is easy to get around and is more lightweight, if you run into a problem
you can just restructure your JSON to make it work where binary or XML take
more work to change without breaking changes, especially downstream causing
many more versions and conversions. JSON is a data messaging format meant to
simplify. Most of the issues in the OP article could be solved storing the
values in a string with a "type" or "info" key that allows conversion in the
backend i.e. long numbers or hex etc or storing binary as base64 etc.

JSON is based on basic CS types in objects, lists, simple data types like
string, number, bool, date, this makes for a simplifying of all systems that
serialize and deserialize to it. JSON helps spread simplicity while being
dynamic.

JSON works best with ever changing dynamic data/code/projects we build today
and in seconds you can be consuming data from third party APIs faster than any
other format and more simplistically, that is why it won.

~~~
eadmund
> Before JSON, XML and standard binary formats, there were just CSV/TSV and
> random binary formats which was a bigger minefield.

S-expressions predate both, are simpler to parse than either, are more legible
than both and are cheaper than either.

Here's a JSON example from
[http://json.org/example.html](http://json.org/example.html):

    
    
        {
            "glossary": {
                "title": "example glossary",
        		"GlossDiv": {
                    "title": "S",
        			"GlossList": {
                        "GlossEntry": {
                            "ID": "SGML",
        					"SortAs": "SGML",
        					"GlossTerm": "Standard Generalized Markup Language",
        					"Acronym": "SGML",
        					"Abbrev": "ISO 8879:1986",
        					"GlossDef": {
                                "para": "A meta-markup language, used to create markup languages such as DocBook.",
        						"GlossSeeAlso": ["GML", "XML"]
                            },
        					"GlossSee": "markup"
                        }
                    }
                }
            }
        }
    

In XML it'd be:

    
    
        <!DOCTYPE glossary PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
         <glossary><title>example glossary</title>
          <GlossDiv><title>S</title>
           <GlossList>
            <GlossEntry ID="SGML" SortAs="SGML">
             <GlossTerm>Standard Generalized Markup Language</GlossTerm>
             <Acronym>SGML</Acronym>
             <Abbrev>ISO 8879:1986</Abbrev>
             <GlossDef>
              <para>A meta-markup language, used to create markup
        languages such as DocBook.</para>
              <GlossSeeAlso OtherTerm="GML">
              <GlossSeeAlso OtherTerm="XML">
             </GlossDef>
             <GlossSee OtherTerm="markup">
            </GlossEntry>
           </GlossList>
          </GlossDiv>
         </glossary>
    

And as an S-expression it'd be:

    
    
        (glossary (title "example glossary")
                  (div
                   (title S)
                   (list
                    (entry (id SGML)
                           (sort-as SGML)
                           (term "Standard Generalized Markup Language")
                           (acronym SGML)
                           (def (para "A meta-markup language, use to create markup languages such as DocBook.")
                                (see-also GML XML))
                           (see markup)))))
    

Which is, I believe, a huge improvement.

~~~
djur
The S-expression has cleaner whitespace and field names than the JSON, which
makes it harder to make an apples-to-apples comparison.

But the biggest problem with that S-expression is that I don't know how to
parse it. Is SGML a symbol, identifier, a quoteless string? How do I know when
parsing the 'entry' field that what follows is going to be a list of key/value
pairs without parsing the whole expression? Is 'see-also GML XML' parsed as a
list? How do we distinguish between single element lists and scalars? Is it
possible to express a list at the top level, like JSON allows? How do you
express a boolean, or null?

Of the problems outlined in the OP, S-expressions solve one: there's no
question of how to parse trailing commas because there are no trailing commas.
They do not solve questions of maximum levels of nesting. They have the same
potential pitfalls with whitespace. They have exactly the same problems with
parsing strings and numbers. They have the same problem with duplicated keys.

My point here isn't that you can't represent JSON as S-expressions. Clearly
you can. My point is that in order to match what JSON can do, you have to
create rules for interpreting the S-expressions, and those rules are the hard
part. Those rules, in essence, _are_ JSON; once you've written the logic to
serialize the various types supported by JSON to and from S-expressions,
you've implemented "JSON with parentheses and without commas".

~~~
eadmund
> Is SGML a symbol, identifier, a quoteless string?

It's a sequence of bytes — a string, if you like.

> How do I know when parsing the 'entry' field that what follows is going to
> be a list of key/value pairs without parsing the whole expression?

You wouldn't, and as a parser you wouldn't need to. The thing which accepts
the parsed lists of byte-sequences would need to know what to do with whatever
it's given, but that's the same issue as is faced by something which accepts
JSON.

> Is 'see-also GML XML' parsed as a list?

(see-also GML XML) is a list.

> How do we distinguish between single element lists and scalars?

'(single-element-list)' is a single-element list; 'scalar' is a scalar. Just
like '["single-element-list"]' & '"scalar"' in JSON.

> Is it possible to express a list at the top level, like JSON allows?

That whole expression is a list at top level.

> How do you express a boolean, or null?

The same way that you represent a movie, a post or an integer: by applying
some sort of meaning to a sequence of bytes.

> They do not solve questions of maximum levels of nesting.

They don't solve the problem of finite resources, no. It'll always be possible
for someone to send one more data than one can possibly process.

> They have the same potential pitfalls with whitespace.

No, they don't, because Ron Rivest's canonical S-expression spec indicates
exactly what is & is not whitespace.

> They have exactly the same problems with parsing strings and numbers.

No they don't, because they don't really have either strings or numbers: they
have lists an byte-sequences. Anything else is up to the application which
uses them — just like any higher meaning of JSON is up to the application
which uses _it_.

> They have the same problem with duplicated keys.

No, they don't — because they don't have keys.

> My point is that in order to match what JSON can do, you have to create
> rules for interpreting the S-expressions, and those rules are the hard part.

 _My_ point is that JSON doesn't — and can't — create all the necessary rules,
and that trying to do so is a mistake, because applications do not have
mutually-compatible interpretations of data. One application may treat JSON
numbers as 64-bit integers, another as 32-bit floats. One application may need
to hash object cryptographically, and thus specify an ordering for object
properties; another may not care. _Every_ useful application will need to do
more than just parse JSON into the equivalent data structure in memory: it
needs to validate it & then work with it, which almost certainly means
converting that JSON-like data structure into an application-specific data
structure.

The key, IMHO, is to punt on specifying all of that for everyone for all time
and instead to let each application specify its protocol as necessary. The
reason to use S-expressions for that is that they are structured and capable
of representing anything.

Ultimately, we can do more by doing less. JSON is seductive, but it'll
ultimately leave one disappointed. It does a lot, but not enough.
S-expressions do enough to let you do the rest.

~~~
djur
I hope you understand that those questions were rhetorical -- they're
questions that do not need to be asked about the equivalent JSON
representation. Questions developers don't have to ask each other about the
data they're sending each other.

The canonical S-expression representation solves some of the problems JSON
has, true, but the example you provided is not a canonical S-expression. It
wouldn't make sense for it to have been, because canonical S-expressions are a
binary format and not comparable in this context to JSON or XML.

Application developers voted with their feet for serialization formats with
native representations of common data types (strings, numbers, lists, maps,
booleans, null). There's a lot of reasons that JSON has supplanted XML, but
one of them is that JSON has these types built in and XML does not. A lot of
real-world data interchange and storage can make good use of those primitives.
Many problems boil down to "how do I pass around a list of key/value records".
There is a lot to say for not having to renegotiate that kind of basic detail
every time two applications need to communicate.

You can represent S-expressions as JSON strings and arrays. I've done it. It
was the best way to represent the data I was trying to store, but that's
because the data was already represented as S-expressions. I've never seen
anyone else do it, and that doesn't surprise me. For most purposes JSON is
used for, it is more useful than S-expressions -- not necessarily more
powerful, but more useful.

~~~
tonyg
Interpreted as a Rivest S-expression, the example given above conforms to the
"advanced transport representation" [1], and so can automatically and
straightforwardly be converted to the "canonical representation" [2].

In an important sense, then, I'd claim that it _is_ a "canonical
S-expression".

The reason this works is because SPKI S-expressions aren't just a grammar for
a syntax, they also come with [3] a total /equivalence relation/, which is
exactly what JSON lacks and which is what makes JSON such a pain to work with.

In other words, SPKI S-expressions have a semantics. JSON doesn't.

Lots of other "modern" data languages also lack equivalence relations, making
them similarly difficult to use at scale.

[ETA: Of course, your point about lacking common data types is a good one! My
fantasy-land ideal data language would be something drawing from both SPKI
S-expressions and BitTorrent's "bencoding", which includes integers and hashes
as well as binary blobs and lists.]

\---

[1] Section 6.3 of
[http://people.csail.mit.edu/rivest/Sexp.txt](http://people.csail.mit.edu/rivest/Sexp.txt)

[2] Section 6.1 of
[http://people.csail.mit.edu/rivest/Sexp.txt](http://people.csail.mit.edu/rivest/Sexp.txt)

[3] The SPKI S-expression definition is still a draft and suffers a few
obvious problems - ASCII-centrism and the notion of a "default MIME type"
being two major deficits. Still, I'd love to see the document revived,
updated, and completed. Simply _having an equivalence relation_ already lifts
it head and shoulders above many competing data languages.

------
pedrorijo91
I'm always saying that I can't understand how do we have a new hipster
programming language/framework every year, but we still struggle dealing with
JSON

------
fwdpropaganda
Damn, the people that work on these kind of things are heros.

------
Froyoh
Gson is the best one out there

------
threepipeproblm
TOML is supposed to be easy to parse.

------
your-nanny
Doing God's work, man. Helluva job

