
Unintuitive JSON Parsing - 2038AD
https://nullprogram.com/blog/2019/12/28/
======
ufo
In cases like this, the parser and lexer can often produce a better error
message if they are written to accept a more lax input and then check it for
errors.

For example, instead of faithfully implementing the grammar from the
specification, allow numbers with leading zeroes and then produce an error for
them.

Another situation where this comes up is parsing language keywords. Instead of
writing a separate lexer rule for every keyword, write a single rule for
keyword-or-identifier, and then use a hash table lookup inside of that to
determine if it is a keyword or identifier.

~~~
raverbashing
> instead of faithfully implementing the grammar from the specification, allow
> numbers with leading zeroes and then produce an error for them.

But that's the problem. The tokenizer doesn't talk to the grammar parser (and
vice versa)

The tokenizer could understand numbers with leading zeroes and throw an error
there.

Something to think about: do languages - not json - interpret -1.2 as
[MINUS][NUMBER] or just [NUMBER]. Or how does languages deal with 1.0-2.0
compared to 1.0+-2.0

~~~
zwkrt
If you know you are expecting an expression, then it is easy for the parser to
understand ‘-‘ as a unary negation if it is the first symbol, otherwise it
must be a binary subtraction operator.

    
    
        Expr: ‘-‘ Expr {
                return -1 * $2;
                }
            | Expr ‘-‘ Expr {
                return $1 - $2;
                }
            | ‘(‘ Expr ‘)’ {
                return $2;
                }
            | ...

------
nicoburns
> The parser will not complain about leading zeros because JSON has no concept
> of leading zeros.

Of course there is no logical reason why the parser shouldn't have this
concept just because the spec doesn't require. IMO, beyond basic correctness,
user friendly error messages are the main differentiator between excellent
parsers and crappy parsers.

~~~
mjevans
Reporting the last valid input and the start of the first invalid location
(possibly repeating the first couple characters of invalid content, filtered
for safety) is what I'd generally prefer in an error message.

~~~
evilpie
For some reason the author truncated the error messages reported by Chrome and
Firefox to not include the line/column.

The message in Firefox is: JSON.parse: expected ',' or ']' after array element
at line 1 column 3 of the JSON data

In Chrome: Unexpected number in JSON at position 2

------
abaines
I was initially surprised that all the lexers treat "[01]" as four tokens, but
it makes sense from the state diagram.

In the past I've encountered JSON lexing that only considers token boundaries
on "special" characters i.e. ",}]:" and whitespace. This will return a lexing
error when it sees "01" (equivalently "truefalse").

~~~
s_gourichon
You're right. Current parser violates the
[https://en.m.wikipedia.org/wiki/Principle_of_least_astonishm...](https://en.m.wikipedia.org/wiki/Principle_of_least_astonishment)
when it breaks a string of characters without a whitespace into several
tokens.

One could imagine first tokenizing only based on whitespace, then only
starting to figure out what the tokens are. Which means parsing them
individually. Which means another parsing step.

I think this would match human more closely: structure is more obvious based
on visual separation than detailed analysis.

I guess it wasn't done that way because the current way of operation means one
parser to rule all sources, and that parser can handle more complicated cases.
That kind of design decision is more surprising later, but is kind of
understandable when you draft a language as the same time as your first
parser.

------
svnpenn
What possible reason could someone have for wanting to do this? It is
explicitly not recommended in JavaScript:

[https://developer.mozilla.org/Web/JavaScript/Reference/Error...](https://developer.mozilla.org/Web/JavaScript/Reference/Errors/Deprecated_octal)

See for yourself:

    
    
        > 'use strict'; 01;
        SyntaxError: "0"-prefixed octal literals and octal escape sequences are
        deprecated; for octal literals use the "0o" prefix instead

~~~
ekimekim
Octal notation is traditionally used in several contexts - file mode probably
being the most common. If you were writing a JSON object to describe a file to
be created, and you were under the mistaken impression that JSON supported
octal with a leading zero (like most languages), it would be entirely
reasonable to write something like:

    
    
        {
            "path": "/foo",
            "mode": 0644,
            "contents": "bar"
        }

~~~
svnpenn
Yeah ok, but its also explicitly not allowed by the specification, both in
text:

> A number is very much like a C or Java number, except that the octal and
> hexadecimal formats are not used.

and image:

[https://json.org/img/number.png](https://json.org/img/number.png)

as shown literally on the JSON home page:

[https://json.org](https://json.org)

I am all for good error handling, but at some point you do have to blame the
user.

~~~
setr
Error messages are "blaming the user".. its job is to help inform the user the
mistake he made.

You can silently beat your child everytime he makes a mistake, until he
accidentally does the job correctly (and doesn't get beaten), but it seems to
me that making use of our ability to communicate can be much more efficient
(and significantly less painful for the child).

And json is merely a (very innefficient, and somewhat problematic) protocol
for information exchange; it's not something you should expect people to have
read the spec for, especially when its whole popularity stems from it being
"intuitive" \-- that is, you don't really need to read the spec to deal with
it effectively

------
spankalee
The article leads with an incorrect statement. JSON is now a subset of
JavaScript: [https://github.com/tc39/proposal-json-
superset](https://github.com/tc39/proposal-json-superset)

~~~
wereHamster
According to the proposal, this has been shipped in V8 in Chrome 66 (from the
V8 bug report: The ECMAScript ⊃ JSON proposal shipped in V8 v6.6 and Chrome
66). And yet my Chrome version 79 does not parse "[01]", throws the same error
as described in the article. Same error in Node 12.14.0 (which includes V8
7.7.299.13). Something doesn't add up.

~~~
will4274
Not sure what's confusing. "[01]" is not valid JSON. JSON being a subset of
JavaScript means that all valid JSON constructs are valid JavaScript
constructs. So, the subset statement says nothing at all about "[01]"

~~~
jchook
That said, in theory, shouldn't:

JSON.parse("0o10") === 8?

I get SyntaxError: Unexpected token o in JSON at position 1

~~~
ufedvjot13467
No. Valid json is valid JavaScript. Valid JavaScript is not, necessarily,
valid json.

~~~
jchook
Thanks, I misunderstood the proposal, which apparently applies mostly to
"unescaped LINE SEPARATOR or PARAGRAPH SEPARATOR characters" within strings.

------
olliej
The problem with octal formatted numbers, and why JSON (and strict mode JS)
explicitly disallow them dates back to the JS engines of the time.

Essentially you have Netscape and IE. Netscape added support for "octal", IE
did not, that meant that you had code like `x = 017` that had different values
in the two engines. Given the early JSON parsers essentially just called
eval() on the string that wasn't ok behavior for a data interchange format.

Then you have the absurd behavior of the Netscape octal implementation, which
leads to such wonders as `018-017==3`, which make it a super terrible footgun.

Sensibly modern syntax makes the difference between octal and decimal very
explicit with a 0o prefix, just like 0x, 0b, etc. I wish I knew why it was
originally decided to not use 0o when 0x was in use.

------
will4274
Why is concatenated json a thing?

In what sense is: {0}{1} better than [{0},{1}]? Presumably, if a few bytes are
a major concern, you aren't using JSON anyway.

~~~
PeterisP
If you have a file or network stream with millions of separate JSON items,
then you might want to parse and process each item separately as it is
received, and the surrounding structure just gets in the way. That being said,
it's properly better to explicitly acknowledge that you're using something-
like-json-but-not-really-json like
[http://jsonlines.org](http://jsonlines.org) does instead of simply
concatenating json objects.

~~~
will4274
With the application understanding that the top level object is an array of
independently parsable json objects, it should still be possible to stream the
format I suggested, assuming you use a streaming / sax parser.

~~~
im3w1l
Streaming parsers are very inconvenient to work with.

------
donpdonp
Its possible the parser was laid out to choke on octals as a way to protect
the 'standard'. Its one decision to not support octals, and its another to
make octal-style an error so those numbers are not parsed as base10.

~~~
zAy0LfpBZLC8mAC
You have it all backwards. What a string in any language means is defined by
the language specification. If the JSON spec doesn't say that '01' is an octal
number, then it's not an octal number. What you would like it to be, or what
other language specs say, is completely irrelevant for what that string means
in JSON.

Also, making a language X parser accept anything that is not in fact language
X is nothing but a terrible idea. If there is one thing that standards are
good for, it's interoperability. And if there is one thing that hurts
interoperability, it's having different implementations of supposedly the same
standard accept and reject different inputs. That's how you get websites that
work in one browser, but not another, because one browser was so helpful to
make up some meaning for your creative markup instead of rejecting it with an
error message, which obviously helps you absolutely nothing with the next
browser that is of a different opinion. If you think the spec is stupid, you
have to change the spec, if you don't manage to do that, you still should
implement the spec, because interoperability is more important than whether
your program can read some input that isn't JSON and that therefore no other
JSON parser is guaranteed to understand anyway.

------
ComputerGuru
So is it incorrect (Technically, at any rate) for a parser to support leading
zeroes in its implementation?

~~~
jchook
Yep. It's technically incorrect.

It seems honoring this type of technical correctness matters a lot. For
example, imagine if ECMA added a new feature (e.g. 0-prefixed octal literals)
in 2020..

Another issue: security. Imagine a hacker figured out that you used a mix of
JSON parsers on your application (e.g. V8 and jq), and they produced different
output.

For a vaguely related example, consider that some URL parsers interpret Ｎ
(U+FF2E - fullwidth latin N) as ".", meaning you can sneakily add a ".." to
the URL with ＮＮ (see [https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-
A-Ne...](https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-
SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf))

~~~
ComputerGuru
I believe what you're _actually_ saying is that regardless of whether or not
it is technically correct, it would be incorrect (and I agree with you there).

My question was more "for inputs not defined as being valid by the spec, is
the result undefined (a la C++ UB where anything and everything is legal in
response) or is it required to reject said input".

The sibling response says extensions are allowed, but that wouldn't come into
play if an input is _specifically_ called out as disallowed (vs simply not
taken into account whatsoever).

~~~
jchook
Section 9 of the RFC would indeed technically allow interpreting 01 as you
please, but section 6 states:

“Numeric values that cannot be represented in the grammar below... are not
permitted”

Regardless I agree we should not do such things.

------
johnhenry
> Either the leading zero is ignored, or it indicates octal, as it does in
> many languages, including JavaScript.

This is false. In JavaScript, a leading zero, unless accompanied by a
lowercase oh ('o') does indicate the number is written in octal.

08 === 0o10; // true

Here, the left side is still base 10, while the right side is base 8.

~~~
reificator
I wouldn't say it's false, I would say it's _no longer true in strict mode_.

[https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Refe...](https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Reference/Errors/Deprecated_octal)

Javascript for many years has assumed a leading zero means an octal prefix,
and it's only recently that behavior has changed.

EDIT: Also your example does not disprove this. Here's another example that
you should try running in the console:

    
    
        011 === 11; // false
        011; // 9

~~~
johnhenry
Thanks for the link!

EDIT: Also, yes, I the mistake with my initial example -- as '8' doesn't exist
in octal, we have no choice but to interpret the leading '0' as padding and
'08' as base ten, whereas '11' can be interpreted as base eight.

------
userbinator
IMHO if it's not going to support octal anyway, it makes zero(!) sense to
artificially limit/special-case things like this, because then it's much
simpler and more consistent to have leading zeros behave like any other digit.

~~~
ChrisSD
JSON was made to be "based on a subset" of Javascript. The only way to be
compatible with JS while removing octals is to disallow leading zeroes
entirely. Doing otherwise would lead to JSON and JS behaving differently with
the same input.

Of course, until recently JSON wasn't a strict subset of JS but that was an
oversight rather than by design.

~~~
magicalhippo
I've been programming for 30 years, across many different languages from
assembler and up. I've yet to use octals for any code.

What am I missing out on? Why are they included in modern languages like JS?

~~~
peterwwillis
If you ever want to count in base-8 (or a subset), or have an abbreviated form
of binary, or pack decimal in a way that's easier to reason, or divide a
number in half (down to 1) without getting fractions, or represent file
permissions (a grouping of four octals).

Is JS a modern language? It was made 24 years ago as a prototype for a
scripting language loosely mimicing Java. Presumably octals would have been
still used often on recent machines.

~~~
magicalhippo
> If you ever want to count in base-8 (or a subset)

So far I haven't ever had that need it seems.

> have an abbreviated form of binary

Yeah ok, but why octal over hex? After all, hex maps better to the underlying
storage.

> pack decimal in a way that's easier to reason

How'd that work? I know about BCD but I don't see how octal improves the
situation, being base-8.

> divide a number in half (down to 1) without getting fractions

Huh?

> represent file permissions (a grouping of four octals)

Ok, I get that for C and such, but how often do you do that in JS?

> Is JS a modern language?

Compared to C, where octal support is understandable, I'd say yes.

------
quotemstr
I think JSON would be vastly improved if it were to just allow comments.
Maintaining configuration in JSON is unnecessarily painful due to this
pointless feature gap.

~~~
reificator
While I agree, that seems off-topic and it comes up most every time issues
with JSON are brought up.

I found this to be a rather interesting article, and it'd be a shame if the
discussion around it centered on such a well-tread topic.

