
Parsing malformed JSON - p8donald
https://peteris.rocks/blog/parsing-malformed-json/
======
bhaak
Great, after the tag soup of modern browsers are we now also going to see json
soup?

Sometimes it's obvious what's wrong with malformed data you receive. A classic
would be encoding errors.

But as soon as you start supporting broken components and APIs, you will never
be able to unsupport it.

Prime example would be HTML. Granted, in the beginning, it was supposed to be
written by humans but that was rather quickly not a major obstacle anymore and
even a human can produce valid HTML with the help of a syntax checker.

~~~
drakenot
I've written a relatively popular Atom/RSS feed parser for Go [0].

I struggled with this very issue but I ultimately ended up attempting to be
robust against out-of-spec feeds. A super strict feed parsing library is less
useful than one that can successfully parse certain classes of broken feeds.

It is a fine line to walk -- I won't add a great deal of complexity to support
overly broken feeds, but if it is relatively simple to support certain types
of common mistakes I'll do it.

[0] [https://github.com/mmcdole/gofeed](https://github.com/mmcdole/gofeed)

~~~
treve
I'm doing this with WebDAV too. When I come across a bug that's clearly an
implementation problem I weigh how prevalent the software is, how likely they
will be able to fix it and if possible I add a user-agent specific workaround
so new clients can't rely on the same bug with my server.

~~~
kr0
But then we add the IE-nightmare of using an accepted user-agent in a new
product to workaround cases like this

~~~
treve
That nightmare had to do with misbehaving servers. IE had to advertise as
Mozilla so servers would serve the better response.

In this case it would be possible for a client to fake a UA, but it's more
likely that they weren't aware they were doing things incorrectly and correct
the behavior rather than opting in to mimicing a different UA to get the
server to behave in a non-standard way.

I haven't seen this happen, and this is one of the most popular DAV
implementations. I have seen people fix broken implementations as I've slowly
been making the server more strict over the last 10 years.

------
peterkelly
Please don't do things like this. It only encourages people to be lazy about
producing conforming documents, and different parsers that try to compensate
for syntax errors are going to do so in different ways. We learnt this the
hard way with HTML.

~~~
ludamad
I'm not quite seeing who you think would be encouraged here. Bad JSON output
is usually created in a rush by someone who didn't test their output. It's
unlikely that someone who does test their JSON output would become lazy
because a few lenient parsers exist.

~~~
hueving
Once there are parsers accepting bad input, people will inevitably test with
those parsers and assume their output is okay.

------
captainmuon
Many people here wondering how you can end up with JSON this bad, and who is
"sending" it to them. Well, the poster is not neccessarily running a REST
service. At work, I've dealt with plenty of little JSON (and XML) files,
created by "little tools" and passed around via files and pipes. Since I work
in science, most of our coders are the users of their code, so you can imagine
both code quality and UX are poor. And the main reason something like this
happens is that people don't use proper serialization, because they never
heard of it, or they don't have the right tools. They just construct JSON by
string interpolation. If they are lucky, they remember to replace `'` with
`\"`. In fact, that looks a lot like what happened here (plus one or two
levels of escaping).

Appropos escaping, people are most likely to do this if they never wrote PHP
websites as kids and never went through `urlencode` and `mysql_escape_string`
hell.

------
RMarcus
I wrote a library to handle (many cases) of invalid JSON, motivated by a
similar experience. [https://github.com/RyanMarcus/dirty-
json](https://github.com/RyanMarcus/dirty-json)

I'm on my phone now, but later today I'll test to see if it would have worked
for the author. It's good for cleaning up JSON, but I would be weary of
putting it (or anything like it) anywhere near production.

------
k2xl
I'm hoping nobody actually does this in production. As an academic exercise it
is interesting.

Maybe I'm old fashioned - I'm all for flexible APIs and all, but to its point.
If a customer sends rotten stuff, it should just be rejected with a 40x code.

At minimum, check to make sure it is proper JSON... I know that a lot of
stream processors will put it into a queue and 200 right away and then process
in the background, but I don't think that ensuring it is at least JSON and
doesn't have a content size of more than X could be too intensive.

In this case, if the data was already accepted and you've got no choice but to
deal with it, you've gotta do what you got to do. I've been there, and it
ain't fun cleaning up a 900 GB JSON file.

~~~
junke
I don't deal with such huge files. Honestly, what use case requires _900GB_ of
_JSON_?

~~~
hyperman1
I got one for you. We have to upload json files containing for a bunch of
articles some encoded rules, and the legal text in the law why the encoded
rules are what they are.

The law part was supposed to be a few lines of text. Except when they dont
know which article to give. In that case they provide the full law text,
including scanned pdfs, base64 encoded. All 2GB of it. Basically you have
something with the meaning null, encoded in a huge string.

Now the creation of this file was given to a third party, who don't bother
with finding out the relevant law, and paste the 2GB blob into every article
they modify, just to be sure. At this point we have 500 000 articles in that
file. We get a new one every month.

Not fun at all. But it is modern, at least, in the past it was a cobol flat
file.

~~~
junke
This looks like TheDailyWTF.com, but thanks.

------
devy
This reminds me "Parsing JSON is a minefield" post a few weeks ago, TL;DR,
JSON is not standardized or (having multiple standards) making parsing /
validating JSON data very tricky in edge cases.

[https://news.ycombinator.com/item?id=12796556](https://news.ycombinator.com/item?id=12796556)

~~~
beejiu
What are the multiple standards of JSON? I am only aware of one standard; it
is the implementations that are the problem.

~~~
devy
Read the original article that I linked in the HN discussion, skip through to
the section where it says "Yet JSON is defined in at least six different
documents". You're welcome.

------
Analemma_
How well do you know the sender? Because this looks like an attack, or at
least a probe: something to try and crash the parser and see what response
they get back, to see if you are vulnerable to some kind of heap corruption
attack.

------
aikah
> I have no idea how something like this was generated.

It would be interesting to ask the sender how .

> If the file is small enough or the data regular enough, you could fix it by
> hand with some search & replace.

off course.

> But the file I had was gigabytes in size and most of it looked fine.

I suspect a faulty JSON library, it's important to figure out how it was
generated so an eventual issue can be opened and the bug can be fixed.

------
junke
> I had this "JSON" file sent to me

Why? By whom? Did you complain loudly?

------
PaulHoule
Malformed data is a scalability problem. Unusual failure modes from coding
problems to random bit flips become inevitable as the data volume approaches
infinity.

~~~
thwd
Agreed, but the payload from the article doesn't seem to have suffered from
astral radiation. Rather, random attempts at quote-escaping by someone who
doesn't understand what they're doing. Also notice the "nan" value -- JSON has
no concept of NaN.

------
mSparks
my first reaction would be to parse until you hit a problem. then use a string
distance function and a genetic algorithm to find the problematic characters.

in other words. find multiple possibilities that result in valid a json object
and choose the one with the shortest distance.

then, of course log out the changes.

I do something similar with csvs. mssql is notorious for spitting out junk
inside csv files.

also, i can guess how it was created.

the code is probably in c, and a rare edge case is overwriting memory before
it hits the file.

------
wccrawford
It's a neat trick, but not something I'd deploy into production. If I have to
try to guess at what the customer is sending me, I'm not going to apply it to
their account.

In an emergency, I might hand-edit it and make it right, but I'd absolutely
insist that further files be in the correct format.

~~~
k__
Isn't this used mainly in editors that want to provide some hints even for
JSON you didn't finished yet.

~~~
wccrawford
That's a legit use for it, sure. But when a non-techie sees something like
this, they immediately think of all the hassle they can save a customer that
is having trouble making valid JSON. "We'll just parse it for them!" They
completely ignore that it's not possible to know for sure what the customer
really wanted, and it's the start of a lot of headaches.

------
mwkaufma
Or, how I made my service a DDoS target.

It's not just the extra compute, it's the lack of a formal specification. If
different services applied this kind of ad hoc "postel's principle" they may
parse the malformed markup differently, and end up introducing downstream
inconsistencies.

~~~
hueving
Or even vulnerabilities. Imagine a scenario where a parser for an
authentication engine reads a different value for a given key than the value
the authorization logic reads.

~~~
brassic
This isn't theoretical, I've seen it with HTTP, HTML and elsewhere. Any time
two pieces of software disagree on how to parse a chunk of data, especially if
one of them is supposed to be doing some sort of security check, you should
expect to find a vulnerability lurking.

I don't know if there's a name for this class of problem. I'd be interested to
know.

------
latch
Not a python developer so I was surprised when the built-in json library has a
flag allow_nan which is True by default.

Also, not invalid, but surprising / annoying (a while to debug). An empty lua
table is the same as an empty lua array: {}. This causes ambiguity.

    
    
        // will print {}
        print(cjson.encode(cjson.decode('[]')))

~~~
amyjess
Another nice feature of the built-in JSON library is that you can choose what
class to instantiate with the data. The default is a dict, but if you're
trying to parse Avro records (or something else that cares about field order),
you can change that to an OrderedDict.

------
anentropic
just send the file back where it came from

------
nommm-nommm
Why would a JSON file be GBs in size? I think that's the more interesting
question.

~~~
nilved
Because it has GBs of data? There's no size limit on JSON.

~~~
mertd
I think nom-nom is trying to imply that if you're passing GBs of Json around,
"human readability" isn't probably a concern. Therefore you could go for an
efficient binary format.

~~~
ludamad
JSON isn't just about human readability, it's about being a 'good enough'
standard for data exchange. What binary format would you use that people could
parse as reliably as JSON?

~~~
ezrast
In case your question wasn't rhetorical, I believe MessagePack is the leading
schema-less binary serialization format (which does not contradict your point
as it is still less ubiquitous than JSON).

------
agounaris
Wasn't easier to just remove the wrong characters manually? :P

Validate the json and if its wrong just throw it away. Makes no sense trying
to fix/guess the correct form of an input.

------
nkrisc
Should you really assume malformed JSON is even correct?

------
ekiara
Wouldn't a better option be an error log? you reply to the client that "I can
accept 398,500 of your 400,000 submitted records, attached are the records
that do not conform to the expected template. Choose either to (1) submit only
the validated records and discard the malformed ones or (2) reformat the
malformed records and resubmit the entire batch"

------
ape4
JSON should have a nicer way of dealing with double quotes in data. That would
avoid many encoding mistakes.

~~~
JadeNB
> JSON should have a nicer way of dealing with double quotes in data. That
> would avoid many encoding mistakes.

So you update the standard to this nicer way of dealing with double quotes,
and now people forget to indicate whether they're using the nice new way or
the ugly old way, or they mix the two approaches ….

~~~
ape4
It would have to be phased in ... like html5 or any browser improvement.

------
fbreduc
i don't get malformed json, i tell the sender to re-send data as json

------
bborud
Don't.

