
Json vs. simplejson vs. ujson - harshulj
http://jyotiska.github.io/blog/posts/json_vs_simplejson_vs_ujson.html
======
jmoiron
When I wrote the same kind of article in Nov 2011 [1], I came to similar
conculsions; ujson was blowing everyone away.

However, after swapping a fairly large and json-intensive production spider
over to ujson, we noticed a large increase in memory use.

When I investigated, I discovered that simplejson reused allocated string
objects, so when parsing/loading you basically got string compression for
repeated string keys.

The effects were pretty large for our dataset, which was all API results from
various popular websites and featured lots of lists of things with repeating
keys; on a lot of large documents, the loaded mem object was sometimes 100M
for ujson and 50M for simplejson. We ended up switching back because of this.

[1] [http://jmoiron.net/blog/python-
serialization/](http://jmoiron.net/blog/python-serialization/)

~~~
JoshTriplett
Seems like there should be a standard Python mechanism for constructing
"atoms" or "symbols" that automatically get commoned up.

~~~
rat87
I'm pretty sure symbols are not meant to be created from "user" input where
user is untrusted, can't this lead to ddos atacks? Same thing for interning.
De-Duping doesn't have that risk.

~~~
munificent
Lua has an interesting approach here. In Lua, _all_ strings are interned. If
you have "two" strings that consist of the same bytes, you are guaranteed that
they have the same address and are the same object. Basically, every time a
string is created from some operation, it's looked up in a hash table of the
existing strings and if an identical one is found, that gets reused.

However, that hash table stores _weak_ references to those strings. If nothing
else refers to a string, the GC can and will remove it from the string table.

This gives you great memory use for strings and optimally fast string
comparisons. The cost is that creating a string is probably a bit slower
because you have to check the string table for the existing one first.

It's an interesting set of trade-offs. I think it makes a lot of sense for Lua
which uses hash tables for _everything_ , including method dispatch and where
string comparison must be fast. I'm not sure how much sense it would make for
other languages.

~~~
TheLoneWolfling
A problem with that approach:

You can discover what internal strings are held in a web application via a
timing attack.

Better hope you never hold onto a reference to internal credentials inside the
application! (Say... DB username / password? Passwords before they're hashed?
Etc.)

------
borman
The problem with all (widely known) the non-standard JSON packages is, they
all have their gotchas.

cjson's way of handling unicode is just plain wrong: it uses utf-8 bytes as
unicode code points. ujson cannot handle large numbers (somewhat larger than 2
__63, i 've seen a service that encodes unsigned 64-bit hash values in JSON
this way: ujson fails to parse its payloads). With simplejson (when using
speedups module), string's type depends on its value, i.e. it decodes strings
as 'str' type if their characters are ascii-only, but as 'unicode' otherwise;
strangely enough, it always decodes strings as unicode (like standard json
module) when speedups are disables.

~~~
smerritt
Agreed, especially about simplejson. I work on a project that uses simplejson,
and it leads to ugly type checking all over the place because you never know
what your JSON string got turned into. For example:

[https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...](https://github.com/openstack/swift/blob/39c1362a4f5a7df75730d3388bf37c3b5fbdc9c8/swift/account/reaper.py#L362)

and
[https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...](https://github.com/openstack/swift/blob/39c1362a4f5a7df75730d3388bf37c3b5fbdc9c8/swift/common/db.py#L51)

and
[https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...](https://github.com/openstack/swift/blob/39c1362a4f5a7df75730d3388bf37c3b5fbdc9c8/swift/common/middleware/slo.py#L291)

and many more just like those.

The worst part is the bugs that appear or disappear depending on whether
simplejson's speedups module is in use or not.

------
Drdrdrq
I disagree with the conclusion. How about this: you should use the tool that
most of your coworkers already know and which has large community support and
adequate performance. In other words, stop foling around and use json library.
If (IF!!!) you find performance inadequate, try the other libraries. And most
of all, if optimization is your goal: measure, measure and measure! </rant>

------
jbergstroem
I just want to add another library in here which – at least in my world – is
replacing json as the number one configuration and serialisation format. It's
called libucl and it's main consumer is probably the new package tool in
FreeBSD: `pkg`

Its syntax is nginx-like but can also parse strict json. It's pretty fast too.

More info here:
[https://github.com/vstakhov/libucl](https://github.com/vstakhov/libucl)

~~~
michaelmior
Out of curiousity, what do you mean by "my world"? Is this a particular domain
you're working in, or just your personal usage?

~~~
leojg
He is obviously an alien and he is trying to introduce the tools he use in his
home world.

OnTopic: I think is an unusual way of saying "the environment I use"

------
wodenokoto
How hard is it to draw a bar graph? I'd imagine it is easier than creating an
ASCII table and then turning that into an image, but I've never experimented
with the latter.

~~~
azinman2
Not helpful.

~~~
gpvos
Actually sounds like an honest question to me.

~~~
azinman2
"How hard is it" is sarcastic here.

~~~
metaphorm
I didn't read it that way. try to be more charitable of your interpretation of
things other people say on the internet.

------
chojeen
Maybe this is a dumb question, but is json (de)serialization really a
bottleneck for python web apps in the real world?

~~~
MagicWishMonkey
Depends on the app. My previous job required processing thousands of address
book contact records uploaded to the server in a massive list. It was not
unsual for some of these objects to exceed 10mb (when serialized to disk).

The default json module took close to 5 seconds to deserialize the payload
once it hit the server, while ujson could do the same work in a fraction of
the time (less than a second). 5 seconds might not seem like a whole lot when
the import process as a whole could take 30 seconds or so, but when the user
is stuck staring at their device it makes sense to cut down the response time
any way you can.

------
michaelmior
> ultrajson ... will not work for un-serializable collections

So I can't serialize things with ultrajson that aren't serializable? I must be
missing something in this statement.

> The verdict is pretty clear. Use simplejson instead of stock json in any
> case...

The verdict seems clear (based solely on the data in the post) that ultrajson
is the winner.

~~~
kelseyfrancis
> The verdict seems clear (based solely on the data in the post) that
> ultrajson is the winner.

ultrajson isn't a drop-in replacement, though, because it doesn't support
sort_keys.

~~~
michaelmior
Fair enough. Although I'm not sure why one would want that behaviour given
that there is no guarantee of ordering when a particular JSON file is
processed with any other library.

~~~
mdaniel
I don't know what they do with it, but it's handy for writing tests against an
expected JSON file: assert json.dumps(expected, sort_keys=True) ==
json.dumps(obj, sort_keys=True) # where expected was json.load()-ed and obj
was produced by the function

~~~
michaelmior
I don't understand your example and why you wouldn't just do assert expected
== obj.

------
jroseattle
> keep in mind that ultrajson only works with well defined collections and
> will not work for un-serializable collections. But if you are dealing with
> texts, this should not be a problem.

Well-defined collections? As in, serializable? Well sure, that's requisite for
the native json package as well as simplejson (as far as I can recall --
haven't used simplejson in some time.)

But does "texts" refer to strings? As in, only one data type? The source code
certainly supports other types, so I wonder what this statement refers to.

~~~
random567
ujson doesn't error out if you have a collection that isn't serializable so
you can lose individual keys. It also has issues with ints and floats that are
too big (just fails out)

------
foota
I disagree with the verdict at the end of the article, it seems like json
would be better if you were doing a lot of dumping? And also for the added
maintenance guarantee of being an official package.

------
jkire
> We have a dictionary with 3 keys

What about larger dictionaries? With such a small one I would be worried that
a significant proportion of the time would be simple overhead.

[Warning: Anecdote] When we were testing out the various JSON libraries we
found simplejson much faster than json for dumps. We used _large_
dictionaries.

Was the simplejson package using its optimized C library?

~~~
jkire
> In this experiment, we have stored all the dictionaries in a list and dumped
> the list using json.dumps()

I completely failed to read this the first time I went through. I guess this
is equivalent to dumping bigger dictionaries.

> [Warning: Anecdote] When we were testing out the various JSON libraries we
> found simplejson much faster than json for dumps.

Turns out we were using _sort_keys=True_ option, which apparently makes
simplejson much faster than json.

------
ktzar
The usage of percentages in the article is wrong. 6 is not 150% faster than 4.

~~~
anon4
6 = 1.5 * 4. I'm not seeing the problem.

~~~
icebraining
"150% faster" implies the speed is 2.5 times, not 1.5.

------
stared
But ujson comes at a price of slightly reduced functionality. For example, you
cannot set indent. (And I typically set indent for files <100MB, when working
with third-party data, often manual inspection is necessary).

(BTW: I got tempted to try ujson exactly for the original blog post, i.e.
[http://blog.dataweave.in/post/87589606893/json-vs-
simplejson...](http://blog.dataweave.in/post/87589606893/json-vs-simplejson-
vs-ultrajson.))

Plus, AFAIK, at least in Python 3 json IS simplejson (but a few version
older). So every comparison of these libraries is going to give different
results over time (likely, with difference getting smaller). Of course,
simpejson is the newer thing of the same, so it's likely to be better.

------
willvarfar
(My own due diligence when working with serialisation:
[http://stackoverflow.com/questions/9884080/fastest-
packing-o...](http://stackoverflow.com/questions/9884080/fastest-packing-of-
data-in-python-and-java)

I leave this here in case it helps others.

We had other focus such as good for both python and java.

At the time we went msgpack. As msgpack is doing much the same work as json,
it just shows that the magic is in the code not the format..)

------
apu
Also weird crashes with ultra json, lack of nice formatting in outputs, and
high memory usage in some situations

------
dbenhur
> Without argument, one of the most common used data model is JSON

JSON is a data representation, not a data model.

------
js2
I'll have to try ultrajson for my use case, but when I benchmarked pickle,
simplejson and msgpack, msgpack came out the fastest. I also tried combining
all three formats with gzip, but that did not help. Primarily I care about
speed when deserializing from disk.

------
velox_io
I know it goes against the grain, but I wish that binary json (UBJSON) had
much more widespread usage. There's no reason tools can't convert it back to
json for us old humans.

The speed deference between working with binary streams and parsing text is
night and day.

~~~
zapov
No it's not.

[http://hperadin.github.io/jvm-serializers-
report/report.html](http://hperadin.github.io/jvm-serializers-
report/report.html)

------
akoumjian
We took a look at ujson about a year ago and found that it failed loading even
json structures that went 3 layers deep. I also recall issues handling unicode
data.

It was a big disappointment after seeing these kinds of performance
improvements.

------
MagicWishMonkey
It kills me that the default JSON module is _so_ slow, if you're working with
large JSON objects you really have no choice but to use a 3rd party module
because the default won't cut it.

------
bpicolo
Python version? Library version? Results are meaningless without that info

------
fijal
The standard JSON has an optimized version in PyPy (that does not beat ujson,
but is a lot faster than the stdlib one in cpython)

------
UUMMUU
was aware of simplejson but had not seen ultra json. This is awesome to see.
Thanks for the writeup.

------
aaronem
*(Python)

