Sonic: A fast JSON serializing and deserializing library

unspecified · on Nov 23, 2021

    Small (400B, 11 keys, 3 layers)
    Medium (13KB, 300+ key, 6 layers)
    Large (635KB, 10000+ key, 6 layers)

I would be so happy to work somewhere where a 0.000635 GB json file is "large".

hiyer · on Nov 24, 2021

From the documentation it seems to me that they're mostly concerned about resource utilization when processing APIs, and not dealing with files. 635KB may be representative of a large API payload in their environment.

BeeOnRope · on Nov 23, 2021

Just another case where a library tests and publishes results for all competing libraries slower than it, but none faster. cough simdjson [1] cough

---

[1] https://github.com/simdjson/simdjson

nemothekid · on Nov 23, 2021

simdjson is listed thought.

memco · on Nov 23, 2021

It is mentioned here: https://github.com/bytedance/sonic/blob/a577eafc253adb943924..., but it isn't included in the benchmarks graphs. Seems this repo is specifically focused on Golang and isn't necessarily motivated by being the fastest JSON [de]serializer on the planet.

craigching · on Nov 23, 2021

There is this one though: https://github.com/minio/simdjson-go

EDIT: Apologies, simdjson-go is included in the benchmarks and discussed as well.

BeeOnRope · on Nov 23, 2021

Well it wasn't in the benchmark graphs and it didn't show in Ctrl+F in the README.

Admittedly, I didn't do an exhaustive search.

TkTech · on Nov 24, 2021

A port of simdjson, simdjson-go, is in the benchmark graphs. It seems to be too fast to be visible in most of the graphs.

skavi · on Nov 24, 2021

the graphs show rate not time

dr_zoidberg · on Nov 24, 2021

Right, they show MB/s of parsed JSON, and it definitely surprised me: simdjson-go is among the slowest! It looks like there's something fishy going on there, as it is _orders of magnitude_ slower that vanilla simdjson.

OJFord · on Nov 24, 2021

If someone runs at a.. 'higher' rate (m/s) than someone else, is it not right to say that they are faster?

tzs · on Nov 24, 2021

Yes. But on a bar graph showing rate taller bars mean faster, but on a bar graph showing time to complete a task taller bars mean slower.

The graphs are showing rate and simdjson-go has shorter bars (or nonexistent bars which suggest a quantity too low to show up at the resolution of the graphs). TkTech though the graphs were showing time and so thought they were showing that simdjson-go is very fast, which skavi is pointing out is reversed because they are showing rate.

OJFord · on Nov 24, 2021

Oh right yes sorry I didn't really tie up the discussion to the graphs (or lacked caffeine or something!).

leftnode · on Nov 23, 2021

I regularly deal with JSON documents several MB in size, but do developers frequently deal with JSON documents several GB in size? If so, where do you encounter something like that? Surely "processing" that much data (for whatever definition of process you have) is orders of magnitude slower than parsing it.

I love the idea of a library trying to squeeze every last bit of performance out of the CPU, but I'm genuinely curious at the problems it solves in the real world.

Groxx · on Nov 23, 2021

Depends. I've had multi-gigabyte `[{..}, {..}, ...]` json arrays from database dumps, and doing even basic things with that with jq takes ages unless you use the (highly obtuse IMO) streaming methods. Sometimes you can pre-grep to filter the results to something trivial to process, but sometimes the structure is not unique enough to let you do that, or it depends on multiple field values - filtering that with a json parser makes perfect sense, and then speed can matter.

That said, a 2x+ improvement for a couple megabytes, especially if done many times per second, is still a significant improvement.

hobs · on Nov 24, 2021

Monthly general ledger entries for the largest real estate companies, tried XML and JSON, eventually landed on compressed CSV for best trade off between human readable large files (~1-3GB) and compressibility.

jeffbee · on Nov 23, 2021

GeoJSON data can be just as big as you’d like. I have a big archive of GIS data where the files are enormous.

magicalhippo · on Nov 24, 2021

Not JSON, but we process XMLs on the order of a GB. Largest ones are consolidated invoices (ie a lot of separate invoices in one file). Other large ones contain rules and codelists in multiple languages.

jonstewart · on Nov 24, 2021

All the time. It's not there's a single record that is that size, but all sorts of things log in json, so you wind up with multi-GB jsonl files. As an example: AWS CloudTrail logs.

nerdponx · on Nov 23, 2021

Huge log files are one use case. Large datasets are another, in data analysis, machine learning, and ETL tasks.

mayama · on Nov 24, 2021

Aren't log files processed a line at a time? Last time I had to deal with some structured log, I streamed lines concurrently into json parser and it went pretty fast.

thechao · on Nov 24, 2021

Hardware config file for a small SOC. It faithfully describes every addressable physical byte.

uncomputation · on Nov 23, 2021

Since when is 635 KB “large”? I suppose it depends on your use case, since they consider 400 B a small, they probably use lots of JSON APIs for many small things.

otabdeveloper4 · on Nov 24, 2021

635kb is indeed "large" if it's one message. (You will have millions of messages in one file if you're doing batch processing.)

greatgib · on Nov 24, 2021

How many libs we have seen saying that they are the fastest deserializing library, until you try to open files with corner cases, like ut8 characters...

cinntaile · on Nov 24, 2021

I thought UTF-8 was the standard encoding that the json spec requires?

kevingadd · on Nov 24, 2021

Because ASCII text is valid UTF-8, you will sometimes encounter "UTF-8" code that doesn't actually work for multi-byte characters

ZephyrBlu · on Nov 24, 2021

Why does it feel like there are consistently new JSON parsing libraries popping up? People joke about JS frameworks, but JSON parsers almost feel like they're at the same level.

harrygeez · on Nov 24, 2021

somehow when I saw it's a fast JSON serializing salary library I knew it's for Go. For some reasons there're always new JSON libraries for it

rizzaxc · on Nov 24, 2021

Go is kinda first class for microservice web app (and probably its only strong point)

snailysnooze · on Nov 24, 2021

web development -> REST -> JSON -> JSON parsing libraries

So like js frameworks, json parsers are everywhere

TheMagicHorsey · on Nov 23, 2021

It seems its mostly written in assembly? I'd be worried about portability, maintainability, and security.

I'd be way more comfortable if it was written in C, or better yet Rust.

scottlamb · on Nov 23, 2021

It sounds like it essentially is written in C. INTRODUCTION.md says:

> As for insufficiency in compiling optimization of go language, we decided to use C/Clang to write and compile core computational functions, and developed a set of asm2asm tools to translate the fully optimized x86 assembly into plan9 and finally load it into Golang runtime.

Github says it's 59.6% assembly and 6.5% C. Possibly the assembly is just the checked-in result of compiling+translating the C?

Gross that they have to do this. I know the Go folks really prioritize speed of compilation, but I wish they'd support debug builds with their own backend and release builds with LLVM so you could get this kind of performance when actually writing Go. I see there have been a few attempts at Go + general-purpose backend (gccgo, llgo, gollvm) but none seem to use the official Go frontend written in Go, so I think they're doomed to be second-class at best.

edit: and/or, if the Go folks and GCC and/or LLVM coulds could negotiate a shared ABI (not necessarily switching to the platform's default C ABI but having "cc --abi=special-golang-stack-copying-thing"), folks could just link against something compiled in C/Rust/whatever without requiring this separate compilation+translation step (the high-maintenance, high-performance path) or CGo overhead (the easy but slow-for-frequent-calls path).

latchkey · on Nov 23, 2021

I really don't care what language it is written in. As long as there is full test coverage, I'm happy.

I've been a long time user of jsoniter and it is much faster than the standard lib. It really makes a difference for the work I do. If this is even faster and has tests, even better.

mastax · on Nov 24, 2021

Using Go assembly lets you avoid the Cgo overhead IIRC.

susodapop · on Nov 24, 2021

Released by ByteDance, who operate Tiktok.

vmception · on Nov 24, 2021

Which probably means its solving a real world and modern problem they encounter and carries the same weight as a library released by Google

bugmen0t · on Nov 24, 2021

Cool. What’s the interop story between Go‘s existing implementation for parsing objects with duplicate keys? Keep first, overwrite or abort?

throwaway889900 · on Nov 23, 2021

Seems like it's similar in performance to Jackson?

asadm · on Nov 24, 2021

I had to try working with a 100GB single json file. Couldn’t find anything that could work with that sized file.

iainmerrick · on Nov 24, 2021

Out of curiosity, what does "working with" mean here? What operations did you need to perform on it? Streaming reads, transforms, indexed reads, appends, edits?

I'm thinking that any general-purpose JSON loader is likely to perform badly for a 100GB file, purely because it'll use 2x (or more likely 10x) as much memory for the parsed representation. So you'd want some kind of special-case reader for huge files -- maybe it just builds some kind of sparse index with pointers back into an mmap of the raw data.

onion2k · on Nov 24, 2021

At that point you might as well ignore the fact it's JSON and just treat it as a stream.

todotask · on Nov 24, 2021

Does it escape data by default? It appear to be a low-priority for the author in the issues.