Hacker News new | past | comments | ask | show | jobs | submit login
Sonic: A fast JSON serializing and deserializing library (github.com/bytedance)
98 points by ngaut on Nov 23, 2021 | hide | past | favorite | 43 comments



    Small (400B, 11 keys, 3 layers)
    Medium (13KB, 300+ key, 6 layers)
    Large (635KB, 10000+ key, 6 layers)
I would be so happy to work somewhere where a 0.000635 GB json file is "large".


From the documentation it seems to me that they're mostly concerned about resource utilization when processing APIs, and not dealing with files. 635KB may be representative of a large API payload in their environment.


Just another case where a library tests and publishes results for all competing libraries slower than it, but none faster. cough simdjson [1] cough

---

[1] https://github.com/simdjson/simdjson


simdjson is listed thought.


It is mentioned here: https://github.com/bytedance/sonic/blob/a577eafc253adb943924..., but it isn't included in the benchmarks graphs. Seems this repo is specifically focused on Golang and isn't necessarily motivated by being the fastest JSON [de]serializer on the planet.


There is this one though: https://github.com/minio/simdjson-go

EDIT: Apologies, simdjson-go is included in the benchmarks and discussed as well.


Well it wasn't in the benchmark graphs and it didn't show in Ctrl+F in the README.

Admittedly, I didn't do an exhaustive search.


A port of simdjson, simdjson-go, is in the benchmark graphs. It seems to be too fast to be visible in most of the graphs.


the graphs show rate not time


Right, they show MB/s of parsed JSON, and it definitely surprised me: simdjson-go is among the slowest! It looks like there's something fishy going on there, as it is _orders of magnitude_ slower that vanilla simdjson.


If someone runs at a.. 'higher' rate (m/s) than someone else, is it not right to say that they are faster?


Yes. But on a bar graph showing rate taller bars mean faster, but on a bar graph showing time to complete a task taller bars mean slower.

The graphs are showing rate and simdjson-go has shorter bars (or nonexistent bars which suggest a quantity too low to show up at the resolution of the graphs). TkTech though the graphs were showing time and so thought they were showing that simdjson-go is very fast, which skavi is pointing out is reversed because they are showing rate.


Oh right yes sorry I didn't really tie up the discussion to the graphs (or lacked caffeine or something!).


I regularly deal with JSON documents several MB in size, but do developers frequently deal with JSON documents several GB in size? If so, where do you encounter something like that? Surely "processing" that much data (for whatever definition of process you have) is orders of magnitude slower than parsing it.

I love the idea of a library trying to squeeze every last bit of performance out of the CPU, but I'm genuinely curious at the problems it solves in the real world.


Depends. I've had multi-gigabyte `[{..}, {..}, ...]` json arrays from database dumps, and doing even basic things with that with jq takes ages unless you use the (highly obtuse IMO) streaming methods. Sometimes you can pre-grep to filter the results to something trivial to process, but sometimes the structure is not unique enough to let you do that, or it depends on multiple field values - filtering that with a json parser makes perfect sense, and then speed can matter.

That said, a 2x+ improvement for a couple megabytes, especially if done many times per second, is still a significant improvement.


Monthly general ledger entries for the largest real estate companies, tried XML and JSON, eventually landed on compressed CSV for best trade off between human readable large files (~1-3GB) and compressibility.


GeoJSON data can be just as big as you’d like. I have a big archive of GIS data where the files are enormous.


Not JSON, but we process XMLs on the order of a GB. Largest ones are consolidated invoices (ie a lot of separate invoices in one file). Other large ones contain rules and codelists in multiple languages.


All the time. It's not there's a single record that is that size, but all sorts of things log in json, so you wind up with multi-GB jsonl files. As an example: AWS CloudTrail logs.


Huge log files are one use case. Large datasets are another, in data analysis, machine learning, and ETL tasks.


Aren't log files processed a line at a time? Last time I had to deal with some structured log, I streamed lines concurrently into json parser and it went pretty fast.


Hardware config file for a small SOC. It faithfully describes every addressable physical byte.


Since when is 635 KB “large”? I suppose it depends on your use case, since they consider 400 B a small, they probably use lots of JSON APIs for many small things.


635kb is indeed "large" if it's one message. (You will have millions of messages in one file if you're doing batch processing.)


How many libs we have seen saying that they are the fastest deserializing library, until you try to open files with corner cases, like ut8 characters...


I thought UTF-8 was the standard encoding that the json spec requires?


Because ASCII text is valid UTF-8, you will sometimes encounter "UTF-8" code that doesn't actually work for multi-byte characters


Why does it feel like there are consistently new JSON parsing libraries popping up? People joke about JS frameworks, but JSON parsers almost feel like they're at the same level.


somehow when I saw it's a fast JSON serializing salary library I knew it's for Go. For some reasons there're always new JSON libraries for it


Go is kinda first class for microservice web app (and probably its only strong point)


web development -> REST -> JSON -> JSON parsing libraries

So like js frameworks, json parsers are everywhere


It seems its mostly written in assembly? I'd be worried about portability, maintainability, and security.

I'd be way more comfortable if it was written in C, or better yet Rust.


It sounds like it essentially is written in C. INTRODUCTION.md says:

> As for insufficiency in compiling optimization of go language, we decided to use C/Clang to write and compile core computational functions, and developed a set of asm2asm tools to translate the fully optimized x86 assembly into plan9 and finally load it into Golang runtime.

Github says it's 59.6% assembly and 6.5% C. Possibly the assembly is just the checked-in result of compiling+translating the C?

Gross that they have to do this. I know the Go folks really prioritize speed of compilation, but I wish they'd support debug builds with their own backend and release builds with LLVM so you could get this kind of performance when actually writing Go. I see there have been a few attempts at Go + general-purpose backend (gccgo, llgo, gollvm) but none seem to use the official Go frontend written in Go, so I think they're doomed to be second-class at best.

edit: and/or, if the Go folks and GCC and/or LLVM coulds could negotiate a shared ABI (not necessarily switching to the platform's default C ABI but having "cc --abi=special-golang-stack-copying-thing"), folks could just link against something compiled in C/Rust/whatever without requiring this separate compilation+translation step (the high-maintenance, high-performance path) or CGo overhead (the easy but slow-for-frequent-calls path).


I really don't care what language it is written in. As long as there is full test coverage, I'm happy.

I've been a long time user of jsoniter and it is much faster than the standard lib. It really makes a difference for the work I do. If this is even faster and has tests, even better.


Using Go assembly lets you avoid the Cgo overhead IIRC.


Released by ByteDance, who operate Tiktok.


Which probably means its solving a real world and modern problem they encounter and carries the same weight as a library released by Google


Cool. What’s the interop story between Go‘s existing implementation for parsing objects with duplicate keys? Keep first, overwrite or abort?


Seems like it's similar in performance to Jackson?


I had to try working with a 100GB single json file. Couldn’t find anything that could work with that sized file.


Out of curiosity, what does "working with" mean here? What operations did you need to perform on it? Streaming reads, transforms, indexed reads, appends, edits?

I'm thinking that any general-purpose JSON loader is likely to perform badly for a 100GB file, purely because it'll use 2x (or more likely 10x) as much memory for the parsed representation. So you'd want some kind of special-case reader for huge files -- maybe it just builds some kind of sparse index with pointers back into an mmap of the raw data.


At that point you might as well ignore the fact it's JSON and just treat it as a stream.


Does it escape data by default? It appear to be a low-priority for the author in the issues.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: