Hacker News new | past | comments | ask | show | jobs | submit login

This is very cool. Meanwhile, in the xi-editor project, we're struggling with the fact that Swift JSON parsing is very slow. My benchmarking clocked in at 0.00089GB/s for Swift 4, and things don't seem to have improved much with Swift 5. I'm encouraging people on that issue to do a blog post.

[1]: https://github.com/xi-editor/xi-mac/issues/102

I wrote my own Swift JSON parser quite a while ago, https://github.com/postmates/PMJSON. In my limited benchmarking it parses slower than Foundation's JSONSerialization (by a factor of 2–2.5 IIRC) but encodes faster, and my impression was most of the time was spent constructing Dictionaries, but I didn't do too much performance work on it. It might be interesting to have someone else take a crack at improving the performance.

That said, it also includes an event-based parser (called JSONDecoder), so if you want to handle events in order to decode into your own data structure and skip the intermediate JSON data structure, you might be able to get faster than JSONSerialization that way.

Why does Xi use JSON in the first place? It would be easier and faster to use a binary format, e.g. Protobufs, Flatbuffers or if the semantics of JSON is needed: CBOR.

From “Design Decisions”[1]:

> JSON. The protocol for front-end / back-end communication, as well as between the back-end and plug-ins, is based on simple JSON messages. I considered binary formats, but the actual improvement in performance would be completely in the noise. Using JSON considerably lowers friction for developing plug-ins, as it’s available out of the box for most modern languages, and there are plenty of the libraries available for the other ones.

1: https://github.com/xi-editor/xi-editor/blob/master/README.md...

So is it too slow or not?

We actually do get 60fps, but JSON parsing on the Swift side takes more than its share of total CPU load, affecting power consumption among other things. So (partly to address the trolls elsewhere in the thread), the choice of JSON does not preclude fast implementation (as the existence of simdjson proves), but it does make it dependent on the language having a performant JSON implementation. I made the assumption that this would be the case, and for Swift it isn't.

At some point though, isn't it maybe easier just to use an inherently more efficient format than trying to rely on clever implementations to save you?

I totally get json for public internet services where you want to have lots of consumers and using a more efficient format would be significant friction, but writing an editor frontend is a very large endeavor -- it seems like the extra work of adopting something more efficient than json (like flatbuffers or whatever) would really be in the noise.

It's a complicated tradeoff. It's not just performance, the main thing is clear code. Another factor was support across a wide variety of languages, which was thinner for things like flatbuffers at the time we adopted JSON. Also, "clever implementations" like simdjson don't have a high cost, if they're nice open source libraries.

The problem with clever implementations isn't that they can't be reused or that they have abnormally high cost for end-users (though this is sometimes the case). It's that they inherently require more work to maintain, author, and debug over time. When you're talking about a cross language protocol that will have myriads of available implementations (each with different constraints), it's not unreasonble to take a look at how much work a third party must engage in to get such a "clever" implementation (or, in other words, "how many people could reimplement simdjson?") And if those existing clever implementations aren't available (or viable) for some use case, then you're out of luck and start at square one. This happens more often than you think.

In this case there's a lot of work already put into fast JSON parsers, but in general JSON is not a very friendly format to work with or write efficient, generalized implementations of. Maybe it's not worth switching to something else. I'm not saying you should, it seems like a fine choice to me. But clever implementations don't come free and representation choice has a big impact on how "clever" you need to be.

Re clear code, to my mind it comes out pretty much the same regardless of serialization† format: best approach is to have protocol be written down in some real language (e.g. flatbufs schema or annotated rust structs or whatever), and codegen for target languages.

My guess is it's easier to write an efficient flatbuffers (or similar) serializer+deserializer than an efficient json serializer+deserializer. And the top-end of performance definitely higher.

So if you're already reaching the point of needing to write your own json deserializers...

(† Unless you're talking about some hand-written bespoke binary format, but that would almost certainly be crazy.)

One of the other performant libraries in the comparison section of simdjson has a Swift wrapper: https://github.com/chadaustin/sajson. Haven't tried it, but one option would be to bring that up to date. Another option, now that Swift 5 strings use utf8 as a native encoding, it may be possible to write a fast Json parser in native Swift. Likely someone already has or is doing that.

It's not a binary yes/no question.

Given equally-high quality JSON and binary serdes, JSON is sufficiently fast. Raphlinus is saying that Swift's built-in deserialiser is obnoxiously slow.

Any reason not to just use a third party Swift JSON library?

Xi has multiple components written in multiple languages. In the rust core, json de/serialization is not a problem, but swift is lacking a similar high-performance library.

I'm going off topic at this point but I'd think for a native app the main advantages of a binary format would be the static typing and code generation that come from using an IDL.

Rust (the language the Xi core is developed in) has static typing for JSON, as well as other serialization formats: https://github.com/serde-rs/json

I'm familiar with serde. It's an incredible project, but I wouldn't quite call it "static typing for JSON". You still have to unwrap the parse at some point. However, I will concede the point that if you have Rust on both sides then you'll get most of the benefits.

You can have a binary format that's self-describing. It's important to understand all of the independent parts that go into a format.


You misread the rationale. He is arguing that, with all conditions same, the difference between binary formats and JSON would be in the noise. It is often the case that the object construction is more costly than the JSON parsing, and you can't fix that with binary formats.

As a minimal and extremely non-scientific benchmark, I've constructed a simple fixed data structure that encodes to JSON (using Python `json` module) and simple binary formats (that would be an ideal case for Python `struct` module). Decoding the same simple value 1,000,000 times in CPython 3.6.4 took...

    Format  Size  Iters.     Speed
    ------  ----  ---------  ----------
    JSON      28    205,000   5.75 MB/s
    Struct     6  2,400,000  14.4  MB/s
Of course YMMV, but even the `struct` module was only 2--12 times (depending on what you care about) faster than the `json` module in this particular case. And this is really minimal, you need an (slow) interpreted code for more complex binary formats. Right, you can use PyPy for the JIT compilation or binary modules for sidestepping the interpreter overhead! The point is that, it of course matters, but not quite drastic improvements you'd imagine.

> "It is often the case that the object construction is more costly than the JSON parsing, and you can't fix that with binary formats."


  typedef struct _some_struct_t
      unsigned long some_long;
      unsigned long some_other_long;
  } some_struct_t;

      some_struct_t foo = { 0 };

      foo.some_long = 1;
      foo.some_other_long = 2;
Is somehow comparable to using JSON?

C is one of extreme cases; that's why Cap'n'proto works pretty well in C++ and its cousins for example (it amortizes the decoding cost to accessors, and accessors are really cheap in those languages). There are many languages and implementations where decoding cost is not as significant.

> "C is one of extreme cases"

I would say it's the other way around.

We've had the knowledge and tools to build performant, scalable and highly maintainable systems for a while now. The learning curve is there, but that's part of the trade. We've been too occupied with reducing the entry barrier though - the end result being people shoving JSON into places it should have never been in.

JSON can absolutely be a part of a text editor's architecture - with areas that don't necessarily require near real time performance (think configuration, metrics). Anything beyond that - C structs would be a great way to go, and I don't see why there's a debate here.

Because the idea of Xi is that it can support different frontends for different platforms, and that probably wouldn't work out to well if they all had to be in C.

The Xi backend is already written in Rust, a relatively low-level language with a somewhat C-like FFI/ABI. The choice to use JSON in time-critical code, when more performant alternatives are available, seems to me like a mistake.

The whole point is JSON is not in time-critical path.

This is a super flawed argument. Clearly flat buffers and even protocol buffers are faster to serialize and deserialize than json, regardless of what you benchmark in python.

And for the amount of messages that are being sent, the speed difference is irrelevant.

This is the same conclusion sqlite developers came to. They tested turning JSON column types to binary and the speed difference was not large enough to warrant maintaining that code so they kept the data in JSON.

If the speed different is irrelevant, why are they struggling with it?

Because most implementations are reasonably efficient. Swift default one is apparently not.

Python might be the one language that isn't true for. In my Python experience, the Google protobuf library is frustratingly slower than the built-in json module for any data structures I've cared about, which is why things like pyrobuf exist to solve that performance problem: https://github.com/appnexus/pyrobuf

So you claim that decoding flat buffers and protobuf is faster than decoding with `struct`? I'm pretty much aware of various flaws and even stated some, but I barely buy that claim without a separate benchmark (which I really welcome by the way).

At least I fully understand what the `struct` module actually does under the hood---it sorta compiles to a list of fields and "interprets" the dead simple VM in C. Oh, of course I've used the precompiled `struct.Struct` for that reason (but it was only 20% faster). Anyway, this arrangement is typical for most schematic serialization formats in any language: a bunch of function calls for gluing the desired format, plus a set of well-optimized core functions (not necessarily written in C :-). Henceforth my justification that this is close to the "bare-bones" serialization format.

> with all conditions same, the difference between binary formats and JSON would be in the noise.

But, seemingly, in this case the conditions aren't the same.

Are you using the slowness of Python's `struct` module to prove that binary formats in fast languages are slow?

I've benchmarked Capnp Vs JSON for Modern C++ in C++, and Capnp was something like 8 times faster.

If you're struggling with JSON performance how is moving to a binary format like Capnp (or Flatbuffers etc.) not a better solution?

It seems they're getting parsing times 1,000x slower than any other parser, 10,000x slower than simdjson. The complaint is understandable, but ironic :)

These numbers are not quite right for a variety of reasons (performance measurement methodology is hard), but to do something more of an apples-to-apples comparison, it's about 50x slower than serde in Rust. That's still a lot, obviously.

But... how else are the people that have never seen a byte array or had to flip endianness will be able to write plugins for my text editor?

Because JSON encoding/decoding was not found to be a typical performance bottleneck, and because JSON is supported in virtually every programming language (Xi allows you to write frontends in pretty much any language you want).

After spending most of a year doing deep surgery on systems that used CBOR extensively, I can report that the common CBOR parsers are not faster than common JSON parsers; surprisingly, they are actually slower. CBOR is also not easier; it's much less widely supported, and you need a separate debugging representation. It does have three real advantages over JSON: it supports binary strings, it's a monument to Carsten Bormann's ego, and data encoded in CBOR takes slightly fewer bytes than the same data encoded in JSON. (The second is only an advantage if you're Carsten Bormann.)

There are a few more advantages to CBOR:

1) there's a distinction between integers and floating point values;

2) you can semantically tag values (yes, this is a text string, but treat it as a date; this is a binary string, but treat it as a big number; etc.);

3) you can have maps with non-text keys.

I'm not sure what Carsten Bormann's ego has to do with CBOR, but I found RFC-7049 one of the better written specs, with plenty of encoding examples. It made it real easy to write a encoder/decoder [1] and use the examples as test cases.

[1] https://github.com/spc476/CBOR

All three of those could be advantages under some circumstances, but I've more often found them to be disadvantages. What do you do with maps with non-text keys when you're deserializing in JS or Perl? For that matter, what do you do in Python when the key of a map is a map? When you have a date, do you decode it as a datetime object, as a text string, or as some kind of wrapper object that gives you both alternatives?

I agree that having lots of examples in the spec is good.

> What do you do with maps with non-text keys when you're deserializing in JS or Perl?

Um, use another language? I use Lua, which can deal with non-text keys. As for decoding dates (if they're semantically tagged, which you can with CBOR) I convert it to a datetime object, on the grounds that if I care about tagged dates, I'm going to be using them in some capacity.

But that's not to say you have to use the flexibility of CBOR. But for me, having distinct integer and floating point values, plus distinct text and binary data, is enough of a win to use it over JSON.

While theoretically true, in practice the actual character parsing tends to a small to negligible part of the overall time. Which leads to the measurable fact that on macOS/iOS, the JSON serialization stuff is actually one of their fastest, faster than their binary stuff.

I ran one of the Codable benchmarks in instruments, and here's what the top functions were:

  19.98 s   swift_getGenericMetadata
  19.15 s   newJSONString
  16.17 s   objc_msgSend
  15.33 s   _swift_release_(swift::HeapObject*)
  14.45 s   tiny_malloc_should_clear
  12.81 s   _swift_retain_(swift::HeapObject*)
  11.28 s   searchInConformanceCache(swift::TargetMetadata<swift::InProcess> const*, swift::TargetProtocolDescriptor<swift::InProcess> const*)
  10.46 s   swift_dynamicCastImpl(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*, swift::TargetMetadata<swift::InProcess> const*, swift::DynamicCastFlags)
So it looks like a lot of the time is going into memory management or the Swift runtime performing type checking.

Yeah, I've done some analysis, it's creating a ton of objects to conform to the Codable protocol, and a lot of those objects are for codingPath, which is updated for basically every node in the tree. It's not a mystery, we just don't know the best way to fix it.

Is there a reason you need to use Codable? Sorry if this sounds uninformed, I haven't taken that much time to look at what you're doing exactly (I just ran https://github.com/jeremywiebe/json-performance).

That's one of the things we're considering. But it is by far the most idiomatic way to do things in Swift. One of the alternatives we're considering is implementing the line cache (including the update protocol) in Rust, which would be a huge performance jump.

No, I don’t think the project needs to use Codable. The point of that benchmark was to evaluate Codable’s performance under Swift 5. It was posed that performance was much improved. The benchmark points out that it has a little bit but not significantly.

Codable is desirable because it encodes/decides directly to strifes vs manually picking fields out of dicts.

Can you see any differences with different levels of optimization? I recall a presentation at some point where the old obj-C style compiled code did a lot of checks before and after calling a method ("does this object listen to this message?"), while with an optimization option enabled (whole module optimization?) these calls could be optimized out. That is, with Swift they can make the resulting machine code less er, "checking for safety", so to speak.

This was done at -O I believe (whatever the default is for "Profiling" in Xcode). This is anecdotal, but the fact that the code isn't littered with _swift_retain/_swift_release calls probably means that most of the standard reference-counting boilerplate has been optimized away.

Yeah, Swift-most-everything is pretty slow, but particularly parsing/generating. Pre-Swift Foundation serialisation code was already...majestic, and in the Swift conversion they've typically managed to slow things down even further. Which didn't seem possible, but they managed.

I have given a bunch of talks[1] on this topic, there's also a chapter in my iOS/macOS performance book[2], which I really recommend if you want to understand this particular topic. I did really fast XML[3][4], CSV[5] and binary plist parsers[6] for Cocoa and also a fast JSON serialiser[7]. All of these are usually around an order of magnitude faster than their Apple equivalents.

Sadly, I haven't gotten around to doing a JSON parser. One reason for this is that parsing the JSON at character level is actually the smaller problem, performance-wise, same as for XML. Performance tends to be largely determined by what you create as a result. If you crate generic Foundation/Swift dictionaries/arrays/etc. you have already lost. The overhead of these generic data structure completely overwhelms the cost of scanning a few bytes.

So you need something more akin to a steaming interface, and if you create objects you must create them directly, without generic temporary objects. This is where XML is easier, because it has an opening tag that you can use to determine what object to create. With JSON, you get "{" so basically you have to know what structure level corresponds to what objects.

Maybe I should write that parser...

[1] https://www.google.com/search?hl=en&q=marcel%20weiher%20perf...

[2] https://www.amazon.com/gp/product/0321842847/

[3] https://github.com/mpw/Objective-XML

[4] https://blog.metaobject.com/2010/05/xml-performance-revisite...

[5] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[6] https://github.com/mpw/MPWFoundation/blob/master/Collections...

[7] https://github.com/mpw/MPWFoundation/blob/master/Streams.sub...

That resonates well with my conclusions that led to the Replicated Object Notation project. [1]. If the parser creates an AST tree or some number of dictionaries or some other bullshit... "now you have two problems", that's it.

I settled on a tabular-log format, which is streamed and immediately consumed most of the time, no intermediate object structures.

Then, that "text vs binary" distinction became mostly moot. The binary is slightly more efficient, but grossly less readable, so no big gain, unless at grand scale.

[1] http://replicated.cc

What are you using? Have you tried NSJSONSerialization? It’s quite fast (am very curious how it shows in these benchmarks), but I don’t think it does the fancy Codable stuff.

You might want to check out the benchmark I wrote to compare exactly that.


Swift has JSONEncoder and JSONDecoder types to do Codable, though internally they have to encode to/decode from the Foundation objects that JSONSerialization produces.

Hey Raph, have you seen https://github.com/bmkor/gason? Seems like a low-cost bridge to a high-performance C++ implementation.

Hadn't seen that particular wrapper, but if we're going to take on an FFI solution, we're more likely to use Rust for this, and implement more logic than just JSON parsing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact