It strikes me the bottleneck for this problem isn't Vec or List, it's the serde_...

ComputerGuru · on May 14, 2024

> All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

I had a thought in my reply [0] on this that actually might let him eat his cake and have it too in this regard. I think you can heavily abuse serde::de::Visitor to schema validate without actually parsing (or with less parsing, at any rate). I went into more detail in my comment but I wanted to ping you (@duped).

[0]: https://news.ycombinator.com/item?id=40357159

aldanor · on May 16, 2024

Or, use sonic_rs::Value which uses simd and arena allocation for keys and in my case for big payloads was 8x faster than serds_json.

Or, use sonic_rs::LazyValue if you want to parse json on demand as you traverse it.

ComputerGuru · on May 16, 2024

Hey, that's a nifty looking library. Thanks for pointing it out to me!

ww520 · on May 14, 2024

I've seen a lexer/parser scheme that encodes the lexer token type along with the token file location information into a u64 integer, something like

    struct Token {
        token_type: u8,
        type_info: u8,
        token_start: u32,    // offset into the source file.
        token_len: u16
    }

It's blazing fast. The lexer/parser can process millions of lines per second. The textual information is included, and the location information is included.

acidx · on May 14, 2024

This is roughly what my JSON parser does. It does type-checking, albeit without using JSON-schema, but an object descriptor that you have to define to parse and generate JSON.

It's been developed for embedded systems (it was written originally for a NATS implementation in the Zephyr RTOS), so it's a bit limited and there's no easy way to know where some parsing/type validation error happened, but the information is there if one wants to obtain it: https://github.com/lpereira/lwan/blob/master/src/samples/tec...

duped · on May 15, 2024

I think the key insight is that the true benchmark is bytes/second (bandwidth) of the lexer/parser, so reducing the size of the output data (tokens/AST nodes) is a massive gain in the amount of data that you can process in the same amount of time.

The fewer bytes you can pack data into the more data that you can process per second. Computers may be the most complex machines ever built but the simple fact of having fewer things to touch means you can touch more things in the same amount of time remains true.

dmitry_dygalo · on May 14, 2024

That is really cool idea! Thank you

EGreg · on May 14, 2024

Just use capn’proto. No deserialization needed !

aabhay · on May 14, 2024

Whats your experience like using it? Is it ergonomic or does it require you to do lots of type gymnastics?

cabronerp · on May 14, 2024

This repo has a nice pub/sub implementation based on capnp: https://github.com/commaai/cereal/blob/master/log.capnp