Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It strikes me the bottleneck for this problem isn't Vec or List, it's the serde_json Value type that needs to be used. This is useful for serializing/deserializing values into Rust types but if you're trying to validate JSON against a schema you don't actually need the JSON value data, just the types of the nodes (or more specifically, you only need some of the value data, and probably not much of it, so don't pay for it when you don't have to).

If you implemented your own parser for the schema and the JSON and only used an AST to validate + span information (which can just be a pair of u16s for start/end of a token) then you can collect your error messages very, very quickly and generate the error message once validation is finished.

Heavily optimized compilers will do this for semantic analysis and type checking, where the IR they use can be constructed quickly and then used with the input text to get helpful messages out, while the guts of the algorithm is only working with semantic information in a structure that's compact and easy to access/manipulate.

All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.



> All that said, serde_json is incredibly convenient and giving up to write your own parser is a big hammer for a problem that probably doesn't need it.

I had a thought in my reply [0] on this that actually might let him eat his cake and have it too in this regard. I think you can heavily abuse serde::de::Visitor to schema validate without actually parsing (or with less parsing, at any rate). I went into more detail in my comment but I wanted to ping you (@duped).

[0]: https://news.ycombinator.com/item?id=40357159


Or, use sonic_rs::Value which uses simd and arena allocation for keys and in my case for big payloads was 8x faster than serds_json.

Or, use sonic_rs::LazyValue if you want to parse json on demand as you traverse it.


Hey, that's a nifty looking library. Thanks for pointing it out to me!


I've seen a lexer/parser scheme that encodes the lexer token type along with the token file location information into a u64 integer, something like

    struct Token {
        token_type: u8,
        type_info: u8,
        token_start: u32,    // offset into the source file.
        token_len: u16
    }
It's blazing fast. The lexer/parser can process millions of lines per second. The textual information is included, and the location information is included.


This is roughly what my JSON parser does. It does type-checking, albeit without using JSON-schema, but an object descriptor that you have to define to parse and generate JSON.

It's been developed for embedded systems (it was written originally for a NATS implementation in the Zephyr RTOS), so it's a bit limited and there's no easy way to know where some parsing/type validation error happened, but the information is there if one wants to obtain it: https://github.com/lpereira/lwan/blob/master/src/samples/tec...


I think the key insight is that the true benchmark is bytes/second (bandwidth) of the lexer/parser, so reducing the size of the output data (tokens/AST nodes) is a massive gain in the amount of data that you can process in the same amount of time.

The fewer bytes you can pack data into the more data that you can process per second. Computers may be the most complex machines ever built but the simple fact of having fewer things to touch means you can touch more things in the same amount of time remains true.


That is really cool idea! Thank you


Just use capn’proto. No deserialization needed !


Whats your experience like using it? Is it ergonomic or does it require you to do lots of type gymnastics?


This repo has a nice pub/sub implementation based on capnp: https://github.com/commaai/cereal/blob/master/log.capnp




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: