Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Gosax – A high-performance SAX XML parser for Go (github.com/orisano)
76 points by orisano 11 months ago | hide | past | favorite | 32 comments
I've just released gosax, a new Go library for high-performance SAX (Simple API for XML) parsing. It's designed for efficient, memory-conscious XML processing, drawing inspiration from quick-xml and pkg/json. https://github.com/orisano/gosax Key features:

- Read-only SAX parsing - Highly efficient parsing using techniques inspired by quick-xml and pkg/json - SWAR (SIMD Within A Register) optimizations for fast text processing

gosax is particularly useful for processing large XML files or streams without loading the entire document into memory. It's well-suited for data feeds, large configuration files, or any scenario where XML parsing speed is crucial. I'd appreciate any feedback, especially from those working with large-scale XML processing in Go. What are your current pain points with XML parsing? How could gosax potentially help your projects?




Very nice, thank you!

Unhelpfully my only pain point with XML parsing is colleagues refusing to use XML in favour of json or, in really grim moments, yaml.

So I'm delighted to see a sensible modern web language implementation of the one true data exchange format. Thank you for sharing it.


I agree YAML is awful. JSON is ok if you allow comments at least (for configuration use cases). There are a couple of variants that do: JSON5 and Jsonnet. I like JSON5 for it's relative simplicity but Jsonnet has much better ecosystem support so I'd probably go with that.

XML is just terrible though. Unless you have a proper schema everything is entirely untyped (tbf the schema support is pretty good). But more to the point it just doesn't map to normal programming language objects cleanly. It's a document markup language, not an object encoding.

That means there's an annoying mismatch when parsing for 99% of use cases.

Couple that with the crazy verbosity and the weird confusing features like namespaces... I think I would rather use YAML to be honest, even though it is really bad.

Since YAML is a superset of JSON I sometimes actually use JSON with `#` comments, and read it as YAML. Only downside is nothing checks if you are using that format correctly.


JSON numbers have such huge issues that I would question that JSON has types either..

I have seen real life examples of some developers putting money as a decimal in JSON, then other developers using some default parsing into float and finally truncate that float on conversion to Decimal money to loose 1 cent.

I have also seen examples of developers putting int64 random IDs in JSON interacting with Azure Cosmos, then discover that when fetching data back the numbers had been silently rounded to 53 bits! (i.e., float64 precision)

JSON does not document at all how its numbers should behave, making it useless for a lot of things.

Honestly prefer XML, being honest about not having typing.

Don't get me started on YAML and types..


Out of curiosity, what are your top reasons to pick XML over JSON(+jsonschema) or Msgpack/Protobuf, as data interchange? I have come of age as a professional software engineer around the time when industry has started switching from XML to JSON, and as a consequence in the JSON camp, but I am always curious to hear out folks with a different opinion.


Not OP:

I'm in the same boat, but I found XML has some nice properties that I sometimes miss in JSON, given that XML is used well ("correctly"), such as the differentiation of metadata (attributes) and data (nodes), namespaces, standard query languages, XSLT etc. (You can use XSLT on the web even.)

Think of all the custom, ad-hoc code that turns JSON into HTML vs having a declarative standardized way of doing so.

https://developer.mozilla.org/en-US/docs/Web/XSLT


When to use XML/What XML is good at: https://news.ycombinator.com/item?id=11446984


Great writeup. To add an example, I personally use JSON for most of my work, but have found myself using XML for certain AI use cases that require annotating an original text.

For example, if I wanted an AI to help me highlight to the user where in a body of text I mentioned AI, I might have it return something like:

<text>Great writeup. To add an example, I personally use JSON for most of my work, but have found myself using XML for certain <ai-mention>AI</ai-mention> use cases that require annotating an original text with segments.</text>


Msgpack is broadly fine (I shipped a parser for it a while ago, relatively few ambiguities in the spec). Json is kind of ok if you don't do numbers but schemas for it are a pain and the tooling gives me hacked together vibes.

I like being able to read and edit the data files easily in a text editor (bias against the binary formats) and for there to be a decent chance tools written by other people will interact predictably with the format (so it can't be bespoke).

I'd say the main feature is that an XML document with a schema tells you a lot about the various shapes of the file that you might need to worry about. It's essentially an extensible type system for tree shaped data.

XML has an annoying collection of spurious limitations but that's what you get with lowest common denominator / popular-cos-old systems.


Have you tried CBOR/CDDL tooling, nonlogical? What is your opinion of it?


Uninteresting fact: I did the code to download TomTom map updates. Mozilla XUL app.

The XML required a good 4GB of RAM to load the model. So… I just read the stream to get to the token I needed and read until the end token.

Obviously, it was faster and required much less memory. The take-away is if you don’t need to parse the model, don’t.

I assume that nowadays, they’re using more sensible format.



Thank you for sharing this benchmark, and the library. I was expecting ideal Go performance to approach that of Java and C#/.NET for large files, which last time I checked (a while ago) was about half the throughput of C code using libxml2. Beating libxml2 by a significant margin is very impressive.


Provided a similar implementation is ported to C and C#, it would have ended up performing faster - Go compiler is relatively weak, and Go the language lacks certain crucial performance primitives that C, C++ and C# (and Rust) have.


Is there any improvement on the deficient namespace handling in the stdlib ?


The "deficient namespace handling in the stdlib" is only relevant when parsing XML, then trying to re-emit it. Since this library does not support re-emitting XML, it is either "worse" or "n/a", depending on your mood.

However, looking at the output data structures, yes, it would have the same problem if the obvious modification to re-emit XML was made.

It's actually very common, to the point I'm surprised when I encounter an XML parser that handles the problem you are referring to correctly, in any language. I've had to hack it in to every XML parser I've ever used when I care about preserving namespace abbreviations.


Oh nice, I've recently been looking into streaming XML parsing in Go without a CGO depdency and found the available options pretty lacking.

Great to see this sort of thing!


Wish I'd had this a few years ago. I had to parse Confluence wiki backups which, for reasons only known to Atlassian and god, lacked any closing tags. I ended up writing something similar to this, but mine was a lot kludgier.


Little trick with xml.Decoder. unlike unmarshal, decoder ignores any garbage after the XML, which is nice if you want to parse HTML without dealing with the DOM


Does this support DTD/custom entities stuff? I would hope the answer is no, but just checking


What's wrong with the standard library parser?


It loads the whole document to memory?


I upvoted you just because I made a golang library with the same name but different purpose

https://github.com/artpar/gosax/

its a High performance golang implementation of Symbolic Aggregate approXimation


This feels ... 20 years too late?

But excellent. Thanks!


As Go only emerged in late 2009 then it can’t really be more than 15 years too late, can it?


I think he means SAX parsers and XML were all the rage 20 years ago. Today, not so much thankfully!


This.


libexpat was released in 1998 - the original high perf streaming parser for XML written in C


Nice. I like the event based/callback based parsing tools for XML a lot. A little more cognitive work up front but much more efficient. A little sad if unsurprised that XML is still a thing in 2024, but if you have to read it, use a streaming parser.


If you've ever tried to read data from an XLSX file, you'll find that streaming XML parsing is quite beneficial

And the world runs on Excel files.


I really hate SAX. Callback based parsing is really unergonomic, and means you always have to code an explicit state machine. You can't use your control flow as implicit state.

It's like choosing to use `.then()` instead of `await`. I seriously don't understand why it is so popular in the XML world when pull based parsing is much easier to use and surely just as efficient? Just brain damaged Java design patterns maybe?


Because of msgs that are larger than I want to allocate. Explicit state machines forces one to think thru the problem. And it forces the solution to be one pass over the input data. I almost never am forced to use Java so unsure about that reference.


Pull parsers can deal with arbitrarily large messages too. And they also do one pass over the input data.

Yeah if you're unfamiliar, SAX is like this (pseudocode):

  interface SAXCallbacks {
     void onBeginToken(string name);
     void onAttribute(string key, string val);
     void onText(string text);
     void onEndToken(string name);
  }

  void parse(Reader input, SAXCallbacks yourCallbackImplementations);
Whereas pull parsers are like this:

  enum Token {
    Begin(string name),
    Attribute(string key, string val),
    Text(string text),
    End(string name),
  }

  class PullParser {
    void open(Reader input);
    Token next();
  }
They are much easier to use because you can trivially write a recursive descent parser:

  void parseThing(parser) {
    let token = parser.next();
    if (token == Begin("foo")) {
       parseFoo(parser);
    } else if ...
Whereas with SAX you're going to end up with some monstrous hand-coded state machine like

   class ThingParser {
     enum State {
       ParsingThing,
       ParsingFoo,
       ParsingFooExpectingAttributes,
       ParsingFooExpectingEndTag,
       ...
So painful. Honestly it's so obviously the right way to do tokenisation and parsing that I have yet to see another language that even has names for them. They all just use pull parsers. Nobody else does callback-based parsing like SAX because it's obviously ridiculous.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: