I am very interested in parsing and optimizing them. Currently I have an html pa...

willvarfar · on May 8, 2024

There are so many 'it depends' here, based on your use-cases and perhaps details specific to go optimisation that I am unaware of.

The first step, tokenization, might well be very go-specific around zero-copy networking and things, sorry. The general idea, though, would be to use a nice little finite-state-machine to eat characters and recognise when the tags open and close, the name of the tag, the attributes and values, the comments and the body etc. And if you can be avoiding allocating any kind of object on a heap at this stage, it'll almost always be a big win.

But you need to then create a DOM, and you're going to need memory allocation for that.

But the arena approach in the article is good for this; use a big byte array as an arena, and be writing a cleaned-up normalized copy of the parsed html into it, with length-prefixes and distance-to-next-tag etc baked in. Typically a DOM has nodes with parent, previous sibling and next sibling pointers. You'll probably still want these, but these can be offsets written into the byte buffer in between fragments of the parsed html, rather than maintained as a separate structure.

Then, if the user of your DOM can modify it, you can have a node navigation API that seamlessly stores rewritten DOM parts in a new byte arena or appended on the end of the original arena.

neonsunset · on May 8, 2024

It may seem unexpected given all the hype around Go, but it is a surprisingly poor choice for this. If you want a more convenient language than C++ or Rust but retain the ability to reach optimal performance, C# could serve you much better.

Go underperforms at trivial XML parsing: https://news.ycombinator.com/item?id=40283721 (and the parent comment)

If you push it, C# can extract optimal HW utilization when parsing text, beating C++: https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-... (Go was not on the list because it was that much slower)

willvarfar · on May 8, 2024

A quick google finds https://github.com/ffenix113/fastxml which seems to be doing the 'tips and tricks' of arena parsing and things. Any idea how fast it compares when you get away from memory allocations and things and end up just seeing how the compiler does basic byte manipulation?

neonsunset · on May 8, 2024

The description indicates it is not production ready, and is archived at the same time.

If you pull all stops in each respective language, C# will always end up winning at parsing text as it offers C structs, pointers, zero-cost interop, Rust-style struct generics, cross-platform SIMD API and simply has better compiler. You can win back some performance in Go by writing hot parts in Go's ASM dialect at much greater effort for a specific platform.

For example, Go has to resort to this https://github.com/golang/go/blob/4ed358b57efdad9ed710be7f4f... in order to efficiently scan memory, while in C# you write the following once and it compiles to all supported ISAs with their respective SIMD instructions for a given vector width: https://github.com/dotnet/runtime/blob/56e67a7aacb8a644cc6b8... (there is a lot of code because C# covers much wider range of scenarios and does not accept sacrificing performance in odd lengths and edge cases, which Go does).

Another example is computing CRC32: you have to write ASM for Go https://github.com/golang/go/blob/4ed358b57efdad9ed710be7f4f..., in C# you simply write standard vectorized routine once https://github.com/dotnet/runtime/blob/56e67a7aacb8a644cc6b8... (its codegen is competitive with hand-intrinsified C++ code).

There is a lot more of this. Performance and low-level primitives to achieve it have been an area of focus of .NET for a long time, so it is disheartening to see one tenth of effort in Go to receive so much more spotlight.

kragen · on May 8, 2024

you may want to add blank lines between your bullets

mightee_pirate · on May 8, 2024

kragen · on May 8, 2024

happy to help