I am very interested in parsing and optimizing them. Currently I have an html parser written in go. I have used Go's html parser but it is waaay slower than mine. Here what I do:
- stream the content (either from a file or from network) and send chunks to the tokenizer
- tokenizer is dead simple with a state machine
- when an html tag is found, emit the token with positions (problem is someone needs to hold the whole incoming buffer. currently its copying the incoming bytes but I plan to hold the bytes until a token is found. after emitting the token with the buffer (copying) the inner buffer will be freed)
- parser does its thing and creates a DOM
as you can see I am not well versed in optimizing to the last bit but I am very interested in diving deep into the optimizations.
Currently I am looking at zero-copy networking. there is supposedly a library for go (fasthttp) that would not copy bytes from networking interface but I havent tried it yet.
What kind of algorithms would you recommend for tokenizing & parsing html / xml very efficiently ?
There are so many 'it depends' here, based on your use-cases and perhaps details specific to go optimisation that I am unaware of.
The first step, tokenization, might well be very go-specific around zero-copy networking and things, sorry. The general idea, though, would be to use a nice little finite-state-machine to eat characters and recognise when the tags open and close, the name of the tag, the attributes and values, the comments and the body etc. And if you can be avoiding allocating any kind of object on a heap at this stage, it'll almost always be a big win.
But you need to then create a DOM, and you're going to need memory allocation for that.
But the arena approach in the article is good for this; use a big byte array as an arena, and be writing a cleaned-up normalized copy of the parsed html into it, with length-prefixes and distance-to-next-tag etc baked in. Typically a DOM has nodes with parent, previous sibling and next sibling pointers. You'll probably still want these, but these can be offsets written into the byte buffer in between fragments of the parsed html, rather than maintained as a separate structure.
Then, if the user of your DOM can modify it, you can have a node navigation API that seamlessly stores rewritten DOM parts in a new byte arena or appended on the end of the original arena.
It may seem unexpected given all the hype around Go, but it is a surprisingly poor choice for this. If you want a more convenient language than C++ or Rust but retain the ability to reach optimal performance, C# could serve you much better.
A quick google finds https://github.com/ffenix113/fastxml which seems to be doing the 'tips and tricks' of arena parsing and things. Any idea how fast it compares when you get away from memory allocations and things and end up just seeing how the compiler does basic byte manipulation?
The description indicates it is not production ready, and is archived at the same time.
If you pull all stops in each respective language, C# will always end up winning at parsing text as it offers C structs, pointers, zero-cost interop, Rust-style struct generics, cross-platform SIMD API and simply has better compiler. You can win back some performance in Go by writing hot parts in Go's ASM dialect at much greater effort for a specific platform.
For example, Go has to resort to this https://github.com/golang/go/blob/4ed358b57efdad9ed710be7f4f... in order to efficiently scan memory, while in C# you write the following once and it compiles to all supported ISAs with their respective SIMD instructions for a given vector width: https://github.com/dotnet/runtime/blob/56e67a7aacb8a644cc6b8... (there is a lot of code because C# covers much wider range of scenarios and does not accept sacrificing performance in odd lengths and edge cases, which Go does).
There is a lot more of this. Performance and low-level primitives to achieve it have been an area of focus of .NET for a long time, so it is disheartening to see one tenth of effort in Go to receive so much more spotlight.
- tokenizer is dead simple with a state machine
- when an html tag is found, emit the token with positions (problem is someone needs to hold the whole incoming buffer. currently its copying the incoming bytes but I plan to hold the bytes until a token is found. after emitting the token with the buffer (copying) the inner buffer will be freed)
- parser does its thing and creates a DOM
as you can see I am not well versed in optimizing to the last bit but I am very interested in diving deep into the optimizations.
Currently I am looking at zero-copy networking. there is supposedly a library for go (fasthttp) that would not copy bytes from networking interface but I havent tried it yet.
What kind of algorithms would you recommend for tokenizing & parsing html / xml very efficiently ?