Faster XML Stream Processing in Go (2019)

llimllib · 2024-05-07T02:35:10

I reran the results on my m1 mac pro since this post is 6 years old:

    python etree: 3.25s
    python lxml: 2.01s
    go stdlib: 3.35s
    go gosax: 1.45s
    C: 0.38s

So, python etree gives the same result 6 years on while lxml has gotten faster. Go's standard library does about as well as python's stdlib, and has gotten faster on the cgo binding.

Procedure:

    - clone https://github.com/eliben/xmlgen
    - run `xmlgen -f 2 > out.xml` to generate a test xml file
    - clone https://github.com/eliben/code-for-blog/ and switch to the 2019/xml-stream dir
    - mv out.xml from above to the current directory
    - `time python etree-count.py out.xml`
    - `pip install lxml && time python lxml-count.py out.xml`
    - `go build go-stdlib-count.go && time ./go-stdlib-count out.xml`
    - `cd c-libxmlsax-count && make && time ./c-libxmlsax-count ../out.xml`
    - `cd gosax-count && go build gosax-count-simple.go && time ./gosax-count-simple ../out.xml`

neonsunset · 2024-05-07T09:23:46

C# takes 0.55s on M1 Pro with relatively ancient and not most optimized System.Xml.XmlReader:

    ./xmlgen -f 2 > dotnet/out.xml
    cd dotnet && dotnet publish -o .
    time ./XMLParse out.xml

https://gist.github.com/neon-sunset/6ba67f23e58afdb80f6be868...

neonsunset · 2024-05-07T17:28:05

Downvoting this won't improve Go's performance :)

eadmund · 2024-05-07T12:06:19

I wrote this quick little program on my laptop:

    (ql:quickload "cxml")
    
    (defclass finder (sax:default-handler)
      ((in-location :initform nil)
       (count :initform 0)))
    
    (defmethod sax:start-element ((finder finder) namespace-uri local-name qname attributes)
      (setf (slot-value finder 'in-location) (string= local-name "location")))
    
    (defmethod sax:characters ((finder finder) data)
      (when (and (slot-value finder 'in-location) (search "Africa" data))
        (incf (slot-value finder 'count))))
    
    (let ((finder (make-instance 'finder)))
      (cxml:parse #P"out.xml" finder)
      (slot-value finder 'count))

and got:

    Evaluation took:
      5.316 seconds of real time
      5.339639 seconds of total run time (5.330888 user, 0.008751 system)
      [ Real times consist of 0.032 seconds GC time, and 5.284 seconds non-GC time. ]
      [ Run times consist of 0.049 seconds GC time, and 5.291 seconds non-GC time. ]
      100.45% CPU
      8,547,510,059 processor cycles
      1,082,877,232 bytes consed

For comparison:

    python etree: 3.04s
    python lxml: 3.65
    go stdlib: 5.18
    go gosax: 2.37
    C: 0.54

Really disappointing that the Common Lisp version performs worse and is less readable than the Python lxml version.

This one is more readable, but even slower:

    (klacks:with-open-source (s (cxml:make-source #P"out.xml")) 
      (loop for key = (klacks:peek s)
            with count = 0
            while key
            when (and (eq key :start-element)
                      (string= (klacks:current-lname s) "location"))
              do (progn 
                   (setq key (klacks:peek-next s))
                   (when (and (eq key :characters) (search "Africa" (klacks:current-characters s)))
                     (incf count)))
            do (klacks:consume s)
            finally (return count)))

It gets:

    Evaluation took:
      9.996 seconds of real time
      10.054716 seconds of total run time (9.979712 user, 0.075004 system)
      [ Real times consist of 0.092 seconds GC time, and 9.904 seconds non-GC time. ]
      [ Run times consist of 0.101 seconds GC time, and 9.954 seconds non-GC time. ]
      100.59% CPU
      16,077,773,140 processor cycles
      2,338,978,688 bytes consed

MrBuddyCasino · 2024-05-07T12:24:07

Java 21, naive implementation, javax.xml.stream.XMLStreamReader, 2015 MBP:

1.2s warm

1.6s cold

MrBuddyCasino · 2024-05-07T14:26:26

~700ms cold on a 7800X3D - maybe its time to upgrade my laptop.

kristianp · 2024-05-07T00:51:12

I love this type of optimisation story.

Interesting to note that his "in theory" best performance still calls to lxml. A true in theory best performance would surely be in the multiple Gigabyte per second range (if simdjson's performance is taken as a goal), and not 0.56 seconds to process 230 MiB which is only about half a GBps. Half a gigabyte per second is probably more than sufficient for most uses, so I'm splitting hairs somewhat.

Here is the issue for slow sax parsing in encoding/xml. No work on it since around 2020:

https://github.com/golang/go/issues/21823

masklinn · 2024-05-07T11:03:34

> A true in theory best performance would surely be in the multiple Gigabyte per second range (if simdjson's performance is taken as a goal)

I don't think it makes any sense for the parser for a completely different and much simpler format to be "taken as a goal".

mananaysiempre · 2024-05-08T12:00:23

Full XML is hard if you’re going to go nuts with entity declarations and such (I’ve heard SVG output from Adobe tools is fond of those for some reason?), but the part that people commonly use is not really that much more complicated than JSON; perhaps even simpler in some places, given the absence of backslash escapes (that the simdjson paper spends quite a lot of time on). So for the fast path I think it’d be reasonable to expect JSON-like performance.

kristianp · 2024-05-07T22:51:45

Perhaps "goal" is the wrong word, I just mean the "in theory best performance". It doesn't matter that json is a different format, they both have the same worst case complexity for the particular example problem.

cxr · 2024-05-07T00:59:53

Great post.

An issue I have with it is the way it frames the speed of a language, e.g., Python, C, etc., especially here:

> I implemented the same program using pure C with libxml, which has a SAX API. [...] It takes just 0.56 seconds to process our 230 MiB input file, which is very impressive given the other results, but also not very surprising. This is C, after all.

What the author is actually comparing here of course is the quality of the GCC (or LLVM or whatever) code generator, not "C" versus e.g. Go (the programming language and/or its standard libraries).

There are enough misapprehensions about the source of good/bad performance of Runtimes, Not Languages. Quips like this (that it's "C, after all") don't help that discussion.

hnlmorg · 2024-05-07T07:35:41

Some language paradigms are always going to produce less optimised binaries.

Such as garbage collected languages vs non-garbage collected languages. And Python enables meta programming, which itself massively reduces the performance of the language (from a purely computational perspective).

While C compilers do generally have much better support for optimisations, the language itself was initially designed to sit pretty close to assembly. So even unoptimised C will perform better than the equivalent code written in Go or Python using their respective idioms.

neonsunset · 2024-05-07T08:37:55

Garbage collection is completely orthogonal to the quality of emitted codegen.

What GC does have impact on is the way VM handles memory management:

- Cost of allocation helper calls / are they (partially) inlined

- Cost of write barriers

- Cost of GC pauses, their frequency or the cost paid in other places if it's shifted from GC pauses (which Go does)

- Cost of thread suspension/yielding strategy - relates to the previous point but is distinct should a GC have full stop-the-world phase

Generally speaking, Go's compiler is weak but where Go manages to get performance back are two areas:

- Go forces you to write a lot of, effectively, manually inlined code by virtue of being painful to use abstractions and generics in

- By constraining the range of tools you can use, it focuses on providing lightweight implementation for those (e.g. channels)

This allows Go to excel at particular niche it is targeted at (networked back-end services) but results in its severe underperformance against expectations the moment you step away from that use case.

Both JVM and .NET are good counterexamples to Go that being garbage collected does not mean sacrificing performance. Unfortunately, historical choices in JVM prevent it from reaching zero-cost abstraction + low/zero-allocation style of performance oriented code in C, C++ and Rust so C# remains pretty much the only GC-based language that lets you do so.

Swift receives honorable mention but it appears to consistently have much worse performance due to ARC and virtual dispatch so it can't really compete unless you write totally-not-C code in it.

hnlmorg · 2024-05-07T23:09:01

You’re arguing against my point while going on to make the same point.

Please note that I said “optimised” and not “quality”. The GP said quality but I deliberately didn’t. I want to be clear that myself and the author were strictly talking about code that is optimised for execution speed.

There are a few other issues with your comment too. Such as conflating inlining with code deduplication wrt generics.

JonChesterfield · 2024-05-07T01:56:25

The C runtime doesn't really show up on a profiler though. It's crawl the init_array then jump to main. Small enough that a lot of people don't notice the C runtime exists at all. Compiling C to efficient machine code is a much easier problem than compiling Go to similar machine code since all the heavy lifting is done by the programmer.

YZF · 2024-05-07T02:33:48

C doesn't really have a "run time". The startup code is optional and I wouldn't really call it a "run time" in the sense that Go has a runtime for managing goroutines and GC.

It's not that easy to write Go code with no dynamic memory allocations and no GC pressure.

I think it's fair to say "C is faster than Go". It's not purely a function of the language but a combination of factors including the quality of the compiler/optimizer, available libraries, and language features (and their performance cost).

JonChesterfield · 2024-05-07T10:50:50

It certainly does have a runtime and it isn't optional. It's usually in object files starting with crt, e.g. crt1.o. "c-run-time-1.o". That does whatever is necessary to make the machine look like the C world expects.

Setting up the stack pointer before jumping to main is important. Arranging for argc/argv to work. Waking init array or section for global constructors.

There's also libc which is roughly a pile of more-or-less useful functions, some of them implemented in terms of syscalls. That one is nominally optional.

There's probably a compiler runtime involved as well, libgcc or libclang_rt.builtins or similar, which is likely to be cyclically dependent on libc. Compiler-dependent whether that is optional, and thus likely whether libc is optional in practice.

PhilipRoman · 2024-05-07T11:54:46

You absolutely can write void _start(void) and as long as you finish using exit() everything will work just fine. I've written some super small programs this way (on the order of few kilobytes) which don't interface with system libraries in any way.

JonChesterfield · 2024-05-07T14:41:38

That'll work on some systems and not others. Whatever compiled and linked your C arranged to _start to be at the place the instruction pointer starts from. Setting up your own stack from C is dicey but sometimes the default value for the stack register is usable as-is.

I tend to go with ffreestanding and the crt*.o for the platform instead of going from _start directly but sure, you can write whatever startup code your platform wants yourself and link with -nostdlibs or similar. Some platforms are OK with just the address of the entry point burned into an elf. Compiler test code sometimes looks like that.

planede · 2024-05-07T12:10:54

C code conforming to the ISO C standard generally requires a (very small) runtime.

YZF · 2024-05-08T01:58:36

I would not call this a runtime myself. Go's runtime is "stuff Go does while your program is running" which is very different than something like crt0 that does some initial setup. Either way, this is optional in C. Some setup is done by the OS as well (it's not like the "run time" makes up where the stack is going to be).

jstimpfle · 2024-05-07T17:28:16

Any other language can implement the same conventions, and those conventions aren't specified in the C standard but really platform specific.

yxhuvud · 2024-05-07T06:33:54

Yes it typically does have a runtime. For example, it has persistent structures in memory related to malloc/free, and the manipulation of those most definitely show up when you profile.

jstimpfle · 2024-05-07T09:03:03

It's a bit funny to call this "runtime" when malloc/free are nothing but library functions.

yxhuvud · 2024-05-07T09:49:47

Yes, it is a runtime provided by libc, but how on earth is it not a runtime?

jgalt212 · 2024-05-07T11:50:23

Maybe the poster considers a runtime something more similar to a global event loop than a set of global functions.

yxhuvud · 2024-05-07T13:37:51

But libc is more than a set of global functions - it has a bunch of global state associated with them as well. And manipulation of that is what I consider a runtime.

jstimpfle · 2024-05-07T17:18:26

You can live entirely without malloc(). You don't need it. You can even allocate memory dynamically without malloc, just call a different API. On Windows, it's typical to use e.g. VirtualAlloc(). So thinking of malloc as runtime seems extremely strange to me.

libc is not the only library that has "global state". There is nothing special about it. You don't need libc (yes, some compilers / platforms somewhat expect you to bundle libc). And malloc is not libc, although some of the functionality in libc uses malloc internally / gives you back memory that you are expected to free().

hnlmorg · 2024-05-08T10:51:45

You can get different standard libraries and runtimes for other languages too. That doesn't mean that their de facto compilation and execution method isn't via a runtime.

jstimpfle · 2024-05-08T13:38:24

Call it what you want then; I will continue to refer to malloc/free as library functions since that's what they are to me -- functions that I can call, and if I don't call them, they don't come haunting me.

ternaryoperator · 2024-05-07T05:13:17

The golang runtime has a penalty due to the fact that it sets up memory for GC at startup. It is legitimately C runtime vs. go runtime.

Cthulhu_ · 2024-05-07T13:25:56

So the measurement is total application execution, not just the XML processing part itself? That's a bit of an unfair comparison, but then, it's a "micro"benchmark that should be taken with a grain of salt.

cxr · 2024-05-07T13:24:17

So you agree with me.

notamy · 2024-05-07T00:51:26

(2019)

userbinator · 2024-05-07T03:38:05

The task is to find how many times "Africa" appears in the data of the <location> tag throughout the document.

I wonder how fast grep/strstr compares, since they can run at "faster than memcpy" speeds, which means dozens of GB/s on any relatively recent machine. I know it's not the "pedantically correct" way to to do it, but in practice it works well. Only mode-escapes like CDATA can break the fact that XML tags otherwise cannot occur as substrings, so a substring match can work otherwise. This is especially true for JSON which doesn't have such escapes.

Of course not using XML may gain more than an order of magnitude in speed, so if you have the chance to control both ends and get rid of XML in favour of a simpler binary protocol, I strongly recommend advocating for it.

jiggawatts · 2024-05-07T09:13:52

Please never do this. I beg you.

This is why we can't have nice things, because someone somewhere always wants to cut a corner in a way that isn't as robust as they think it is.

I once had an argument with a junior developer that he can't just use a hash code directly instead of an equality check. It can be used as a first step to quickly reject unequal values, but equality must be verified. He disagreed right up until a finance report had "unexpected" dollar figures in it.

There is nothing I value more in the code I write than 100.0% pedantic correctness.

The opposite of this is inserting Thread.sleep(1000) into multi-threaded code to "fix" it.

It's a tiny, slippery step from skipping a correct parsing step to that eldritch madness.

acimim_ha · 2024-05-07T09:52:16

Correctness and robustness are not always the requirement. Sometimes it's simplicity. Sometimes it's time-to-market. Sometimes you get fed a broken XML and regex actually does a better job.

I've also been a junior developer and thought that my code has to be correct, look correct and perform correctly. Then I got a boss. Boy I was wrong.

jiggawatts · 2024-05-07T12:43:15

> Sometimes you get fed a broken XML

How do you think that XML was generated?

By people who took similar shortcuts, that's how.

Layering badness on top of badness results in a tyre fire.

Cthulhu_ · 2024-05-07T13:28:10

Performance is another one; there was one benchmark post I read a while ago that compared a stdlib implementation with a custom implementation, and the gist of it was that the stdlib had a few extra safety checks to prevent crashes and the like.

It's setting your sliders. If you have robust and predictable input, you can forego some of the safety checks in favor of performance.

jiggawatts · 2024-05-07T23:25:30

> ... forego some of the safety checks in favor of performance.

Are you a manager at Boeing, by any chance?

jgalt212 · 2024-05-07T11:53:20

> There is nothing I value more in the code I write than 100.0% pedantic correctness

When dealing with messy input data, this goal can be akin to perfect being the enemy of the good.