

Show HN: A 6KB, fast and mostly-correct XML parser in C - yorhel
http://dev.yorhel.nl/yxml

======
mtdewcmu
This looks interesting. Is it brand new? I've seen parsers with similar names,
but, of course, almost all short letter combinations containing xml have been
used, so that's expected.

It doesn't allocate or buffer, so when it says that it verifies proper
nesting, I assume that doesn't mean verifying that start tags and end tags
have matching names. I think that would be impossible.

I've been looking for a decent C parser similar to this to try some
experiments with. Dealing with XML can be dreary, but I like to try to think
of how you might make an API that doesn't make your code suck. It would be at
a higher level, so I might be able to use this.

It looks like you pass the parse function a character at a time. To be fast,
you'd want to somehow encourage that to get inlined. I'm not sure how easy or
possible it is when you put the function in a separate .c file. There would
probably be ways to stuff everything into the header.

I'm interested particularly in parsers that don't buffer. It seems like a
fundamentally more flexible and potentially better way to parse XML is by
memory mapping and letting the VM do what it does best.

~~~
yorhel
It's new, yes. I wrote it about two months ago because I needed a parser and
found the existing solutions too bloated.

> It doesn't allocate or buffer, so when it says that it verifies proper
> nesting, I assume that doesn't mean verifying that start tags and end tags
> have matching names. I think that would be impossible.

It does match start tags and end tags, and in order to do so it does, indeed,
need a small buffer. However, the only thing that needs to be stored in this
buffer is a stack of element names, and that rarely needs much space - a
simple fixed-size buffer is sufficient. It can be safely allocated on the
stack if the application wants to. See the documentation for the yxml_init()
for more details. The "mostly-bufferless" part of the homepage refers to the
lack of a complicated input buffer and the avoidance of complex buffer
management for output where possible.

> It looks like you pass the parse function a character at a time. To be fast,
> you'd want to somehow encourage that to get inlined.

The function is quite large. You can probably get it inlined when put inside a
header file and when it's only called in a single place, and you may indeed
see a performance boost. I haven't measured how significant such an
improvement may be.

However, if performance is the primary goal (it wasn't for yxml), a completely
different design would probably be better. Yxml needs to manage an explicit
state object for each input character. For each new character it has to look
at the state to know how it's handled, handle it, and then update the state
for the next character. I was honestly surprised to see that yxml performed
quite well in my benchmarks.

A more efficient design is probably the goto-based state machine approach
that, e.g. ragel[1] can output. The reason I didn't go for that approach is
that it complicates buffering. You either need to have the full file mapped
into memory (not possible when reading from a socket or from a compressed
stream), or you need to implement your own buffering and provide hooks so that
the application can fill the buffer when it is depleted. The latter approach
is used by most other XML parsers.

[1] [http://www.complang.org/ragel/](http://www.complang.org/ragel/)

~~~
mtdewcmu
I was interested in yours specifically because it's the only one I've seen
that doesn't buffer (almost). You might as well drop the last little buffer
and be completely bufferless. Then it will never cause an error or block, and
you'll have no dependencies. If you want to support validation, make a higher-
level interface to this and do it there. It will be just as easy.

~~~
yorhel
Hmm? The stack buffer in yxml will never cause parsing to block, and there's
no added dependencies on... anything? I don't think that buffer causes any
problems even on a size-restricted microcontroller that doesn't have malloc().
As long as you can find ~512 bytes or so of free memory you can parse a lot of
files.

The only situation in which that buffer would cause an error is when the
application used a too small buffer, or when the document is far too deeply
nested or has extremely long element/property names. Both the maximum nesting
level name lengths should, IMO, be limited in the parser in order to protect
against malicious documents. Most parsers have separate settings for that,
yxml simplifies that by letting the application control the size of a buffer.

The stack buffer in yxml is also used to make the API a bit easier to use.
With the buffer I can pass element/property names as a single zero-terminated
C string to the application, without it I would have to use the same mechanism
as used for attribute values and element contents, and that mechanism isn't
all that easy to use. (This is the one case where I chose convenience over
simplicity, but I kinda wanted the validation anyway so that wasn't really a
problem)

~~~
mtdewcmu
Ah. I have not yet looked too far into how yxml works. I never came up with a
perfectly satisfactory solution to the zero-buffer problem myself, but you've
hit on a lot of the things that make it a hassle either way. It's almost
impossible to do a usable xml parser under the assumption that it will not
buffer and it will not be guaranteed access to more than one character at a
time. I started developing one like that based on a goto-driven state machine,
but I stopped working on it, because the interface was going to be too
inconvenient.

What I meant about blocking was blocking on malloc. It sounds like you're
expecting the caller to take care of allocation?

