Tested on over 2.5 billion pages from Google's index.
The crazy edge cases were...challenging. The source code to the parser is very assert-heavy, so if there's anything that's amiss, it tends to blow up with an assertion failure. I'd run the MapReduce and it would blow up a few hundred times, then MapReduce would stop trying and kill the job. Then when I had a spare moment, I'd look at the assertion failures, pick off the most common ones, and run it again. This time it would get farther, I'd pick off another couple of bugs, and run it again.
As expected, the triggering frequency of bugs follows a power-law distribution. It took a long time before I could get it to parse one HTML document, and then it would fail on 1% of documents, then 0.1% of documents, then 0.01%, and so on. It got stuck at a roughly 1-in-a-million failure rate by a long time, until I figured out that it was crashing because of a stack overflow in the testing code, which would recursively sanity-check the produced DOM. Some documents generate a DOM >20,000 nodes deep, which is evidently too much to fit in typical C stacks, although Gumbo can handle them. (I found one page with a DOM tree 100,000 nodes deep - it was really an XML document masquerading as HTML, with a bunch of self-closing nodes that don't self-close under HTML5 parsing rules - and when I posted the link to say "Look what I found!", I got a bunch of "Kind of a dick move, linking to a page that crashes Webkit.")
The 5 second overview is really that it's the same as getting good at any new skill. You find an area that you don't know how to do, and then keep working at it until you do know how to do it. Then repeat with finer-grained details. There were a bunch of skills involved in this project - C, HTML5, UTF-8 decoding, debugging, testing, autotools, CTypes, API design, documentation - that I wasn't all that good at when I started that I had to pick up along the way.
1. Hubbub provides a SAX-style (callback-based) API, while Gumbo gives you a DOM-style (tree-based) struct directly. Hubbub is likely faster in this regard, Gumbo is easier to use out-of-the-box.
2. Gumbo is better tested. It's unclear whether Hubbub's 90% test coverage is "90% of the code is tested" or "90% of the tests pass", but Gumbo has 100% code coverage, 100% of html5lib tests pass (as of 0.95; the html5lib maintainer has pointed out that additional tests were added to trunk recently that don't pass), and it's run without crashing on ~4.5B documents from Google's index.
3. Gumbo has better support for source locations and going between original text and parse tree.
4. Hubbub has character encoding detection, Gumbo doesn't.
It's an xml parser that is neither DOM nor SAX. I haven't seen much mention of it before, except as a recommendation for Java devs. There's a C version too. It makes bold claims about performance.
"Comparing with DOM, VTD-XML is significantly faster (up to 10x), more memory-efficient (up to 5x).
Comparing with SAX/PULL, VTD-XML is not only faster, but also is capable of random-access, therefore is easier to use."
Basically by building a DOM style model in SAX fashion.
It wouldn't have worked for Gumbo's purposes because
1. Gumbo captures a lot more information than can fit in a 64 bit token. For example, Gumbo decodes entity references; this requires that text be available in a fresh buffer because each individual character might be something different than the source text.
2. One of Gumbo's goals was to make it easy to write bindings in other languages. Most languages can bind to C structs easily, but binding to C function calls often requires a much more verbose preamble to setup args, return types, conversions, etc. (I was actually thinking of LLVM when I designed Gumbo's API, since the project it was initially for at the time was looking at LLVM as an embedded JIT. Binding to a struct that's C-formatted just requires defining a new type, but binding to a function call requires codegenning a lot of argument setup.)
I wasn't suggesting that you should have used the approach, I was wondering if you had used the approach. I've learned a little bit about the limits of this tokenising parser method, thanks for your reply.
EDIT: Badgar thanks for your comment, I'll search out your lecturer's work if I ever have to parse something. Shame your account seems banned or something. I looked through your history, and it was over saying you had trouble quitting weed or something stupid.
FWIW my house mate kicked his weed addiction by cutting out triggers: people, places and things that would encourage him to light one up. He had all the problems with it you list. He had to stop drinking for a while to have sufficient willpower. He resumed drinking after successfully kicking weed.
The token stream is an array of 32 bit integers, each of which is the token type bitmasked onto an index into the input file of the start of the token. If you need the token text, you reparse. Caches can be implemented as small hash maps from token index to cached value.
The canonical AST is the CFG parse tree with fixups to convert recursion to children of a node type for a variable number of children. It is stored as an array of integers. The node is a sequence of integers, with one integer for the root followed by each child in sequence. Each internal node's value is the index of the CFG rule evaluated to produce the node, and the children of the node correspond to the CFG rule's right hand side (minus keywords). Terminals are stored as integer indexes into the token stream. Nonterminals are the integer index of the child internal node.
Bill has been a big name in compilers for the better part of 5 decades now, and he said he's been using this pattern for almost as long. It's ridiculously fast, which is why he used it decades ago for DEC.
I just forked Gumbo and gave it a shot, but my limited experience of v8 didn't get me very far. I was able to create and build a basic plugin, but returning the scope with a Gumbo object was beyond my limited capabilities.
I think as a little side project I may continue to work on this.
I'm giving it another crack now.
You can see my first attempt at https://github.com/jbrooksuk/gumbo-parser/commit/d64f78b125f... - unfortunately it's not quite there. I can't figure out how to send back the struct created by Gumbo.
PyPy JIT and html5lib is about 8x faster as it is cpython.
from gumbo import html5lib
I'm not sure offhand what the speed would be - I'd imagine the Gumbo backend would be significantly faster than html5lib by virtue of being written in C, but speed was not a design goal, and so I suspect it's currently significantly slower than lxml. What Gumbo gives you over lxml is HTML5 compatibility - lxml does an HTML4-approximate parse.
- Byte strings (opposed to Unicode ones) have encoding sniffed and parsed according to that in html5lib whereas they're all handled as UTF-8 in Gumbo.
- There's a namespaceHTMLElements option in html5lib which avoids putting HTML elements in the HTML namespace, useful for some legacy HTML processing tools.
- html5lib can read directly from a file object, which might in extreme cases be a useful memory saving (though the parse tree will likely use 100x the amount of memory anyway), but perhaps is more useful when dealing with network streams (it doesn't block waiting for all the data before starting to parse).
- html5lib supports fragment parsing, as is used by innerHTML.
Otherwise, given it takes a normal html5lib tree builder, it should support almost everything else (the tree walkers, albeit with indirection from Gumbo's own representation of the tree, and related stuff like the serialiser).
Compared with libxml2, it provides what is likely a better tested parse algorithm (ultimately, libxml2's is just a few bits of error handling of the non-fatal type in the libxml2 parser with a few bits of variant behaviour. I know the experience of HubHub's author was it had a fair few bad bugs like infinite loops and the like, as well as radically different behaviour to any browser and what most web authors expect to get.
Speed wise, quickly trying to appears to be a few times quicker than html5lib under PyPy and an order of magnitude quicker under CPython. This will likely differ with the input given.
This can therefore be used with Gumbo by passing the lxml tree builder into its html5lib.parse like method.
There were a bunch of reasons for the choice of C over C++:
1. At the time, we were doing a bunch of stuff with LLVM in the templating language. I'd previously been responsible for trying to integrate LLVM with C++ generated code, and it is painful, mostly because of name mangling and vtable dispatch. Providing a C API sidesteps this entirely, as LLVM can call into C code and use C structs no problem, and once the API is in C there's little reason to make the internals be in C++.
2. We wanted to provide tooling for this templating language, and the easiest way to write tooling is in Python or some other scripting language. It's easier to provide Python etc. bindings with a C API than a C++ API.
3. We'd intended from the start to open-source this. One of the team members had significant open-source experience, and he pointed out that within the open-source community, there are a number of people who basically refuse to use C++ and will instantly disqualify a C++ library. So regardless of whether these people are right, to reach the maximum number of people and prospective projects it should be in C.
Do you often read the output of some library macros that the C pre-compiler generates?
I do when they have bugs in them. :(
The parts of C++ that I most missed with this project were standard libraries for string and vector. Many times they were just accumulating or munging values that would eventually end up in the C-API parse tree, and so if I wrote them in C++, I'd just need to translate to a C implementation afterwards. I could potentially have used classes & objects for some of the states, but the array-of-function-pointers that it currently uses is basically just as easy and simpler.
Why do we need yet another HTML5 Parser?
What's wrong with Webkit?
What's wrong with the new Gecko2 HTML5 parser?
And what license is it?
Compared to Gecko and WebKit, this gives you just the parser, which is significantly simpler than the whole engine and all you want for many applications.