Hacker News new | comments | show | ask | jobs | submit login
Google Open-Sources Gumbo: C Library for Parsing HTML5 (github.com)
262 points by bryanh on Aug 14, 2013 | hide | past | web | favorite | 59 comments

A good thing about developing tools at Google is the access to test data with their index.

The line:

  Tested on over 2.5 billion pages from Google's index.
That's quite awesome, and would cover quite a few edge cases.

Indeed, although anyone can also get 3.8 billion pages from CommonCrawl.


And anyone can pay for the EC2 instances to test all 3.8B pages!

Although not anyone can crawl 2+ billion pages AND fix all the crazy edge cases without going insane :)

Who says I didn't go insane? ;-)

The crazy edge cases were...challenging. The source code to the parser is very assert-heavy, so if there's anything that's amiss, it tends to blow up with an assertion failure. I'd run the MapReduce and it would blow up a few hundred times, then MapReduce would stop trying and kill the job. Then when I had a spare moment, I'd look at the assertion failures, pick off the most common ones, and run it again. This time it would get farther, I'd pick off another couple of bugs, and run it again.

As expected, the triggering frequency of bugs follows a power-law distribution. It took a long time before I could get it to parse one HTML document, and then it would fail on 1% of documents, then 0.1% of documents, then 0.01%, and so on. It got stuck at a roughly 1-in-a-million failure rate by a long time, until I figured out that it was crashing because of a stack overflow in the testing code, which would recursively sanity-check the produced DOM. Some documents generate a DOM >20,000 nodes deep, which is evidently too much to fit in typical C stacks, although Gumbo can handle them. (I found one page with a DOM tree 100,000 nodes deep - it was really an XML document masquerading as HTML, with a bunch of self-closing nodes that don't self-close under HTML5 parsing rules - and when I posted the link to say "Look what I found!", I got a bunch of "Kind of a dick move, linking to a page that crashes Webkit.")

I like programming assert-heavy C. As soon as something is out of wack with my mental model the whole thing explodes. I sure as hell don't want to try handling things I don't already understand.

You can use asserts in an exploratory fashion to document code you don't understand too.

Not so good for code that runs in servers, since you have no chance to drop requests or degrade gracefully.

If you hit it with enough examples of input as was done here, you can be fairly confident. But you're right, short of that, assert heavy code is going to cause headaches in servers. C is a nice and quick language to run tests on which makes it doable.

I'd be interested in reading about your learning when moving from a UI -> Algo heavy engineer :)

It's a long story, and it's also not complete yet (I'm actually doing very UI heavy work right now as a tech lead). It's also not really correct to say it started with UI - I was big into programming language theory in college, even implementing a bunch of toy interpreters/compilers, one of which even got some measure of fame on the Internet.

The 5 second overview is really that it's the same as getting good at any new skill. You find an area that you don't know how to do, and then keep working at it until you do know how to do it. Then repeat with finer-grained details. There were a bunch of skills involved in this project - C, HTML5, UTF-8 decoding, debugging, testing, autotools, CTypes, API design, documentation - that I wasn't all that good at when I started that I had to pick up along the way.

This is the most inspiring thing I've read in a long time. There's a lot of chatter on HN about how people became an "expert" in this or that, but for some reason, the way you phrased it really resonated with me. And to see the end result -- holy crap. HTML is complicated.

How does this compare to Hubbub, another C library for parsing HTML5?


From a cursory glance at the Hubbub source:

1. Hubbub provides a SAX-style (callback-based) API, while Gumbo gives you a DOM-style (tree-based) struct directly. Hubbub is likely faster in this regard, Gumbo is easier to use out-of-the-box.

2. Gumbo is better tested. It's unclear whether Hubbub's 90% test coverage is "90% of the code is tested" or "90% of the tests pass", but Gumbo has 100% code coverage, 100% of html5lib tests pass (as of 0.95; the html5lib maintainer has pointed out that additional tests were added to trunk recently that don't pass), and it's run without crashing on ~4.5B documents from Google's index.

3. Gumbo has better support for source locations and going between original text and parse tree.

4. Hubbub has character encoding detection, Gumbo doesn't.

Have you seen this before? http://vtd-xml.sourceforge.net/faq.html#How_do_I_get_started...

It's an xml parser that is neither DOM nor SAX. I haven't seen much mention of it before, except as a recommendation for Java devs. There's a C version too. It makes bold claims about performance.

"Comparing with DOM, VTD-XML is significantly faster (up to 10x), more memory-efficient (up to 5x).

Comparing with SAX/PULL, VTD-XML is not only faster, but also is capable of random-access, therefore is easier to use."

Basically by building a DOM style model in SAX fashion.

Interesting. I'm not familiar with the library. I'm familiar with the general programming model of parsing a document into a number of tokens and then encoding document structure into offsets between tokens.

It wouldn't have worked for Gumbo's purposes because

1. Gumbo captures a lot more information than can fit in a 64 bit token. For example, Gumbo decodes entity references; this requires that text be available in a fresh buffer because each individual character might be something different than the source text.

2. One of Gumbo's goals was to make it easy to write bindings in other languages. Most languages can bind to C structs easily, but binding to C function calls often requires a much more verbose preamble to setup args, return types, conversions, etc. (I was actually thinking of LLVM when I designed Gumbo's API, since the project it was initially for at the time was looking at LLVM as an embedded JIT. Binding to a struct that's C-formatted just requires defining a new type, but binding to a function call requires codegenning a lot of argument setup.)

It's a shame, I wish vtd-xml was a more popular library, so I could read more about it rather than have to do it myself. libxml2 seems to rule the roost. vtd-xml doesn't have a debian package and the C files gave a lot of warnings when I compiled. I don't know enough about its performance to say if the bold claims are true. The author says the Java version is a little faster than the C version, which strikes me as odd - I wonder is he basing that on long duration benchmarks.

I wasn't suggesting that you should have used the approach, I was wondering if you had used the approach. I've learned a little bit about the limits of this tokenising parser method, thanks for your reply.

EDIT: Badgar thanks for your comment, I'll search out your lecturer's work if I ever have to parse something. Shame your account seems banned or something. I looked through your history, and it was over saying you had trouble quitting weed or something stupid.

FWIW my house mate kicked his weed addiction by cutting out triggers: people, places and things that would encourage him to light one up. He had all the problems with it you list. He had to stop drinking for a while to have sufficient willpower. He resumed drinking after successfully kicking weed.

My undergrad thesis advisor, Bill McKeeman, wrote his parsers in this fashion. I implemented a parser using this model and extended its existing lexer.

The token stream is an array of 32 bit integers, each of which is the token type bitmasked onto an index into the input file of the start of the token. If you need the token text, you reparse. Caches can be implemented as small hash maps from token index to cached value.

The canonical AST is the CFG parse tree with fixups to convert recursion to children of a node type for a variable number of children. It is stored as an array of integers. The node is a sequence of integers, with one integer for the root followed by each child in sequence. Each internal node's value is the index of the CFG rule evaluated to produce the node, and the children of the node correspond to the CFG rule's right hand side (minus keywords). Terminals are stored as integer indexes into the token stream. Nonterminals are the integer index of the child internal node.

Bill has been a big name in compilers for the better part of 5 decades now, and he said he's been using this pattern for almost as long. It's ridiculously fast, which is why he used it decades ago for DEC.

Node.js bindings would be awesome, especially for writing a headless website parser.

I just forked Gumbo and gave it a shot, but my limited experience of v8 didn't get me very far. I was able to create and build a basic plugin, but returning the scope with a Gumbo object was beyond my limited capabilities.

I think as a little side project I may continue to work on this.

Nice idea, give the repo an I'll give it a push =)

Currently it's just a straight fork - https://github.com/jbrooksuk/gumbo-parser

I'm giving it another crack now.

You can see my first attempt at https://github.com/jbrooksuk/gumbo-parser/commit/d64f78b125f... - unfortunately it's not quite there. I can't figure out how to send back the struct created by Gumbo.

I would be quite interested in Gumbo as the backend to the awesome pure Python but otherwise rather-slow https://github.com/html5lib/html5lib-python, which actually has great whitelisting/cleaning facilities but is easily an order of magnitude slower than lxml's more limited clean_html.

PyPy JIT and html5lib is about 8x faster as it is cpython.

Gumbo's Python wrapper should be a drop-in replacement for html5lib. Just replace

     import html5lib

     from gumbo import html5lib
The tree generated from gumbo.html5lib.HTMLParser should be API-compatible with the one generated by html5lib.HTMLParser. (Possibly modulo some minor features...html5lib's maintainer has filed a bug about implementing treewalkers in the html5lib adaptor.)

I'm not sure offhand what the speed would be - I'd imagine the Gumbo backend would be significantly faster than html5lib by virtue of being written in C, but speed was not a design goal, and so I suspect it's currently significantly slower than lxml. What Gumbo gives you over lxml is HTML5 compatibility - lxml does an HTML4-approximate parse.

Well, differences off hand compared with html5lib:

- Byte strings (opposed to Unicode ones) have encoding sniffed and parsed according to that in html5lib whereas they're all handled as UTF-8 in Gumbo.

- There's a namespaceHTMLElements option in html5lib which avoids putting HTML elements in the HTML namespace, useful for some legacy HTML processing tools.

- html5lib can read directly from a file object, which might in extreme cases be a useful memory saving (though the parse tree will likely use 100x the amount of memory anyway), but perhaps is more useful when dealing with network streams (it doesn't block waiting for all the data before starting to parse).

- html5lib supports fragment parsing, as is used by innerHTML.

Otherwise, given it takes a normal html5lib tree builder, it should support almost everything else (the tree walkers, albeit with indirection from Gumbo's own representation of the tree, and related stuff like the serialiser).

Compared with libxml2, it provides what is likely a better tested parse algorithm (ultimately, libxml2's is just a few bits of error handling of the non-fatal type in the libxml2 parser with a few bits of variant behaviour. I know the experience of HubHub's author was it had a fair few bad bugs like infinite loops and the like, as well as radically different behaviour to any browser and what most web authors expect to get.

Speed wise, quickly trying to appears to be a few times quicker than html5lib under PyPy and an order of magnitude quicker under CPython. This will likely differ with the input given.

Okay, digging about some more, and actually running Gumbo in its html5lib wrapper, it appears no quicker than html5lib itself (the cost of the tree building dominates the actual parsing). :(

They already provide adapters for standard Python HTML parsing libraries[1], specifically html5lib and BeautifulSoup. This is how they suggest it be used with Python[2].

[1]: https://github.com/google/gumbo-parser/tree/master/python/gu...

[2]: https://github.com/google/gumbo-parser#python-usage

From a quick glance, it looks like you'd need to make Gumbo a backend for lxml. Is that even possible?

https://github.com/html5lib/html5lib-python/issues/105 seems to imply that such a thing is possible. I am unsure about the requirement for lxml. I was under the impression that lxml is an optional walker, the default is the slower pure python walker.

You're misunderstanding the level at which html5lib operates: it merely parses to a tree (using a "tree builder" to provide a common API to the parser to build the tree, which can be a DOM tree, an ElementTree, an lxml tree, whatever) and provides a generic "tree walker" API that walks over one of those tree formats and provides a common stream of events (start tag, end tag, text, comment, etc.) which can then be used, e.g., in the serialiser.

This can therefore be used with Gumbo by passing the lxml tree builder into its html5lib.parse like method.

I'm curious as to what was the motivation behind this project at Google. It seems to me that the only benefit of writing something like this in pure C would be performance gains over existing parsers, but it specifically says in the README that parsing performance was not one of the goals.

It actually arose out of a templating language project within Google, which was written in C++. We evaluated the existing C/C++ parsers (which at the time were Webkit, the auto-generated port of validator.nu, and another Google-internal parser - we didn't learn about Hubbub until later), and found that the effort needed to integrate with our project, and the number of dependencies they would bring into the serving system, precluded us from using them easily. Hixie suggested "Just write your own! It shouldn't be too hard, the algorithm is all specified in the HTML spec" (har, har, famous last words), and Gumbo was born out of naivete and youthful optimism. :-)

There were a bunch of reasons for the choice of C over C++:

1. At the time, we were doing a bunch of stuff with LLVM in the templating language. I'd previously been responsible for trying to integrate LLVM with C++ generated code, and it is painful, mostly because of name mangling and vtable dispatch. Providing a C API sidesteps this entirely, as LLVM can call into C code and use C structs no problem, and once the API is in C there's little reason to make the internals be in C++.

2. We wanted to provide tooling for this templating language, and the easiest way to write tooling is in Python or some other scripting language. It's easier to provide Python etc. bindings with a C API than a C++ API.

3. We'd intended from the start to open-source this. One of the team members had significant open-source experience, and he pointed out that within the open-source community, there are a number of people who basically refuse to use C++ and will instantly disqualify a C++ library. So regardless of whether these people are right, to reach the maximum number of people and prospective projects it should be in C.

Most languages can be extended by C code without too much fuss. If the C is written in a platform-independent manner (which should be possible for this library), it's pretty much write-once, run-anywhere. Even if optimal performance wasn't a goal, it's still nice to have a single portable, performant html5 parser that's been tested against billions of pages.

I think the big move here is that they didn't publish it to their own http://code.google.com site.

They post quite a bit to GitHub: https://github.com/google/

This is interesting isn't it. To their credit, perhaps, they posted it to the place where it will get the most exposure.

Nice to see they've written the parser from scratch rather than some BISON and FLEX construct that generates impenetrable code.

Just like with any generated code, you are not supposed to read the code bison and flex generate (except maybe if you are a flex/bison developer). 'Impenetrable' is neither an advantage nor a disadvantage of generated code.

Do you often read the output of some library macros that the C pre-compiler generates?

> Do you often read the output of some library macros that the C pre-compiler generates?

I do when they have bugs in them. :(

Is this what Google actually uses to parse HTML5 pages? Seems like there might be some SEO value in studying it if so. Not for any "black hat" purposes, more to understand what they can and can't easily get to.

I'm very excited because even though is useless for me is an enough small project to actually learn something of C programming...and from Google no less.

Coming from Google doesn't mean it's good code to learn from. Be careful not to pick up bad (and potentially disasterous, security-wise) habits like not checking for arithmetic overflows in functions such as enlarge_vector_if_full (vector.c) or maybe_resize_string_buffer (string_buffer.c) -- or not checking the return value from malloc.

You could file bugs for these rather than posting them in a comment thread that will likely be soon forgotten.

Can you comment on why this is C and not C++? (Although some examples are in C++)

If you are writing this as a library you're most likely going to end up having to write a C wrapper for most language bindings. In the end it may just be easier to write the whole library in C and then write the specific binding for Python, Ruby, etc.

It's very easy to re-use C code in most other languages.

Sure, but why not write in C++ and just expose a C interface?

That's possible, but it doesn't actually save all that much effort, and the interface layer would slow things down needlessly.

The parts of C++ that I most missed with this project were standard libraries for string and vector. Many times they were just accumulating or munging values that would eventually end up in the C-API parse tree, and so if I wrote them in C++, I'd just need to translate to a C implementation afterwards. I could potentially have used classes & objects for some of the states, but the array-of-function-pointers that it currently uses is basically just as easy and simpler.

Great! As a developer of a live html code editor, I've got my excellent html parser (in Delphi) already, but still have to bookmark this one for future reference

Are there plans to add bindings for PHP? Would be awesome.

I don't have plans to. It's an open-source library, though, so there's nothing stopping an enterprising programmer familiar with PHP extensions to add some herself. That's what Gumbo was designed for: to serve as a building block for other tools.

Hopefully someone will get busy on that. Great project!

any chance to open source DOM too? Thug made a solid DOM with pyV8 but still not 100% compatible as far as i know.

Um... Incoming Naive Questions.

Why do we need yet another HTML5 Parser? What's wrong with Webkit? What's wrong with the new Gecko2 HTML5 parser?

And what license is it?

A standalone parser written in C is a great asset. Pretty much any language worth mentioning has C bindings, so they are now just a bindings implementation away from having a reasonably fast (the fact that performance was a non-goal notwithstanding), standards compliant HTML parser. This is an improvement over the status-quo where most languages have bindings to lxml which is fast but has made-up error handling and a tendency to deal poorly with quite a lot of content, and some languages have slow, native implementations of the HTML standard parsing algorithm (I wrote much of Python's html5lib so I am aware both that it is slow and that it is non-trivial to speed up).

Compared to Gecko and WebKit, this gives you just the parser, which is significantly simpler than the whole engine and all you want for many applications.

Because WebKit and Gecko are not only parsers for starters. They are much more complex layout engines. Which is a whole other can of worms.

The new Gecko parser is based on a Java->C++ translation from this standalone parser: http://about.validator.nu/htmlparser/

Is it me or do the comments in the source seem sparse?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact