
Google Open-Sources Gumbo: C Library for Parsing HTML5 - bryanh
https://github.com/google/gumbo-parser#gumbo---a-pure-c-html5-parser
======
thejosh
A good thing about developing tools at Google is the access to test data with
their index.

The line:

    
    
      Tested on over 2.5 billion pages from Google's index.
    

That's quite awesome, and would cover quite a few edge cases.

~~~
tlb
Indeed, although anyone can also get 3.8 billion pages from CommonCrawl.

[http://commoncrawl.org/a-look-inside-common-
crawls-210tb-201...](http://commoncrawl.org/a-look-inside-common-
crawls-210tb-2012-web-corpus/)

~~~
faddotio
And anyone can pay for the EC2 instances to test all 3.8B pages!

------
sanxiyn
How does this compare to Hubbub, another C library for parsing HTML5?

[http://www.netsurf-browser.org/projects/hubbub/](http://www.netsurf-
browser.org/projects/hubbub/)

~~~
nostrademons
From a cursory glance at the Hubbub source:

1\. Hubbub provides a SAX-style (callback-based) API, while Gumbo gives you a
DOM-style (tree-based) struct directly. Hubbub is likely faster in this
regard, Gumbo is easier to use out-of-the-box.

2\. Gumbo is better tested. It's unclear whether Hubbub's 90% test coverage is
"90% of the code is tested" or "90% of the tests pass", but Gumbo has 100%
code coverage, 100% of html5lib tests pass (as of 0.95; the html5lib
maintainer has pointed out that additional tests were added to trunk recently
that don't pass), and it's run without crashing on ~4.5B documents from
Google's index.

3\. Gumbo has better support for source locations and going between original
text and parse tree.

4\. Hubbub has character encoding detection, Gumbo doesn't.

~~~
eliasmacpherson
Have you seen this before? [http://vtd-
xml.sourceforge.net/faq.html#How_do_I_get_started...](http://vtd-
xml.sourceforge.net/faq.html#How_do_I_get_started_learning_VTD-XML)

It's an xml parser that is neither DOM nor SAX. I haven't seen much mention of
it before, except as a recommendation for Java devs. There's a C version too.
It makes bold claims about performance.

"Comparing with DOM, VTD-XML is significantly faster (up to 10x), more memory-
efficient (up to 5x).

Comparing with SAX/PULL, VTD-XML is not only faster, but also is capable of
random-access, therefore is easier to use."

Basically by building a DOM style model in SAX fashion.

~~~
nostrademons
Interesting. I'm not familiar with the library. I'm familiar with the general
programming model of parsing a document into a number of tokens and then
encoding document structure into offsets between tokens.

It wouldn't have worked for Gumbo's purposes because

1\. Gumbo captures a lot more information than can fit in a 64 bit token. For
example, Gumbo decodes entity references; this requires that text be available
in a fresh buffer because each individual character might be something
different than the source text.

2\. One of Gumbo's goals was to make it easy to write bindings in other
languages. Most languages can bind to C structs easily, but binding to C
function calls often requires a much more verbose preamble to setup args,
return types, conversions, etc. (I was actually thinking of LLVM when I
designed Gumbo's API, since the project it was initially for at the time was
looking at LLVM as an embedded JIT. Binding to a struct that's C-formatted
just requires defining a new type, but binding to a function call requires
codegenning a lot of argument setup.)

~~~
eliasmacpherson
It's a shame, I wish vtd-xml was a more popular library, so I could read more
about it rather than have to do it myself. libxml2 seems to rule the roost.
vtd-xml doesn't have a debian package and the C files gave a lot of warnings
when I compiled. I don't know enough about its performance to say if the bold
claims are true. The author says the Java version is a little faster than the
C version, which strikes me as odd - I wonder is he basing that on long
duration benchmarks.

I wasn't suggesting that you should have used the approach, I was wondering if
you had used the approach. I've learned a little bit about the limits of this
tokenising parser method, thanks for your reply.

EDIT: Badgar thanks for your comment, I'll search out your lecturer's work if
I ever have to parse something. Shame your account seems banned or something.
I looked through your history, and it was over saying you had trouble quitting
weed or something stupid.

FWIW my house mate kicked his weed addiction by cutting out triggers: people,
places and things that would encourage him to light one up. He had all the
problems with it you list. He had to stop drinking for a while to have
sufficient willpower. He resumed drinking after successfully kicking weed.

------
jbrooksuk
Node.js bindings would be awesome, especially for writing a headless website
parser.

I just forked Gumbo and gave it a shot, but my limited experience of v8 didn't
get me very far. I was able to create and build a basic plugin, but returning
the scope with a Gumbo object was beyond my limited capabilities.

I think as a little side project I may continue to work on this.

~~~
joshguthrie
Nice idea, give the repo an I'll give it a push =)

~~~
jbrooksuk
Currently it's just a straight fork - [https://github.com/jbrooksuk/gumbo-
parser](https://github.com/jbrooksuk/gumbo-parser)

I'm giving it another crack now.

You can see my first attempt at [https://github.com/jbrooksuk/gumbo-
parser/commit/d64f78b125f...](https://github.com/jbrooksuk/gumbo-
parser/commit/d64f78b125fa42e7af47f1dc8ed6ca9d71371019) \- unfortunately it's
not quite there. I can't figure out how to send back the struct created by
Gumbo.

------
bryanh
I would be quite interested in Gumbo as the backend to the awesome pure Python
but otherwise rather-slow [https://github.com/html5lib/html5lib-
python](https://github.com/html5lib/html5lib-python), which actually has great
whitelisting/cleaning facilities but is easily an order of magnitude slower
than lxml's more limited clean_html.

PyPy JIT and html5lib is about 8x faster as it is cpython.

~~~
nostrademons
Gumbo's Python wrapper should be a drop-in replacement for html5lib. Just
replace

    
    
         import html5lib
    

with

    
    
         from gumbo import html5lib
    

The tree generated from gumbo.html5lib.HTMLParser should be API-compatible
with the one generated by html5lib.HTMLParser. (Possibly modulo some minor
features...html5lib's maintainer has filed a bug about implementing
treewalkers in the html5lib adaptor.)

I'm not sure offhand what the speed would be - I'd imagine the Gumbo backend
would be significantly faster than html5lib by virtue of being written in C,
but speed was not a design goal, and so I suspect it's currently significantly
slower than lxml. What Gumbo gives you over lxml is HTML5 compatibility - lxml
does an HTML4-approximate parse.

~~~
gsnedders
Well, differences off hand compared with html5lib:

\- Byte strings (opposed to Unicode ones) have encoding sniffed and parsed
according to that in html5lib whereas they're all handled as UTF-8 in Gumbo.

\- There's a namespaceHTMLElements option in html5lib which avoids putting
HTML elements in the HTML namespace, useful for some legacy HTML processing
tools.

\- html5lib can read directly from a file object, which might in extreme cases
be a useful memory saving (though the parse tree will likely use 100x the
amount of memory anyway), but perhaps is more useful when dealing with network
streams (it doesn't block waiting for all the data before starting to parse).

\- html5lib supports fragment parsing, as is used by innerHTML.

Otherwise, given it takes a normal html5lib tree builder, it should support
almost everything else (the tree walkers, albeit with indirection from Gumbo's
own representation of the tree, and related stuff like the serialiser).

Compared with libxml2, it provides what is likely a better tested parse
algorithm (ultimately, libxml2's is just a few bits of error handling of the
non-fatal type in the libxml2 parser with a few bits of variant behaviour. I
know the experience of HubHub's author was it had a fair few bad bugs like
infinite loops and the like, as well as radically different behaviour to any
browser and what most web authors expect to get.

Speed wise, quickly trying to appears to be a few times quicker than html5lib
under PyPy and an order of magnitude quicker under CPython. This will likely
differ with the input given.

~~~
gsnedders
Okay, digging about some more, and actually running Gumbo in its html5lib
wrapper, it appears no quicker than html5lib itself (the cost of the tree
building dominates the actual parsing). :(

------
avian
I'm curious as to what was the motivation behind this project at Google. It
seems to me that the only benefit of writing something like this in pure C
would be performance gains over existing parsers, but it specifically says in
the README that parsing performance was not one of the goals.

~~~
nostrademons
It actually arose out of a templating language project within Google, which
was written in C++. We evaluated the existing C/C++ parsers (which at the time
were Webkit, the auto-generated port of validator.nu, and another Google-
internal parser - we didn't learn about Hubbub until later), and found that
the effort needed to integrate with our project, and the number of
dependencies they would bring into the serving system, precluded us from using
them easily. Hixie suggested "Just write your own! It shouldn't be too hard,
the algorithm is all specified in the HTML spec" (har, har, famous last
words), and Gumbo was born out of naivete and youthful optimism. :-)

There were a bunch of reasons for the choice of C over C++:

1\. At the time, we were doing a bunch of stuff with LLVM in the templating
language. I'd previously been responsible for trying to integrate LLVM with
C++ generated code, and it is _painful_ , mostly because of name mangling and
vtable dispatch. Providing a C API sidesteps this entirely, as LLVM can call
into C code and use C structs no problem, and once the API is in C there's
little reason to make the internals be in C++.

2\. We wanted to provide tooling for this templating language, and the easiest
way to write tooling is in Python or some other scripting language. It's
easier to provide Python etc. bindings with a C API than a C++ API.

3\. We'd intended from the start to open-source this. One of the team members
had significant open-source experience, and he pointed out that within the
open-source community, there are a number of people who basically refuse to
use C++ and will instantly disqualify a C++ library. So regardless of whether
these people are right, to reach the maximum number of people and prospective
projects it should be in C.

------
pokoleo
I think the big move here is that they didn't publish it to their own
[http://code.google.com](http://code.google.com) site.

~~~
tvon
They post quite a bit to GitHub:
[https://github.com/google/](https://github.com/google/)

------
xedarius
Nice to see they've written the parser from scratch rather than some BISON and
FLEX construct that generates impenetrable code.

~~~
idlewan
Just like with any generated code, you are not supposed to read the code bison
and flex generate (except maybe if you are a flex/bison developer).
'Impenetrable' is neither an advantage nor a disadvantage of generated code.

Do you often read the output of some library macros that the C pre-compiler
generates?

~~~
munificent
> Do you often read the output of some library macros that the C pre-compiler
> generates?

I do when they have bugs in them. :(

------
Lasher
Is this what Google actually uses to parse HTML5 pages? Seems like there might
be some SEO value in studying it if so. Not for any "black hat" purposes, more
to understand what they can and can't easily get to.

------
cliveowen
I'm very excited because even though is useless for me is an enough small
project to actually learn something of C programming...and from Google no
less.

~~~
clarry
Coming from Google doesn't mean it's good code to learn from. Be careful not
to pick up bad (and potentially disasterous, security-wise) habits like not
checking for arithmetic overflows in functions such as enlarge_vector_if_full
(vector.c) or maybe_resize_string_buffer (string_buffer.c) -- or not checking
the return value from malloc.

~~~
nostrademons
You could file bugs for these rather than posting them in a comment thread
that will likely be soon forgotten.

------
swah
Can you comment on why this is C and not C++? (Although some examples are in
C++)

~~~
desas
It's very easy to re-use C code in most other languages.

~~~
wtetzner
Sure, but why not write in C++ and just expose a C interface?

~~~
nostrademons
That's possible, but it doesn't actually save all that much effort, and the
interface layer would slow things down needlessly.

The parts of C++ that I most missed with this project were standard libraries
for string and vector. Many times they were just accumulating or munging
values that would eventually end up in the C-API parse tree, and so if I wrote
them in C++, I'd just need to translate to a C implementation afterwards. I
could potentially have used classes & objects for some of the states, but the
array-of-function-pointers that it currently uses is basically just as easy
and simpler.

------
edwinyzh
Great! As a developer of a live html code editor, I've got my excellent html
parser (in Delphi) already, but still have to bookmark this one for future
reference

------
frik
Are there plans to add bindings for PHP? Would be awesome.

~~~
nostrademons
I don't have plans to. It's an open-source library, though, so there's nothing
stopping an enterprising programmer familiar with PHP extensions to add some
herself. That's what Gumbo was designed for: to serve as a building block for
other tools.

~~~
soulclap
Hopefully someone will get busy on that. Great project!

------
ksec
Um... Incoming Naive Questions.

Why do we need yet another HTML5 Parser? What's wrong with Webkit? What's
wrong with the new Gecko2 HTML5 parser?

And what license is it?

~~~
venomsnake
Because WebKit and Gecko are not only parsers for starters. They are much more
complex layout engines. Which is a whole other can of worms.

~~~
mjn
The new Gecko parser is based on a Java->C++ translation from this standalone
parser:
[http://about.validator.nu/htmlparser/](http://about.validator.nu/htmlparser/)

------
carsonreinke
Is it me or do the comments in the source seem sparse?

