But when you think about it it is kind of logical: inner loops and low level code tend to be fairly static and are often re-used so it pays of to write them in a language that maximizes for that use-case, but structure can be executed much slower and re-use is relatively low when going from one application to the next so it pays of to write that in another language.
So we get this split roughly halfway down the application stack where we switch from interactive and interpreted high level language to compiled low level language.
It's a very interesting optimum.
A very similar split goes right through the middle of Objective-C. Although it is compiled, the "Objective" part always felt pretty close to interactive and is pretty high-level, and C is, well, C. Same concept, less mismatch between the parts.
One of Objective-Smalltalk's goals is to invert the Objective-C trade-off: make the interactive, high-level part primary, while still allowing a smooth transition to the low-level parts without resorting to making that transition automatic by either a heldencompiler or super-smart JIT.
Yes, it is a very interesting optimum, though I think it is one that can be improved on.
Take parsing a config file - it doesn't need to be fast unless it's truly in your hot path and you're doing millions of them. I like not having to worry about memory management when it comes to them. On the other hand if I have a hot path/slow operation having the ability to compile a library to do the heavy lifting and gluing it together in a higher-level language means that you spend a lot less time fighting your language and a lot more time solving your problems.
For example, today I was doing some homomorphic encryption experiments, and leveraged the GNU GMP library for some integer calculations. Crossing that "boundary" in Nim was the easiest thing I've ever done. Take a high level "ref" type (`Int`) which is actually `type Int = ref mpz_t`: This means I can write nice high-level procs over the top of the low-level GMP types, but also just de-reference any `Int` to pass it to the "real" GMP functions as needed!
I'd love to see other languages explore this, too, as I think it's really powerful. It's my biggest gripe with Node.js: FFI is a bit "ick" the last time I tried it!
But yes as long as you write good C, it's a pretty awesome pattern.
I have worked on document formats (office, pdf etc) and I understand these standards are simply tedious!
It shows that a good engineer took the time to think about not only speed, but correctness in implementation, and compiling things with `-pedantic-errors -Wall -Werror` is a practice that I very rarely see in the wild, and should be heralded as good practice. It's far too common to see hundreds of warnings zip past while compiling, and sometimes they do tell you about problems you need to solve.
You should release your code with no warnings on the current compiler, but new warnings shouldn't break everything.
html -> gumbo parse tree -> lxml tree
What makes it efficient even with two tree transformations ?
Is one of the above trees constructed in sequential way and is less-recursive ?
The slightly longer answer is that the vast majority of time in parsing is spent actually parsing - examining each character and adjusting your state machine accordingly. When I was benchmarking the gumbo-python bindings, it was 95%+ parsing, < 5% spent on tree construction. Speed up the parsing part by 100x (which isn't all that unreasonable, when you consider that a field reference in Python involves hashing a string and looking up an object in a dictionary attached to the object), and you'll get an equivalent speedup in total runtime that can pay for a lot of tree reconstructions. The Gumbo parse tree itself will often fit in L2 cache (IIRC I'd benchmarked it at about 90% of documents use < 400K RAM), so it doesn't take much time to traverse.
Doing the Gumbo => lxml translation in C rather than Python gives a similar speedup for tree construction: instead of having to lookup fields in Python as dictionary references, you can do it in C by memory offset.
>> benchmarking the gumbo-python bindings, it was 95%+ parsing, < 5% spent on tree construction
95% on parsing - this speeds up since its done in C instead of python - ok.
5% spent on tree construction. I was little surprised with this - I assumed dictionaries might help although construction (seem expensive) for tree but tree traversal becomes optimal.
But as I write this, I find you have mentioned trees in question are of memory OFFSET types. That may explain why its quick. Else DOM styled trees (which I assumed) in C would be expensive to construct and traverse.
Finally, this begs another question : memory offset trees ? How optimal are they in construction since without having knowledge of depth of a tag and its children-depth - this will be have to be a two time parse with some sort of data structure(tree) to maintain book keeping for a second time parse to construct memory offset tree. Can this be done in single parse ? Could you share some insight into this ?
I know that lxml won't parse as many cases as this module, but in many cases lxml can do the job.
Yes, it does need a new maintainer, or maintained fork. I finally lost my commit rights to the /google GitHub org about a year ago, 2 years after leaving Google. So I don't actually have the ability to merge pull requests anymore. I still have some friends at Google, some of whom were involved in Gumbo's development, but they've moved on to other projects as well and likely wouldn't be able to take over. I'm working on a new startup idea myself, so my time is fairly limited, though I can answer questions if they come up.
That looks like pretty serious security lapse on Google's part without further context.
There's quite a few changes on the original upstream since then. gumbo_malloc -> gumbo_parser_allocate, free_node -> destroy_node, passing around the parser as an argument, &c.
I have no objectsion to merging in changes from gumbo-parser that are not in sigil-gumbo over time, but it will need to be done gradually, as there is only so much time I can devote to html5-parser now that it meets the needs of calibre.
I have some doubts that html5-parser would be 30x faster than html5lib without the performance work in 0.9.4; Gumbo up through 0.9.3 was really slow for a C library.
And if you really insist on doing that, why not just parse XHTML as HTML? HTML5 parsing rules can already handle some XHTML-like constructs; it's what browsers do when they're served XHTML as text/html. If it's good enough for them, it should be good enough for you.
On the web: very rarely. I've actually never seen a self-closing <title /> tag, but I've seen other cases where docs that are actually XML are served with a text/html MIME type and this creates pathological DOM structures.
(I have a funny story from Gumbo's testing that illustrates this. I ran across a document that was actually some XML dialect with 45,000 self-closing elements in a row, but was served with a text/html MIME type. Since HTML5 doesn't recognize self-closing elements that aren't in the spec, this created a DOM tree with 45,000 levels of nesting. Gumbo could handle this because it uses an iterative state machine for parsing, but my testing code did recursive descent on this and choked. I posted it on MemeGen - Google's internal water-cooler website - with a link to the offending web page ... and then got a few emails from other Googlers about how it was kinda rude of me to crash their browsers. It turns out Chrome couldn't handle the page, and would die with a stack overflow when viewing it.)
I'm assuming the tokenizer successfully tokenizes through "<title", and that we're in the "tag name state"; from there, we see the space, that takes us to the "before attribute name state"; then we consume the solidus, and switch to the "self-closing start tag state"; we see the ">" and the spec says,
> Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token.
So, now the tokenizer is:
^ about to consume this
So, let's say the tokenizer escapes okay. The parser is another story…
Let's assume at this point we're in the "in head" insertion mode; we get a start tag from the tokenizer as per above. This state tells us,
> A start tag whose tag name is "title"
> Follow the generic RCDATA element parsing algorithm.
but that algorithm seems to be directed at the tokenizer, which runs a bit counter to
> The input to the tree construction stage is a sequence of tokens from the tokenization stage.
Regardless, the self-closing flag on the start tag that got emitted doesn't get acknowledged, which seems to cause a "parse error". But what should a parser do with that parse error? How should we behave? If we follow the link, we end up with
> The error handling for parse errors is well-defined (that's the processing rules described throughout this specification)
…which, no spec, no you don't.
The best I've got is that the tokenizer and tree construction stages aren't independent; that is, if we go back to the bit where we're supposed to be following the "generic RCDATA element parsing algorithm" — which is under tree construction, it states,
> switch the tokenizer to the RCDATA state.
which seems to heavily imply that we adjust the tokenizer's state from within tree construction. Okay. So if we follow the algorithm for RCDATA, noting that the tokenizer was about to consume </head>, AFAICT, it gets very quickly into the "RCDATA end tag open state" after eating the </ from </head>, then "RCDATA end tag name state", then eventually consumes the > and we hit:
> "If the current end tag token is an appropriate end tag token, then switch to the data state and emit the current tag token. Otherwise, treat it as per the "anything else" entry below."
So what's an appropriate end tag token?
> An appropriate end tag token is an end tag token whose tag name matches the tag name of the last start tag to have been emitted from this tokenizer,
The last tag we emitted was that weird <title/> thing, so no, this is not an appropriate end tag, and we skip to "anything else"
> Switch to the RCDATA state. Emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS character token, and a character token for each of the characters in the temporary buffer (in the order they were added to the buffer). Reconsume the current input character.
Uh-oh. From here, my mental parse just loops around in the RCDATA state; the title of the document ends up being "</head><body><p>foo".
I threw up a simple webserver to host that string: Both Chrome and Firefox agree! But as a parser, I feel abused.
The HTML5 algorithm could be a bit clearer here, but I think it does specify a well formed result, and additionally would flag this document as having a parse error (for an unacknowledged self-closing tag, but hilariously? oddly?, not for the rampant consumption of tags by the RCDATA algorithm…)
But that seems to be what the spec says to do. I'm going to go out on a limb and guess that ebook readers don't follow the spec, and that if you attempted to, people would (incorrectly) blame your software, even if it's doing the right thing, putting you between a rock and a hard place because the ebook writers can't write well-formed markup? (I really feel for you here, if that's correct; ideally, you could just emit a blank book and tell the ebook writer to fix it…; maybe if you emitted a warning at the bottom that stated "This eBook contained errors; a best-effort has been made to correct it; if you experience problems with this book, notify the publisher.") But, then, my SO shows me plenty of just simply typographical errors in her ebooks, and she can't even see the markup.
Has anyone benchmarked this against html5lib compiled with PyPy?
Html5lib is slow in CPython because there are a lot of low-level per-character tests, as required by the HTML5 spec. The spec specifies how bad HTML is to be parsed in great detail, which means a lot of IF statements. CPython is a dog on code like that.
PyPy, not so much.
Yes, http://speed.pypy.org/comparison/ has html5lib parsing (an old, static copy of) the HTML5 spec as a benchmark. It's about 3x faster than CPython, which is to say, still very slow.
I agree there's a good argument that writing new parser code in C is not a great idea, but I don't think it's possible — or at least easy — to get reasonable parsing speed in pure Python. My ideal solution here would be bindings to a Rust parser e.g. html5ever, although there is a tradeoff there in terms of ease of distribution.
The main missing feature in html5ever for these purposes is the ability to construct the final parse tree in the rust code (c.f. lxml), without which the performance bottlneck becomes constructing the Python objects to represent the document. Of course a streaming API like SAX can be even faster, but often isn't all that useful.
https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C1... has an up-to-date version of html5lib, albeit only on CPython: notably, both 3.6 and the latest 3.7 build are significantly faster than 2.7.
That said, I don't think html5lib is going to become massively quicker: string allocations are just going to become an ever bigger issue (i.e., s[1:10] causes an allocation in Python v. just referencing the subsequence), and even using Cython, at least under CPython, isn't going to help with that.
I initially had the API for Gumbo use string slices a lot more than the final released API, and then found that I couldn't do it and needed to allocate in order to maintain correctness. I'd done a patch that arena-allocated all memory used in the parse, which gave a fairly significant CPU speedup, but it also bloated max memory usage in ways that some clients found unacceptable, so I never merged it. Small C strings at least are quite lightweight; Python strings have a lot of additional overhead, and much of the PyObject structure itself requires chasing pointers.
(I run a web crawler written in Python, which means code exposed to arbitrarily bad HTML. I've had to report and fix bugs in BeautifulSoup and html5lib. At least they fail in a well-defined way. C code can fail in arbitrarily bad ways, and C string processing is notorious for being troublesome.)
Funny how somebody submitted exactly this on HN a few hours ago : https://news.ycombinator.com/item?id=14591017
For example, running the following:
/usr/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
Basically your language needs an equivalent of lxml
Strictness works well when you have a small number of active producers and consistent, constantly renewed content.
The web is the opposite - a huge number of producers and widely varying content, some of which is 1-2 decades old and will never be updated.
You can design and build a great, strict HTML parser - it just won't work well for a significant portion of the www.
And if you make one that is popular enough, you can change the direction of evolution of the web! (See mobile safari and flash)