
MyHTML – HTML Parser on Pure C with POSIX Threads Support - yxlx
http://lexborisov.github.io/myhtml/
======
leeoniya
> By the way, SCRIPT tag tokenization is a hell of an effort. I had to draw a
> _graph_ [...] Next in turn are the CSS parser and Renderer.

CSS parsing should be ok, but layout computation is hard, especially with all
the latest specs. The graph presented in the article will be the size of a
postage stamp on an aircraft carrier deck.

Take a look at the Cassowary constraint solver, btw:
[http://overconstrained.io/](http://overconstrained.io/)

> I'm writing them all by myself, still full of energy.

I wish the author the best of luck.

~~~
vjeux
The great thing about writing the layout computation code is that specs are
mostly additive. You can start by supporting only a few properties and then as
you progress support more and more.

I've used this approach for css-layout[1] and in 2 weeks I got enough
implemented to support most use cases we've needed to build mobile apps.

Also, Cassowary won't really help you there. It's going to take a much bigger
effort to translate CSS into constraints than just reimplementing the steps
themselves.

[1] [https://github.com/facebook/css-layout](https://github.com/facebook/css-
layout)

~~~
leeoniya
cool project!

flex-box is a mostly self-contained and powerful spec (if i understand
correctly).

however, when you don't need to account for floats, relative layout, mixed
box-sizing, negative margins, complex overflow conditions and interaction with
a ton of older specs, you vastly simplify the problem space for yourself.

it makes perfect sense for a modern system but is quite far from a general
impl that can compute layout from html+css unconditionally. the article starts
with:

"Once I got an X idea, but its implementation required a calculated DOM with
all its styles and goodies"

so the goal is not "the most useful subset". flexbox is currently the least-
used (& least supported) layout, so for the author's purposes which sounds
like scraping existing markup would not help very much.

------
scrollaway
This is incredibly clean code. Large, long-term single-person hobby projects
make for some kickass codebases. Well done.

------
lxe
Amazing work. How does this compare (in terms of speed mostly) to Google's
gumbo parser?

~~~
fabrice_d
It looks much faster:
[http://lexborisov.github.io/myhtml/bm/time_0_100.png](http://lexborisov.github.io/myhtml/bm/time_0_100.png)

~~~
kudosall
could you elaborate please? it looks quite radical.

------
legulere
[https://github.com/servo/html5ever](https://github.com/servo/html5ever) seems
to also have a (not yet complete) C API

------
mablae
Putting "my" in front of anything should be forbidden.

Just "my" two cents.

~~~
vardump
MySQL. Although that My refers to author's daughter's name. You can guess the
rest of his kids names from other products: MaxDB and MariaDB.

------
agumonkey
Interesting to see, just took handmade xml parser as a personal challenge, in
python though, I've been hitting nasty performance issues compared to libxml2.

~~~
Mikhail_Edoshin
Basic XML parsing should be very simple, it's deterministic with one-symbol
lookahead. There's a number of small C parsers out there and even one written
in assembly. If you want to validate it though or parse DTDs, that's a
different story.

~~~
agumonkey
Indeed that's why I was surprised to be 200x slower than libxml2. A lesson in
performance.

~~~
khedoros
Any idea what the cause of the bottleneck is?

------
chris_wot
I wonder how easy it would be too adapt the API to a set of C++ classes?

