
Show HN: Compile-time HTML parsing in C++14 - rep_movsd
https://github.com/rep-movsd/see-phit
======
Iv
After going through several love-hate cycles in my career toward C++ I must
say I have a kind of admiration for what the C++ authors are trying to
achieve.

It took me a while to understand why every iteration of C++ brings so many
(often Turing-complete or close) side effect horrors. The reason is that C++
is more than a language: it is part of an actually philosophical quest that
language design is about.

It is about trying to bridge the language of the machines and the language of
human abstractions as closely as possible. In itself it does not necessarily
lead to the best language possible, but it explores an interesting limit.

One can use higher level languages like Haskel, Prolog or even a bit lower
like Ruby or Python and code nice abstractions. By doing so, however, the
programmer typically loses track of what the machine implementation will be.

C++ strives to be a language where you can still feel what the machine will
actually implement while you code high-level abstractions.

That is their goal, that is their quest. In doing so there are many side-
effects that pop up, but their effort is commendable.

That so many people still use it is impressive but, I think, irrelevant to the
priorities they typically set.

~~~
pcwalton
> The reason is that C++ is more than a language: it is part of an actually
> philosophical quest that language design is about.

> It is about trying to bridge the language of the machines and the language
> of human abstractions as closely as possible. In itself it does not
> necessarily lead to the best language possible, but it explores an
> interesting limit.

> C++ strives to be a language where you can still feel what the machine will
> actually implement while you code high-level abstractions.

This has been false ever since the earliest days of ANSI C. The C and C++
standards define an abstract machine that is quite far from any machine that
exists today. Type-based aliasing rules, to name one important example, are
something that has almost nothing to do with anything that exists in the
hardware.

It's quite enlightening to read the description of LLVM IR [1] and observe how
far it is from anything a machine does. In fact, LLVM IR is quite a bit
_lower_ level than C is, as memory is untyped in LLVM IR without metadata:
this is not at all the case in C.

In reality, C++ is an attempt to build a high-level language on top of the
particular abstract virtual machine specification that happened to be the
accidental byproduct of a consensus process among hardware/compiler vendors in
1989. It turns out that this has been a very helpful endeavor for a lot of
people, but I don't think we should claim that it's anything more than that.
There's nothing "philosophically" interesting about the C89 virtual machine.

[1]: [https://llvm.org/docs/LangRef.html](https://llvm.org/docs/LangRef.html)

~~~
Iv
I don't see how type aliasing negates anything of what I am saying. Types are
an abstraction but C++ allows you to extract its pointer, to make a sizeof()
of it, to do pointer casting and arithmetics.

You can go low level with most high level languages, what sets C++ apart is
that using low-level instructions have generally zero overhead. ptr2 =
static_cast<uint8_t*>(ptr)+5; pretty much maps to the assembly code you would
expect.

~~~
pcwalton
> Types are an abstraction but C++ allows you to extract its pointer, to make
> a sizeof() of it, to do pointer casting and arithmetics.

And the semantics of those operations are dictated not by what the underlying
machine does but what the C++ _abstract machine_ does, which is very
different. The underlying memory is still typed.

> You can go low level with most high level languages, what sets C++ apart is
> that using low-level instructions have generally zero overhead.

That's true for the JVM, .NET, even JS...

> ptr2 = static_cast<uint8_t*>(ptr)+5; pretty much maps to the assembly code
> you would expect.

No, it doesn't, not after optimizations and undefined behavior. It's perfectly
acceptable for the compiler to turn that into a no-op if, for example, "ptr"
was a null pointer.

------
Mmrnmhrm
I am very disappointed with constexpr in C++14 and C++17.

The work shown here is impressive, but the point is that it should not be. D
has shown that variable initialization at compile time can be painless.

The main limitations I find are: a) Lack of support in the standard library
(e.g., you should code your own sort, vector class, etc.)

b) Terrible Terrible Terrible compilation times.

c) Lack of a dynamic memory allocation within constexpr.

constexpr functions are not evaluated while parsing (e.g., the code is not
compiled and then executed). My use case, which was to avoid a 1 second
preprocessing time while loading a library, took more than 20 minutes to
compile and used more than 60 GB of RAM (on GCC7, CLANG did not even manage to
compile).

Stuff like initializing a bitset can easily eat all your ram (e.g. this fails
to compile): #include <bitset> int main() { std::bitset<1024 _1024_ 1024> bs;
}
[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63728](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63728)

~~~
rep_movsd
It's coming soon, as always.

A lot of the STL is already being constexpr-ized (as Stephen T Lavalej has
mentioned)

This code i wrote takes about 5 seconds on gcc and 1.5 seconds on clang to
compile for a 1000 node HTML template (about 28 KB). Definitely slow, but not
really really slow.

I have a feeling recursion slows down the compile time, I will attempt an
iteration based version eventually and see.

------
mhh__
For reference: In D, you could just take a HTML parser then run it at compile
time without writing any new code.

~~~
reikonomusha
Likewise in Common Lisp, arguably much more easily.

~~~
systems
how ... macros?

~~~
TeMPOraL
Macros, or (eval-when (:compile-toplevel) [your code here]). Caveats apply.

------
setzer22
Sadly, the thing that stood out the most to me is the sentence _We attempt to
make the compiler generate the most sensible error message_ , followed by an
incomprehensible error message completely unrelated to the problem.

C++ should really get proper metaprogramming, with support for user-defined
error messages. All this template stuff always seemed like an awful hack to
me: Fighting the compiler instead of being the compiler.

~~~
rep_movsd
It is a hack, but what else is there eh?

Going from templates to constexpr is like moving from a war zone to a forest.
It's still scary and dark, but at least you know what hit you. static_printf
is coming soon.

The error message looks incomprehensible, but the last couple of lines give
you enough information - the error string and the line number - it looks much
better on the console since its colorized

~~~
avar

        > It is a hack, but what else is there eh?
    

The non-hacky way to do this sort of thing in both C and C++ is to just use
less clever two-phase compilation as part of your build process.

I.e. you'd have your template compiler extract the HTML from your source (or
more easily, dedicated template files), parse, validate and compile them. The
end result of that would then be embedded in your binary somehow.

The gettext set of tools are a good example how how this works in another
related domain. You extract strings from your program, they and the contents
of corresponding _.po files are validated, then compiled to efficient_.mo
binary files for runtime use.

~~~
rep_movsd
Why rely on an external tool?

All someone has to do here is #include my header and write their templates
with a small suffix and prefix.

No need to install, configure, script and so on.

Your argument is similar to saying "Why should an IDE parse your code and
underline errors? We can run make everytime"

Another advantage of my approach is that you have access to the parsed
template as a data structure - that you can compose or modify as you wish.

The only other way to do this is to parse at runtime - which is definitely
slower

~~~
avar
Most people using gettext get it via their package system. Your library is
also an external tool they need to similarly fetch & install.

And no, my argument is not similar to your IDE comparison. You'd get the exact
same thing, parsing / compiling when you compile your project. You'd just
offload slightly more work to make & the linker.

You can also get access to the parsed version as a datastructure. This is what
the gettext library does with its compiled *.mo files.

Anyway, I don't think your thing is a useless approach. It's very hacky in C++
but this sort of thing is the best way to do something like this in many other
languages.

I was just pointing out that there's decades of precedence for achieving the
same results in C, i.e. parsing some custom language out of the project at
compile-time and shipping it with the binary. Which you (with your "what else
is there" question) seemed to be unaware of.

------
cletus
I have nothing to add other than that's both cool and scary.

~~~
rep_movsd
Nothing is as scary as boost :)

This is as scary as a bunny painted in camouflage.

~~~
MrMorden
I don't know what's scarier: that I used the Boost preprocessor library a few
months ago; or that the file format I had to deal with sucked the stars out of
the sky in such a manner that a reasonable person could have agreed that
BOOST_PP was the best possible choice under the circumstances.

------
tlb
Wow.

How does one debug complex constexpr code? I assume there's no printf.

It'd be really cool if it supported some kind of interpolation, like Jinja
templates, so it could generate dynamic page templates at compile time.

~~~
Jeaye
Often times, the easiest way to debug complex template code is to
intentionally fail the compilation, with a minimal context, so you can read
the compiler's output on what the types and values are. Here's a simple
example of a function I used to use:

    
    
        template <typename T, typename ...Ts>
        void show_types()
        { static_assert((T*)nullptr, "Type log"); }
    
        int main()
        {
          show_types<int, float, bool>();
        }
    

When trying to compile this, using something like `g++ show-type.cpp`, you'll
get this output:

    
    
        show-type.cpp: In instantiation of ‘void show_types() [with T = int; Ts = {float, bool}]’:
        show-type.cpp:7:32:   required from here
        show-type.cpp:3:3: error: static assertion failed: Type log
         { static_assert((T*)nullptr, "Type log"); }
    

The same can be done for non-type template args, of course.

~~~
ubadair
I prefer this:

T::asdfasdf();

Although I might start using yours.

------
kbenson
_The program will fail to compile if the HTML is malformed_

By malformed are we talking about incorrectly closed, or actual invalid HTML?
HTML doesn't require all tags be closed...

~~~
rep_movsd
The HTML that browsers accept is loosely defined

The HTML specification has a grammar and defines whats allowed - a
subset/derivation of XHTML/SGML and their ilk

There are a bunch of test template files in the test/ folder that demonstrate
what kinds of errors are caught.

~~~
lucideer
> _The HTML that browsers accept is loosely defined_

This has changed since the advent of the HTML5 specification, the primary
purpose of which was to retroactively describe existing browser HTML parsing
behaviours and to document and specify them comprehensively in all of their
complexity.

The wisdom of this abominably complex approach may be questionable, but it's
certainly no longer "loosely" defined.

I'd be curious as to whether this implements HTML5...

~~~
taeric
I remember the promise of xhmtl and the sad reality that, by and large, nobody
actually cared about well formed documents.

The specification allowing for "implicit" tags is something that just doesn't
make any bloody sense to me. It feels like someone looked at all warnings in a
compiler and said, I'll just explicitly define exactly how I want each warning
to behave, so that it is no longer dangerous behavior.

~~~
tannhaeuser
HTML _always_ had tag omission and "self-closing" elements - these are
features from its SGML roots. A formal SGML grammar for modern W3C HTML 5.1
can be found at [1] (my project).

[1]:
[http://sgmljs.net/blog/blog1701.html](http://sgmljs.net/blog/blog1701.html)

~~~
lucideer
I don't think the gp was saying this is a new thing, just an old thing that
never made sense to her/him.

That looks like a very nice resource though, thank you.

~~~
taeric
Indeed.

And to be fair, I'm actually somewhat ok with self closing tags. Though, I
can't easily see why I'd want tags that can't self close.

~~~
tannhaeuser
To be clear, HTML's "self-closing elements" are img, meta, and others which
never have content nor end-element tags. These are based on SGML empty
elements, and HTML merely tolerates the XML-style form

    
    
        <img ... />
    

but does _not_ allow

    
    
        <img ...></img>
    

though parser recovery might be able to deal with it, and WebSGML can formally
reject or accept it.

Apart from empty elements, HTML's tag inference can also be formalized using
SGML. Tag inference is what makes this

    
    
        <title>Title</title>
        <p>Body Text
    

a valid HTML document and be treated as though

    
    
        <html
          <head>
            <title>Title</title>
          </head>
          <body>
            <p>Body Text</p>
          </body>
        </html>
    

had been written.

~~~
taeric
Right, the "self-closing elements" don't actually bother me. I do slightly
prefer the <foo /> form, but in general I don't care. What used to baffle me
was I couldn't do a "<div />" as that wasn't allowed. And I couldn't think of
a good reason to care on that. (It seemed you had to go out of your way to
disallow that, for no apparent reason.)

The tag inference, I just don't get. I can /almost/ understand the example you
used, but the examples include crap like:

    
    
        <table>
          <thead>
            <tr>
            <tr>
    

Where the second "<tr>" is actually part of the inferred "<tbody>". Just, why?

Edit: And to be perfectly clear, I do expect most of this to be handled for me
by whatever framework I'm using nowdays. And, I don't actually generate
documents directly that much. For docs, I typically go with LaTeX or friends.
(Honestly, probably org-mode moreso, but even that is light nowdays.)

~~~
tannhaeuser
Well, in my paper, like you, I'm criticizing (tag omission in) HTML5's table
content models, and discourage aggressive use of it ([1]), so probably I'm not
the one to defend it ;)

Even the HTML specification text itself got its tables wrong ([2]; also
explained in [1]).

[1]: <[http://sgmljs.net/docs/html5.html#start--and-end-element-
tag...](http://sgmljs.net/docs/html5.html#start--and-end-element-tag-omission-
in-table-content>)

[2]:
<[https://github.com/whatwg/html/commit/6e305c457e42276bf275b8...](https://github.com/whatwg/html/commit/6e305c457e42276bf275b8432302a32c929b0eb8>)

~~~
taeric
Apologies, I did not mean for you to be on a defensive. Just adding to the
point. If anyone skipped your link, they shouldn't have. Thanks for sharing!

