> Are you talking about etree.HTML() being garbage?
Yes.
> And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )
No problem there. XML is much stricter and thus easier to "get right" so to speak. lxml's html parser is built upon libxml's HTML parser[0], which predates HTML5, has not been updated to handle it, and is as its documentation notes
> an HTML 4.0 non-verifying parser
This means it harks back to an era where every parser did its thing and tried its best on the garbage it was given without necessarily taking in account the neighbour.
Yes.
> And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )
No problem there. XML is much stricter and thus easier to "get right" so to speak. lxml's html parser is built upon libxml's HTML parser[0], which predates HTML5, has not been updated to handle it, and is as its documentation notes
> an HTML 4.0 non-verifying parser
This means it harks back to an era where every parser did its thing and tried its best on the garbage it was given without necessarily taking in account the neighbour.
[0] http://xmlsoft.org/html/libxml-HTMLparser.html