Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Are you talking about etree.HTML() being garbage?

Yes.

> And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )

No problem there. XML is much stricter and thus easier to "get right" so to speak. lxml's html parser is built upon libxml's HTML parser[0], which predates HTML5, has not been updated to handle it, and is as its documentation notes

> an HTML 4.0 non-verifying parser

This means it harks back to an era where every parser did its thing and tried its best on the garbage it was given without necessarily taking in account the neighbour.

[0] http://xmlsoft.org/html/libxml-HTMLparser.html



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: