
Reformatting Bad HTML - timrosenblatt
http://www.cloudspace.com/blog/2011/02/18/reformatting-bad-html/
======
alcuadrado
I think a better (or more moddern) approach would be to use an HTML5 parser
(like html5lib), as it has a standar way to parse INVALID documents too, and
then recreate the HTML with the DOM.

BTW: Paul Irish has this ticket asking for help/someone to update lazyweb to
use an HTML5 parser: [https://github.com/paulirish/lazyweb-
requests/issues#issue/2...](https://github.com/paulirish/lazyweb-
requests/issues#issue/20)

------
benvanderbeek
[http://validator.w3.org/check?uri=http://www.cloudspace.com/...](http://validator.w3.org/check?uri=http://www.cloudspace.com/blog/2011/02/18/reformatting-
bad-html/&charset=\(detect+automatically\)&doctype=Inline&group=0)

~~~
PaulHoule
the W3 validator is a joke. it rejects many constructions which are widely
used and at worst harmless. it overwhelms you with so many BS messages that
you can't see where the real problems are.

overall, the bookmarklet version of this doesn't impress me. i deal with web
pages wholesale rather than retail, so libtidy and the command-line tidy
floats my boat.

~~~
benvanderbeek
totally. i thought it was a 60% obnoxious 40% funny thing to post. i took a
risk!

------
cheald
My preferred method is "html2haml < filename.html | haml". Doesn't care about
document validity - just cleans up attributes and tag nesting/indentation.
Generally speaking, I don't want the document corrected, just reformatted.
Works for any XML, too, not just HTML.

------
tetsuharu
this even works for .html.erb files

~~~
mdaniel
> this even works for .html.erb files

And ".erb" stands for ...?

~~~
graywh
[http://www.ruby-
doc.org/stdlib/libdoc/erb/rdoc/classes/ERB.h...](http://www.ruby-
doc.org/stdlib/libdoc/erb/rdoc/classes/ERB.html)

