

Did You Know: BeautifulSoup's bits are rotting - andrewljohnson
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

======
artlogic
I've been using BeautifulSoup on a project and noticed the exact problems he's
mentioning. I actually ended up filtering the source with a regexp to remove
script tags and their contents prior to parsing because of the HTMLParser
weirdness. It wasn't a pleasant experience. The whole time I was doing this, I
kept looking at my nice Firebug element tree and wondering "Why am I even
going to this trouble?"

Does anyone else wonder why we're writing all these parsers when both Mozilla
and WebKit have reliable, robust parsers that are actively maintained? How
difficult would it be to package up the existing code and distribute it with
wrappers for python, ruby, etc... I assume there's something I don't know,
because not only has it not been done, but no one seems to want to talk about
it.

~~~
earl
I briefly looked into doing this. The answer is pretty damn difficult, at
least in the case of mozilla.

~~~
jrockway
Actually, I think it would be pretty easy if you are willing to have a running
Mozilla process. Just connect to it with MozRepl, get it to render a page, and
then inspect the DOM with JavaScript. (This could be library-ed up so that you
get a W3C DOM back on the Python side, or whatever.)

I use a similar technique to get emacs to syntax-highlight my slides. Connect
to the running emacs (with all my settings), run htmlify via emacsclient
--eval, and enjoy perfect highlighting!

~~~
earl
Sorry, yes -- I definitely don't want a running mozilla process. Plus it's not
at all clear that it's possible to run mozilla headless, though I didn't look
that hard.

~~~
jrockway
You can run any X app headless with Xvfb.

~~~
earl
Ah, cool -- it's just that my servers don't run X. Or really have enough ram
to spare to add 30 copies of X, mozilla, and other associated stuff. I really
just need a relatively compact parsing engine.

~~~
jrockway
I'm not sure why you would need 30 copies of X or Mozilla.

Either way, it is kind of inelegant, but it is hard to pick-and-choose parts
of Mozilla. This is probably the simplest way to let Mozilla parse your HTML.
(That, however, may not be necessary. I have done a lot of screen-scraping,
and I have never encountered anything that HTML::TreeBuilder got confused on.
Lately, I've been using libxml2, and that has also worked very well. Zero
problems.)

------
andrewljohnson
This is so unfortunate. It's such a great piece of software that so many of us
depend on.

It's really too bad that there's not enough money in it for Leonard to keep it
up. But, I have no bitterness, just thanks!

~~~
calambrac
Your title really rubs me the wrong way. This isn't bitrot, it's actually
quite the opposite: the problem showed up because he _does_ actively maintain
the code, he made the latest release compatible with future versions of the
standard Python distribution.

He's standing up to say he's going to honor his responsibility to this code
even though he doesn't enjoy it anymore, but that that doesn't include writing
html parsers, and you come along and scream 'bitrot'. Sorry, but that's kind
of an assholish thing of you to do.

~~~
lacker
Its performance is getting worse over time because maintaining its speed
requires more maintenance than anyone is willing to give it, at least so far.
I would call that bit rot too.

I think both of you agree that the original author deserves only thanks.

~~~
calambrac
But that's not what the linked article is about at all. If you have benchmarks
and you want to write that article, by all means, do it.

------
jsrfded
After having run various html/xml/rss parsers against a 1B page web crawl, I'd
have to say that it's pretty rare to find ones that can actually pass the web
fuzz test. Most seem to have been written from a more spec-driven approach.
This is fine in a controlled environment, but pretty useless if you want to
turn the code loose on real world web data.

Some of the stuff we find, like 1-in-80M core dumps are to be expected because
they're so rare and most folks don't have that much test data. But many others
could be found by simply running a parser against a few hundred random urls
from the dmoz rdf. I wish more lib developers would do this.

~~~
thristian
I'm sure the html5lib guys would love to hear about parser bugs exposed by a
corpus that large:

<http://code.google.com/p/html5lib/>

Especially since html5lib is supposed to follow the HTML5 parsing rules, which
were basically reverse-engineered from IE's HTML parsing, so they ought to
work for every web-page in existence.

~~~
lacker
I don't think anything is going to work on every web page in existence.
Perhaps strlen.

~~~
andrewljohnson
Yeah, since i just wrote a spider last night using html5lib, and had to wrap
it up in a try block, I can categorically say that it doesn't work for all
webpages:

    
    
      parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
      try:
        document = parser.parse(response)
      except Exception, e:
        print 'parse failed ' + str(e)
        return

------
biohacker42
Such a great project should have little trouble finding good devs. Imagine how
many bright young hacker would kill to be an official contributor for
BeautifulSoup.

He could make an earnest attempt at finding other people to work on it and
just do code reviews. Be a figure of advice and authority while doing no real
work. That would be great IMHO.

~~~
yesimahuman
Well it is free software. If he's throwing in the towel, couldn't someone just
put the code on github/bitbucket and run with it?

------
apgwoz
Wait, why not just port SGMLParser to Python 3.0? Did I miss something?

~~~
etal
Here's the note in PEP 3108:

    
    
      sgmllib [done]
        * Does not fully parse SGML.
        * In the stdlib for support to htmllib which is slated for removal.
    

Based on that and the standard docs, it looks like it was lost in the standard
library reorganization. The HTMLParser and other HTML-related libraries were
merged into a new html module; sgmllib's parser was an incomplete
implementation of SGML that only HTMLParser used, so apparently in the reorg
it was found unnecessary and scrapped.

<http://www.python.org/dev/peps/pep-3108/#id53>

~~~
apgwoz
But, if it worked for BeautifulSoup, as it existed in Python 2.5 would be
suitable for BeautifulSoup to continue as if it were the last good version....
I'm not suggesting it be readded to the standard library, just that it be
added to BeautifulSoup.

------
tdavis
Personally, I _far_ prefer lxml to BeautifulSoup. The latter is incredibly
slow and leaks memory like a sieve unless you manually tear apart object
trees. Although, BS is easier to write in many cases. Just don't use it for
any heavy work.

------
jrockway
Does Python have a libxml2 binding? I have had pretty good luck with its
parse_html_string function.

Failing that, you can always use Perl and HTML::TreeBuilder / HTML::Parser.
They work pretty well on malformed input.

~~~
jinglebells
That it does, it works like this:

import libxml2

    
    
      parse_options = libxml2.HTML_PARSE_RECOVER + libxml2.HTML_PARSE_NOERROR + libxml2.HTML_PARSE_NOWARNING
    
      xml_document = libxml2.readDoc(junk_html, None, None, parse_options)
    
      clean_xhtml = xml_document.getRootElement().serialize()
    

Note: this method of "cleaning" works by building an XML tree out of HTML,
except that HTML is not XML, so non-closed tags such as <textarea></textarea>
will get closed and the browser puts any HTML after the tag _into_ the
textarea on the screen, so don't use this if you still want to output to a
browser.

EDIT: fixed formatting.

