Lxml: an underappreciated web scraping library

lunchbox · on Dec 11, 2008

When people think about web scraping in Python, they usually think BeautifulSoup.

When I think about web scraping in Python, I think Mechanize. BeautifulSoup is great for parsing already downloaded HTML files, but it doesn't have the same web-navigating features Mechanize does such as stateful web browsing and easy form filling (unless I'm missing something).

kaens · on Dec 11, 2008

I've been thinking about checking out Mechanize, but I've been using BeautifulSoup and a small library of helper functions for a while and haven't found time to familiarize myself with Mechanize.

BeautifulSoup is lacking easy form-filling functions, but they're pretty easy to write, assuming that the site doesn't do really funky stuff with their parameters. Basically, just grab all the input and select elements from a given form and put them in a dict, update values, and urlencode.

utnick · on Dec 11, 2008

+1 for mechanize, been doin alot of scraping in Ruby lately and the ruby mechanize ( i'm assuming its a port ? ) is quite nice

ianb · on Dec 11, 2008

Perl's Mechanize was probably the basis for both.

EastSmith · on Dec 11, 2008

Love HN! I've been trying to accomplish "Cleaning up HTML" for some time now and Lxml seems to have the exact functionality I was looking for :)

inovica · on Dec 11, 2008

We use it extensively in what we do and its also incredibly fast (which is why we started using it)