Hacker News new | past | comments | ask | show | jobs | submit login
Lxml: an underappreciated web scraping library (ianbicking.org)
24 points by astrec on Dec 11, 2008 | hide | past | favorite | 6 comments



When people think about web scraping in Python, they usually think BeautifulSoup.

When I think about web scraping in Python, I think Mechanize. BeautifulSoup is great for parsing already downloaded HTML files, but it doesn't have the same web-navigating features Mechanize does such as stateful web browsing and easy form filling (unless I'm missing something).


I've been thinking about checking out Mechanize, but I've been using BeautifulSoup and a small library of helper functions for a while and haven't found time to familiarize myself with Mechanize.

BeautifulSoup is lacking easy form-filling functions, but they're pretty easy to write, assuming that the site doesn't do really funky stuff with their parameters. Basically, just grab all the input and select elements from a given form and put them in a dict, update values, and urlencode.


+1 for mechanize, been doin alot of scraping in Ruby lately and the ruby mechanize ( i'm assuming its a port ? ) is quite nice


Perl's Mechanize was probably the basis for both.


Love HN! I've been trying to accomplish "Cleaning up HTML" for some time now and Lxml seems to have the exact functionality I was looking for :)


We use it extensively in what we do and its also incredibly fast (which is why we started using it)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: