

Better Python Scraping - Installing lxml and Beautiful Soup - wesleyzhao
http://wesleyzhao.com/python-web-page-scraping-installing-libxml

======
ljlolel
Ive scraped dozens of sites beautiful.soup is great, but if you want the job
done as quickly and cleanly as possible, PyQuery is thr best.

Its the same as jquery, but in Python.

Working with beatiful soup quickly becomes.long.and messy and tedious

With pyquery, you get what you want with just a couple of CSS 3 selectors
simplw and nice

wow android 2.2 is terrible for inputting text

~~~
wesleyzhao
Hm interesting thought. Any good tutorials/sites I can use to get going on
this? Or is it so simple I won't even need that. I find myself scraping a lot,
so finding the best lib for that is a priority.

~~~
pxm
Just tried this out last week. The docs are pretty sparse, but it seems to
mirror the jQuery interface fairly closely so if you're familiar with one,
you'll have a fair idea of the other:
<http://packages.python.org/pyquery/api.html>

That's the key advantage as I see it. If I'm scraping something it's often in
a hurry and I just want it done. Not having to internalise a new API is a
significant win in that respect.

------
VuongN
I'm learning python on the fly, but I tend to ask a lot of question on
freenode's #python. Installing lxml wasn't so bad. I just did "pip install
lxml" (easy_install lxml should work too) on my Debian VPS and home server.
Seemed to work for me.

I am sticking with lxml only for my scraping and html5lib to do my richtext
parsing.

~~~
wesleyzhao
Glad to hear it was so easy for you! For some reason, before I found a great
tutorial, I kept running into errors on my installations. First it was because
I didn't even have setuptools installed, then I didn't have some other dev
dependencies, then I just got plain stuck. But after I got it up and running
it was smooth sailing RE scraping!

~~~
oinksoft
Installing development headers is an essential sysadmin skill. Any failed
python extension install should make it clear which headers are missing, and
you will want to search for the matching package, usually suffixed with
"-devel" (at least on Debian and RedHat systems). So if you see an issue with
libjpeg, you will look for libjpeg-devel, zlib -> zlib-devel, etc.

------
cdr
I much prefer Scrapy (<http://scrapy.org/>). BeautifulSoup is pretty outdated.

~~~
iam
XPath/CSS path selectors for scraping are definitely the "now" way to do it. I
recommend Nokogiri (<http://nokogiri.org/>) for the Ruby users, it does pretty
much the same thing in a nutshell.

What it really comes down it is 2x-3x smaller code, plus much faster to write
it since you can just test your CSS selector in your web browser such as with
FireBug before sticking it into your code.

~~~
wesleyzhao
If I'm not mistaken LXML uses XPath and CSS path selectors right?

------
lamby
Why not "sudo apt-get install python-lxml python-beautifulsoup"? Difficult to
make the "olde" argument when you're installing dependencies from apt.

~~~
wesleyzhao
Does that work? If I remember correctly I may have tried that and it was one
of the things that failed. Something would not compile I believe. I am still a
noob so if that works and I just messed it up I wouldn't be surprised. Do you
know by experience that is all you need?

~~~
oinksoft
... of course it will work. And there's no compilation step involved, either,
as I've never encountered a Debian package that was not a binary distribution
(but maybe there are some, requiring build-essential? This package surely does
not depend on build-essential).

~~~
wesleyzhao
I'll run this on a clean Ubuntu instance and if this works I'll update my
post! Thanks!!

------
tsumnia
Is mechanize (<http://wwwsearch.sourceforge.net/mechanize/>) considered
outdated or convoluted? It's what I've used for my scrapings.

Also, how well do these other scrapers handle Javascript? I've had to abandon
some scrapes from ASP pages because they wouldn't properly handle it.

~~~
danohuiginn
it's a different tool. Mechanize concentrates on navigating pages and
downloading pages, following links, handling cookies etc. BeautifulSoup and
lxml parse information out of the html.

There's some overlap, but not much. I have tended to use BeautifulSoup and
mechanize together. As mentioned above, BeautifulSoup is no longer being
actively maintained, and I'd recommend starting with lxml in most cases. I'm
still using BeautifulSoup mainly because I have most of the package memorized.

~~~
tsumnia
Thanks, I use the same combination, but haven't needed to make any new parses
in a while. I'll read up on lxml for the next time.

------
llambda
Maybe I missed it: why aren't you using pip? As I recall, the set up is as
simple as: sudo pip install lxml or sudo pip install BeautifulSoup. If you're
learning Python, definitely learn pip. Pip will make your life easier! :)

------
imgabe
Just to throw in my own "I like X" better. I recently had some pages that
Beautiful Soup just choked on and couldn't parse.

I like html5lib, which will even spit out a Beautiful Soup parse tree if
that's your thing.

------
Torn
Does this page kill the chrome tab for anyone else?

