
Web Scraping with Beautiful Soup (2014) - xcoding
http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html
======
Animats
Cute. That's what BeautifulSoup is good for.

I'm a longtime user of BeautifulSoup.

BeautifulSoup does not use or create "the DOM". It does convert HTML into a
tree, but that tree is somewhat different from a browser's Document Object
Model. For most screen-scraping purposes, this doesn't matter. But if the page
uses Javascript to manipulate the DOM on page load, it does.

I have a tool for looking at a web page through BeautifulSoup. This reads the
page from a server, parses it into a tree with BeautifulSoup using the HTML5
parser, discards all Javascript, makes all links absolute, and turns the tree
back into HTML in UTF-8, properly indented. If you run a page through this and
it still makes sense, scraping will probably work. If not, simple scraping
won't work and you'll probably have to use a program-controlled browser that
will execute JavaScript.

Some examples:

HN looks good:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.y...](http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.ycombinator.com)

AFL-CIO, the site used in the article, looks great:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org](http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org)

Twitter's images disappear:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.tw...](http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.twitter.com)

Adobe's formatting disappears:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.ad...](http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.adobe.com)

Intel complains about the browser but looks OK:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com](http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com)

Grubhub gives us nothing as plain HTML:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com](http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com)

Same for Doordash:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com](http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com)

(No scraping restaurant menus with BeautifulSoup.)

Cool stuff in pure CSS works fine:
[http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawe...](http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawenterprises.com/cfimg/)

(You don't really need Javascript any more just to get the page up.)

~~~
gnahckire
Ooo this is really neat. Thanks for sharing!

------
madenine
My web scraping toolkit (Python):

-Beautiful Soup

-Requests Lib

-JSON Lib

-Selenium

-Urllib2

-Cookielib

Handles 99% of things I encounter. With dynamic sites, you're often better off
simulating the request than controlling the browser.

Then you get to parse your way through someone's annoyingly formatted JSON.

------
pryelluw
I enjoy using scrapy because it allows for a bit more functionality. Check it
out if beautiful soup is too simplemente for your needs.

~~~
hanniabu
Which would you say is easier for a beginner(to scraping and to python in
general) to use?

~~~
reubano
BS4 is easier but scrapy is more powerful. E.G., with scrapy you can program
to follow all 'next' links and download 100s of pages in parallel. BS4 is more
for grabbing content from a single page.

------
Ed10101
Absolutely love BeautifulSoup, I use it almost everyday in my job. There isn't
really a scrapy vs. BS4 divide as you can still use the library with Scrapy,
as opposed to it's standard parsing functionality. It also works well with
lxml which is considerably faster.

It's also possible to build very performant and large scale crawlers with just
BS4 and requests . Though managing the architecture is a bit of a pain, but it
definitely can be done.

There are also a number of cases where it's better to use Bs4 than scrapy.

Also using PhantomJS, Selenium and BS4 can provide you with a very powerful
data scraping solution.

------
prions
I recently used BeautifulSoup in a Wikipedia scraping project. It's definitely
a great tool but it had a few annoying functionality issues.

I had some preprocessing using decompose(), which can take lists of
attributes. However it cannot decompose multiple attributes at once such as
anchors with a certain title and table elements in one call. It felt
cumbersome to call it multiple times.

It seems like Scrapy does exactly what I needed though (click on the first
link in the article), which would have saved me a ton of headache. So it goes.

------
geooooooooobox
NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser

