
How to Scrape Web Using Python, Selenium and Beautiful Soup - chsasank
https://swethatanamala.github.io/2018/09/01/web-scraping-using-python-selenium-and-beautiful-soup/
======
xarball
Why would you switch from selenium to beautiful soup halfway through what
you're trying to do, and force your program to re-request the same information
from the web server? Selenium has access to the entire DOM, and the entire
JavaScript session already loaded in a running web browser. It has way more
power for data mining than beautiful soup does.

It looks like they're just trying to use selectors, but these directions seem
to completely miss that functionality in Selenium's API. Just search the
WebDriver documentation for 'find_element_by_':

[https://selenium-python.readthedocs.io/api.html](https://selenium-
python.readthedocs.io/api.html)

I use Selenium for all my web crawling, exactly because I would rather have
one crawler with all the backing support of a modern web browser, than corner
myself into not having something as crucial as a JavaScript parser halfway
through implementing a bot that's designed to hook what's basically an end-
user interface sitting on top of all that.

The most obvious benefit of Selenium to me, is that by having all that, I can
make my interactions with a web server look _more_ like a user, and fly under
the radar a little more. This tends to require less work on my part when I
treat websites more like a whole package (though more RAM, yes!)

~~~
chsasank
One reason to use beautiful soup is that selenium is slow. You need to open
the whole webpage including images, css etc. With requests/beautiful soup you
can just parse the collected URLs very fast.

~~~
xarball
Selenium sets up the browser profile for you, so you can disable images,
videos, css, javascript, embeds, all to your heart's content.

I've recently started using Selenium with the privoxy proxy, exactly because
browser headless modes are still fairly new tech. They don't all necessarily
support all the standard profile features (addons, settings, etc), or behave
the same way. It's really neat seeing where they're going, but they sometimes
need a bit of help MITM-ing traffic, so that's where a good filter comes in
handy.

In the user-facing web world, 'slow' is kind of a relative term. Even with a
barebone system, you're nearly always going faster than most servers will put
out. I just take my chances bringing in bigger tools, because the personal
cost of maintaining an under-equipped tool is usually a greater time-waster to
keep up to date as your target site evolves, than the personal cost of waiting
for variably-optimized background work to perform its duties.

~~~
chsasank
Thanks, that was insightful!

------
haloux
Ryan Mitchell did an excellent talk at DEFCON23 about defeating bot checks and
other common barriers that web scrapers face. Excellent watch for anyone
interested in scraping:
[https://youtu.be/PADKIdSPOsc](https://youtu.be/PADKIdSPOsc)

Shameless plug: her O’Reilly book and associated github work “Web Scraping
with Python” is an excellent read.

------
fareesh
Coming from a Ruby background I've always been curious about Python's
libraries for scraping. I've tried scrapy and beautiful soup, but somehow kept
going back to Nokogiri and mechanize.

I found the CSS selector or xpath based syntax and the DSL to be a lot more
convenient and less verbose to deal with.

Is selenium still the best bet for parsing JS powered pages these days? I was
under the impression that headless chrome was more memory / performance
efficient.

I do a lot of scraping work but my methods have not really evolved in the past
3-4 years, always on the lookout for something more elegant / quicker.

~~~
onesmallcoin
Look into chromedriver, It maintained by the Chromium team and provides an
executable that allows webdriver to control chrome. I've used this
successfully in the past running chrome in headless mode, it seems pretty
scalable. If I had to build the same product again, I'd still use
chromedriver. Although I'd also consider Sikuli for image recognition /
automation outside of the browser. Check out the chromedriver project here!
[http://chromedriver.chromium.org/getting-
started](http://chromedriver.chromium.org/getting-started)

~~~
theshadowknows
+1 for Sikuli! I only very rarely see it mentioned anywhere!

