
A Practical Exercise in Web Scraping - zrail
https://www.petekeen.net/a-practical-exercise-in-web-scraping
======
danso
I've always thought web scraping was one of the best ways to teach the
practical use of code. My first practice with Ruby was to scrape apartment
listings from Craigslist and concatenate it into one page for easy browsing.
Simple to do, and useful to your real life, and as a bonus, more practice time
with programming.

(obviously I don't endorse scraping Craigslist because of TOS. Just confessing
to a past infraction back when the NYC housing market was even more absurd
than it is today)

~~~
wslh
It is also a good way to help friends because usually you give a lot of value
to others very quickly.

In the same way, I also helped a birdwatcher to download thousand of bird
calls. My own rant: [http://blog.databigbang.com/the-call-of-the-web-
scraper/](http://blog.databigbang.com/the-call-of-the-web-scraper/)

------
FBT
The author very carefully avoids mentioning what web serial this was
precisely, but from context I am pretty confident he is talking about Worm
[1], for anyone who is interested.

[1] [https://parahumans.wordpress.com/](https://parahumans.wordpress.com/)

~~~
Tyr42
How'd you figure that out?

~~~
Dru89
More to the point, I think, why would you post it? The author clearly didn't
want fan-made ebooks out there and the blog basically gave a step-by-step
converter, given the correct url.

~~~
FBT
Because I don't feel that an author should be allowed to restrict the
distribution of his or her works, at all. It's exactly the concept of Death of
the Author [1], just on a slightly different level.

If the author forbade, say, the translation of his work into French, or
forbade the reading of it by people with brown hair, or forbade the reading of
it on browsers other than Internet Explorer, I'm sure you'd agree with me that
such a restriction is insane, and would refuse to cooperate with it.

This restriction he is trying to impose is a bit less crazy, but he has just
as little right to make it.

[1]
[http://tvtropes.org/pmwiki/pmwiki.php/Main/DeathOfTheAuthor](http://tvtropes.org/pmwiki/pmwiki.php/Main/DeathOfTheAuthor)

------
tmikaeld
Web scraping is nice to learn and when you hit that wall where it starts
getting really complex - i.e you need to login to a page, have cookies, etc..
there are really good tools available, lite Outwit hub that uses the whole
browser as a scraper.

------
prolways
Isn't this wildly illegal? I thought all web scraping violated CFAA... not to
mention copyright...

~~~
hfsktr
In many cases how is it different from visiting each page and printing/saving
the html? It's useful to create a backup of the site[1].

I'm not saying it's good or bad. It could be used for both. I know at my
previous job we used it (a bit shady I think) to scrape and compare competitor
prices to ours.

[1] Though I can't understand why you'd want to backup your own site but I am
pretty sure I've heard of it being done.

------
mtrimpe
Wasn't there some parser other than Nokogiri which was focused more on being
lenient than correct?

~~~
famousactress
Python, not Ruby.. but Beautiful Soup is a lenient parser and a really awesome
library:
[http://www.crummy.com/software/BeautifulSoup/](http://www.crummy.com/software/BeautifulSoup/)

------
mimiflynn
Isn't it more convenient to use something like Pocket for this?

~~~
zrail
Does pocket follow all of the Next links and build an ebook for you?

