Hacker News new | past | comments | ask | show | jobs | submit login
A Practical Exercise in Web Scraping (petekeen.net)
52 points by zrail on Dec 16, 2013 | hide | past | favorite | 16 comments



I've always thought web scraping was one of the best ways to teach the practical use of code. My first practice with Ruby was to scrape apartment listings from Craigslist and concatenate it into one page for easy browsing. Simple to do, and useful to your real life, and as a bonus, more practice time with programming.

(obviously I don't endorse scraping Craigslist because of TOS. Just confessing to a past infraction back when the NYC housing market was even more absurd than it is today)


It is also a good way to help friends because usually you give a lot of value to others very quickly.

In the same way, I also helped a birdwatcher to download thousand of bird calls. My own rant: http://blog.databigbang.com/the-call-of-the-web-scraper/


The author very carefully avoids mentioning what web serial this was precisely, but from context I am pretty confident he is talking about Worm [1], for anyone who is interested.

[1] https://parahumans.wordpress.com/


How'd you figure that out?


More to the point, I think, why would you post it? The author clearly didn't want fan-made ebooks out there and the blog basically gave a step-by-step converter, given the correct url.


Because I don't feel that an author should be allowed to restrict the distribution of his or her works, at all. It's exactly the concept of Death of the Author [1], just on a slightly different level.

If the author forbade, say, the translation of his work into French, or forbade the reading of it by people with brown hair, or forbade the reading of it on browsers other than Internet Explorer, I'm sure you'd agree with me that such a restriction is insane, and would refuse to cooperate with it.

This restriction he is trying to impose is a bit less crazy, but he has just as little right to make it.

[1] http://tvtropes.org/pmwiki/pmwiki.php/Main/DeathOfTheAuthor


The author of this post drops enough hints. Also, Worm is enjoying a large surge of popularity right now, as it was just completed.


Web scraping is nice to learn and when you hit that wall where it starts getting really complex - i.e you need to login to a page, have cookies, etc.. there are really good tools available, lite Outwit hub that uses the whole browser as a scraper.


Isn't this wildly illegal? I thought all web scraping violated CFAA... not to mention copyright...


In many cases how is it different from visiting each page and printing/saving the html? It's useful to create a backup of the site[1].

I'm not saying it's good or bad. It could be used for both. I know at my previous job we used it (a bit shady I think) to scrape and compare competitor prices to ours.

[1] Though I can't understand why you'd want to backup your own site but I am pretty sure I've heard of it being done.


> I thought all web scraping violated CFAA... not to mention copyright...

I hope this doesn't apply to meta tags and outbound links otherwise one of my projects just got a lot more awkward.


Wasn't there some parser other than Nokogiri which was focused more on being lenient than correct?


Python, not Ruby.. but Beautiful Soup is a lenient parser and a really awesome library: http://www.crummy.com/software/BeautifulSoup/


Maybe Hpricot[1]? It's been deprecated for a long time but from what I remember it was much more lenient on the structure of the incoming document.

[1] https://github.com/hpricot/hpricot


Isn't it more convenient to use something like Pocket for this?


Does pocket follow all of the Next links and build an ebook for you?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: