Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best resources for web-scraping? What resources are missing?
19 points by webmaven on Aug 21, 2014 | hide | past | favorite | 21 comments
Tools, libraries, services, books, blog posts, etc. all count as 'resources' for the purposes of this question.

Feel free to characterize your recommendations by whether the intended audience is an experienced dev or a n00b.

UPDATE: If any patterns emerge for pain points or missing docs, examples, or resources for learning I might create a page or an ebook project on GitHub that covers that area.




Two great tools for web scraping are https://import.io/ and https://www.kimonolabs.com/. There are also lots of developer tools out there, the most popular is probably http://scrapy.org/.

If you are interested in web crawling, which is often necessary if you want to extract data from very large sites (or many sites), I just wrote up a blog post comparing open source web crawling systems (including scrapy): http://blog.blikk.co/comparison-of-open-source-web-crawlers


Thanks for linking to those services, and your comparison of crawler libraries is very well done.


I wrote this earlier this year, covers quite a few details, it's been featured on HN frontpage a couple of times though so maybe you have already seen it.

http://jakeaustwick.me/python-web-scraping-resource/


Thanks! Great overview of the basics.


The crawling aspect always seems overlooked to me. It's really easy to get a single page and pull the required data from it. However, what strategies do you use to crawl? How do you get around IP blocks? Continuous crawling? Etc.

In the end, it's the infrastructure that powers the extraction that requires all my attention. I've got a bunch of techniques I use, but I'd love to compare with how other people do it.

Love that more scraping resources are coming online. I see scraping as the important link between the web as it is now, and the web as it will be in 20 years. (web2 -> web3 for the jargon geeks.) The whole semantic web isn't going to be useful for most non-academics without considerable structuring effort put into existing data.


I used to grab the page with curl, then use xslt on it to grab what I needed. The language is less important IMHO than the need for simplicity

Edit. Thinking about what's available now, maybe something using phantomjs or similar?


With so many single-page and ajax-powered websites these days, I've pretty much abandoned traditional fetching tools like Curl; in favor of headless browsers like PhantomJS, Selenium, etc. Here's a pretty good list: http://stackoverflow.com/questions/18539491/headless-browser...

Of course this only covers the actual gathering of the content, not storing/indexing in an efficient way.


Thanks for the link!


If you need an infrastructure to speed up your scraping or want to analyze some pre-crawled copies of the web there is Zillabyte. They have examples on their blog as well, Growth Hackers tweets this one out a lot

http://blog.zillabyte.com/2014/06/23/5-easy-steps-to-build-a...

There is also Import.io for web extraction. Do you have something you are looking for in particular?


Nope. Nothing in particular. Curious what people are already using and what they think might be missing, including docs, examples, etc.

UPDATE: If any patterns emerge for pain points or missing docs, examples, or resources for learning I might create a page or an ebook project on GitHub that covers that area.


For something quirky i use python with BeautifulSoup. Or lately node.js with phantom.js and server side jQuery.

And shameless plug - for anything more trivial and on-demand or websites that have good structure (which is the most of them) I use http://redpluck.com. It's easy and fast to setup, supports scraping behind login walls and powerful deep scraping.


Interesting, thanks for pointing to those tools and services. I hadn't heard of RedPluck before.


Is anyone using the Common Crawl dataset prior to scraping, or are there certain sites missing from that archive?

http://commoncrawl.org/common-crawl-url-index/

http://commoncrawl.org/july-2014-crawl-data-available/


I've used it and it's great. However, many sites disallow all crawling via robots.txt and only whitelist certain crawlers such as Google or Bing. These sites will be "missing" from CommonCrawl.

One thing CommonCrawl really needs is a good index. https://github.com/trivio/common_crawl_index is great, but it only exists for the data from 2012. Apparently there was something in the works for 2014 data but there hasn't been any update for months: https://groups.google.com/forum/#!topic/common-crawl/mrZBnvD...


It depends on what you want to scrape. What do you hope to accomplish with an answer to this question? Do you intend to study/play around with a bunch of the libraries/tools mentioned? Do you have some sort of project in mind? Are you trying to get us to write a 'top 10 web scraping' blog post for you?


Hah! No, not trying to get a blog post written, but if any patterns emerge for pain points or missing docs, examples, or resources for learning I might create a page or an ebook project on GitHub that covers that area.

I probably will play around with the tools that are new to me regardless.

UPDATE: Updated the OP.


There's no single answer that will stand the test of time, because the structure of Web pages changes over time, to some extent because of scrapers.

This general advice should serve:

1. Get the page content the simplest way possible.

2. Use regular expressions to extract the desired content.

For both the above goals, and because so much revision is needed over time, I recommend Python.


Well, sure, but even if your recommendation is to do it yourself with Python (rather than using a service) and regexes rather than BeautifulSoup, then the question becomes mechanize vs. scrapy vs. robobrowser, etc. etc.


  Use regular expressions to extract the desired content.
You shouldn't use regex for parsing HTML. It's not reliable, nor suitable. Use a proper DOM parser. Read http://blog.codinghorror.com/parsing-html-the-cthulhu-way/


I run this service: https://screenslicer.com

It can search, extract fields, and page through results--automatically (no config). A full-featured developer API will be available soon for finer-grained control.


Webpage to RSS/XML feeds: http://feedity.com




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: