
Ask HN: Best resources for web-scraping? What resources are missing? - webmaven
Tools, libraries, services, books, blog posts, etc. all count as &#x27;resources&#x27; for the purposes of this question.<p>Feel free to characterize your recommendations by whether the intended audience is an experienced dev or a n00b.<p><i></i>UPDATE:<i></i> If any patterns emerge for pain points or missing docs, examples, or resources for learning I might create a page or an ebook project on GitHub that covers that area.
======
dennybritz
Two great tools for web scraping are [https://import.io/](https://import.io/)
and [https://www.kimonolabs.com/](https://www.kimonolabs.com/). There are also
lots of developer tools out there, the most popular is probably
[http://scrapy.org/](http://scrapy.org/).

If you are interested in web crawling, which is often necessary if you want to
extract data from very large sites (or many sites), I just wrote up a blog
post comparing open source web crawling systems (including scrapy):
[http://blog.blikk.co/comparison-of-open-source-web-
crawlers](http://blog.blikk.co/comparison-of-open-source-web-crawlers)

~~~
webmaven
Thanks for linking to those services, and your comparison of crawler libraries
is very well done.

------
Jake232
I wrote this earlier this year, covers quite a few details, it's been featured
on HN frontpage a couple of times though so maybe you have already seen it.

[http://jakeaustwick.me/python-web-scraping-
resource/](http://jakeaustwick.me/python-web-scraping-resource/)

~~~
webmaven
Thanks! Great overview of the basics.

------
theworst
The crawling aspect always seems overlooked to me. It's really easy to get a
single page and pull the required data from it. However, what strategies do
you use to crawl? How do you get around IP blocks? Continuous crawling? Etc.

In the end, it's the infrastructure that powers the extraction that requires
all my attention. I've got a bunch of techniques I use, but I'd love to
compare with how other people do it.

Love that more scraping resources are coming online. I see scraping as the
important link between the web as it is now, and the web as it will be in 20
years. (web2 -> web3 for the jargon geeks.) The whole semantic web isn't going
to be useful for most non-academics without considerable structuring effort
put into existing data.

------
bennyp101
I used to grab the page with curl, then use xslt on it to grab what I needed.
The language is less important IMHO than the need for simplicity

Edit. Thinking about what's available now, maybe something using phantomjs or
similar?

------
MalcolmDiggs
With so many single-page and ajax-powered websites these days, I've pretty
much abandoned traditional fetching tools like Curl; in favor of headless
browsers like PhantomJS, Selenium, etc. Here's a pretty good list:
[http://stackoverflow.com/questions/18539491/headless-
browser...](http://stackoverflow.com/questions/18539491/headless-browser-and-
scraping-solutions)

Of course this only covers the actual gathering of the content, not
storing/indexing in an efficient way.

~~~
webmaven
Thanks for the link!

------
cblock811
If you need an infrastructure to speed up your scraping or want to analyze
some pre-crawled copies of the web there is Zillabyte. They have examples on
their blog as well, Growth Hackers tweets this one out a lot

[http://blog.zillabyte.com/2014/06/23/5-easy-steps-to-
build-a...](http://blog.zillabyte.com/2014/06/23/5-easy-steps-to-build-a-saas-
lead-generation-app/)

There is also Import.io for web extraction. Do you have something you are
looking for in particular?

~~~
webmaven
Nope. Nothing in particular. Curious what people are already using and what
they think might be missing, including docs, examples, etc.

 __UPDATE: __If any patterns emerge for pain points or missing docs, examples,
or resources for learning I might create a page or an ebook project on GitHub
that covers that area.

------
Ronsenshi
For something quirky i use python with BeautifulSoup. Or lately node.js with
phantom.js and server side jQuery.

And shameless plug - for anything more trivial and on-demand or websites that
have good structure (which is the most of them) I use
[http://redpluck.com](http://redpluck.com). It's easy and fast to setup,
supports scraping behind login walls and powerful deep scraping.

~~~
webmaven
Interesting, thanks for pointing to those tools and services. I hadn't heard
of RedPluck before.

------
walterbell
Is anyone using the Common Crawl dataset prior to scraping, or are there
certain sites missing from that archive?

[http://commoncrawl.org/common-crawl-url-
index/](http://commoncrawl.org/common-crawl-url-index/)

[http://commoncrawl.org/july-2014-crawl-data-
available/](http://commoncrawl.org/july-2014-crawl-data-available/)

~~~
dennybritz
I've used it and it's great. However, many sites disallow all crawling via
robots.txt and only whitelist certain crawlers such as Google or Bing. These
sites will be "missing" from CommonCrawl.

One thing CommonCrawl really needs is a good index.
[https://github.com/trivio/common_crawl_index](https://github.com/trivio/common_crawl_index)
is great, but it only exists for the data from 2012. Apparently there was
something in the works for 2014 data but there hasn't been any update for
months: [https://groups.google.com/forum/#!topic/common-
crawl/mrZBnvD...](https://groups.google.com/forum/#!topic/common-
crawl/mrZBnvDvVvo)

------
reefoctopus
It depends on what you want to scrape. What do you hope to accomplish with an
answer to this question? Do you intend to study/play around with a bunch of
the libraries/tools mentioned? Do you have some sort of project in mind? Are
you trying to get us to write a 'top 10 web scraping' blog post for you?

~~~
webmaven
Hah! No, not trying to get a blog post written, but if any patterns emerge for
pain points or missing docs, examples, or resources for learning I might
create a page or an ebook project on GitHub that covers that area.

I probably will play around with the tools that are new to me regardless.

 __UPDATE: __Updated the OP.

------
lutusp
There's no single answer that will stand the test of time, because the
structure of Web pages changes over time, to some extent because of scrapers.

This general advice should serve:

1\. Get the page content the simplest way possible.

2\. Use regular expressions to extract the desired content.

For both the above goals, and because so much revision is needed over time, I
recommend Python.

~~~
webmaven
Well, sure, but even if your recommendation is to do it yourself with Python
(rather than using a service) and regexes rather than BeautifulSoup, then the
question becomes mechanize vs. scrapy vs. robobrowser, etc. etc.

------
logn
I run this service: [https://screenslicer.com](https://screenslicer.com)

It can search, extract fields, and page through results--automatically (no
config). A full-featured developer API will be available soon for finer-
grained control.

------
nreece
Webpage to RSS/XML feeds: [http://feedity.com](http://feedity.com)

