Ask HN: What are best tools for web scraping? - pydox
======
sharmi
If you are a programmer, scrapy[0] will be a good bet. It can handle
robots.txt, request throttling by ip, request throttling by domain, proxies
and all other common nitty-gritties of crawling. The only drawback is handling
pure javascript sites. We have to manually dig into the api or add a headless
browser invocation within the scrapy handler.

Scrapy also has the ability to pause and restart crawls [1], run the crawlers
distributed [2] etc. It is my goto option.

[0] [https://scrapy.org/](https://scrapy.org/)

[1]
[https://doc.scrapy.org/en/latest/topics/jobs.html](https://doc.scrapy.org/en/latest/topics/jobs.html)

[2] [https://github.com/rmax/scrapy-redis](https://github.com/rmax/scrapy-
redis)

~~~
Bromskloss
Would you still recommend Scrapy if the task wasn't specifically crawling?

~~~
sharmi
Nope. It is very specifically tailored to crawling. If you just need something
distributed why not check out RQ [0], Gearman [1] or Celery [2]? RQ and Celery
are python specific.

[0] : [http://python-rq.org/docs/](http://python-rq.org/docs/)

[1] : [http://gearman.org/](http://gearman.org/)

[2] : [http://docs.celeryproject.org](http://docs.celeryproject.org)

------
jackschultz
I've actually wrote about this! General tips that I've found from doing more
than a few projects [0], and then an overview of Python libraries I use [1].

If you don't want to clock on the links, requests and BeautifulSoup / lxml is
all you need 90% of the time. Throw gevent in there and you can get a lot of
scraping done in not as much time as you think it would take.

And as long as we're talking about web scraping, I'm a huge fan of it. There's
so much data out there that's not easily accessible and needs to be cleaned
and organized. When running a learning algorithm, for example, a very hard
part that isn't talked about a lot is getting the data before throwing it in a
learning function or library. Of course, there the legal side of it if
companies are not happy with people being able to scrape, but that's a
different topic.

I'll keep going. The best way to learn about what are the best tools is to do
a project on your own and teat them all out. Then you'll know what suits you.
That's absolutely the best way to learn something about programming -- doing
it instead of reading about it.

[0] [https://bigishdata.com/2017/05/11/general-tips-for-web-
scrap...](https://bigishdata.com/2017/05/11/general-tips-for-web-scraping-
with-python/)

[1] [https://bigishdata.com/2017/06/06/web-scraping-with-
python-p...](https://bigishdata.com/2017/06/06/web-scraping-with-python-part-
two-library-overview-of-requests-urllib2-beautifulsoup-lxml-scrapy-and-more/)

~~~
Bromskloss
> BeautifulSoup / lxml

When should one use one or the other, would you say?

~~~
ivan_ah
You can use the BeautifulSoup API with the `lxml` parser:
[https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insta...](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-
a-parser)

I've heard that `lxml` can choke on certain badly-formed markup, but it's very
fast. Personally has never failed on me.

~~~
nerdponx
LXML also is known to have memory leaks [0][1], so be careful using it in any
kind of automated system that will be parsing lots of small documents. I
personally encountered this issue, and actually caused to abandon a project
until months later when I found the references I linked above. It works nice
and fast for one-off tasks, though.

Also, a question: how often do you really encounter badly-formed markup in the
wild? How hard is it really to get HTML right? It seems pretty simple, just
close tags and don't embed too much crazy stuff in CDATA. Yet I often read
about how HTML parsers must be "permissive" while XML parsers don't need to
be. I've never had a problem parsing bad markup; usually my issues have to do
with text encoding (either being mangled directly or being correctly-encoded
vestiges of a prior mangling) and the other usual problems associated with
text data.

[0]: [https://benbernardblog.com/tracking-down-a-freaky-python-
mem...](https://benbernardblog.com/tracking-down-a-freaky-python-memory-leak-
part-2/)

[1]:
[https://stackoverflow.com/q/5260261](https://stackoverflow.com/q/5260261)

------
samtc
I maintain ~30 different crawlers. Most of them are using Scrapy. Some are
using PhantomJS/CasperJS but they are called from Scrapy via a simple web
service.

All data (zip files, pdf, html, xml, json) we collect are stored as-is
(/path/to/<dataset name>/<unique key>/<timestamp>) and processed later using a
Spark pipeline. lxml.html is WAY faster than beautifulsoup and less prone to
exception.

We have cronjob (cron + jenkins) that trigger dataset update and discovery.
For example, we scrape corporate registry, so everyday we update the 20k
oldest companies version. We also implement "discovery" logic in all of our
crawlers so they can find new data (ex.: newly registered company). We use
Redis to send task (update / discovery) to our crawlers.

~~~
frik
> We use Redis to send task (update / discovery) to our crawlers.

Some kind of queue implemented with Redis? How does it work?

~~~
CGamesPlay
Probably not what the GP uses, but Resque does this in Ruby land.

~~~
bdcravens
Sidekiq has emerged as a better option to Resque

------
danso
Always fascinated by how diverse the discussion and answers is for HN threads
on web-scraping. Goes to show that "web-scraping" has a ton of connotations,
everything from automated-fetching of URLs via wget or cURL, to data
management via something like scrapy.

Scrapy is a whole framework that may be worthwhile, but if I were just
starting out for a specific task, I would use:

\- requests [http://docs.python-requests.org/en/master/](http://docs.python-
requests.org/en/master/)

\- lxml [http://lxml.de/](http://lxml.de/)

\- cssselect
[https://cssselect.readthedocs.io/en/latest/](https://cssselect.readthedocs.io/en/latest/)

Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But
using the web developer tools you can usually figure out the requests made by
the browser and then use the Session object in the Requests library to deal
with stateful requests:

[http://docs.python-requests.org/en/master/user/advanced/](http://docs.python-
requests.org/en/master/user/advanced/)

I usually just download pages/data/files as raw files and worry about
parsing/collating them later. I try to focus on the HTTP mechanics and, if
needed, the HTML parsing, before worrying about data extraction.

~~~
upofadown
>Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize.

Did the version of Mechanize written in Py2 stop being supported?

~~~
danso
Looks like it's recently been updated but no big announcement that it's Python
3 ready: [https://github.com/python-
mechanize/mechanize](https://github.com/python-mechanize/mechanize)

I've also seen these alternatives:

\-
[https://robobrowser.readthedocs.io/en/latest/](https://robobrowser.readthedocs.io/en/latest/)

\-
[https://github.com/MechanicalSoup/MechanicalSoup](https://github.com/MechanicalSoup/MechanicalSoup)

MechanicalSoup seems well updated but the last time I tried these libraries,
they were either buggy (and/or I was ignorant) and I just couldn't get things
to work as I was used to in Ruby and Mechanize.

------
marvinpinto
I would recommend using Headless Chrome along with a library like
puppeteer[0]. You get the advantage of using a real browser with which you run
pages' javascript, load custom extensions, etc.

[0]:
[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer)

~~~
pteredactyl
I second this. I built using beautiful soup before and found Puppeteer much
easier when interacting with the web. Especially nasty .NET sites.

------
beernutz
The absolute best tool i have found for scraping is Visual Web Ripper.

It is not open source, and runs in windows only, but it is one of the easiest
to use tools that i have found. I can set up scrapes entirely visually, and it
handles complex cases like infinite scroll pages, highly javascript dependent
pages and the like. I really wish there were an open source solution that was
as good as this one.

I use it with one of my clients professionally. Their support is VERY good
btw.

[http://visualwebripper.com/](http://visualwebripper.com/)

------
hydragit
WebOOB [0] is a good Python framework for scraping websites. It's mostly used
to aggregate data from multiple websites by organizing each site backend
implement an abstract interface (for example the CapBank abstract interface
for parsing banking sites) but it can be used without that part.

On the pure scraping side, it has a "declarative parsing" to avoid painful
plain-old procedural code [1]. You can parse pages by simply specifying a
bunch of XPaths and indicating a few filters from the library to apply on
those XPath elements, for example CleanText to remove whitespace nonsense,
Lower (to lower-case), Regexp, CleanDecimal (to parse as number) and a lot
more. URL patterns can be associated to a Page class of such declarative
parsing. If declarative becomes too verbose, it can always be replaced locally
by writing a plain-old Python method.

A set of applications are provided to visualize extracted data, and other
niceties are provided for debug easing. Simply put: « Wonderful, Efficient,
Beautiful, Outshining, Omnipotent, Brilliant: meet WebOOB ».

[0] [http://weboob.org/](http://weboob.org/)

[1] [http://dev.weboob.org/guides/module.html#parsing-of-
pages](http://dev.weboob.org/guides/module.html#parsing-of-pages)

------
mping
I use nightmarejs
[https://github.com/segmentio/nightmare](https://github.com/segmentio/nightmare)
which is based on electron; I recommend it if you're on js

~~~
Cyph0n
That looks like a pretty interesting scraping library.

------
zapperdapper
No one has mentioned it so I will: consider Lynx, the text-mode web-browser.
Being command-line you can automate with Bash or even Python. I have used it
quite happily to crawl largeish static sites (10,000+ web pages per site). Do
a `man lynx` the options of interest are -crawl, -traversal, and -dump. Pro
tip - use in conjunction with HTML TIDY prior to the parsing phase (see
below).

I have also used custom written Python crawlers in a lot of cases.

The other thing I would emphasize is that a web scraper has multiple parts,
such as crawling (downloading pages) and then actually parsing the page for
data. The systems I've set up in the past typically are structured like this:

1\. crawl - download pages to file system 2\. clean then parse (extract data)
3\. ingest extracted data into database 4\. query - run adhoc queries on
database

One of the trickiest things in my experience is managing updates. So when new
articles/content are added to the site you only want to have to get and add
that to your database, rather than crawl the whole site again. Also detecting
updated content can be tricky. The brute force approach of course is just to
crawl the whole site again and rebuild the database - not ideal though!

Of course, this all depends really on what you are trying to do!

------
phsource
For someone on a Javascript stack, I highly recommend combining a requester
(e.g., "request" or "axios") with Cheerio, a server-side jQuery clone. Having
a familiar, well-known interface for selection helps a lot.

We use this stack at WrapAPI ([https://wrapapi.com](https://wrapapi.com)),
which we highly recommend as a tool to turn webpages into APIs. It doesn't
completely do all the scraping (you still need to write a script), but it does
make turning a HTML page into a JSON structure much easier.

~~~
nn757
Isn't cheerio only for static content?

------
baldfat
I use R since that is the language I use mostly httr and rvest. Edit I missed
typing rvest thanks for the comments you use the two together.

[https://cran.r-project.org/web/packages/httr/vignettes/quick...](https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html)

~~~
amrrs
Rvest is also another nice option in R.

------
indescions_2017
Headless Chrome, Puppeteer, NodeJS (jsdom), and MongoDB. Fantastic stack for
web data mining. Async based using promises for explicit user input flow
automation.

~~~
jdc0589
I had a ton of issues with JsDom historically. They could have been fixed, but
Cheerio always worked out better for me.

------
Risse
If you use PHP, Simple HTML DOM[0] is an awesome and simple scraping library.

[0]
[http://simplehtmldom.sourceforge.net/](http://simplehtmldom.sourceforge.net/)

~~~
ge96
I also have used Simple HTML Dom

One thing I haven't worked on yet is waiting for stuff to load if that is a
problem. Otherwise you try to limit hitting a site either using sleep/CRON

What's also interesting is session tokens, one site I was able to hunt down
the generated token bread crumb which JS produced, but it wasn't valid. Still
had to visit the site, interesting.

------
levi_n
I use a combination of Selenium and python packages (beautifulsoup). I'm
primarily interested in scraping data that is supplied via javascript, and I
find Selenium to be the most reliable way scrape that info. I use BS when the
scraped page has a lot of data, thereby slowing down Selenium, and I pipe the
page source from Selenium, with all javascript rendered, into BS.

I use explicit waits exclusively (no direct calls like
`driver.find_foo_by_bar`), and find it vastly improves selenium reliability.
(Shameless plug) I have a python package, Explicit[1], that makes it easier to
use explicit waits.

[1]
[https://pypi.python.org/pypi/explicit](https://pypi.python.org/pypi/explicit)

~~~
bluntfang
>I'm primarily interested in scraping data that is supplied via javascript,
and I find Selenium to be the most reliable way scrape that info.

Have you found that you aren't able to find accessible APIs to request
against? Have you ever tried to contact the administrators to see if there's
an API you could access? Are you scraping data that would be against ToS if
you tried to get it in a way that would benefit both you and the target web
site?

~~~
levi_n
>Have you found that you aren't able to find accessible APIs to request
against?

I'm scraping from variety of different websites (1000+) that my org doesn't
own. Reconfiguring to hit APIs would be complex, and a maintenance problem,
both of which I easily avoid by using selenium to drive an actual browser, at
the expense of time.

>Have you ever tried to contact the administrators to see if there's an API
you could access?

Just not feasible given the scope and breadth of the scraping.

>Are you scraping data that would be against ToS if you tried to get it in a
way that would benefit both you and the target web site?

I inspect and respect the robots.txt

------
giarc
For non-coders, import.io is great. However, they used to have a generous free
plan that has since went away (you are limited to 500 records now). Still a
great product, problem is they don't have a small plan (starts at $299/month
and goes up to $9,999).

~~~
adventured
I was looking at services in this area a few weeks ago to automate a small
need I had and ran across these guys. They offer a free 5,000 monthly request
basic plan. I gave it a try, worked fine (I ended up building my own solution
for greater control). It's just for scraping open graph (with some fall-back
capability) tags though.

[https://www.opengraph.io/](https://www.opengraph.io/)

~~~
iagovar
I use Grepsr. Really recommend, they have a Chrome extension that works like
Kimono. Really easy for non technical people. If you have someone in Marketing
or whatever that needs some data, maybe the only thing that they need to know
is to use CSS Selectors and so on.

------
cholmon
I recently stumbled across [http://go-colly.org/](http://go-colly.org/), that
looks well thought out and simple to use. It seems like a slimmed down Go
version of Scrapy.

------
elchief
Anyone who suggests a tool that can't understand JavaScript doesn't know what
they are talking about

You should be using Headless Chrome or Headless Firefox with a library that
can control them in a user-friendly manner

~~~
sp0rk
There are a great many sites that degrade gracefully when JS support is not
available. It makes absolutely no sense to waste the resources required to run
a full headless browser when simple HTTP requests will retrieve the same
information faster, more efficiently, and in a way that's easier to
parallelize.

~~~
xur17
A lot of times you can also watch the api calls JS pages (or apps) make and
retrieve nice structured json data.

I personally avoid executing js unless it's necessary, as it adds more
complexity, and is noticeably more brittle.

~~~
jordanpg
Using an undocumented API, however, carries significant risk for production
operations.

~~~
flukus
If you're web scraping then you've already decided that this risk is
worthwhile, it's already an undocumented API.

------
jmkni
I've had a surprising amount of success with the HTML Agility Pack in .net, if
you have a decent understanding of HTML it's pretty usable.

~~~
inglor
Try CsQuery, it's much nicer in terms of APIs.

------
khuknows
Shameless plug - I build this tiny API for scraping and it works a treat for
my uses: [https://jsonify.link/](https://jsonify.link/)

A few similar tools also exist, like [https://page.rest/](https://page.rest/).

------
ravenstine
It depends on what you're trying to do.

For most things, I use Node.js with the Cheerio library, which is basically a
stripped-down version of jQuery without the need for a browser environment. I
find using the jQuery API far more desirable than the clunky, hideous
Beautiful Soup or Nokogiri APIs.

For something that requires an actual DOM or code execution, PhantomJS with
Horseman works well, though everyone is talking about headless Chrome these
days so IDK. I've not had nearly as many bad experiences with PhantomJS as
others have purportedly experienced.

~~~
imjasonmiller
I have been playing around with Cheerio for a short while and it is quite
cool! Although extracting comments wasn't as straightforward as I thought it
would be.

Do you have any experience with processing and scraping large files using
Cheerio? It doesn't support streaming does it? I am currently faced with
processing a ~75 MB XML and I am not sure if Cheerio is suited for that.

------
Doctor_Fegg
If you speak Ruby, mechanize is good:
[https://github.com/sparklemotion/mechanize](https://github.com/sparklemotion/mechanize)

~~~
DrSayre
I generally use mechanize when I need to scrape something from the web. I
found this awhile back and it's helped me
[https://www.chrismytton.uk/2015/01/19/web-scraping-with-
ruby...](https://www.chrismytton.uk/2015/01/19/web-scraping-with-ruby/)

------
polote
I maintain about 8 crawlers and I use only vanilla Python

I have a function to help me search :

    
    
       def find_r(value, ind, array,stop_word):
       	indice = ind
       	for i in array:
       		indice = value.find(i,indice)+1
       	end =  value.find(stop_word,indice)
       	return value[indice: end], end
    
    

You can use it like that :

    
    
       resulting_text , end_index = find_r(string, start_index, ["<td", ">"], "</td")
    
    

To find text it is quite fast and you don't need to master a framwork

------
CGamesPlay
If you can get away without a JS environment, do so. Something like scrapy
will be much easier than a full browser environment. If you cannot, don’t
bother going halfway and just go straight for headless chrome or Firefox.
Unfortunately Selenium seems to be past its useful life as Firefox dropped
support and chrome has a chrome driver which wraps around it. Phantom.js is
woefully out of date and since it’s a different environment than your target
site was designed for just leads to problems.

~~~
AutomatedTester
I manage the WebDriver work at Mozilla making Firefox work with Selenium. I
can categorically State we haven’t killed Selenium. We, over the last few
years, have invested more in Selenium than other browsers.

Selenium IDE no longer works in Firefox for a number of reasons; 1) Selenium
IDE didn’t have a maintainer 2) Selenium IDE is a Firefox add on and Mozilla
changed how adding worked. They did this for numerous security reasons.

~~~
CGamesPlay
My apologies, I was mistaken, but I can't edit my post now. It looks like the
selenium code has moved into something called geckodriver, which I suppose is
a wrapper around the underlying Marionette protocol.

------
deathemperor
I've just finished my research on web scraping for my company (took me about 7
days). I started with import.io and scrapinghub.com for point and click
scraping to see if I could do it without writing codes. Ultimately, UI point
and click scraping is for none-technical. There are many data you would find
it hard to scrape. For example, lazada.com.my stores the product's SKU inside
an attribute that looks like <div data-sku-simple="SKU11111"></div> which I
couldn't get. import.io's pricing is also something. I need to pay $999 a
month for accessing API data is just too high.

So I decided to use scrapy, the core of scrapinghub.com.

I haven't written much python before but scrapy was very easy to learn. I
wrote 2 spiders and run on scrapinghub (their serverless cloud). Scrapinghub
support jobs scheduling and many other things at a cost. I prefer scrapinghub
because in my team we don't have DevOps. It also supports Crawlera to prevent
IP banning, Portia for point and click (still in beta, it was still hard to
use), and Splash for SPA websites but it's buggy and the github repo is not
under active maintenance.

For DOM query I use BeautifulSoup4. I love it. It's jQuery for python.

For SPA websites I wrote a scrapy middleware which uses puppeteer. The
puppeteer is deployed on Amazon Lambda (1m free request first 365 days, more
than enough for scraping) using this [https://github.com/sambaiz/puppeteer-
lambda-starter-kit](https://github.com/sambaiz/puppeteer-lambda-starter-kit)

I am planning to use Amazon RDS to store scraped data.

------
dsacco
I've done this professionally in an infrastructure processing several
terabytes per day. A robust, scalable scraping system comprises several
distinct parts:

1\. A crawler, for retrieving resources over HTTP, HTTPS and sometimes other
protocols a bit higher or lower on the network stack. This handles data
ingestion. It will need to be sophisticated these days - sometimes you'll need
to emulate a browser environment, sometimes you'll need to perform a
JavaScript proof of work, and sometimes you can just do regular curl commands
the old fashioned way.

2\. A parser, for correctly extracting specific data from JSON, PDF, HTML, JS,
XML (and other) formatted resources. This handles data processing. Naturally
you'll want to parse JSON wherever you can, because parsing HTML and JS is a
pain. But sometimes you'll need to parse images, or outdated protocols like
SOAP.

3\. A RDBMS, with databases for both the raw and normalized data, and columns
that provide some sort of versioning to the data in a particular point in
time. This is quite important, because if you collect the raw data and store
it, you can re-parse it in perpetuity instead of needing to retrieve it again.
This will happen somewhat frequently if you come across new data while
scraping that you didn't realize you'd need or could use. Furthermore, if
you're updating the data on a regular cadence, you'll need to maintain some
sort of "retrieved_at", "updated_at" awareness in your normalized database.
MySQL or PostgreSQL are both fine.

4\. A server and event management system, like Redis. This is how you'll
allocate scraping jobs across available workers and handle outgoing queuing
for resources. You want a centralized terminal for viewing and managing a) the
number of outstanding jobs and their resource allocations, b) the ongoing
progress of each queue, c) problems or blockers for each queue.

5\. A scheduling system, assuming your data is updated in batches. Cron is
fine.

6\. Reverse engineering tools, so you can find mobile APIs and scrape from
them instead of using web targets. This is important because mobile API
endpoints a) change _far_ less frequently than web endpoints, and b) are _far_
more likely to be JSON formatted, instead of HTML or JS, because the user
interface code is offloaded to the mobile client (iOS or Android app). The
mobile APIs will be private, so you'll typically have to reverse engineer the
HMAC request signing algorithm, but that is virtually always trivial, with the
exception of companies that really put effort into obfuscating the code.
apktool, jadx and dex2jar are typically sufficient for this if you're working
with an Android device.

7\. A proxy infrastructure, this way you're not constantly pinging a website
from the same IP address. Even if you're being fairly innocuous with your
scraping, you probably want this, because many websites have been burned by
excessive spam and will conscientiously and automatically ban any IP address
that issues something nominally more than a regular user, regardless of
volume. Your proxies come in several flavors: datacenter, residential and
private. Datacenter proxies are the first to be banned, but they're cheapest.
These are proxies resold from datacenter IP ranges. Residential IP addresses
are IP addresses that are not associated with spam activity and which come
from ISP IP ranges, like Verison Fios. Private IP addresses are IP addresses
that have not been used for spam activity before and which are reserved for
use by only your account. Naturally this is in order from lower to greater
expense; it's also in order from most likely to least likely to be banned by a
scraping target. NinjaProxies, StormProxies, Microleaf, etc are all good
options. Avoid Luminati, which offers residential IP addresses contributed by
users who don't realize their IP addresses are being leased through the use of
Hola VPN.

Each website you intend to scrape is given a queue. Each queue is assigned a
specific allotment of workers for processing scraping jobs in that queue.
You'll write a bunch of crawling, parsing and database querying code in an
"engine" class to manage the bulk of the work. Each scraping target will then
have its own file which inherits functionality from the core class, with the
specific crawling and parsing requirements in that file. For example,
implementations of the POST requests, user agent requirements, which type of
parsing code needs to be called, which database to write to and read from,
which proxies should be used, asynchronous and concurrency settings, etc
should all be in here.

Once triggered in a job, the individual scraping functions will call to the
core functionality, which will build the requests and hand them off to one of
a few possible functions. If your code is scraping a target that has
sophisticated requirements, like a JavaScript proof of work system or browser
emulation, it will be handed off to functionality that implements those
requirements. Most of the time, this won't be needed and you can just make
your requests look as human as possible - then it will be handed off to what
is basically a curl script.

Each request to the endpoint is a job, and the queue will manage them as such:
the request is first sent to the appropriate proxy vendor via the proxy's API,
then the response is sent back through the proxy. The raw response data is
stored in the raw database, then normalized data is processed out of the raw
data and inserted into the normalized database, with corresponding timestamps.
Then a new job is sent to a free worker. Updates to the normalized data will
be handled by something like cron, where each queue is triggered at a specific
time on a specific cadence.

You'll want to optimize your workflow to use endpoints which change
infrequently and which use lighter resources. If you are sending millions of
requests, loading the same boilerplate HTML or JS data is a waste. JSON
resources are preferable, which is why you should invest some amount of time
before choosing your endpoint into seeing if you can identify a usable mobile
endpoint. For the most part, your custom code is going to be in middleware and
the parsing particularities of each target; BeautifulSoup, QueryPath, Headless
Chrome and JSDOM will take you 80% of the way in terms of pure functionality.

~~~
kbenson
> 3\. A RDBMS, with databases for both the raw and normalized data

I've found the filesystem (local or network, depending on scale) works well
for the raw data. A normalized file name with a timestamp and job identifier
in a hashed directory structure of some sort (I generally use
$jobtype/%Y-%m-%d/%H/ as a start) works well, and reading and writing gzip is
trivial (and often you can just output the raw content of gzip encoded
payloads). The filesystem is an often overlooked database. If you end up
needing more transactional support, or to easily identify what's been
processed or not, look at how Maildir works.

After normalization, the database is ideal though.

That said, I was doing a few gigabytes a day, not a dew terabytes, so you
might have run into some scale issues I didn't. I was able to keep it to
mostly one box for crawling and parsing, but crawlers ended up being complex
and job-queue driven enough that expanding to multiple systems wouldn't have
been all that much extra work (an assessment I feel confident in, having done
similar things before).

------
austincheney
This is perhaps the fastest way to screenscrape a dynamically executed
website.

1\. First go get and run this code, which allows immediate gathering of all
text nodes from the DOM:
[https://github.com/prettydiff/getNodesByType/blob/master/get...](https://github.com/prettydiff/getNodesByType/blob/master/getNodesByType.js)

2\. Extract the text content from the text nodes and ignore nodes that contain
only white space:

let text = document.getNodesByType(3), a = 0, b = text.length, output = []; do
{ if ((/^(\s+)$/).test(text[a].textContent) === false) {
output.push(text[a].textContent); } a = a + 1; } while (a < b); output;

That will gather ALL text from the page. Since you are working from the DOM
directly you can filter your results by various contextual and stylistic
factors. Since this code is small and executes stupid fast it can be executed
by bots easily.

Test this out in your browser console.

~~~
AznHisoka
And how do you do #1? Node, I presume?

~~~
austincheney
No, manually go there and copy/paste the code. Then when building your scraper
bot use that code.

~~~
AznHisoka
but how do you use that code? its javascript, right? how would you use it if
your crawler is written in Ruby or Python?

~~~
austincheney
You could write a crawler in any language. Crawling is easy as you are
listening for HTTP traffic and analyzing the HTML in the response.

To accurately get the content in dynamically executed pages you need to
interact with the DOM. This is the reason Google updated its crawler to
execute JavaScript.

~~~
AznHisoka
Yep, I know, but that means if I am writing the crawler in Ruby/Python, this
is not something I can do, right?

~~~
austincheney
Yes. The crawler can be written in nearly any language. The actual scraper
probably has to be written in JavaScript in order to access and interact with
the DOM as the user would and thereby gain access to content that is not
present by default.

------
jacinda
If you're specifically looking at news articles, go for the Python library
Newspaper:
[http://newspaper.readthedocs.io/en/latest/](http://newspaper.readthedocs.io/en/latest/)

Auto-detection of languages, and will automatically give you things like the
following:

>>> article.parse()

>>> article.authors [u'Leigh Ann Caldwell', 'John Honway']

>>> article.text u'Washington (CNN) -- Not everyone subscribes to a New Year's
resolution...'

>>> article.top_image
u'[http://someCDN.com/blah/blah/blah/file.png'](http://someCDN.com/blah/blah/blah/file.png')

>>> article.movies
[u'[http://youtube.com/path/to/link.com'](http://youtube.com/path/to/link.com'),
...]

------
mmmnt
For very simple tasks Listly seems to be a fast and good solution:
[http://www.listly.io/](http://www.listly.io/)

If you need more power, I heard good stuff about
[http://80legs.com/](http://80legs.com/) though never tried them myself.

If you really need to do crazy shit like crawling the iOS App Store really
fast and keep thing up to date. I suggest using Amazon Lambda and a custom
Python parser. Though Lambda is not meant for this kind of things it works
really well and is super scalable at a reasonable price.

------
jppope
Headless chrome in the form of puppeteer
([https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer))
or Chromeless
([https://github.com/graphcool/chromeless](https://github.com/graphcool/chromeless))
or for smaller gigs use nightmare.js
([http://www.nightmarejs.org/](http://www.nightmarejs.org/)).

scapy is fine but selenium, phantom, etc are all outdated IMO

~~~
blowski
> are all outdated IMO

For what reason? Genuine question.

~~~
CGamesPlay
Phantom is woefully out of date, you need a polyfill even for Function.bind.
Firefox dropped support for Selenium in 47, and chromedriver only supports it
with a wrapper called chromedriver.

~~~
hugs
Are you talking about Selenium WebDriver or Selenium IDE (the record/playback
tool for Firefox)? Those are two separate things. Selenium WebDriver
implements is a cross-browser W3C-standard and Firefox very much still
supports it.

~~~
CGamesPlay
Hmm, I guess through geckodriver, which is a parallel to chromedriver? Just
reading through [https://developer.mozilla.org/en-
US/docs/Mozilla/QA/Marionet...](https://developer.mozilla.org/en-
US/docs/Mozilla/QA/Marionette/WebDriver) which starts with a warning about
"rough edges" and "substantial differences".

------
btb
We have been using kapow robosuite for close to 10 years now. Its a commercial
GUI based tool which have worked well for us, it saves us a lot of maintenance
time compared to our previous hand-rolled code extraction pipeline. Only
problem is that its very expensive(pricing seems catered towards very large
enterprises).

So I was really hoping this this thread would have revealed some newer
commercial GUI-based alternatives(on-premise, not SaaS). Because I dont really
ever want to go back the maintenance hell of hand rolled robots ever again :)

------
kanishkalinux
for mostly static pages requests/pycurl + beautifulsoup more than sufficient.
For advance scraping, take a look at scrapy.

for javascript heavy pages most people rely on selenium webdriver. However you
can also try hlspy ([https://github.com/kanishka-
linux/hlspy](https://github.com/kanishka-linux/hlspy)), which is a little
utility I made a while ago for dealing with javascript heavy pages for simple
usage.

------
bootcat
One of the important avenues to scrape AJAX heavy and phantomjs avoiding
websites is using the google chrome extension support. They can mirror the dom
and send it to an external server for processing where we can use python lxml
to xpath to appropriate nodes. This worked for me to scrape Google, before we
hit the capatcha. If anyone is interested, i can share code i wrote to scrape
websites !

If you can scrape findthecompany database ? I have done it successfully !!

~~~
visarga
> This worked for me to scrape Google, before we hit the capatcha.

If Google wanted to give back something to the community, it would offer cheap
automated searches (current prices are absurd). Another thing - more depth
after the first 1000 results. Sometimes you want to know the next result. We
shouldn't need to do all these stupid things to batch query a search engine,
it should be open. That makes it all the more important to invent an open-
source, federated search engine, so we can query to our heart's content (and
have privacy).

~~~
zapperdapper
Agree 100% too.

As for 'federated search engine' \- it's not 'federated' per se but check out
Gigablast search engine. Open source (source on GitHub) and a TOTALLY AWESOME
piece of software written by one guy. You can do good searches at the
Gigablast site[1], or set up your own search engine. Gigablast also offers an
API (I may be wrong but I think DuckDuckGo uses that API for some tasks).

[1] [http://gigablast.com](http://gigablast.com)

------
etatoby
If you need to scrape content from complex JS apps (eg. React) where it
doesn't pay to reverse engineer their backend API (or worse, it's
encrypted/obfuscated) you may want to look at CasperJS.

It's a very easy to use frontend to PhantomJS. You can code your interactions
in JS or CoffeeScript and scrape virtually anything with a few lines of code.

If you need crawling, just pair a CasperJS script with any spider library like
the ones mentioned around here.

------
theden
I've had good success with scrapy ([https://scrapy.org/](https://scrapy.org/))
for my personal projects

------
Jeaye
I've written a bit on web scraping with Clojure and Enlive here:
[https://blog.jeaye.com/2017/02/28/clojure-
apartments/](https://blog.jeaye.com/2017/02/28/clojure-apartments/)

That's what I'd use, if I had to scrape again (no JS support).

------
mrskitch
I’d recommend puppeteer or some other Chrome driver. It’s fast and resilient
even on single page apps.

If you’re looking to run it on a Linux machine also take a look at
[https://browserless.io](https://browserless.io) (full disclosure I’m the
creator of that site).

~~~
mrskitch
I should note that this doesn't lock you into any particular lib, just solves
the problem of running on Chrome in a service like fashion.

------
riekus
Depends on your skillset and the data you want to scrape. I am testing waters
for a new business that relies on scraped data. As a non programmer I had good
success testing stuff with contentgrabber. Import.io also get mentioned a lot.
Tried out octoparse but wast stable with the scraping.

~~~
selllikesybok
I find the desktop tool by import.io a little challenging to work with. Their
toy web-demo is solid for simple table extraction, though.

~~~
wtfdaemon
It's gotten light-years better since the desktop tool existed.

They've completely deprecated/sun-setted the desktop tool in favor of a
greatly improved web application.

~~~
selllikesybok
Belated, but thanks - will check out the web tool.

------
vrathee
If you are looking for SaaS or managed services, Try
[https://www.agenty.com/](https://www.agenty.com/)

Agenty is cloud-hosted web scraping app and you can setup scraping agents
using their point and click CSS Selector Chrome extension to extract anything
from HTML with these 3 modes below: \- TEXT : Simple clean text \- HTML :
Outer or Inner HTML \- ATTR : Any attribute of a html tag like image src,
hyperlink href…

Or advance mode like REGEX, XPATH etc.

And then save the scraping agent to execute on cloud-hosted app with most
advanced features like batch crawling, scheduling, multiple website scraping
simultaneously without worrying in ip-address block or speed like never
before.

------
doominasuit
If you need to interpret javascript, or otherwise simulate regular browsing as
closely as possible, you may consider running a browser inside a container and
controlling it with selenium. I have found it’s necessary to run inside the
container if you do not have a desktop environment. This is better suited for
specific use cases rather than mass collection because it is slower to run a
full browsing stack than to only operate at the HTTP layer. I have found that
alternatives like phantomJS are hard to debug. Consider opening VNC on the
container for debugging. Containers like this that I know of are SeleniumHQ
and elgalu/selenium.

------
hmottestad
If you know Java, then my go to library is Jsoup
[https://jsoup.org/](https://jsoup.org/)

It lets you use jQuery-like selectors to extract data.

Like this: Elements newsHeadlines = doc.select("#mp-itn b a");

~~~
jasondc
+1 Saves a ton of time, and very simple to use

------
cdolan
Outwit Hub, specifically the advanced or enterprise levels.

It has a GUI on it that is not designed very well, and documentation that is
complete, but hard to search...

But it can do just about any type of scrape, including getting started from a
command line script

~~~
selllikesybok
Second this. My go-to for years now. Inexpensive for what it does. Factor in
the cost of building out it's features in your home rolled solution, and
you'll be saving a ton. Plus the team is very responsive if you need support.
And is open to small consulting projects if you need something beyond your own
abilities.

------
jpetersonmn
I used to use a combo of python tools. Requests, beautifulsoup mostly. However
the last few things I've built used selenium to drive headless chrome
browsers. This allows me to run the javascript most sites use these days.

------
jancurn
Apify ([https://www.apify.com](https://www.apify.com)) is a web scraping and
automation platform where you can extract data from any website using a few
simple lines of JavaScript. It's using headless browsers, so that people can
extract data from pages that have complex structure, dynamic content or employ
pagination.

Recently the platform added support for headless Chrome and Puppeteer, you can
even run jobs written in Scrapy or any other library as long as it can be
packaged as Docker container.

Disclaimer: I'm a co-founder of Apify

------
servitor
I agree with others, with curl and the likes you will hit insurmountable
roadblocks sooner or later. It's better to go full headless browser from the
start.

I use a python->selenium->chrome stack. The Page Object Model [0] has been a
revelation for me. My scripts went from being a mess of spaghetti code to
something that's a pleasure to write and maintain.

[0] [https://www.guru99.com/page-object-model-pom-page-factory-
in...](https://www.guru99.com/page-object-model-pom-page-factory-in-selenium-
ultimate-guide.html)

------
sl0wik
I had great experience with www.apify.com.

------
mfontani
Whatever you end up using for scraping, I beg you to pick a unique user-agent
which allows a webmaster to understand which crawler is it, to better allow it
to pass through (or be banned, depending).

Don't stick with the default "scrapy" or "Ruby" or "Jakarta Commons-
HttpClient/...", which end up (justly) being banned more easily than unique
ones, like "ABC/2.0 -
[https://example.com/crawler"](https://example.com/crawler") or the like.

~~~
danso
Note that for some libraries, the agent is set to empty or whatever the
default is for the tool (e.g. `curl/7.43.0` for curl). It's always worth
setting it to _something_.

As a frequent scraper of government sites, and sometimes commercial sites for
research purposes, I avoid as much as possible as faking a User Agent, i.e.
copying the default strings for popular browsers:

`Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36`

Almost always, if a site rejects my scraper on the basis of agent, they're
doing a regex for "curl", "wget" or for an empty string. Setting a user-agent
to something unique and explicit, i.e. "Dan's program by danso@myemail.com"
works fine without feeling shady.

Maybe for old government sites that break on anything but IE, you'll have to
pretend to be IE, but that's very rare.

------
Softcadbury
With node, you can use cheerio [0]. It allows you to parse html pages with a
JQuery similar syntax. I use it in production on my project [1]

[0]
[https://github.com/cheeriojs/cheerio](https://github.com/cheeriojs/cheerio)
[1] [https://github.com/Softcadbury/football-
peek/blob/master/ser...](https://github.com/Softcadbury/football-
peek/blob/master/server/updaters/scorersUpdater.js)

------
colinchartier
We had a really tough time scraping dynamic web content using scrapy, and both
scrapy and selenium require you to write a program (and maintain it) for every
separate website that you have to scrape. If the website's structure changes
you need to debug your scraper. Not fun if you need to manage more than 5
scrapers.

It was so hard that we made our own company JUST to scrape stuff easily
without requiring programming. Take a look at
[https://www.parsehub.com](https://www.parsehub.com)

------
256cats
I use Node and either puppeteer[0] or plain Curl[1]. IMO Curl is years ahead
of any Node.js request lib. For proxies I use (shameless plug!)
[https://gimmeproxy.com](https://gimmeproxy.com) .

[0]
[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer)

[1] [https://github.com/JCMais/node-libcurl](https://github.com/JCMais/node-
libcurl)

~~~
sagivo
Really nice concept.

------
mitchtbaum
I made this
[https://www.drupal.org/project/example_web_scraper](https://www.drupal.org/project/example_web_scraper)
and produced the underlying code many years ago. The idea is to map xpath
queries to your data model and use some reusable infrastructure to simply
apply it. It was very good, imho (for what it was). (I'm writing this comment
since I don't see any other comments with the words map or model :/ )

------
bbayer
I am really surprised nobody mentioned pyspider. It is simple, has a web
dashboard and can handle JS pages. It can store data to a database of your
choice. It can handle scheduling, recrawling. I have used it to crawl Google
Play. 5$ Digital Ocean VPS with pyspider installed on it could handle millions
of pages crawled, processed and saved to a database.

[http://docs.pyspider.org/en/latest/](http://docs.pyspider.org/en/latest/)

------
OzzyB
A good host xD

Preferably one that doesn't mind giving you a bunch of IPs, and if they do,
don't charge a fortune for them.

Then you can worry about what software you're gonna use.

~~~
eccfcco15
Which hosts have you used, or would you recommend?

~~~
OzzyB
OVH

You can get upto 256 IPs per server and _not_ pay monthly fees -- just a $3
upfront setup charge.

You're welcome xD

~~~
gerenuk
+1 for ovh ips.

------
mrkeen
I made a crawler
[https://github.com/jahaynes/crawler](https://github.com/jahaynes/crawler)

It outputs to the warc file format
([https://en.wikipedia.org/wiki/Web_ARChive](https://en.wikipedia.org/wiki/Web_ARChive)),
in case your workflow is to gather web pages and then process them afterwards.

------
ngneer
[https://github.com/featurist/coypu](https://github.com/featurist/coypu) is
nice for browser automation. A related question: what are good tools for
database scraping, meaning replicating a backend database via a web interface
(not referring to compromising the application, rather using allowed queries
to fully extract the database).

------
dineshr93
If you know java then jsoup will be very handy. [1]
[https://jsoup.org/](https://jsoup.org/)

------
charlus
For a little diversity on tools, if you're looking for something quick that
others can access the data easily - Google Apps script in a Google Sheet can
be quite useful.

[https://sites.google.com/site/scriptsexamples/learn-by-
examp...](https://sites.google.com/site/scriptsexamples/learn-by-
example/parsing-html)

------
buildops
Why are you looking to scrape? Here's a list of some scraper bots:
[https://www.incapsula.com/blog/web-scraping-
bots.html](https://www.incapsula.com/blog/web-scraping-bots.html)

What about Botscraper:
[http://www.botscraper.com/](http://www.botscraper.com/)

------
wiradikusuma
I tinkered with Apache Nutch
([http://nutch.apache.org/](http://nutch.apache.org/)), but I found it
overkill. In the end, since I use Scala, I use
[https://github.com/ruippeixotog/scala-
scraper](https://github.com/ruippeixotog/scala-scraper)

------
laktek
One of the challenges with modern day scraping is you need to account for
client-side JS rendering.

If you prefer an API as a service that can pre-render pages, I built Page.REST
([https://www.page.rest](https://www.page.rest)). It allows you to get
rendered page content via CSS selectors as a JSON response.

------
blueadept111
Jaunt [[http://jaunt-api.com](http://jaunt-api.com)] is a good java tool.

------
0xdeadbeefbabe
The best tool for web scraping, for me, is something easy to deploy and
redeploy; and something that doesn't rely on three working programs--
eliminating selenium sounds great.

For those reasons I like
[https://github.com/knq/chromedp](https://github.com/knq/chromedp)

------
ksahin
I wrote some blog post about Java web scraping here :
[https://ksah.in/introduction-to-web-scraping-with-
java/](https://ksah.in/introduction-to-web-scraping-with-java/)

As others said, phantomJS (and now headless Chrome) are good tools to deal
with heavy js websites

------
teremin
I use Colly[0][1] which is a young but decent scraping framework for Golang.

[0] [http://go-colly.org/](http://go-colly.org/) [1]
[https://github.com/gocolly/colly](https://github.com/gocolly/colly)

------
tmaly
I just tried puppeteer yesterday for the first time. It seems to work very
well. My only complaint is that it is very new and does now have a plethora of
examples.

I previously have used WWW::Mechanize in the Perl world, but single page
applications with Javascript really require something with a browser engine.

~~~
kjullien
Don't use puppeteer unless its to fiddle 10 minutes with it. Tried to use it
for front-end tests on our stack at work and spent more time debugging 1\.
puppeteer 2\. chrome headless (yes, not even the lib itself) than doing any
work on the project, for instance chrome in headless will sometimes randomly
exit for no reason making the lib crash, has been this way since the lib came
out and its still not fixed.... This is just the first example and yet a
single problem and I already can't work with the lib... Not going to get into
font rendering, random multiple tabs crashes, and other stupid issues that are
a result of bad integration.

------
RandomBookmarks
The "best tool" is different for web developers and non-coders. If you are a
non-technical person that just needs some data there is:

(1) hosted services like mozenda

(2) visual automation tools like Kantu Web Automation (which includes OCR)

(3) and last but not least outsourcing the scraping on sites like
Freelancer.com

------
thallian
I used CasperJS[0] in the past to scrap a javascript heavy forum (ProBoards)
and it worked well. But that was a few years ago, I have no idea what new
strategies came up in the meantime.

[0] [http://casperjs.org/](http://casperjs.org/)

------
tn_
Check out Heritrix if you're looking for an open-source webscraping archival
tool:
[https://webarchive.jira.com/wiki/spaces/Heritrix](https://webarchive.jira.com/wiki/spaces/Heritrix)

------
brycematheson
Shameless plug. I wrote a blog post on how I use Powershell to scrape sites:
[http://brycematheson.io/webscraping-with-
powershell/](http://brycematheson.io/webscraping-with-powershell/)

------
frausto
Been getting blocked by recaptcha more and more, do any of these tools handle
dealing with that or workarounds by default? Tried routing through proxies and
swapping IP addresses, slowing down, etc... Any specific ways people get
around that?

~~~
jakubbalada
You can use services like Anti-captcha [1]

We have a public API on Apify for that [2]

[1] [https://anti-captcha.com/mainpage](https://anti-captcha.com/mainpage)

[2] [https://www.apify.com/petr_cermak/anti-captcha-
recaptcha](https://www.apify.com/petr_cermak/anti-captcha-recaptcha)

------
jschuur
If you want to extract content and specific meta data, you might find the
Mercury Web Parser useful:

[https://mercury.postlight.com/web-parser/](https://mercury.postlight.com/web-
parser/)

------
Karupan
I've had some success using portia[1]. Its a visual wrapper over scrapy, but
is actually quite useful.

[https://github.com/scrapinghub/portia](https://github.com/scrapinghub/portia)

------
traviswingo
I’ve been using puppeteer to scrape and it’s been fantastic. Since it’s a
headless browser, it can handle SPA just as well as server side loaded
traditional websites. It’s also incredibly easy to use with async/await.

~~~
ajcodez
I assume this puppeteer:

\-
[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer)

------
askz
A friend released a little tool to only scrap html from websites, with tor and
proxy chaining

[https://github.com/AlexMili/Scraptory](https://github.com/AlexMili/Scraptory)

------
freeslugs
If you need simple scraping, I like traditional http request lib. For more
robust scraping (ie clicking buttons / filling text), use capybara and either
phantomjs or chromedriver - easy to install using homebrew!

------
mateuszf
`clj-http`, `enlive`, `cheshire` in case of `clojure` worked fine for me

~~~
tuddman
and 'hickory'
[[https://github.com/davidsantiago/hickory](https://github.com/davidsantiago/hickory)]
to work with the site data however you want.

------
thegrif
A ton of people recommended Scrapy - and I am always looking for senior Scrapy
resources that have experience scraping at scale. Please feel free to reach
out - contact info is in my profile.

------
sananth12
If you are looking for image scraping:
[https://github.com/sananth12/ImageScraper](https://github.com/sananth12/ImageScraper)

------
pudo
We're about to announce a new Python scraping toolkit, memorious:
[https://github.com/alephdata/memorious](https://github.com/alephdata/memorious)
\- it's a pretty lightweight toolkit, using YAML config files to glue together
pre-built and custom-made components into flexible and distributed pipelines.
A simple web UI helps track errors and execution can be scheduled via celery.

We looked at scrapy, but it just seemed like the wrong type of framing for the
type of scrapers we build: requests, some html/xml parser, and output into a
service API or a SQL store.

Maybe some people will enjoy it.

------
kbd
For simple tasks, curl into pup is very convenient.

[https://github.com/ericchiang/pup](https://github.com/ericchiang/pup)

------
kopos
Scrapy [[https://github.com/scrapy/scrapy](https://github.com/scrapy/scrapy)]
works really well.

------
vinitagr
[https://github.com/matthewmueller/x-ray](https://github.com/matthewmueller/x-ray)

------
Lxr
Python requests + lxml, with Selenium as a last resort.

------
bantersaurus
beautifulsoup

~~~
cjsuk
Using this as well with Requests to automate eBay/gumtree/craigslist. Works
very well

~~~
djaychela
Any details on this anywhere, or is it not for public consumption? I'm just
getting started in Python and want to do something with Gumtree and eBay as an
idea to help me in a different sphere.

~~~
cjsuk
It's not really for public consumption because it's embarrassingly badly
written :)

It's pretty dumb really. Just figured out the search URLs and then parse the
list responses. It then stores the auctions/ad IDs it has seen in a tiny redis
instance with 60 days' expiry on each ID it inserts. If there are any items it
hasn't seen each time it runs, it compiles them in a list and emails them to
me via AWS SNS. Runs every 5 minutes from cron on a Raspberry Pi Zero plugged
into the back of my XBox 360 as a power supply and my router via a
USB/ethernet cable.

The main bulk of the work went into the searches to run which are a huge list
of typos on things with a high return. I tend to buy, test, then reship them
for profit. Not much investment gives a very good return - pays for the food
bill every month :)

~~~
djaychela
Thanks for the info - I'm sure mine will be of lower quality when I do write
it - hoping to compile real-world info on sold vehicles by scraping info from
eBay and Gumtree, but that will take time and more skills than I currently
possess. Good to hear someone's made something out of a similar idea, though.

~~~
cjsuk
Sounds like a good idea. Good luck - you can do it! :)

------
fazkan
scrapy and BS4, for serious stuff. Selenium, for automating logging and other
UI related stuff, you can even play games with it.

------
kazinator
TXR: [http://www.nongnu.org/txr](http://www.nongnu.org/txr)

------
crispytx
I did a little web scraping project a few years ago using:

* cURL

* regex

------
thejosh
If you are scraping specific pages on a site, curl. Then transform that into
the language you use.

------
cm2012
For non developers dexi.io is great.

------
novaleaf
i wrote a tool: PhantomJsCloud.com

it's getting a little long in the tooth, but I will be updating it soon to use
a Chrome based renderer. If you have any suggestions, you can leave it here or
PM me :)

------
aaronhoffman
This tool takes a list of URIs and crawls each site for contact info. Phone,
email, twitter, etc

[https://github.com/aaronhoffman/WebsiteContactHarvester](https://github.com/aaronhoffman/WebsiteContactHarvester)

------
jpepinho
WebDriver.io using Selenium and PhantomJS would be a good way to go!

------
kzisme
So in general what do most people use web scraping for? Is it building up
their on database of things not available via an API or something? It always
sounds interesting, but the need for it is what confuses me.

~~~
tmuir
I've generally used it to sort data in some way that's not available on the
original webpage. Either into a csv file, making large lists easier to view,
or to determine some optimum, such as the best price.

\- Which squares have historically hit the most often in Superbowl Squares
([http://www.picks.org/nfl/super-bowl-squares](http://www.picks.org/nfl/super-
bowl-squares))

\- Search a job website for a search term and list of locations, collecting
each job title, company, location, and link, to view as one large spreadsheet,
instead of having to navigate through 10 results per page.

\- Collect cost of living indices in a list of cities

------
greyfox
i did a quick search and didnt see this listed here:

[https://www.httrack.com/](https://www.httrack.com/)

------
etattva
Scrapy and Jsoup are best combinations

------
tomc1985
Perl or Ruby and Regular Expressions

------
herbst
Nokogiri

------
vsupalov
That really depends on your project and tech stack. If you're into Python and
are going to deal with relatively static HTML, then the Python modules Scrapy
[1], BeautifulSoup [2] and the whole Python data crunching ecosystem are at
your disposal. There's lots of great posts about getting such a stack off the
ground and using it in the wild [3]. It can get you pretty darn far, the
architecture is _solid_ and there are lots of services and plugins which
probably do everything you need.

Here's where I hit the limit with that setup: dynamic websites. If you're
looking at something like discourse-powered communities or similar, and don't
feel a bit too lazy to dig into all the ways requests are expected to look,
it's no fun anymore. Luckily, there's lots of js-goodness which can handle
dynamic website, inject your javascript for convenience and more [4].

The recently published Headless Chrome [5] and puppeteer [6] (a Node API for
it), are really promising for many kinds of tasks - scraping among them. You
can get a first impression in this article [7]. The ecosystem does not seem to
be as mature yet, but I think this will be foundation of the next go-to
scraping tech stack.

If you want to try it yourself, I've written a brief intro [8] and published a
simple dockerized development environment [9], so you can give it a go without
cluttering your machine or find out what dependencies you need and how the
libraries are called.

[1] [https://scrapy.org/](https://scrapy.org/)

[2]
[https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[3] [http://sangaline.com/post/advanced-web-scraping-
tutorial/](http://sangaline.com/post/advanced-web-scraping-tutorial/)

[4] [https://franciskim.co/dont-need-no-stinking-api-web-
scraping...](https://franciskim.co/dont-need-no-stinking-api-web-
scraping-2016-beyond/)

[5]
[https://developers.google.com/web/updates/2017/04/headless-c...](https://developers.google.com/web/updates/2017/04/headless-
chrome)

[6]
[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer)

[7] [https://blog.phantombuster.com/web-scraping-
in-2017-headless...](https://blog.phantombuster.com/web-scraping-
in-2017-headless-chrome-tips-tricks-4d6521d695e8)

[8] [https://vsupalov.com/headless-chrome-puppeteer-
docker/](https://vsupalov.com/headless-chrome-puppeteer-docker/)

[9] [https://github.com/vsupalov/docker-puppeteer-
dev](https://github.com/vsupalov/docker-puppeteer-dev)

------
21stio
golang

~~~
deathemperor
I signed up for proxycrawl, used the javascript api to access a SPA website
written in React and it just show a blank page.
[https://api.proxycrawl.com/?token=aDcC1lB-
NZ5_r4vMSN-L3A&url...](https://api.proxycrawl.com/?token=aDcC1lB-
NZ5_r4vMSN-L3A&url=https://www.shopee.vn) (I don't mind my token is exposed)

------
pwaai
hey I'm working on this thing called BAML (browser automation markup language)
and it looks something like this:

    
    
        OPEN http://asdf.com
        CRAWL a
        EXTRACT {'title': '.title'}
    

It's meant to be super simple and built from ground up to support crawling
Single Page Applications.

Also, creating a terminal client (early ver:
[https://imgur.com/a/RYx5g](https://imgur.com/a/RYx5g)) for it which will
launch a Chrome browser and scrape everything.
[http://export.sh](http://export.sh) is still very early in the works, I'd
appreciate any feedback ( _email in profile, contact form doesn 't work_).

------
dor_jack
If you need to perform a web-scale crawl I strongly recommend
[https://www.mixnode.com](https://www.mixnode.com).

