

Ask HN: How to create a Web crawler? - gurpreet42

I am creating a website for price comparison. I want to fetch the price of a single product from multiple shopping carts.
I have some questions regarding this.
1) What is the best way to fetch price form different websites?
2) Sometime a single product have different name on different shopping carts. How to handle this problem?
3) When I will send multiple requests on a website (shopping cart) using my created web crawler (or web spider) will they block me or can they take a legal action against me?
4) How can I make the process automated to omit manual error.<p>I will start with 3 shopping carts initially. Most of the shopping carts are not providing any type of API or other type of access to their products. 
Currently my approach is to tear down the HTML and find required information from HTML.<p>I want to go with C# &#38; .Net technology because I am good in it (this is what I thought).<p>Please suggest what is the best way of doing this.
======
Piskvorrr
Okay, step #1: go search the web for "web spidering" and "web crawling". Go
read the relevant articles (and Wikipedia's, too; a good starting point IMHO).
Step #2: update this question when you have a question that's answerable. See,
entire _books_ can be (and have been) written on the subjectS that you ask
about (and that's even before getting started with the legal aspects).

As for question number 4: Impossible unless you have strong AI (see also: How
Apple automatically collated their map data from three different sources and
what they got as a result). You see, fully automated data collection from
multiple source will introduce _more_ errors than manual methods (which are
slow, OTOH); automated collection and manual verification is necessary.

~~~
gurpreet42
I have searched on web and have created one solution, this is semi automated
means I have to put some URLs and some parameters to it to get desired result.

Its worked for me up to some extent.

------
philipDS
An alternative to Python or C# for web crawling could be Node.js. It's pretty
good at it and you have a few libraries that can help you: * Node.io
(<https://github.com/chriso/node.io>) * Phantomjs-node for dynamic content
(<https://github.com/sgentle/phantomjs-node>) * Cheerio for a jQuery server-
side implementation (<https://github.com/MatthewMueller/cheerio>) * Node-
jquery as an alternative to Cheerio (<https://github.com/coolaj86/node-
jquery>)

A single product might have a different name, but you might try to scrape the
product ID's if they exist. Product IDs should be unique. If both websites
provide those ID's you could compare those in your database. If that's not
possible: as a small hack you could also use Amazon Mechanical Turk to issue
manual tasks to compare product names. This way, real people will check if two
products are the same in case there is doubt. This will cost you a little, but
you could give those people 5 cents per product comparison or something like
that.

For question 3, some websites don't allow you to crawl their content. Read
their ToS :-)

For the rest I agree with Piskvorrr, you could do some trial and error and
learn on the fly or read some books (and still do trial and error and learn on
the fly ;-)). Good luck!

~~~
gurpreet42
thanks ..

------
mmariani
If you're willing to veer off from C# there's a great framework in Python
called Scrapy [0], it's fast and easy to pick up. You could crawl and save the
results in your database with Scrapy, and get them back with your C# backend.

[0] <http://scrapy.org/>

~~~
gurpreet42
I can switch to any platform, provided that would be best one. I will check
this framework and try to create a demo application so that i can get better
idea. There are many companies that use web crawling. Are they using Pyton or
something other ?

------
dgunn
Some sites will say you can't scrape. They'll say so in their TOS if that's
the case. If you absolutely need that info, there are ways to not look like a
spider. One way is to use a browser driver (pretty heavy option so I would
just not use those sites for now). I can't remember the name of the browser
drivers available, but that's one possibility. I doubt they'll have libraries
for your languages though.

I would definitely recommend python for web scraping. There are a ton of
tutorials out there. I believe there are even a few from Google which are
pretty good.

~~~
gurpreet42
Thanks for your suggestions. I will try python for my further efforts.

------
knes
Check the udacity lesson CS101 where you build a web crawler with google guys.

<http://www.udacity.com/view#Course/cs101/CourseRev/apr2012>

