
Show HN: crawl a website and store it in S3 from your browser - spullara
https://github.com/spullara/browsercrawler
======
ryan-allen
Is there a header set to identify this crawler so I can limit it?

For what it's worth, when people crawl sites I am responsible for, and it
impacts our products performance, I block the offending IP.

It doesn't get unblocked until the system reboots (rarely) or someone lodges a
support ticket to say they cannot access the site.

It's heavy handed but crawlers can cause a significant amount of trouble given
it's a non-usual usage pattern. Spiders often have a time between hits, I hope
you have programmed one in rather than going full speed!

~~~
spullara
As it only does one URL at a time and uploads to S3 between requests, it
shouldn't unduly load any reasonable system. I'll add an additional
"BrowserCrawler" string to the user-agent, that seems very reasonable.

Update: JQuery can't set the User-Agent header on ajax requests it appears. I
have instead set a new X-User-Agent header to BrowserCrawler.

~~~
JoachimSchipper
What's this X-User-Agent stuff? The entire internet isn't going to special-
case your hack. Just honour robots.txt.

~~~
spullara
I don't that a individual storing web pages has any requirement to obey
robots.txt. You should be able to click "Save Page" on anything you have
access to. Obviously redistributing that is another can of worms.

------
raptrex
How hard would it be to modify it to store it locally on your computer?

~~~
dotBen

      wget --spider
    

is probably your friend

------
neuromancer2600
Great! I had the same idea only with GAE integration and option to download
locally. Any plans on expanding that interface?

~~~
spullara
I have it going to S3 because it gives me the ability to instantly send
someone a link or access from wherever I am. Also, I planned on doing larger
crawls + analysis. Should be pretty easy though to hook it into a local file
API like I mention above if that is more convenient for you.

------
petervandijck
That is _very_ cool.

