

IndexTank / 80legs Crawlathon (developer contest) - diego
http://blog.indextank.com/759/new-contest-indextank-80legs-crawlathon-2/

======
va_coder
IndexTank as opposed to WebSolr has a proprietary api. I don't think you can
easily move the data to your own servers like you can with WebSolr

~~~
personalcompute
Another problem is 80legs will only crawl pages <100kb with the free/contest
plan. This means it is useless for all but the smallest pages.

~~~
jdrock
Er.. not true, the contest plan (which is different from the free plan) allows
up to 10 MB download. Registered contestants should have access to their plans
within an hour of registering.

And if you don't have it, just contact us:
<http://www.80legs.com/contact.html>.

~~~
personalcompute
Ah, awesome, thank you for the correction, this allows me to continue with my
plan. I said that within an hour of registering, but had asked on their
support page as well.

------
revorad
If I'm building a search engine, say focused on telescopes, can I use 80legs
to crawl youtube videos (or results from google video search)? I mean can I
add this url to my list of urls to crawl? -
<http://www.youtube.com/results?search_query=telescope>

~~~
jdrock
Our default crawler obeys robots.txt and it looks like the /results URLs are
not allowed. However.. I think you could start from a URL like
<http://www.youtube.com/watch?v=sAzhOSbxMiI> and then crawl to the linked
videos from there...

~~~
revorad
Oh I didn't know about the robots.txt rule for /results. Thanks for pointing
it out or I would have built my own crawler and got banned. I think I'll go
with playlists.

------
orborde
What's the goal of the contest? "Build the most awesome app on our platform"?

It's not clear to me reading the post.

~~~
diego
Create a small web search engine. Crawl stuff with 80 legs, index it with
IndexTank. Example:

\- Crawl everything you can find about bitcoin and do a bitcoin search engine
(just popped up in my mind, may not be the most interesting idea).

~~~
personalcompute
Then post the bitcoin-related whatever to HN. Instant winner.

------
btucker
Is there any documentation on how to get started using these two products
together?

~~~
diego
It should be pretty straightforward:

\- run a crawl at 80legs \- download the results as a csv or xml \- feed them
to IndexTank using your client of choice

Over the weekend I'll put on my (rusty) hacker hat, do an example and blog
about it.

------
Karhan
Is there a good reason that reddit has banned 80legs crawling?

~~~
jdrock
True story: We crawled them a while back (before they expanded their
engineering team) and because of our distributed system, "alarms" were going
off. Rather then take time to tell their system we were not a DDOS attack,
they put us in robots.txt. I imagine their small team had other stuff to work
on.

~~~
xtacy
So, 80legs caused more traffic volume to reddit than Google crawler? That
seems interesting; why would it be the case?

~~~
jdrock
Not necessarily more traffic, though that's possible. More likely it was the
fact that it was coming from multiple IP addresses.

------
Omnipresent
can't sign up yet. Seems to be down

80legs is currently down for maintanence.

We're working on upgrades, new releases, and other fun stuff! Be back soon!

~~~
jdrock
Hm.. can you try again? Seems to be working from our end!

