

Download the internet - daleharvey
http://www.dotnetdotcom.org/

======
HendrikR
The number of pages crawled keeps counting up. One per second. A glorious
glance into the source code shows, it's faked (the counting).

------
juliend2
I love their navigation :

# Information on how to block our crawler. (Hint, it doesn't involve legal
action)

# Our purpose and goal. (Yes we have one and no it doesn't involve spam)

# Our technology. (Thanks open source!)

[...]

------
gojomo
Their dump would be more useful if they...

(1) Used a preexisting aggregate web content format. Their ad hoc format is
simple enough, but can't handle content with NULLs, and loses valuable
information (such as time of capture -- you can't trust server 'Date' headers
-- and resolved IP address at time of collection).

They could use the Internet Archive classic 'ARC' format (not to be confused
with the older compression format of the same name):

<http://www.archive.org/web/researcher/ArcFileFormat.php>

Or the newer, more involved and chatty but still relatively straightforward
'WARC' format:

<http://archive-access.sourceforge.net/warc/>

(2) Explained how the 3.2 million pages in their initial dump were chosen.
(That's only a tiny sliver of the web; where did they start and what did they
decide to collect and put in this dataset?)

(FYI, I work at the Internet Archive.)

~~~
bravura
gojomo, I have looked at the file formats. Could you propose some off-the-
shelf web spiders? I would like to accumulate a lot of text for NLP research.

~~~
gojomo
At the Internet Archive we've created Heritrix for 'archival quality' crawling
-- especially when you want to get every media type, and sites to
complete/arbitrary depth, in large but polite crawls. (It's possible, but not
usual for us, to configure it to only collect textual content.)

The Nutch crawler is also reasonable for broad survey crawls. HTTrack is also
reasonable for 'mirroring' large groups of sites to a filesystem directory
tree.

~~~
bravura
Could you outline how to configure it to collect only textual content?

~~~
gojomo
Very roughly:

(1) Add a scope rule that throws out discovered URIs with popular non-textual
extensions (.gif, .jpe?g, .mp3, etc.) before they are even queued.

(2) Add a 'mid-fetch' rule to FetchHTTP module that early-cancels any fetches
with unwanted MIME types. (These rules run after HTTP headers are available.)

(3) add a processor rule to whatever is writing your content to disk (usually
ARCWriterProcessor) that skips writing results (such as the early-cancelled
non-textual results above) of unwanted MIME types.

Followup questions should go to the Heritrix project discussion list,
<http://groups.yahoo.com/group/archive-crawler/> .

------
Shamiq
How _could_ you use this?

Creative ideas please!

~~~
braindead_in
Write a better page rank algorithm?

~~~
daleharvey
I think an open source google could be a pretty great project, I would imagine
its been tried before, but by seperating out the steps where these guys crawl,
other people build indexes, and others handle lookups, it sounds more
reasonable than one project taking on the whole thing

~~~
snprbob86
The biggest problem here is hosting the index... in RAM...

It takes A LOT of machines to power a modern search engine which serves any
real amount of traffic. One key component of an open source search engine
would be a sort-of peer-to-peer distributed infrastructure. When I suggested
this in an earlier thread, people were quick to point out the liability
concerns here... but maybe it could work somehow... but then how do you get
people to sign up for it?

That said, I think this is incredibly interesting stuff. I would really love
to see open source, peer-served web utilities. For example, I'd want access to
many of the components of a web search engine, not just the search results
themselves. Things like a language model for spell checking or word
segmentation. Or a set analysis tool for detecting synonyms.

~~~
braindead_in
Ya. Thats true. But purely from an data standpoint, having the same index as
Google has would be valuable. Just because you can do so many things with it.
Like, figuring out how to increase the page rank of your site!

------
ojbyrne
It's funny but just before I read this, somebody sent me some pictures of
Barcelona (where I lived back in 2000-2001). I had a bunch of pictures on my
(non-corporate) website from 2000 to about 2003. Rather than digging them up
from somewhere on my various hard drives, I just turned to the internet
archive. And there they were.

It also turned up this error message, forever preserved in amber, so to speak:
[http://web.archive.org/web/20030802202553/www.permafrost.net...](http://web.archive.org/web/20030802202553/www.permafrost.net/gallery/Barcelona/unknown)

------
meatbag
dotbot is one of the data sources that SEOmoz uses for their Linkscape
crawler. Not sure why it was submitted here but I suspect promotional motives.

It is interesting data, but when people build crawlers to index the entire
www, especially if the data is intended as an SEO intelligence tool, certain
issues arise. Some background on this particular issue:
[http://incredibill.blogspot.com/2008/10/seomozs-new-
linkscap...](http://incredibill.blogspot.com/2008/10/seomozs-new-linkscape-
creates-webmaster.html)

~~~
inovica
Are you sure that Linkscape use the submitted organisations bot? I agree that
it could just be promotional - as they are only giving access to some of their
content (around 10% at my calc) but I'm not sure we're talking about the same
bot in this instance

~~~
meatbag
from <http://www.seomoz.org/linkscape/help/sources>

    
    
        * Dotnetdotcom.org
        * Grub.org/Wikia
        * Page-Store.com
        * Amazon/Alexa’s crawl and internet archive resources
        * Exalead’s commercially available data
        * Gigablast’s commercially available data
        * Yahoo!’s BOSS API and other data sources
        * Microsoft’s Live API and other data sources
        * Google’s API and other data sources
        * Ask.com’s API and other data sources
        * Additional crawls from open source, commercial and academic projects
    

In my experience, the single most useful feature (main selling point) of the
Linkscape tool is that it reports http status codes (for a price) so SEOs can
detect 301 redirects, etc. AFAIK, dotnetdotcom.org has the only free, publicly
available crawl data which also includes http status codes. Not sure about
Exalead and Gigablast but I am pretty sure the other SEs don't release this
information. To clarify: I don't have any proof, and things may have changed,
but I've read some intelligent speculation (smarter than me) which claims that
dotbot/dotnetdotcom.org provides the majority of the data (especially the
unique info, like status codes) for the Linkscape tool.

------
thorax
search.wikia.com uses Grub: <http://search.wikia.com/about/crawl.html>

And that index is also open for download, though I haven't looked much into
it.

------
parenthesis
Do check out at the bottom in the "Dotbot Spider Statistics" table:

    
    
      # of Tubes Found Clogged 7188420

~~~
CalmQuiet
Yes: they _do_ have a sense of humor.

And: are thoughtful enough to include typos, because (as we all know) some
people appreciate the opportunity to find errors). [ "...discussion of
girlfriend/boyfriend/husband/wife issues are stickily prohibited." ]

------
YoavShapira
It's good to see activity in this space, with more people and offerings (even
if some of them are semi-dodgy). It's been too quiet for a while, with just
the major search engines and other big players doing their own proprietary
indices.

------
braindead_in
cool idea. they should add a bit torrent link for downloading.

~~~
jonursenbach
There's one there already.

~~~
KrisJordan
Is their tracker working for anyone? Not so much here.

~~~
soult
The tracker is down, but I found some by adding some open trackers and
activating DHT:

* <http://tracker.thepiratebay.org:80/announce>

* <http://denis.stalker.h3q.com:6969/announce>

* <http://tracker.soultcer.net:80/announce>

~~~
Zev
You found the peers through DHT. Randomly adding trackers doesn't really do
much to help the torrent. Not unless _others_ have added the tracker as well.
Which is unlikely if you added it on your own.

~~~
soult
Since the first 2 are quite common "open trackers" (where open refers to the
fact that they track any hash you submit), it is often quite likely to find
sources there, because other people add them as well when they want more
sources.

Furthermore my comment will lead to more people, who might not have a DHT-
supporting client, adding those open trackers.

~~~
jonursenbach
Which torrent clients out there don't support DHT these days?

------
Allocator2008
Awesome idea. I like the flat file based indexing management they mentioned.
Gosh almost wish they had a link whereby one could send them one's C.V.! Keep
up the good work.

------
ajkirwin
Download the internet!

Or, at least, whatever doesn't block robots. :/

~~~
CalmQuiet
Yes: doesn't seem like it's going to be very "deep web", does it?

