

Share code that uses new URL Search tool and win AWS credit - LisaG
http://commoncrawl.org/url-search-tool/

======
djoerd
While I know that some of the pages of my home page are in the crawl, they do
not show up with the following query:
[http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F...](http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F~hiemstra)
nor with:
[http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F...](http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F%7Ehiemstra)
(no, this is not only an ego search problem ;-) )

~~~
anjackson
Yes, that's a little odd. If you shorten the search term, the results come up
just fine:
[http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome/~h...](http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome/~h&start=60)

------
LisaG
I hope that some of you who use/play around with the Common Crawl data will
try out using the JSON files from the URL Search and then share your code.

If you didn't see the details in the blog post, Common Crawl is giving out
$100 in AWS credit to the first five people who share code that incorporates a
JSON file from the URL Search.

~~~
visarga
Is it possible to get a list of webhosts, like all the domains and subdomains,
stripped from the rest of the url?

------
LisaG
From @djoerd Why does @CommonCrawl URL search
(<http://urlsearch.commoncrawl.org/> ) need 'tld.domain' format rather than
'domain.tld'? Read Google's BigTable paper.

------
frederi
Why can't they just write code that reverses the input?

~~~
srobertson
The main intent of the search is to retrieve a list of urls that the site
crawled for a given subdomain, domain or tld. So for now you can do that using
reversed url notation. Which I admit is not very intuitive.

We're toying with the idea of implementing some sort of wild card that way we
can present the urls in natural order. Something like *.google.com to retrieve
all urls under google. But we wanted to judge the level of interest first.
After all "done" is better than "perfect".

~~~
LisaG
"Done" is better than "perfect" should be on a sign hanging in every startup.

------
lubujackson
I'd love it if there was a feature to search for a specific URL. Like if
"com.google" just loaded the Google homepage if you put it in quotes.

~~~
srobertson
good suggestion

------
lubujackson
Top results for "com" are a little odd. Seems like @ wasn't filtered from the
domain part of the URL (though it should be, I would think).

~~~
srobertson
I think it's just that the site converts unicode urls for display purposes. If
you click on one of the links with "@" in it, you'll see the real url in url
encoded format

<http://%2E%2E%2E@harunyahya.com/Ajax/get.comments/oid/4612> 176.34.181.212
20120516214328 text/html 912

------
djoerd
The first FAQ link seems to be broken (maybe a web server setting gone bad?)
BTW, this is a great resource. Thanks for sharing this!

~~~
LisaG
Thanks for the catch Djoerd! We will fix it now

