Share code that uses new URL Search tool and win AWS credit (commoncrawl.org)
on Mar 5, 2013

While I know that some of the pages of my home page are in the crawl, they do not show up with the following query: http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F... nor with: http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F... (no, this is not only an ego search problem ;-) )

Yes, that's a little odd. If you shorten the search term, the results come up just fine: http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome/~h...

Works on me!


So there's a bug there, but not all the time for ~.

I hope that some of you who use/play around with the Common Crawl data will try out using the JSON files from the URL Search and then share your code.

If you didn't see the details in the blog post, Common Crawl is giving out $100 in AWS credit to the first five people who share code that incorporates a JSON file from the URL Search.

Is it possible to get a list of webhosts, like all the domains and subdomains, stripped from the rest of the url?

From @djoerd Why does @CommonCrawl URL search (http://urlsearch.commoncrawl.org/ ) need 'tld.domain' format rather than 'domain.tld'? Read Google's BigTable paper.

Why can't they just write code that reverses the input?

The main intent of the search is to retrieve a list of urls that the site crawled for a given subdomain, domain or tld. So for now you can do that using reversed url notation. Which I admit is not very intuitive.

We're toying with the idea of implementing some sort of wild card that way we can present the urls in natural order. Something like *.google.com to retrieve all urls under google. But we wanted to judge the level of interest first. After all "done" is better than "perfect".

"Done" is better than "perfect" should be on a sign hanging in every startup.

Just an oversight. Most of our work is done by people who graciously volunteer their time. We'll get that fixed.

I'd love it if there was a feature to search for a specific URL. Like if "com.google" just loaded the Google homepage if you put it in quotes.

good suggestion

Top results for "com" are a little odd. Seems like @ wasn't filtered from the domain part of the URL (though it should be, I would think).

I think it's just that the site converts unicode urls for display purposes. If you click on one of the links with "@" in it, you'll see the real url in url encoded format

http://%2E%2E%2E@harunyahya.com/Ajax/get.comments/oid/4612 20120516214328 text/html 912

The first FAQ link seems to be broken (maybe a web server setting gone bad?) BTW, this is a great resource. Thanks for sharing this!

Thanks for the catch Djoerd! We will fix it now

