

Tell HN: I'm giving away a sample dataset -- URI scheme frequency - coderdude

Hi everyone,<p>I've released a small, sample dataset based on a 7.5MM page crawl. The dataset lists the frequencies in which URI schemes appear in the anchor element's href attribute. The list was truncated at 300 URI schemes (for this sample) and is available in TSV format. The columns are: uri_scheme, frequency, unique_pages, unique_domains.<p>http://webscaled.com/static/samples/uri_scheme_part-00000.tsv<p>Please keep in mind that this is rough data. This file came straight from a MapReduce job. The values have not been classified (as they will be in the full version of the dataset), and a few are just people attempting to write "http:" but failed.<p>You are free to do whatever you please with this dataset. I just ask that you reference http://webscaled.com/ somewhere. Also, send me an email from the Webscaled contact form if you're interested in the sample dataset for top Doctypes.<p>Edit: Probably best to release under a license. Creative Commons Attribution-Share Alike 3.0 United States License.<p>Thanks for your time
======
jacquesm
I've looked at the data, it seems that 'rough' is a pretty good description,
cleaning it up shouldn't take more than a couple of minutes though.

Most of the misspellings are for http and for mailto (with matilto as the
funniest), and there are a bunch of entries that are clearly nonsense and can
be lost without any penalty.

aol: is listed 75 times, one could only wish that it would be that easy to get
rid of aol :)

~~~
coderdude
Certainly, I could have eyed it and removed a ton of the crap. I figured most
people would be interested in the top 30 or so anyway, and this also serves to
give people an idea of just how dirty the Web really is. ;)

The final dataset will contain MUCH less noise.

