

Ask HN: Would you buy datasets of Web authoring methods & the Web's structure? - coderdude

I want to sell datasets that I create from doing regular Web crawls. An example dataset would be the link graph. I would sell the link graph in chunks of 1 billion edges for a price. The dataset would include the source and destination URLs, the anchor and title text of the link, any rel or rev values, and so on. Other datasets would include the top 1,000 [HTML editors, CMSs, forum and blog software, etc.], big lists of sites using X technology (AdSense, Feedburner chiclet, etc.), how many sites are using which advertising platform, and so on.<p>Are you personally interested in such datasets? Are there certain niches that would benefit from the data that I should be targeting?<p>Edit: I should note that I've already done a tremendous amount of work on this project. For example, I've completed each of the datasets I listed above from a [rather small] 10 million page crawl. I'm also running a Hadoop cluster with HBase and I'm using MapReduce for data processing.
======
olalonde
What would be interesting would be to let users hook their own custom
functions to your web crawler. In other words, whenever you crawl a link, you
feed the link to the developer's script which does whatever it wants with it.
I've had plenty of ideas that would involve a web crawler, but I just don't
have the time to customize an open source crawler. There are too many
complications like not falling into infinite loops (such as a calendar type
pages).

~~~
coderdude
I have something in mind along those lines, and I'm going to provide at least
two or three ways to create custom datasets from my 'snapshot of the Web.' I
thought about your suggestion of allowing customers to execute their own
scripts over the snapshot but the security issues involved are my big concern.
I would need a way to sandbox it. My plan right now is to offer the ability to
run simple data tasks on the snapshot. For example, regex (filtered for things
like SSN matchers, and so on), a language with a limited vocabulary for
grabbing features with certain textual of visual characteristics, or even
something as simple as xpath. I would run these tasks on the cluster and the
result would be returned to the customer.

------
il
Send me an email to ilya [at] unviral.com about this, I would be very
interested in talking about applying this for tracking ad campaigns and
correlating with some of the data I have been collecting.

There's definitely a lucrative market for this if your crawling and data can
be targeted enough.

------
coderdude
Update: I created a landing page for the time being: <http://webscaled.com/>

