

Ask HN: You must build a backlink index of the web. How do you do it? - rubyrescue

I was using Duck Duck Go today, and decided this is a good question for the community.<p>The question: Say you have to build a backlink index of most (yes intentionally vague) of the web that is updated relatively frequently (at least every 30 days). A full scrape of even a reasonable slice of the web using 80Legs (at least using the old pricing model) would be more than USD $200k. Yet many people are building products, from search engines to SEO tools, that need and use this data.<p>How are they getting it, and how would you get it? Would you buy this data or build it yourself from the perspective of:<p>- The CEO/Entrepreneur who has to pay for the creation of or acquisition of this index and the engineers to maintain it. 
- The CTO who has to build the index or integrate with someone else's.<p>It's currently easy to get around Yahoo Site Explorer's per IP/day limits. Word on the street is it's going away soon. Aside from Yahoo Site Explorer, I'm also aware of Majestic SEO, SEOMoz, lots of desktop SEO apps, and then more 'bare metal' sites like 80legs. Are there other ideas or tools or technologies i'm missing?
======
randfish
We did exactly this at SEOmoz - built an index that's primarily focused on
link data. It's accessible through a free API - www.seomoz.org/api - up to 1
million calls a month free and more at fairly low rates. Our index isn't quite
Google-sized. Our last crawl (updated once per month) was 65 billion URLs,
which we estimate is between 50-65% of Google's main index.

I'm happy to answer any other questions you've got about it if you'd like.
Feel free to email me (rand at seomoz dot org).

------
jdrock
Are you trying to build your own index or do you just want to make a limited
number of calls to someone else's index?

~~~
rubyrescue
right now, we could use between 100k 1MM queries/day... if we had enough
capital, we'd love to find a way to gather and product-ize the data. we think
there are a lot of people who could use it, but that's a lot of work and huge
up-front cost

we're in the position of not needing enough volume to build our own but not
having enough revenue to afford to license a full index either.

~~~
jdrock
Always tough to be in the in-between stages. Providing an index is something
we'll probably consider in the near future. We did a survey with our users and
found that a fair number would like to have an index of our/their crawled
data.

