
The woes of building an index of the web - jennita
https://moz.com/blog/mozscape-index-2015
======
ChuckMcM
This is a really great description of why building crawlers (and indexes) is a
really hard problem. Basically 90% of the "web" is now crap, and by crap I
mean stuff you would never ever want to visit as a real human being. Our
crawler once found an entire set of subdomains with nothing but markkov chain
generated "forum" pages, and of course SEO links for page rank love (note to
SEO types, this hasn't fooled Google for at least 6 years).

The explosion of cheap CPU and storage means that single server with a few
terabytes of disk can serve up a billion or more spam pages. And seemingly
everyone who gets into the game starts with "I know, we'll create a lot of web
sites that link to this thing I'm trying to get to rank in Google results ..."
worse, when it doesn't work they don't bother taking that crap down, they just
link to it from more and more other sites in an attempt to get its host
authority better. That doesn't work either (for getting page rank)

But what it means is that 99.9% of all new web pages created on a given day,
are created by robots or algorithms or other agencies without any motive to
provide value, merely to provide "inventory" for advertisements. You are lucky
if you can pull a billion "real" web pages out of a crawl frontier of 100
billion URIs.

~~~
bambax
Wow, very interesting comment, thanks! Wouldn't it make sense to build (and
maintain) a kind of "official" reference of all pure spam domains? Or does
this list already exist?

~~~
ChuckMcM
Well every crawler has to have this list, the Blekko crawler tries to keep
these pages out of the index (with varying levels of success). But its not
particularly useful for non-crawlers, and since every crawler will have a way
of evaluating hosts (possibly uniquely) it isn't really transportable.

That said, if you have ever wondered why domains that used to have web sites
on them suddenly become huge spam havens, it is because spammers buy up the
domain as soon as it expires and try to exploit its previous reputation as a
non-spam site, to push link authority into some (generally Google's) crawl.

------
greglindahl
Note that they're building a graph of the web for SEO purposes, not a search
engine index.

~~~
b4hand
Our index and API are primarily used by web marketers. We have no interest in
building a general search engine, but there's nothing special about it that is
for SEO.

For the most part, we just provide general facts about the web, and we've been
contacted by academics on more than one occasion for data sets.

~~~
greglindahl
Is that kind of data called an "index" by any community, though? Seems quite
confusing to me, and you can see some of the other folks commenting here that
it's confusing.

~~~
b4hand
We do have a traditional inverted index over anchor text. That's just one
aspect of our data set though.

Originally, that was a larger focus and that's the name we've always called
it. I could see how that might be confusing.

However, I do think the common name for the data that we collect is called a
backlink index, at least in our industry.

~~~
greglindahl
Ah, that's pretty neat that you have an inverted index in there. I'm about to
build one for the Wayback machine.

------
sqldba
I read the whole thing and did a search and still don't know what the index
is, what it's for, or how to use it.

~~~
elsewhen
they have an index of the web intended to mimic google's index which they use
it to provide reports to marketers. with their service, a search engine
optimizer can answer questions such as.: how many incoming links does this url
have?

~~~
b4hand
We provide a public facing API that can answer lot's of questions about URLs,
FQDNs, and root domains. The example you gave is one of them. We calculate a
large number of statistics about every URL, FQDN, and root domain that we know
about on the web.

We also calculate MozRank which is our version of Google's PageRank as well as
some in house higher level metrics like PageAuthority and DomainAuthority
which are machine learning models derived from all of the other metrics we
compute.

