
Trolling the search engines - Buetol
http://blog.dam.io/trolling-the-search-engines/
======
ChuckMcM
This is a common grey/black hat SEO trick, basically create URL text that
matches as many phrases as possible, and as demonstrated, easy to do with
pretty simple cgi scripts and a bit of link seeding.

The theory behind it is that if your site has a lot of relevant "targets" then
it must be more important than a site that has only a few targets. (Consider
wikipedia as the poster child for this)

When people write naive web spiders in an attempt to create their own crawler,
sites like this 'trap' them in an infinite web of apparently unique links.
Always a good idea to stop after a few hundred and kick it back to a human to
see what is up with that :-)

------
solox3
> Can’t believe that: No one did it before me (to my knowledge)

This has certainly been done before:
[http://en.wikipedia.org/wiki/Spider_trap](http://en.wikipedia.org/wiki/Spider_trap)

~~~
Buetol
Thanks, corrected it

~~~
ozh
I had the same "fun" idea 6 years ago. In 3 months Yahoo Site Explorer
reported 150 _million_ pages indexed. As of today Google still shows 55
million pages (in comparison,
[https://www.google.com/search?q=site:http://en.wikipedia.org...](https://www.google.com/search?q=site:http://en.wikipedia.org/)
reports 34 million pages).

I had to kill the experiment (no more new "pages" crawled) because of the CPU
load and bandwidth costs, even throttling robots

------
dchuk
This is nothing new, anyone in the SEO world knows that this has been possible
for years. It's not particularly long lasting as a strategy but it can be
leveraged for things like link spam via massive amounts of cloaking sites.

~~~
vladtaltos
I concur it is known - however, it is still impressive that google does not
have a counter-measure against it. he says they indexed 140k pages of it.
phew. consider having a few thousand of these pots set-up and cross linking to
each other left and right. they'll be crippled... I sense a bug somewhere...

~~~
dchuk
I think you're overestimating Google's capabilities a bit. How are they
supposed to detect that the content is generated on the fly? If it's unique
enough upon generation time, they really can't tell the difference between
this site and a site with legitimate content on it.

This is the core of why Google uses things like links as a ranking signal. If
they based it entirely on the content of the site, they're easily duped. While
it's somewhat trivial to manipulate your backlink profile to increase your
rankings, it's rather hard to fake links from known high quality sites like
the New York Times for instance. So they can "trust" links more than they can
trust the content of the site they're crawling.

So even though this site has a large amount of content indexed, the changes of
it ranking for anything more than gibberish are so low that it doesn't even
matter.

------
RealGeek
I tested this a while ago by auto-generating websites with millions of
keywords. Surprisingly, I was able to get the websites ranked for many long
tail keywords and it started to bring over 100,000 visitors per day.

Though it didn't last very long, Google penalized the websites after couple
months. I did it for testing, but it could be an effective strategy for
spammers who could rinse, repeat and scale.

There are many websites like that still ranking and generating traffic; some
of them have Alexa rank below 1000.

------
xavian
I maintain web crawlers for a large internet portal and this stuff is a pain
enough to deal with when people unintentionally make their websites recursive,
much less INTENTIONALLY. _facepalm_ At least google crawler had issues as
well. Thanks a lot, troll. ;P

~~~
arbuge
You need to update those crawlers to enforce some hard limits on crawl depth,
methinks, especially for low pagerank domains (if there's some easy way to
estimate pagerank, i.e.).

------
Houshalter
Creating an "infinite" website sounds kind of cool. You could generate all the
content with a good language model. Then have humans correct where it makes
mistakes. Maybe upvote or downvote pages that they like.

------
jw2013
When you type in some querying strings the backend just fails, for example:

[http://inf.demos.dam.io/sdfsdfs](http://inf.demos.dam.io/sdfsdfs)

I have yet to find a pattern, but I guess the author did some sort of hashing
to the query string and use the hashcode as the seed to generates random text.
For some given strings, the backend fail has to do with the use of the
hashcode probably. (e.g. use [some_variable_or_constant_here]/[hoshcode -
CONSTANT], when hoshcode == CONSTANT, the backend code does not contain the
exception handling code.) Just a guess.

~~~
eli
Makes sense, otherwise this is a very easy way for spiders to realize they're
in a trap. I'm pretty sure Google tests bogus query strings to see what the
404 page looks like.

~~~
myle
Some legit websites instead of displaying a 404 error, they redirect (3xx) to
a page like the main page or about.

~~~
eli
Or return 200 with an error message. Something Google would want to know in
any cause.

------
DocFeind
As the rest have said, this is old news. Mass indexing is a simple hack in
search. Especially for nonsensical content abusing non-standard terms.

If you want to impress us, rank a few 100k pages for competitive terms with
garbage content.

------
kalleboo
One fun implementation of this is directory.io: A site listing every single
possible Bitcoin private key and it's corresponding address. The theory is
that eventually, millions of years from now, Google will have indexed all
10^74 pages and you can just google for a bitcoin address to steal its
balance.

[http://directory.io](http://directory.io)

------
Buetol
Seems down, here is a cached version:
[http://webcache.googleusercontent.com/search?q=cache:DA5Tih3...](http://webcache.googleusercontent.com/search?q=cache:DA5Tih3fS50J:blog.dam.io/trolling-
the-search-engines/+&cd=1&hl=en&ct=clnk&gl=us)

------
alex_doom
I'm pretty sure this is done on a different scale for the sites that fake data
to get traffic.

~~~
Dorian-Marie
Just looking at the other results in Google: bestwordlist , zyzzyva and other
generated websites.

------
fear91
There's a whole industry in Eeastern Europe that does just this and resells
the link-space. It's nothing new but don't be surprised when your whole domain
gets deindexed.

------
yeukhon
I am slow. Where is the recursion? There are a lot of hyperlinks on the page
but I can't see the recursion? I am trying to find duplicates.

------
elwell
Yeah will probably be blacklisted soon.

------
elwell
How much traffic in visits?

~~~
Buetol
I put the access logs at the bottom of the post, so you can calculate them

EDIT: Installed AWStats, just wait an hour:
[http://stats.demos.dam.io/](http://stats.demos.dam.io/)

EDIT2: Can't manage to make AWStats to parse the old logs...

~~~
Stormcaller
for those interested, I did look at it a bit and before I touched it was 116k~
lines. After I removed all the "bot", "spider"(but not "google" or "mozilla"
or anything that might be used by real useragents)(and yes I am aware chrome
doesnt use google and mozilla is used 99% of those strings, I just couldnt
give any better examples) It went down to 1k~. I scrolled thru and most of
those requests came from "216.151.137.36" so I removed that too. I had 500~
that wasn't "obviously" coming from bots. After this I stopped as what I
wondered was "did this site get significant amount of real visits?" and the
answer at that point was clearly no, or if any, less than 500.

Btw, does anyone know who this ip(216.151.137.36) belongs to? All I get is
spam reports when I google it and doesnt seem to belong to any (major) spider.
[http://stopforumspam.com/ipcheck/216.151.137.36](http://stopforumspam.com/ipcheck/216.151.137.36)

Also on another question, how does google deal with web apps that have
unlimited pages(dynamic urls based on get and so)? As in, how does it say
"this is legit" and this site here is not. Surely backlinks are one thing, but
these can be "faked", too?

~~~
toast0
> Btw, does anyone know who this ip(216.151.137.36) belongs to

FWIW, I see this ip hitting my spider trap too. 14 requests in 30 seconds on
March 14th, and 21 requests in 45 seconds on Feb 20th.

------
spada
wouldn't it be more effective to use your target keywords as the words list
and create dozens or hundreds of slightly relevant pages vs. 148k?

am i wrong in thinking that might actually move SERPs?

