
“Fetch as Googlebot” tool helps to debug hacked sites - Garbage
http://www.mattcutts.com/blog/fetch-as-googlebot-tool-hacked-sites/
======
Zenst
I know one approach from over 10 years ago was to have your editable site on a
issolated server that would periodicaly copy its content to the hosting site.
This meant that any changes to the hosted site would get stamped over by the
master copy.

Still nothing is perfect and this is a good read about a issue alot of people
are not aware of.

~~~
thaumaturgy
What's frustrating to me -- as someone who's getting closer and closer to
publicly launching a shared hosting service -- is that this problem _should be
solved_ already: we have twice-daily backups for our websites, going back up
to a year. It's all automated, and customers can access the backups directly.
If your site's been hacked, you can log in to the backup server, view the
changes for your site's directory, and download the most recent good copy.

For most sites that store their templates as regular files and their contents
in a database, this is plenty good enough. For sites that store their content
as regular files too, it only takes a few extra minutes to separate the good
stuff from the bad stuff.

This is super easy to implement. Every web host should be doing it.

~~~
ars
What would be the point? If you got hacked once, after you restore the files
you'll just get hacked again.

As a reporting tool it could be interesting though.

~~~
Zenst
Content and software are seperate. Were talking about content and not the
software. Sure you fix the flaw in the software that allowed the expliot or
the weak password or however the site was taken over, but it is a seprate
issue.

~~~
justincormack
Not in PHP they aren't always, which is what the vast majority of hacks are
on...

~~~
dangrossman
They are separate in the PHP software that gets targeted by "the vast majority
of hacks". Those hacks are against popular CMS packages that can be scanned
for and exploited in an automated fashion. In a CMS, the software being
exploited and the content are separate, in PHP as every other language.

The PHP files where the content and the system are one and the same (hand
written pages not using a packaged CMS) aren't part of "the vast majority of
hacks" category. Compared to exploiting a WordPress vulnerability in 50+
million installs, someone trying to mess with the black box that is someone's
custom written page happens insignificantly rarely. Your retort doesn't hold
water.

~~~
duskwuff
> The PHP files where the content and the system are one and the same (hand
> written pages not using a packaged CMS) aren't part of "the vast majority of
> hacks" category.

Speaking from experience, this is simply not true. There are automated
scanners in the wild which will attempt to detect and exploit common
vulnerabilities in simple PHP templating systems and CMSes. One frequently
exploited vulnerability is in applications which use URLs of the form:

    
    
        index.php?page=foobar
    

With supporting code along the lines of:

    
    
        $page = $_GET["page"]; /* if register_globals isn't set */
        include("pages/$page.html");
    

Until relatively recently, when PHP started rejecting filenames with embedded
null bytes, code like this was vulnerable to input such as:

    
    
        index.php?page=../../../../../../proc/self/environ%00
    

Applications like this are relatively easy to detect in an automated fashion,
and were for a time being exploited on a very large scale.

------
Sukotto
Interesting to learn that there's a communication method one can use to
directly speak to a human at google and get a _helpful same-day reply_.

How do I, as someone outside of the music industry, gain access to that
communication channel?

~~~
packetslave
Become a paying customer

------
cantlin

        $ curl -v -A Googlebot example.org
        $ curl -v -e www.google.com example.org

~~~
michaelmior
I was going to post the same thing. Fetch as Googlebot is nice but really
nothing too special.

~~~
joshu
You are wrong. Stop guessing and then acting like your guesses are facts.

~~~
michaelmior
What I am guessing? Your comment does nothing to clarify any mistake I might
have made, which I'll gladly admit to.

There are certainly differences between just setting the user agent and
running Fetch as Googlebot. (The incoming IP address being an obvious one.)

------
essayist
This may help clear up something that has puzzled me about a site I use and
often search via Google.

[https://www.google.com/search?num=100&hl=en&safe=off...](https://www.google.com/search?num=100&hl=en&safe=off&q=site%3Ancdd.org+%22for+sale%22&oq=site%3Ancdd.org+%22for+sale%22)

(In this case, I specifically put in "For Sale" to highlight the spammy drug
ads, but they come up even without this).

Puzzle: I scan through looking for pages with "XYZ For Sale" in the title and
then check out Google's cached version of the page. Sometimes, I see the spam
in the cache, but often enough I don't.

So: how is it that the search result is different than google's own cache for
that page?

------
mef
For the curious, the site mentioned in the article appears to be either Alanis
Morissette or The Doors:
<https://www.google.com/search?q=Generic+synthroid+bad+you>

------
drivebyacct2
The sad thing is, even people who ought not be amateurs, but are, fall for
this trick. My mother's company's site was hosted on GoDaddy (I offered to
move it and pay for the hosting as it's a non-profit, but she declined). They
swore up and down for weeks and weeks that they were not responsible and there
was nothing wrong when it was a hosted Wordpress instance.

Most of the time they (hackers, sorry pronoun overload) just naively check the
referrer. Going to Google and searching for the site and clicking it is often
sufficient

~~~
ben0x539
Referrer header considered harmful yet? :-)

