

The (humorous) dangers of Google's new deep search - technoguyrob

I was reading the Slashdot article on Google's new "deep search" (http://tech.slashdot.org/tech/08/04/16/2052206.shtml) where it submits forms and sees what the results are. This is a quite insightful and interesting anecdote one user posted:<p>http://tech.slashdot.org/comments.pl?sid=525058&#38;cid=23096424<p><i>When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash [bash.org].) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.<p>So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.</i>
======
sdurkin
Very funny anecdote. What bothers me though is that this implies that Google
seems to think they own the web.

For example, they told the guy that he was lucky that they were willing to
give him a backup, but it seems to me that Google's the one that should be
taking responsibility for their actions.

Its a short jump from, "you have to use POST or we'll delete your stuff" to
"you have to follow Google standard X or we won't index your site."

Welcome to the first web empire.

~~~
Tichy
No, any web crawler would have done the same thing. It is simply an error to
modify content with a GET.

~~~
sdurkin
Yes, it is an error. What alarms me is Google's attitude.

What happens when we're dealing with interfaces that are a little bit more
ill-defined? Will Google continue to demand that you follow their way of doing
things?

Google's attitude in this case suggests they will.

~~~
neilc
It's not "their way of doing things", it is just the way the web works (per
HTTP spec). Any crawler would have done the same thing in that situation --
that fact that it happened to be Google is merely coincidental. Given the
scale at which they operate, you can't expect Google or any other web-scale
crawler to be mind-readers.

------
keshet
In a way Google is doing QA on the whole web. This will expose all kinds of
bugs sites have hidden behind their form-processing scripts. Databases will
get filled up with random junk ('Google was here'), but that is also good for
QA. On the other hand, a lot of that junk data will be reflected back to the
web by sites which post this stuff.. not so good for the SNR of the Internet
as a whole.

------
graywh
I've seen the same thing happen with a rails app a co-worker created and
populated with data for a demo/tutorial. It got indexed by the university's
web crawler overnight and was empty the next day.

------
TrevorJ
Oh man, THAT is awesome! ....Backing up my site now :-P

------
redorb
I they will only use drop downs, thus most don't contain such harsh options. I
want to hear a good case of getting behind those forms... if they do enter
data into boxes, they will do it off a dictionary list that is certified safe
..

\- Sites still use drop down nav? (is that a case?)

