Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The (humorous) dangers of Google's new deep search
22 points by robertk on April 17, 2008 | hide | past | favorite | 11 comments
I was reading the Slashdot article on Google's new "deep search" (http://tech.slashdot.org/tech/08/04/16/2052206.shtml) where it submits forms and sees what the results are. This is a quite insightful and interesting anecdote one user posted:

http://tech.slashdot.org/comments.pl?sid=525058&cid=23096424

When I interned at Google, someone told me a funny anecdote about a guy who emailed their tech support insisting that the Google crawler had deleted his web site. At first, I think he was told that "Just because we download a copy of your site, doesn't mean your local copy is gone." (a'la obligatory bash [bash.org].) But, the guy insisted, and finally they double checked and his site was in fact gone. Turns out that it was a home-brewed wiki-style site, and each page had a "delete" button. The only problem was, the "delete" button sent its query via GET, not POST, and so the Google spider happily followed those links one-by-one and deleted the poor guy's entire site. The Google guys were feeling charitable and so they sent him a backup of his site, but told him he wouldn't be so lucky the next time, and he should change any forms that make changes to POSTs -- GETs are only for queries.

So, long story short, I wonder how Google will avoid more of this kind of problem if they're really going off the deep end and submitting random data on random forms on the web. Like the above guy, people may not design their site with such a spider in mind, and despite their lack of foresight this could kill a lot of goodwill if done improperly.




Very funny anecdote. What bothers me though is that this implies that Google seems to think they own the web.

For example, they told the guy that he was lucky that they were willing to give him a backup, but it seems to me that Google's the one that should be taking responsibility for their actions.

Its a short jump from, "you have to use POST or we'll delete your stuff" to "you have to follow Google standard X or we won't index your site."

Welcome to the first web empire.


No, any web crawler would have done the same thing. It is simply an error to modify content with a GET.


I agree. Google simply followed the the semantics of http. They had similar problems with the the Google Web Accelerator before. But it's really hard to blame google for the mistake of other programmers.


Additionally, the interface to web crawlers (robots.txt) is well defined.


Yes, it is an error. What alarms me is Google's attitude.

What happens when we're dealing with interfaces that are a little bit more ill-defined? Will Google continue to demand that you follow their way of doing things?

Google's attitude in this case suggests they will.


It's not "their way of doing things", it is just the way the web works (per HTTP spec). Any crawler would have done the same thing in that situation -- that fact that it happened to be Google is merely coincidental. Given the scale at which they operate, you can't expect Google or any other web-scale crawler to be mind-readers.


At least search engines can be told not to touch a certain link, if a bored user did the same thing who would you blame?


In a way Google is doing QA on the whole web. This will expose all kinds of bugs sites have hidden behind their form-processing scripts. Databases will get filled up with random junk ('Google was here'), but that is also good for QA. On the other hand, a lot of that junk data will be reflected back to the web by sites which post this stuff.. not so good for the SNR of the Internet as a whole.


I've seen the same thing happen with a rails app a co-worker created and populated with data for a demo/tutorial. It got indexed by the university's web crawler overnight and was empty the next day.


Oh man, THAT is awesome! ....Backing up my site now :-P


I they will only use drop downs, thus most don't contain such harsh options. I want to hear a good case of getting behind those forms... if they do enter data into boxes, they will do it off a dictionary list that is certified safe ..

- Sites still use drop down nav? (is that a case?)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: