Hacker News new | comments | show | ask | jobs | submit login
A website that deletes itself once indexed by Google (github.com)
233 points by cjlm on Mar 8, 2015 | hide | past | web | favorite | 121 comments



I had a client once who had something similar, although unintentionally. She approached me because her website "kept getting hacked" and she didn't trust the original developers to solve the security problems... And rightly so!

There were two factors that, together, made this happen: first, the admin login form was implemented in JS, and if you went to log in with it with JS disabled, it wouldn't verify your credentials. And it submitted via a GET request. Second, once you were in the admin interface, you could delete content from the site by clicking on an X in the CMS. Which, as was the pattern, presented you with a JS alert() prompt before deleting the content... via a GET request.

Looking at the server logs around the time it got "hacked", you could see GoogleBot happily following all the delete links in the admin interface.


> I had a client once who had something similar, although unintentionally.

I did that too. I was aware of the problem, but at the time (1996) I did not know how to fix it.

So I just documented it and warned that they should keep the site away from altavista.

This was back before cookies had wide support, so login state was in the URL. If you allowed a search spider to know that URL it would have deleted the entire site by spidering it.

I did eventually fix it by switching to forms, and strengthening the URL token to expire if unused for a while. And then eventually switching to cookies (at one point it supported both url tokens and cookies).

I have not thought about those days in such a long time.


Why not POST requests for anything that changed server-side state?


Obviously that is the solution. I know that now, I didn't then. (As I wrote: "I did eventually fix it by switching to forms.")

The whole thing about POST vs GET that everyone knows today for read only vs write was not that well known back then.

Back then you used GET for things with a small number of variables, and POST when you expected enough data that it wouldn't fit in the URL. It was all about the URL, not about the effect of the request.


Ah, I see. Should have picked that up.

I guess there was no Wikipedia to have an article for HTTP back then, which has been an invaluable resource for me to understand some of the intricacies in my work.


I remember those days. Those days only two methods existed, GET and POST! ;)


htaccess would have been your friend. How did you prevent any visitor from deleting the site?


> htaccess would have been your friend

htaccess didn't exist in 1996.

This site ran on IIS 1.0 on Windows NT 3.51. For scripting we used a prerelease Coldfusion version. (i.e. the version before 1.0, which was released as we were developing the site, partially based on feedback we provided as we tested it.)

> How did you prevent any visitor from deleting the site?

A security token in the url which was secret. The worry was that some admin would try to submit the site to altavista for indexing without removing the token from the url first.


> htaccess didn't exist in 1996.

Obviously not for IIS, but .htaccess files go back at least as far as NCSA httpd, and so definitely existed before 1996.


> This site ran on IIS (anything)

There's your first problem.


Unnecessary condescending snark.


Are you this guy http://thedailywtf.com/articles/The_Spider_of_Doom? Or http://craigandera.blogspot.com/2004/04/beware-googlebot_12....? (They seem like the same story but have different names.)


Probably not. This happens more than you might think. I got called in to consult on a project where something similar was happening. Client would add products to their web store and the next day the products were missing.

Unsecured access and 'GET' based deletes were everywhere.


I accidentally deleted about half of the database at a startup where I’d recently started working by approximately the same method. I was running a copy of the web interface on my laptop, connecting over the internet to our MySQL server, and also running ht://dig’s spider on localhost from cron. It started spidering the delete links. Fortunately, I’d also started running daily MySQL backups from cron (there were no backups before I started working there), so we only lost a few hours of everyone’s work. As you can imagine, though, they weren’t super happy with me that day.


If they weren't making backups before, and you instituted them, they should have been super happy with you.


Cowboy coders don't like to see the holes in their development process.


They were unhappy to lose several hours of the rest of the company’s work.


Someone should make a website for indexing bots to play with!


What's the idea solution for this? I would drop a cookie and use that to verify admin privileges on each page. Is that right?


First, use the correct HTTP verb: POST (or possibly DELETE). Googlebot only GETs.

Also as you note, destructive changes should be authenticated, whether by Basic Auth over TLS or the more common cookie tokens.


Fundamentally, authentication when someone tries to delete a thing needs to happen in server-side logic, not on the client side. The rest is flavouring.


Authentication should happen server side, but authentication need not happen at the time of delete. When deleting, you should be authorizing and validating, which can safely be done client side... but, if you are doing something server-side (e.g., a delete) you should also be doing it server side.


HTTP is stateless. You need to generate some kind of token that can be checked for each admin action a user takes.


Authentication, POST requests and CSRF tokens, at the bare minimum.


Basically do the exact opposite of everything they did. Doing authentication in client-side JavaScript is an absolute no-no. Using GET requests for things that have side effects (like deleting content) is another.


Don't authenticate on the client, it's that simple. Authenticate only server side.


Solution is to use a mature web framework that already solves this. ie. Django.


You win "Most Hilarious Bug" for the day.


Clearly! I just came back to HN and wondered what the hell had happened to my karma.


I'm surprised there are so many people on Hacker News asking "why?".

Hackers don't need a reason, other than it being clever, novel, fun, etc. But if you want a reason there are plenty:

* art: there are numerous interpretations of this

* fun: this is sort of the digital equivalent of a "useless box" http://www.thinkgeek.com/product/ef0b/

* science: experiment to see how widespread a URL can be shared without Google becoming aware of it

* security: embed unique tokens in your content to detect if it has leaked to the public


I agree that there are lots of reasons that someone would make a site like this, but I think people are curious as to the maker's specific reason. From the github:

Why would you do such a thing? My full explanation was in the content of the site. (edit: ...which is now gone)

I'm curious as to what the website said originally.


In that case, I'd guess "art".


I think more and more the word hacker has lost its original meaning at least in this community. If I were reading a similar story on a tor hidden service, let's say, I would not be asking why, but here I do.


It's totally twisted, because "business types" got involved and everyone started confusing "hacking" with "working". This story is a cool but simple hack, in the original meaning of the word. I find that a good definition of a hack is "a project or a trick that you can tell to your tech friends over a beer and have some good laugh from it".


My first thought was filesharing.


Auto-DMCA-ing yourself?


Exactly.


My first question was "_why? Is that you?"


It's a digital embodiment of coolness; once the masses can find out about it, it isn't cool anymore and the coolness is gone. Literally.


I think Hipsterism is what you're actually referring to.


I was cool long before it was hip.


Much like tattoos.


An alternative would be to check for the browser user agent and delete the website right at that point and return a 404 page to the Google crawler bot. Then Google won't have a static copy of the website.


Your approach is "a website that irrevocably deletes itself once indexed by Google".

What OP has done is "a website that irrevocably deletes itself once Google decided to publicly reveal the fact that it indexed said website".

OP's approach has no way of knowing when the site was indexed. It's conceivable that Google indexed it on the very first day and decided not to share it publicly until 21 days later.


Technically, the former is when it is "crawled" and the latter is when it is "indexed".

In practice, since 2010 these two events have generally been separated by minutes.


If you really want to get "technical", then the first one is when the site is "crawled" and the latter is when it's "served". "Indexing" happens in-between the two.


Even if the request that claims to be from Googlebot is actually the Googlebot (which it might not be,) that doesn't guarantee the site is indexed. It's impossible to know when the site is indexed without direct access to Google's index.


hah, good point for all intents and purposes (for most) indexed is when it appears in the search results


The problem with that is that you could spoof the user agent.


Actually, you could do a reverse IP lookup against any user agent claiming to be googlebot followed by a forward IP lookup against the domain name you were returned. Legitimate googlebots will be in the *.googlebot.com space.

Source: https://support.google.com/webmasters/answer/80553?hl=en


But Google probably doesn't bother, right?


What I meant is that a human could spoof the user agent and pretend to be Googlebot.


They do tell google not to save a static copy:

> the NOARCHIVE meta tag is specified which prevents the Googles from caching their own copy of the content.


Wouldn't that only prevent the user from seeing the cache? I mean, if it's indexed, then google must have it cached, right?


That meta tag prevents Google from publicly showing their cached version of the page. In practice this means the "Cached" link, within the results, doesn't appear when a given page asks Google to NOARCHIVE -- which I believe can be 'asked for' via either the meta tag or via a special response header.

Edit:

Yeah, 'noarchive' can be specified via the meta tag or via header. Also available to you are a handful of other directives such as NoIndex, NoFollow, NoArchive, NoSnippet, NoTranslate, etc...

See these links for more in-depth info about the directives & which search engines support what:

Directives & Usage in Meta tags - http://noarchive.net/meta/

Usage in Response Headers - http://noarchive.net/xrobots/


What about the opposite? A website that created when it is indexed? Start with nothing and content is added each time the site is visited by Googlebot, or shared on Facebook, tweeted, posted on Reddit, etc. The website exists only so that it can be shared, and the act of sharing it defines what the website is.


This is an uber cool idea. Especially if, when this website is shared by someone, it would attempt to scan the sharer's public feed, last submissions, last comments, last tweets, etc. (depending on where it got shared), and generate additional content based on what it found.

Sounds like an awesome weekend project.


Cool, but why? ( And shoulden't we invent digital Baroque art before inventing digital postmodernism?)


Both exist.

Postmodernism is a lot more relevant to the digital age than anything, imo. It emphasizes pointing out ways of thinking and doing, which I think is especially relevant when we are actually automating most of our ways of thinking and doing.

I know it gets a bad rap because of the ridiculous examples, but the real point of it engages the viewer into a serious kind of contemplation concerning the massive infrastructure that exists and how that shapes our culture, thoughts, understanding, action..

We have the expectation that the generations to come will accept this infrastructure and what it says about how the human mind functions. But much of it is founded on belief systems of how thought and action operate in the real world. Most of these systems are baseless, the idea of a base obfuscated only by the sheer complexity involved in understanding each layer.


Please don't tell me Geocities was our Renaissance.


I really look forward to when we, as academics, historically document and seriously examine the various phases of the internet, from a variety of alternative perspectives.

It's interesting while it's being built, but it's also interesting to look back and reflect on the bigger picture, outside of the buzzwords and technical terminology used to pull the creation through, and make it actualized.

I look forward when critics and theorists start thinking about the goal of the internet from a social perspective, a collective cultural subconscious directive. I look forward to all the kinds of art history theoretical methodology used to explain the significance of Picasso or Manet in their respective time periods, to use the same kinds of methodology to reason about the relation between the internet and everything that is not the internet.

It's interesting when some information gets washed away and other information is retained through time, and it isn't always the stuff that is indexed that is retained. The idea that art critics can even agree to call the same collection of works "cubism" or "impressionism" fascinates me, and I look forward to the same kinds of invented vocabularies being used to describe various processes, movements, and patterns throughout internet culture (way beyond studying memes and tropes - there are so many layers to the collective psyche of the internet, it is dumbfounding).

I don't know what geocities represents. I'd have to define it's 'kind' and compare and contrast it to other 'kinds' throughout time. I know this was meant to be a humorous comment, but I love to weave theories, and some of them even turn out to be descriptive of the nature of things.


And if you want to help out to archive the data that is needed for that kind of work, ArchiveTeam needs your help: http://archiveteam.org/index.php?title=Main_Page :)


AAA games: where someone is paid to do nothing but design the details on imaginary Dwarven armor for an entire year.

If that ain't baroque I don't know what is.


Because.


Normally this kind of comment gets downvoted, and rightly so. But in this context, it's perfect. Well done.


The reason why "Laconian wit" is normally frowned upon is because it's actually almost disruptively lazy. In the event that almost everyone agrees with you, then that's okay.

But should anyone disagree with you, now they're going to have to do the heavy lifting for YOUR side. That disrupts the willingness someone has to even converse with you, and if someone retorts with similarly Laconian wit, you can see the conversation breaks down really fast, because nobody is willing to put in the extra effort to flesh out someone else's opinion when there's no reciprocity or show of effort.


I wouldn't mind the downvotes at all. But I really thought it was the only acceptable answer for the why. :)


Yup. If aiming for less laconism, I like to quote Cave Johnson - "Science isn't about WHY. It's about WHY NOT.".


I'm laconic like that. Great quote, btw.


Just check out any of the MIDI music forums for some sweet digital baroque art.


Thank you.

http://i.imgur.com/cjDeLEb.png

EDIT: What's with the downvote hate? Somebody actually posted a valid key...


As far as I can tell, you just posted part of a random screengrab from your web browser for no obvious reason. Striking's response suggests that this is actually a reference to a site which, per the OP, is gone forever, along with any chance of getting your joke. So...I'm not really sure what you were expecting.


People likely didn't understand that someone posted a key for a game on that website and thought that you just posted an unrelated images.


>Why would you do such a thing? My full explanation was in the content of the site. (edit: ...which is now gone)

So anyone really understood why he did this?


My guess - because he could, and likely had some good laugh when discussing it with friends.


Anyone know the origin or have an archive?


The origin is this: http://eep40h.herokuapp.com/



IF anything, that's a much deeper comment than the website itself. No matter how hard you try, it's impossible to really destroy something once it's been on the web. Resistance is futile.


That's not quite the point being made, since the site didn't even attempt to block indexing via robots.txt or a meta tag.


Is this indexed by google? Doesn't this make the attempt a failure (this time)?


I've submitted it to Google, hopefully it makes it.


I see what you did there, I think.


Not sure if I see this as "art" or something. I mean, irrevocably deletes itself could be attached to a thousand arbitrary things.

- deleted after 100 visitors

- deleted if visited with IE 6.0 for the first time

- deleted if referrer is Facebook

- ...


Also, irrecovability seems a bit questionable (google cache, archive.org etc.)


    <meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate" />
    <meta http-equiv="Pragma" content="no-cache" />
    <meta http-equiv="Expires" content="0" />
+

    Cache-Control: no-cache, no-store, must-revalidate
    Pragma: no-cache
    Expires: 0
+

    User-agent: ia_archiver
    Disallow: /
Of course, this won't prevent crawlers which do not honor these headers/meta tags from caching your site, but if you're not in Google's index you're likely not getting traffic from said crawlers.


Good point. I wonder if meta tags were updated later or did archive.org ignore them - https://web.archive.org/web/20150213152238/http://eep40h.her...


Snapchat for websites...hmmmm perhaps.


I see some potential use of this, for example as soon as Google crawlers reach the site I know that it is accessible from outside and I destroy the site.


That seems to be the exact use case. Did you want to elaborate on why you find that useful?


What is the purpose of a website that is inaccessible "from outside"?


Maybe it's a resource that should only be used by people in a particular organization?


It's not a purpose. It is a detection of state, however.


"Death is reason for the beauty of butterfly"


Who said that? I could not agree less. Butterflies are beautiful for their color, not their death.


It's from "Sohrap Sepehri" [1] an Iranian poet and painter. and I think the replies to your comments answer you question.

[1] https://en.wikipedia.org/wiki/Sohrab_Sepehri


Whoever said that, probably meant selection pressure.


Potentially also that if every butterfly that ever existed were still alive, we wouldn't be very fond of them.


I have to say I'm not usually a fan of conceptual art, but kudos - the concept is great. Keep experimenting!


I would be interested in similar experiments but with a couple of minor variations to see the effects of each:

1. Sending the NOINDEX meta tag

2. Combining meta tags

3. Monitoring for a referrer URL that matches a Google search page to catch the 1st non-sneaky user coming from the index.

4. Monitoring other search engines and their behaviors.


grep Googlebot /var/www/log/* && rm -rf /var/www/site


How about detecting GoogleBot traffic and deleting when it has crawled your website?


Then anyone would be able to trigger the autodestruct by spoofing their UA.


Googlebot's identity can the authenticated to prevent spoofing:

https://support.google.com/webmasters/answer/80553?hl=en


I actually wasn't aware of that! Thanks for the link.



Like a snow angel? Art that auto destructs? Stay in the moment.


What problem does it solve? EDIT: that was an honest question.


The problem of creating something interesting, a.k.a. creating art.


It leaves a gender-bending aftertaste on society.


It helps provide interesting questions.


solves a broken system.


Guess I better clone the source before its deleted..


Oh I was thinking something very similar few minutes ago and when I opened hacker news and saw this post I was amazed


a) One can also use referer to check whether a visitor has come from Google to trigger the deletion (in addition to "seek itself in Google").

b) robots.txt shall get the same results, plus, no cached content at Google, unlike "deleting itself", which the cache content remains at Google.


You mean a website which can't be used with Chrome or even with Android itself, on any browser.


This makes me think about the immensely cool self destructing sunglasses in Mission Impossible


I have a thought that I will forget immediately once somebody asks me what it is.

Now I am an artist, yay :-)


What is it?


@Cjlm, what type of problem does it solve?


A Google worshiping sand mandala ( http://en.wikipedia.org/wiki/Sand_mandala ) ?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: