
What one may find in robots.txt - cellover
http://xn--thibaud-dya.fr/robots.txt.html
======
roel_v
I needed some real-world MS Word & Excel documents years ago to test some
parsing code against. So I started crawling Google results for 'filetype:
_.doc_.xls' style queries. Left it running for a weekend, then the whole
Monday was wasted as I was sucked into looking through the results - some
stuff in there was certainly not meant for public disclosure...

~~~
frik
Did you really crawl Google? That has to be a long time ago. But speaking
about searching on Google as a user:

Google's Advanced search used to be a great tool, until around 2007/08\. For
some reason it never received an upgrade and several things are broken or
don't work any more or were removed (e.g. '+' which is now a keyword for
Google+, the '"' does mean the same; e.g some filetypes are blocked, some show
only a few results).

~~~
rsync
Google never had an advanced search that was as useful as altavista where I
regularly (daily!) searched for things like:

("term1" and "term2") not ("term3" or "term4")

and whatever tiny "power user" features that google had, like "allinsite:term
term term" or '+' don't seem to work at all now.

Google is not optimized for finding things. Google is optimized for ad views
and clicks.

~~~
nostrebored
so you mean like "term1" "term2" -"term3" -"term4"? or if I wanted to do this
without returning results from hackernews "term1" ... -"term4"
-site:news.ycombinator.com ?

I don't see how altavista is superior here

~~~
dbbolton
The problem is "whatever tiny 'power user' features that google had... don't
seem to work at all now."

I think I know what they were talking about. A lot of times it appears that
adding advanced terms to a query will change the estimated number of results
yet all the top hits will be exactly the same. Also, punctuation seems to be
largely ignored, e.g. searching "etc apt sources list" and
"/etc/apt/sources.list" both give me the exact same results. Putting the
filename in quotes also gives the same results as before.

Searching for specific error messages with more than a few key words or a
filename is usually a nightmare.

~~~
nostrebored
This is all true. I do wish there were a flag you could set like
searchp:"/etc/apt/sources.list"

------
philip1209
I find this submission of this article interesting because it underscores
inconsistent handling of I18N/punycode domains. The domain is "thiébaud.fr".
Should submission sites (like HN) show the sites in ASCII? Is there a fraud
risk? Should the web browser show the domain in ASCII?

For me - at no point was I shown the domain decoded to ASCII (either on HN or
in the browser). I recognized the pattern and decoded it manually. For users
who are not technical - this is a failed experience because the domain looks
suspicious and at no point was it decoded.

I wonder when punycode decoding will begin to get attention from developers.
Last year's Google IO had a great talk about how Google realized the
inconsistency of their domain handling with regard to I18N:

[https://www.google.com/events/io/schedule/session/22ce27dc-7...](https://www.google.com/events/io/schedule/session/22ce27dc-7cbf-e311-b297-00155d5066d7)

~~~
aidenn0
I'm on firefox, and it showed me the decoded domain.

~~~
dbbolton
Also on FF, 37.0.2 Linux x64 build, and I see `xn--thibaud-dya.fr` on HN but
`thiébaud.fr` in the status/address bars.

~~~
aidenn0
38.0.1 Linux x64

------
nickhalfasleep
User-agent: *

Allow: /

# A robot may not injure a human being or through inaction allow a human being
to come to harm.

# A robot must obey the orders given it by human beings, except where such
orders would conflict with the First Law

# A robot must protect its own existence, as long as such protection does not
conflict with the First or Second Laws.

------
ar-jan
Hm, the article several times says "required not to be indexed", while
robots.txt is more like "request not to be _crawled_ ". An important
distinction, because a page may well be indexed without being crawled,
typically when there is a link to it. Better to use the noindex meta tag (at
least if you are only concerned with search indexes, not access control).

~~~
polaco
Yes. This need to be emphasized: Disallowing URLs in robots.txt will not
necessarily exclude them from search results. Search engines will still find
those pages if they are linked to or mentioned somewhere else.

The search result will consist only of the URL, and the snippet will say "A
description for this result is not available because of this site's
robots.txt"

Use the noindex tag, folks. Also, Google Webmaster Tools allows you to remove
URLs from Google's index.

Ref.: [https://yoast.com/prevent-site-being-
indexed/](https://yoast.com/prevent-site-being-indexed/)

------
Gigablah
> "At worse, only the internet etiquette has been breached."

Proceeds to announce the name of a stalking victim. Classy.

~~~
camillomiller
It'a name that's in a plain text document that anybody can personally look up
on its browser. What's the point of hiding it, if it's a very good point for
his article?

~~~
MatthewWilkes
He could have made exactly the same point with a fake name and institution.
The information became public due to incompetence, he's made it _much_ more
visible… I don't care to speculate on what personal failing might be the
reason.

~~~
rlidwka
It won't be _exactly_ the same point.

With a fake name there is no proof that the information in question ever
existed.

I would've probably masked a name, but institution url should stay there, so
anyone could check that the point is valid.

~~~
MatthewWilkes
Okay, that's fair, but it would be the same point for practical purposes, in
my opinion. I don't personally believe that it's necessary for people to get
hand-holding to independently verify his individual claims. Certainly other
people could copy his methods to reproduce the results, but I'm not sure what
the benefit is of links into the individual leaks of sensitive information.

------
psykovsky
I like to put a Disallow rule to a randomly named directory with an index.php
file that blocks any IP that accesses it. Then, for a bit of added fun, I put
another Disallow rule to a directory named "spamtrap" which does an .htaccess
redirection to the block script on the randomly named directory.

~~~
asddubs
You understand that all you are doing with that is giving people possibly
looking for attack vectors an attack vector, right? All an evildoer has to do
is embed the blacklist directory as an image somewhere, send it to someone
they want to lock out of the service, etc.

~~~
laumars
It wouldn't be too hard to prevent that. eg the honeypot directory name could
be a hash of the originating IP. You'd need to then have a dynamic robots.txt
but that's easily done.

The destination directory doesn't even need to exist. Worst case scenario you
could handly the hash via your 404 handler or via .htaccess file if all of
your hashes are prefixed. Those are only examples though - there's a multidude
of ways you could handle the incoming request.

~~~
Yengas
Just check the Referer header.

~~~
laumars
You can't trust the referrer header for anything security related.

~~~
Yengas
[https://www.owasp.org/index.php/Cross-
Site_Request_Forgery_(...](https://www.owasp.org/index.php/Cross-
Site_Request_Forgery_\(CSRF\)_Prevention_Cheat_Sheet#Checking_The_Referer_Header)

~~~
laumars
Your citation does reitterate my point. I quote " _However, checking the
referer is considered to be a weaker from of CSRF protection._ "

The referrer header can be subject to all sorts of subtle edge cases such as
switching between secure and unsecure content (or is it the other way around,
I can't recall off hand?) which many broswers will then refuse to send a
referrer header. So while checking the referrer might work most of the time,
it's really not robust enough to be considered trustworthy for anything
security related.

------
noobie
What's with the domain name? it says _thiébaud.fr_ but when copied/pasted it
becomes _xn--thibaud-dya.fr_

Edit: Thanks for the links! :)

~~~
Flimm
Interestingly, Hacker News doesn't support this standard, and the link on the
front page shows the unfriendly version.

~~~
ozh
It's because IDN domains became a standard after tables stopped to be used in
page layouts </sarcasm>

------
look_lookatme
I like:

    
    
      [...]
    
      User-agent: nsa
      Disallow: /
    

From slack.com/robots.txt

------
neil_s
Wow, all the US Department of State files have just gone missing from
archive.org. The servers hosting those files are conveniently down.

~~~
benyami
Check out the Internet Archive FAQ on how to remove a document from their
archives.
[https://archive.org/about/exclude.php](https://archive.org/about/exclude.php)

It looks like they used robots.txt to do that.

~~~
neil_s
Huh, so the wild-card user-agent will block not just searchbots, but also
archivebots. Wonder how OP managed to get screenshots of archive.org having
archives available for those documents.

------
grandfish456
Regarding the Knesset website, it is actually just a boring recordings of the
parliament discussions. Nothing to see here, move along... :-)

------
Scoundreller
Now if someone could do an analysis of humans.txt, that would be cool.

~~~
jenscow
You'd have to get a bot to do that, since us human's can't access the content
it points to.

------
wtbob
> xn--thibaud-dya.fr

Looks like HN needs to learn how to decode Punycode…

~~~
ryan-c
It's _very_ difficult to render decoded punycode domains in a way that does
not facilitate spoofing.

------
eridal
I wonder how could you gather a huge _domain name_ list.

I guess using DNS, or by query some engine, like google, or archive.org

is there a service somewhere?

~~~
anc84
1 million domains ranked by Alexa: [http://s3.amazonaws.com/alexa-
static/top-1m.csv.zip](http://s3.amazonaws.com/alexa-static/top-1m.csv.zip)

For more try [https://commoncrawl.org/](https://commoncrawl.org/)

