

Facebook's robots.txt - sander
https://www.facebook.com/robots.txt

======
perryh2
[http://disqus.com/humans.txt](http://disqus.com/humans.txt)

~~~
glomph
[http://www.last.fm/robots.txt](http://www.last.fm/robots.txt)

------
viana007
[http://www.google.com/robots.txt](http://www.google.com/robots.txt)

~~~
easy_rider
/* would have sufficed

~~~
darkmighty
[https://www.google.ca/search?q=google+search+domain&oq=googl...](https://www.google.ca/search?q=google+search+domain&oq=google+search+domain&aqs=chrome..69i57j0l5.2849j0j9&sourceid=chrome&espv=210&es_sm=93&ie=UTF-8#es_sm=93&espv=210&filter=0&q=site:www.google.com/search+meta)

------
kr1m
You don't scrape Facebook, Facebook scrapes you!

~~~
jgalt212
In the US, you catch a cold. In Soviet Russia, cold catches you!

[http://en.wikipedia.org/wiki/Russian_reversal](http://en.wikipedia.org/wiki/Russian_reversal)

------
yalogin
So what does it mean by facebook whitelisting a scraping service? Do they
actively block scrapers?

~~~
dblacc
I could be wrong but I believe that the the default is that spiders are
blocked and only the "User-Agents" listed are allowed to scrape (but not the
disallow pages).

~~~
elbear
You are correct.

------
pdfcollect
Is there a way to replace this robots.txt with a null robots.txt? :)

~~~
toomuchtodo
You just ignore the robots.txt file, crawl slowly, and from distributed
virtual machines.

Not that you should do that. Robots.txt is a nicety though, the client doesn't
have to respect it, and the server doesn't have to allow your HTTP requests.

------
bibstha
What is a User Agent: Yeti?

~~~
unfunco
It's the crawler for Naver, a south Korean search engine.

------
decasteve
Even Facebook's robots.txt has a hatred for my pseudo-anonymous browser
settings. Facebook gives me this (for any page): "Sorry, something went wrong.
We're working on getting this fixed as soon as we can."

~~~
startling
robots.txt isn't enforced.

~~~
easy_rider
Maybe they should be. Gentleman's agreements do not apply to robots.

~~~
cheald
And how exactly do you propose verifying that the user agent purporting to be
Googlebot or Firefox is actually who they are? They're inherently
unenforceable.

robots.txt is basically a list of rules that lay out "This is how we'd like
you to crawl us. We might stop serving you if you don't comply", rather than a
hard-and-fast set of directives that specify how a webcrawler will be
guaranteed to behave.

~~~
easy_rider
You can implement some strict enforcing in Apache using some crafty
mod_rewrite stuff:
[http://andthatsjazz.org/defeat.html](http://andthatsjazz.org/defeat.html)

User-agent is to easily spoofed, but we could check if the robots are indeed
Google (whitelisted) and not some other crawler that just wants to scrape your
content.

In the realm of mail servers we have something called SPF:
[http://en.wikipedia.org/wiki/Sender_Policy_Framework](http://en.wikipedia.org/wiki/Sender_Policy_Framework)

Just thinking out of the box here, but other than checking IP ranges: Maybe a
hash being sent as a header inside the GET request by the crawler to verify if
they are who they say they are.

