

Hello Facebook Crawler - mwmnj
http://www.mwmeyer.com/blog/hello-facebook-crawler/

======
whalesalad
This reminds me of a recent experience I had with the Bing bot.

This most recent YC round, my co-founder and I used Skydrive to edit our
application. Skydrive integrates pretty nicely with Word, even on a Mac, to
allow for collaborative editing. It's like the best parts of Sharepoint, minus
all the crap, and inside of a modern UI. I'm a diehard Apple user, but I also
subscribe to the "right tool for the job" principle ... in this case it worked
pretty well.

Anyway, inside the document were links to some private areas of our website
that contained demo materials for YC. As requested, they were not password
protected, but also not linked from anywhere else. While submitting I ensured
that our nginx logs would capture visits to these URL's in a separate log, so
we'd know when it was being looked at (sidenote, seeing visitors coming from
inside justin.tv + the rincon hill towers is kind of exhilarating).

What surprised me was that almost immediately after we began working on the
document, the Bing bot was going apeshit exploring the domain and the
'private' URL's. I had to quickly add a robots.txt to deny all on the root. I
thought it was pretty interesting. At first I felt almost violated. But then
it seems logical that they'd be indexing every URL in every document stored in
their datacenter, why not?

~~~
eli
You assume they were indexing skydrive documents. It could well be that one of
the people who visited the link had a Bing toolbar installed.

Either way, all publicly accesible documents _will_ get indexed sooner or
later.

~~~
whalesalad
This was before the document had been sent to anyone. It was still being
edited, only my friend and I were working on it. Also, the documents were not
public.

~~~
eli
I would be surprised if Microsoft is intentionally indexing links in private
documents, but my point stands: Google et al are remarkably good at indexing
the web. If you don't want an otherwise public URL indexed you _must_ use
robots.txt or equivalent.

~~~
drivebyacct2
>If you don't want an otherwise public URL indexed you must use robots.txt or
equivalent.

Which only blocks bots that respect the file...

------
cddotdotslash
Why is this even news? Facebook has been crawling links for ages every time
you post on the site. The crawler is how the link you paste gets a title,
description, and sometimes a thumbnail.

~~~
mars
+1. not sure how a post like this can make it to the front page.

~~~
whalesalad
There was a period where Hacker News consisted primarily of people on the
right-hand side of the spectrum. People who were working inside of startups or
had lots of experience with the web and our industry. Pretty much everyone
knew what sharding was, and MongoDB wan't very popular.

These days we've got a lot more people and they show up all across the board.

Clearly if this is on the homepage, it was voted there by your peers. This
kind of knowledge is completely obvious to many of us, but not everyone is on
your level. Cut 'em some slack.

~~~
omarchowdhury
Even so, for those who are up to that point, that headline could give the
implication that Facebook is getting into the search business.

~~~
jacquesm
That's why you should read the articles and not just the headlines. Headlines
more often than not give a wrong impression.

------
maxjaderberg
By looking at the headers you now have a great way of writing some analytics
tools to see how much your website is shared on Facebook...

~~~
jabo
I would imagine that they cache the page contents and hence hit a URL only
once in a certain period of time, thus skewing any analytics built around
this.

~~~
dannyr
Yeah it's cached by Facebook. That's why if you want to change your meta
and/or open graph tags info, you need to feed your page to Facebook's Url
Linter (<https://developers.facebook.com/tools/lint/>).

~~~
mpeg
you can also programmatically force it to refresh the cache via a POST to
[https://graph.facebook.com/?id=http://google.com&scrape=...](https://graph.facebook.com/?id=http://google.com&scrape=true)

------
edouard1234567
I'm surprised this post makes it to the homepage... They've been doing that
for ever, no need to look at your logs to figure this out. How else would they
find and display an image form the page you're providing a link to.

------
justinph
12 lines of code instead of:

    
    
      tail -f /var/log/apache2/access.log

------
eli
I would imagine they're checking the URL for malware as well.

~~~
nwh
Probably, I've seen then ban whole domains (droplr.com) previously for
distributing malware.

------
spyder
Also it would be smart to run malware check on these urls if they don't
already doing it.

------
slajax
I wish I had enough karma to down vote this.

