
Facebooks crawls every page recorded by its tracking pixel - greenone
So yesterday we figured out that facebooks Facebot crawler will crawl _every_ url that was recorded by their tracking pixel.<p>I find this highly concerning since:<p>1. they are crawling potentially sensitive information granted by links with tokens<p>2. they are triggering potentially harmful and&#x2F;or confusing actions in your website by repeating links<p>3. they are repeating requests in a broken way by not encoding url-parameters correctly, for instance url-encoded %2B ends up just as a &quot;+&quot; thus becoming a whitespace (same goes for slashes etc.)<p>4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later
======
K0nserv
> 1\. they are crawling potentially sensitive information granted by links
> with tokens

Don't put Facebook tracking on sensitive pages. Actually as a service to your
users don't put it anywhere where it doesn't add value.

> 2\. they are triggering potentially harmful and/or confusing actions in your
> website by repeating links

They only perform idempotent[0]* requests which should not have any negative
effect if performed multiple times

0:
[http://restcookbook.com/HTTP%20Methods/idempotency/](http://restcookbook.com/HTTP%20Methods/idempotency/)

* They probably only actually perform GET in reality

~~~
d33
GET is only idempotent in theory. Way too many people abuse GET when creating
websites.

~~~
VMG
Those people should suffer the consequences.

I'm not a fan of facebook in the slightest, but they are crawling websites
they were essentially invited to.

~~~
CapacitorSet
>websites they were essentially invited to

Using an analytics pixel is _not_ an invitation to crawl a website.

~~~
eli
No, merely posting it on a public server was the invitation

~~~
thecatspaw
It is not. Leaving your door unlocked is not inviting everybody in to take
your stuff. You might make it easier for them to break in, but it still is a
break in.

Making something available to the public is not the same as going to the
google webmaster tools and telling them to index your page.

~~~
freehunter
You can argue it's abuse or illegal or fraud or whatever you want, but here's
the thing: how are you going to stop them? Sure, maybe you stop Facebook with
a lawsuit... but everyone else is still doing it, even people outside of your
legal jurisdiction. They're still going to do it, so it's up to you to stop
them with your design. If someone breaks into your website and destroys a
user's data or steals their credit card, that user is _not_ going to want to
hear "but what they did was against our ToS!"

This isn't your house where there are police patrolling and ready to respond
at a moment's notice when they're called. This is the Internet, accessible by
almost literally everyone on the planet, and they don't give a shit about your
policy. That's why best practices and application security was invented. So
use it.

------
detaro
If security of sensitive information depends on tokens in the URL, _don 't
just give those URLs to a third party_, how would that ever be a reasonable
thing to do? (especially since the third party apparently hasn't given you any
guarantees on how they treat that, otherwise we wouldn't be having this
conversation?)

Do your users, your broken software and yourself a favor and don't put
Facebook tracking crap everywhere.

~~~
radicalbyte
That's because our industry is full of hopelessly under qualified people who
somehow manage to create software by randomly throwing together bits of code
from stackoverflow / books and tweaking it until it "works".

~~~
tpllaha
That's true. But also trackers (especially big ones) deliberately try to make
it as easy as possible to accidentally include them in your page. e.g: the if
you use their cdn for fonts/style sheets, if you include a fb like button etc.

A friend of mine covers this more extensively in this blog post which I found
a very interesting read: [https://remusao.github.io/posts/static-
comments.html](https://remusao.github.io/posts/static-comments.html)

So I think it's still concerning that once they're in, they start crawling.
Although not very surprising I have to say... That's their business after all.

~~~
radicalbyte
Yeah, of course they'll try that. They're businesses, they make their money by
tracking people. When I inherit a team of project these CDN links are the
second thing to get removed / fixed (after their the inevitable unencrypted
passwords / homerolled security and homerolled SQL).

------
Quppa
I don't mind Facebook crawling pages as long as it respects robots.txt, but
for the last few weeks we've been _hammered_ by requests from Facebook-owned
IP addresses (millions of hits daily, 50+ for the same URL at times). They
don't even set the User-Agent header.

There's a bug report regarding the missing header here:
[https://developers.facebook.com/bugs/1654459311255613/](https://developers.facebook.com/bugs/1654459311255613/)

Unfortunately it seems impossible to get in touch with Facebook devs directly.

~~~
ikeboy
Send a cease and desist to the CTO. Wait 30 days, then sue under the CFAA.
LinkedIn did it

~~~
icebraining
No, LinkedIn _was sued_ by the recipient of the letter.

------
gnud
I assume the crawler only does HEAD/GET-requests Your fault if your webpage
changes anything based on a GET.

Now, if the crawler doesn't honor robots.txt, then you can complain (loudly).

------
slig
> they are triggering potentially harmful and/or confusing actions in your
> website by repeating links

Not their fault. GET requests should not modify anything.

------
throwaway2016a
This is a great example over outrage by someone who doesn't understand how the
web works. Unfortunately this is a problem with lots of web developers but the
author shouldn't take it personally but should try to learn from it. I can't
understand if they don't though because some of the replies here are a little
harsh.

The summary of what most people are saying including some take aways:

\- If you put something on the Internet it is public. Period. It is up to you
to keep prying eyes away from that page. You can do that with strong
mechanisms (like passwords and firewalls) or weak (like robots.txt) but you
need to do something. You can't expect a page on the Internet to be private.

\- Requests should never ever have anything sensitive in the query string. The
query string is inherently logged. By your browser history, your web server,
any tracking pixels like Facebook you put on the page, etc. If you absolutely
must include a token in the URL (like with OAuth) make sure it is a temporary
token and is immediately replaced with something more durable like a cookie or
local storage, no unnecessary HTML is rendered, and the user is redirected to
a new page that doesn't have it in the URL.

\- GET requests should be idempotent. They should avoid changing any data as
much as possible and should not have side effects. This is specified directly
in the HTTP spec.

\- If your page displayed sensitive data it should send the security tokens in
a header field (like cookies or authentication). Users who hit the page
without that header field should be responded to with a 404.

\- Your point #3 is an add one. It is a bug on the Facebook side, yes, but it
doesn't support your primary argument. In fact, if they fixed that bug it
would make the perceived issues in your primary argument worse.

\- Re #4 they don't need to warn you. See the first bullet. If it is on the
internet it is public. Skype, Slack, Twitter, Google, all do the same thing.

------
Artemix
Isn't it obvious? For which reason, if not tracking and information gathering,
would such a feature even exist?

Best solution is still to block Facebook's infrastructures, as always.

~~~
xstartup
I disagree, it's not at all obvious. Pixels are used for conversion tracking
(cookie pixels/cookieless pixels). Crawl isn't necessary for a pixel to
function.

~~~
lclarkmichalek
If you're trying to transmit trusted information in clientside js, then one
common pattern is to have a user's browser to fire the data initially, then
crawl to obtain a trusted copy. The last company I worked for did this, but
skipped the second step, which led to all kinds of XSS attacks.

------
dotdi
<rant> Shocking!

Abuse of power and shady tracking techniques by Facebook? Unheard of! </rant>

Seriously, this cannot be surprising after learning that the Messenger app
listens to everything you do, all the time. That's just off the top of my
head. They are doing this and much more.

~~~
rock_hard
Please get your facts straight before posting here.

[https://www.wired.com/story/facebooks-listening-
smartphone-m...](https://www.wired.com/story/facebooks-listening-smartphone-
microphone)

~~~
nukeop
What facts? This article is full of inaccurate speculation but there are no
facts in it. For example:

>To make it happen, Facebook would need to record everything your phone hears
while it's on.

This assumption treated as an axiom here is false, and it makes the rest of
the article inaccurate as well.

------
agopaul
A while ago while looking at the apache logs I noticed that the AdWords
remarketing pixel does the same, it was trying to crawl private URLs that are
only accessible to 'admins' that are not linked publicly. I'm not sure if this
is still valid as I blocked by using robots.txt.

Also, the same crawler ignores the "User-agent: *" directive in the robots.txt
file and you have to add specific rules for it: "User-agent: Adsbot-Google"

------
unicornporn
> So yesterday we figured out that facebooks Facebot crawler will crawl
> _every_ url that was recorded by their tracking pixel.

Not surprising at all. Would be interesting to see a write up on this.

------
dspillett
> we figured out that facebooks Facebot crawler will crawl _every_ url that
> was recorded by their tracking pixel.

I would be more surprised to find out that they didn't crawl everything they
can, specifically pages that invite them in.

> 1\. they are crawling potentially sensitive information granted by links
> with tokens

If the page contains sensitive information you absolutely should not have code
that you do not control ( _any_ code loaded from third party hosts, not just
facebook's bits).

As a matter of security due diligence if you have third party hosted code
linked into any such pages you should remove it with some urgency and
carefully review the design decisions that lead to the situation. If you
really must have the third party code in that area then you'll need to find a
way of removing the need for the tokens being present.

Furthermore, if the information is sensitive to a particular user then your
session management should not permit a request from facebook (or any other
entity that has not correctly followed your authentication procedure) to see
the content anyway.

> 2\. they are triggering potentially harmful and/or confusing actions in your
> website by repeating links

Possibly true, but again that suggests a design flaw in the page in question.
I assume that they are not sending POST or PUT requests? GET and HEAD requests
should at very least be idempotent (so repeated calls are not a problem) and
ideally lack any lasting side effect (with the exception of logging).

> 3\. they are repeating requests in a broken way by not encoding url-
> parameters correctly

That does sound like a flaw, but one that your code should be immune to being
broken by. Inputs should always be verified and action not taken unless they
are valid. This is standard practise for good security and stability. The
Internet is a public place, the public includes both deliberately nasty people
and damagingly stupid ones so your code needs to take proper measures to not
allow malformed inputs to cause problems.

You can't use "the page isn't normally linked from other sources so won't
normally be found by a crawler" as a valid mitigation because the page could
potentially be found by a malicious entity via URL fuzzing.

> 4\. I could not find a warning or note on their tracking-pixel documentation
> that pages tracked would be crawled later

A warning would be nice, but again unless they explicitly say they won't do
such things I would be surprised to find that they didn't not that they do.

------
eli
Does it crawl URLs blocked by robots.txt? I doubt it. If you don't want well-
behaving crawler to crawl your site, there's your answer. But not all are well
behaved...

------
dna_polymerase
It is the fucking internet, if you put something on there you should expect
someone to find it, be it a crawler or an attacker.

> 1\. they are crawling potentially sensitive information granted by links
> with tokens

If tokens in GET params are your security concept: please leave the entire
field.

2\. they are triggering potentially harmful and/or confusing actions in your
website by repeating links

So you built something that can be triggered by a simple HTTP request and may
have a harmful potential? Wow.

3\. they are repeating requests in a broken way by not encoding url-parameters
correctly

You are kidding right? That's a problem to you? Either your Webserver drops
these or your routes don't match, end of story.

4\. I could not find a warning or note on their tracking-pixel documentation
that pages tracked would be crawled later

Not a problem, you put it on the web and it will be crawled. Did you ever use
Chrome? They report every URL you type to the Google Crawler. Read that
anywhere lately?

~~~
rurounijones
Everything you said was technically correct yet the message will probably be
lost due to the manner in which you decided to delivered it.

------
boraturan
Highly likely that they are feeding all the data into a deep network for ad
recommendation engine.

------
gaius
Sorry, if you are using their tracking pixel then you deserve no sympathy for
the consequences.

------
zerostar07
is this the same crawler they use for external links ? if yes you can exclude
them, their User-agent id is "facebookexternal"

------
Angostura
Does it take notice of robots.txt?

------
receptor
This is borderline criminal. Practically CSRF attack.

~~~
smt88
This sounds a little extreme at first, but I actually totally agree. It's in
murky waters when it comes to GDPR, for starters.

Where do they draw the line? Why not run a keylogger through embedded like
buttons and widgets? That sounds worse, but isn't all that much worse.

~~~
dspillett
> It's in murky waters when it comes to GDPR, for starters.

I'm not sure about from facebook's side, but from the point of view of how
GDPR applies to the side being crawled if they, as custodians of PII and other
sensitive data, are handing it out to unauthenticated requests, _they_ might
be liable for punishment for lack of due diligence.

~~~
throwaway2016a
I agree with this. The website author is potentially liable for providing
inadequate protections to the user's PII. I don't see anything that would
implicate Facebook here.

Although, there is an interesting side effect that applies to all crawlers in
that website owners failing to protect their customer PII like this means that
crawlers inadvertently gather and store personal data as a side effect. I
can't help but wonder if there is some liability there and if there is if
something like AI or pattern matching can help to scrub the info before it is
stored.

~~~
dspillett
facebook might have an issue with having collected the data too, of course,
but the source site definitely should be taking appropriate measures to avoid
handing it out in the first place.

