Hacker News new | comments | show | ask | jobs | submit login
Facebooks crawls every page recorded by its tracking pixel
92 points by greenone 5 months ago | hide | past | web | favorite | 72 comments
So yesterday we figured out that facebooks Facebot crawler will crawl _every_ url that was recorded by their tracking pixel.

I find this highly concerning since:

1. they are crawling potentially sensitive information granted by links with tokens

2. they are triggering potentially harmful and/or confusing actions in your website by repeating links

3. they are repeating requests in a broken way by not encoding url-parameters correctly, for instance url-encoded %2B ends up just as a "+" thus becoming a whitespace (same goes for slashes etc.)

4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later




> 1. they are crawling potentially sensitive information granted by links with tokens

Don't put Facebook tracking on sensitive pages. Actually as a service to your users don't put it anywhere where it doesn't add value.

> 2. they are triggering potentially harmful and/or confusing actions in your website by repeating links

They only perform idempotent[0]* requests which should not have any negative effect if performed multiple times

0: http://restcookbook.com/HTTP%20Methods/idempotency/

* They probably only actually perform GET in reality


> Actually as a service to your users don't put it anywhere where it doesn't add value.

So don't put it anywhere.


I believe the tracking stuff comes with Like buttons and other Facebook widgets those are what I am referring to when I say "add value". But it could be argued that the tracking alone never adds value to the user.


It could also be argued that not even the like buttons add value


It could also be argued the exact opposite.


GET is only idempotent in theory. Way too many people abuse GET when creating websites.


Those people should suffer the consequences.

I'm not a fan of facebook in the slightest, but they are crawling websites they were essentially invited to.


>websites they were essentially invited to

Using an analytics pixel is _not_ an invitation to crawl a website.


No, merely posting it on a public server was the invitation


It is not. Leaving your door unlocked is not inviting everybody in to take your stuff. You might make it easier for them to break in, but it still is a break in.

Making something available to the public is not the same as going to the google webmaster tools and telling them to index your page.


You can argue it's abuse or illegal or fraud or whatever you want, but here's the thing: how are you going to stop them? Sure, maybe you stop Facebook with a lawsuit... but everyone else is still doing it, even people outside of your legal jurisdiction. They're still going to do it, so it's up to you to stop them with your design. If someone breaks into your website and destroys a user's data or steals their credit card, that user is not going to want to hear "but what they did was against our ToS!"

This isn't your house where there are police patrolling and ready to respond at a moment's notice when they're called. This is the Internet, accessible by almost literally everyone on the planet, and they don't give a shit about your policy. That's why best practices and application security was invented. So use it.


"Hello, I am a HTTP client, can I have /some/super/secret/page?" "200 OK, here it is"

That's your server complying with the request. Whether by intent or by oversight, doesn't matter: the client comes and asks, and your server can refuse. If it complies, well, you told it to. Whether you have merely exposed the page to the public or also shouted its URL from the rooftops, that's completely irrelevant. If it's not supposed to be public, don't make it public.

"Hello, I am a HTTP client, can I have /some/super/secret/page?" "Oh, but you are ^User-agent$=.acebook ? Nope, 403 Forbidden, no data for you." (Or, more generally, "And who are you? 401 Authorize!" - or any other sort of mandatory access control)


Someone viewing a webpage you put online is not at all like someone stealing something you own.


thats like saying "having a public website is an invitation to DOS-attacks"

there are conventions and reasonable expectations, until now I did not expect that a tracking-pixel would be the basis for crawling, so far most crawlers tend to crawl whats publicly linked, not whats potentially publicly reachable if one knows every url there is


Posting a file to a public web server is an implicit invitation for clients (human or automated) to download that file. That's why "secret urls" are universally considered to provide very little security.

There are common conventions (not always followed) around robots.txt and what files to crawl, but I'm not aware of any rules or conventions or standards around URL discovery. Plenty of crawlers attempt to crawl every registered domain name, for example.

"DOS Attack" is sort of a loaded term since it implies malice. Clearly running a web server doesn't mean you invite malicious attacks (though perhaps you should expect them). Some people consider Googlebot to be a DOS attack since it can easily bring poorly designed sites to their knees.


I watch my site's Google index and I can tell you 100% I never gave Google explicit permission to crawl 90% of the pages that show up there.


I believe that is the point that @K0nserv is making. The 'abuse' part I mean.


As an example of the problem, see the "Issues and criticism" of the related topic "Link prefetching" in Wikipedia - https://en.wikipedia.org/wiki/Link_prefetching#Issues_and_cr... .


Almost all of those problems only apply to in-browser prefetchers, which reuse the user's session and connection.


Point taken. Thanks for the correction.


Even according to that document, idempotent methods still can update resources as long as the representation doesn't change:

> Again, this only applies to the result, not the resource itself. This still can be manipulated (like an update-timestamp, provided this information is not shared in the (current) resource representation.

This means that tracking is still could potentially affect some stuff, but honestly not by much.


I fixed it:

Don't put tracking on sensitive pages.


If security of sensitive information depends on tokens in the URL, don't just give those URLs to a third party, how would that ever be a reasonable thing to do? (especially since the third party apparently hasn't given you any guarantees on how they treat that, otherwise we wouldn't be having this conversation?)

Do your users, your broken software and yourself a favor and don't put Facebook tracking crap everywhere.


That's because our industry is full of hopelessly under qualified people who somehow manage to create software by randomly throwing together bits of code from stackoverflow / books and tweaking it until it "works".


That's true. But also trackers (especially big ones) deliberately try to make it as easy as possible to accidentally include them in your page. e.g: the if you use their cdn for fonts/style sheets, if you include a fb like button etc.

A friend of mine covers this more extensively in this blog post which I found a very interesting read: https://remusao.github.io/posts/static-comments.html

So I think it's still concerning that once they're in, they start crawling. Although not very surprising I have to say... That's their business after all.


Yeah, of course they'll try that. They're businesses, they make their money by tracking people. When I inherit a team of project these CDN links are the second thing to get removed / fixed (after their the inevitable unencrypted passwords / homerolled security and homerolled SQL).


I feel many "web agencies" don't even have devs (or any technical people), they just sell wordpress installations.


You mean "hopelessly", right?


Yup, thanks :-)


I don't mind Facebook crawling pages as long as it respects robots.txt, but for the last few weeks we've been hammered by requests from Facebook-owned IP addresses (millions of hits daily, 50+ for the same URL at times). They don't even set the User-Agent header.

There's a bug report regarding the missing header here: https://developers.facebook.com/bugs/1654459311255613/

Unfortunately it seems impossible to get in touch with Facebook devs directly.


Send a cease and desist to the CTO. Wait 30 days, then sue under the CFAA. LinkedIn did it


No, LinkedIn was sued by the recipient of the letter.


what site do you own (if you can tell)?


Not sure I can say.

On a positive note, it's given us an opportunity to focus on performance improvements :)


I assume the crawler only does HEAD/GET-requests Your fault if your webpage changes anything based on a GET.

Now, if the crawler doesn't honor robots.txt, then you can complain (loudly).


> they are triggering potentially harmful and/or confusing actions in your website by repeating links

Not their fault. GET requests should not modify anything.


This is a great example over outrage by someone who doesn't understand how the web works. Unfortunately this is a problem with lots of web developers but the author shouldn't take it personally but should try to learn from it. I can't understand if they don't though because some of the replies here are a little harsh.

The summary of what most people are saying including some take aways:

- If you put something on the Internet it is public. Period. It is up to you to keep prying eyes away from that page. You can do that with strong mechanisms (like passwords and firewalls) or weak (like robots.txt) but you need to do something. You can't expect a page on the Internet to be private.

- Requests should never ever have anything sensitive in the query string. The query string is inherently logged. By your browser history, your web server, any tracking pixels like Facebook you put on the page, etc. If you absolutely must include a token in the URL (like with OAuth) make sure it is a temporary token and is immediately replaced with something more durable like a cookie or local storage, no unnecessary HTML is rendered, and the user is redirected to a new page that doesn't have it in the URL.

- GET requests should be idempotent. They should avoid changing any data as much as possible and should not have side effects. This is specified directly in the HTTP spec.

- If your page displayed sensitive data it should send the security tokens in a header field (like cookies or authentication). Users who hit the page without that header field should be responded to with a 404.

- Your point #3 is an add one. It is a bug on the Facebook side, yes, but it doesn't support your primary argument. In fact, if they fixed that bug it would make the perceived issues in your primary argument worse.

- Re #4 they don't need to warn you. See the first bullet. If it is on the internet it is public. Skype, Slack, Twitter, Google, all do the same thing.


Isn't it obvious? For which reason, if not tracking and information gathering, would such a feature even exist?

Best solution is still to block Facebook's infrastructures, as always.


I disagree, it's not at all obvious. Pixels are used for conversion tracking (cookie pixels/cookieless pixels). Crawl isn't necessary for a pixel to function.


If you're trying to transmit trusted information in clientside js, then one common pattern is to have a user's browser to fire the data initially, then crawl to obtain a trusted copy. The last company I worked for did this, but skipped the second step, which led to all kinds of XSS attacks.


<rant> Shocking!

Abuse of power and shady tracking techniques by Facebook? Unheard of! </rant>

Seriously, this cannot be surprising after learning that the Messenger app listens to everything you do, all the time. That's just off the top of my head. They are doing this and much more.


On iOS this is simply not possible without the user being explicitly notified.

Can you provide some evidence of this happening on Android ?

Also Facebook categorically denies this: http://www.bbc.com/news/technology-41776215


I've heard a lot of anecdotal reports of the messenger app listening to what you do. So much so that I've uninstalled the app.

However, I've never seen a non-anecdotal source or even a source that gathers all anecdotes and gives a decent meta-analysis. Would you happen to have one?


The reply all podcast did a decent discussion of the topic.

The messenger app probably doesn’t listen to you, but it’s abusive in other ways and shouldn’t be uninstalled. The main creepy feature of Facebook apps is that they continuously track your location. That’s a source of much of the creepy targeting that people notice.

Also, Facebook’s weasel-worded response to the issue implies that they do not use pervasive audio targeting, but buy data from people who do. The Wired article that claimed that using audio was impractical is nonsense.


Please get your facts straight before posting here.

https://www.wired.com/story/facebooks-listening-smartphone-m...


What facts? This article is full of inaccurate speculation but there are no facts in it. For example:

>To make it happen, Facebook would need to record everything your phone hears while it's on.

This assumption treated as an axiom here is false, and it makes the rest of the article inaccurate as well.


I've read the article before, and it's not convincing.

I'm not one for conspiracy theories, but the article lists some forms of how it might work and reasons how it would be unfeasible. There are other ways this could work, IMHO.


How is this instance an abuse of power or shady?

I would be surprised if service wouldn't index my site after I put one of their pixels on my site.


A while ago while looking at the apache logs I noticed that the AdWords remarketing pixel does the same, it was trying to crawl private URLs that are only accessible to 'admins' that are not linked publicly. I'm not sure if this is still valid as I blocked by using robots.txt.

Also, the same crawler ignores the "User-agent: *" directive in the robots.txt file and you have to add specific rules for it: "User-agent: Adsbot-Google"


> So yesterday we figured out that facebooks Facebot crawler will crawl _every_ url that was recorded by their tracking pixel.

Not surprising at all. Would be interesting to see a write up on this.


> we figured out that facebooks Facebot crawler will crawl _every_ url that was recorded by their tracking pixel.

I would be more surprised to find out that they didn't crawl everything they can, specifically pages that invite them in.

> 1. they are crawling potentially sensitive information granted by links with tokens

If the page contains sensitive information you absolutely should not have code that you do not control (any code loaded from third party hosts, not just facebook's bits).

As a matter of security due diligence if you have third party hosted code linked into any such pages you should remove it with some urgency and carefully review the design decisions that lead to the situation. If you really must have the third party code in that area then you'll need to find a way of removing the need for the tokens being present.

Furthermore, if the information is sensitive to a particular user then your session management should not permit a request from facebook (or any other entity that has not correctly followed your authentication procedure) to see the content anyway.

> 2. they are triggering potentially harmful and/or confusing actions in your website by repeating links

Possibly true, but again that suggests a design flaw in the page in question. I assume that they are not sending POST or PUT requests? GET and HEAD requests should at very least be idempotent (so repeated calls are not a problem) and ideally lack any lasting side effect (with the exception of logging).

> 3. they are repeating requests in a broken way by not encoding url-parameters correctly

That does sound like a flaw, but one that your code should be immune to being broken by. Inputs should always be verified and action not taken unless they are valid. This is standard practise for good security and stability. The Internet is a public place, the public includes both deliberately nasty people and damagingly stupid ones so your code needs to take proper measures to not allow malformed inputs to cause problems.

You can't use "the page isn't normally linked from other sources so won't normally be found by a crawler" as a valid mitigation because the page could potentially be found by a malicious entity via URL fuzzing.

> 4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later

A warning would be nice, but again unless they explicitly say they won't do such things I would be surprised to find that they didn't not that they do.


Does it crawl URLs blocked by robots.txt? I doubt it. If you don't want well-behaving crawler to crawl your site, there's your answer. But not all are well behaved...


It is the fucking internet, if you put something on there you should expect someone to find it, be it a crawler or an attacker.

> 1. they are crawling potentially sensitive information granted by links with tokens

If tokens in GET params are your security concept: please leave the entire field.

2. they are triggering potentially harmful and/or confusing actions in your website by repeating links

So you built something that can be triggered by a simple HTTP request and may have a harmful potential? Wow.

3. they are repeating requests in a broken way by not encoding url-parameters correctly

You are kidding right? That's a problem to you? Either your Webserver drops these or your routes don't match, end of story.

4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later

Not a problem, you put it on the web and it will be crawled. Did you ever use Chrome? They report every URL you type to the Google Crawler. Read that anywhere lately?


Everything you said was technically correct yet the message will probably be lost due to the manner in which you decided to delivered it.


You're correct, but there's no need to be a dick about it


Can we make a minor exception for this case? Please? Let's trust the OP has a good sense of humor and can interpret critique apersonally.


While I certainly don't disagree with what you said. I think you need to look at his arguments as a way to protect user data. Not all users that use your "mediocre" technical solution are aware of how "mediocre" it is. And if tokens are sent with GET requests or whatever stupid thing.


> Not a problem, you put it on the web and it will be crawled. Did you ever use Chrome? They report every URL you type to the Google Crawler. Read that anywhere lately?

Do you have a source for this? I Googled (!) and found this: https://www.stonetemple.com/google-chrome-discover-pages , which implies the opposite.

I don't use Chrome personally, but I do occasionally dump [none-too critical] preview files on open but otherwise 'hidden' urls on a domain for clients to view. I just find it easier for clients to deal with than inevitably lost passwords, etc, and tend to ask them to let me know when they're done so I can delete the folder.

I'd be interested to know whether their likely use of Chrome means that Google has a pattern of understanding of my domain space!


to clarify:

- marketing wants some tracking, some developers adds it

- ecommerce websites in the real world tend to "need" these tracking/conversion codes

- you do have legitimate get-requests like password-reset links with tokens, also we do use payment providers who send the customers back to us with get links which include payment tokens, newsletter-unsubscribe links are also often simple token links

- and yes normally a get-request should not change anything (at least not when its just repeated) but the sheer fact that they have access to it _and_ are crawling it is bad

my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns combined with the fact that they don't warn you about it


> my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns

There's no way to know which pages are linked publicly without crawling every page for links. So you're right back at square one.

Ultimately if it's on a Internet-facing web server and not hidden behind an IP whitelist or secure login function then you have to assume it is public. All you are arguing is about different degrees of "public" which somewhat misses the real issue of website security.

Some crawlers do deliberately hit random URLs to check how you're handling 404s. Over crawlers are entirely dishonest and will try to find content that wasn't intended to be made public. How are you going to handle them if you're stumped with the Facebook crawlers that you invited onto your site?

> ...combined with the fact that they don't warn you about it

It's pretty obvious behavior in my opinion but maybe they could have been more explicit. However going back to my previous point, no other crawler advertises what it's going to crawl beforehand. So where do you draw the line? Ranting that Google indexed your site? What about visitors buying stuff on your ecommerce package without prior communication requesting access to the site?

You wouldn't ask customers in a bricks-and-mortar store to state their intentions the moment they walked through the shop door so why should every HTTP user agent have to do the same? While web security can be both complex and maddening, responsibility of hardening the site is still yours; not Facebook's.


Stuff like this currently exists in the real world. Therefore, I can understand the complains of the OP.


Highly likely that they are feeding all the data into a deep network for ad recommendation engine.


Sorry, if you are using their tracking pixel then you deserve no sympathy for the consequences.


is this the same crawler they use for external links ? if yes you can exclude them, their User-agent id is "facebookexternal"


Does it take notice of robots.txt?


This is borderline criminal. Practically CSRF attack.


You might be able to argue that, though you are arguing against accepted practise (are you are wanting to ban all web crawling?).

While two wrongs don't make a right, assuming we accept that facebook is wrong in this instance which I don;t think I do, the code for the page handing out sensitive information to an unauthenticated request or taking action based on malformed inputs is negligent.

"Information wants to be free" is not just a hippie ideal it is a technical warning. Unless you take proper measures to control and protect sensitive data it will find a way out.


No it's not. It's common place for other websites to crawl you.

Just add a robots file or block the user agent with your firewall.


This sounds a little extreme at first, but I actually totally agree. It's in murky waters when it comes to GDPR, for starters.

Where do they draw the line? Why not run a keylogger through embedded like buttons and widgets? That sounds worse, but isn't all that much worse.


> It's in murky waters when it comes to GDPR, for starters.

I'm not sure about from facebook's side, but from the point of view of how GDPR applies to the side being crawled if they, as custodians of PII and other sensitive data, are handing it out to unauthenticated requests, they might be liable for punishment for lack of due diligence.


I agree with this. The website author is potentially liable for providing inadequate protections to the user's PII. I don't see anything that would implicate Facebook here.

Although, there is an interesting side effect that applies to all crawlers in that website owners failing to protect their customer PII like this means that crawlers inadvertently gather and store personal data as a side effect. I can't help but wonder if there is some liability there and if there is if something like AI or pattern matching can help to scrub the info before it is stored.


facebook might have an issue with having collected the data too, of course, but the source site definitely should be taking appropriate measures to avoid handing it out in the first place.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: