
A Facebook crawler was making 7M requests per day to my stupid website - napolux
https://coding.napolux.com/a-facebook-crawler-was-making-7m-requests-per-day-to-my-stupid-website/
======
lun4r
We've had the same issue. They were doing huge bursts of tens of thousands of
requests in very short time several times a day. The bots didn't identify as
FB (used "spoofed" UAs) but were all coming from FB owned netblocks. I've
contacted FB about it, but they couldn't figure out why this was happening and
didn't solve the problem. I found out that there is an option in the FB
Catalog manager that lets FB auto-remove items from the catalog when their
destination page is gone. Disabling this option solved the issue.

~~~
stiray
I dont really understand what is the issue. On my welcome page (while all
other urls are impossible to guess) i give browser something that requires a
few seconds of cpu at 100% to crunch. And tracking some user action in
between, visting tarpitted urls etc. In last few years no bot came through.
Why bother with robots.txt, just give them something to break their teeths...

(I would give you the url, but I just dont want ti be visited)

~~~
cblades
I don't think most users would appreciate having a site spike their CPU for a
few seconds when they visit...at least I wouldn't.

------
leafo
Same thing has happened to me:
[https://twitter.com/moonscript/status/1124888489298808834](https://twitter.com/moonscript/status/1124888489298808834)

The network address range falls under Facebook's ownership, so I don't think
it's someone spoofing. I do think it's very possible someone found a way to
trigger crawl requests in large quantity. Alternatively, I would not be
surprised it's just a bug on facebook's end.

~~~
_jal
They've done this before to me, too. First I tried `iptables -j DROP`, which
made the machine somewhat usable, but didn't help with the traffic. After
trying a few things, I tried `-j TARPIT`, and that appeared to make them back
off.

Of course, sample size of 1, etc. It could have been coincidental.

~~~
hinkley
Tarpits are an underappreciated solution to a pool of bad actors.

You can add artificial wait times to responses, or you can just route all of
the 'bad' traffic to one machine, which becomes oversubscribed (be sure to
segregate your stats!). All bad actors fighting over the same scraps creates
proportional backpressure. Just adding 2 second delays to each request won't
necessarily achieve that if multiple user agents are hitting you at once.

~~~
apocalyptic0n3
I never looked into the TARPIT option in iptables before reading your comment.
That seems really useful. I've been dealing with on and off bursts of traffic
from a single AWS region for the last month. They usually keep going for about
90 minutes every day, regardless of how many IPs I block, and consume every
available resource with about 250 requests per second (not a big server and
I'm still waiting for approval to just outright block the AWS region). I'm
going to try a tarpit next time rather than a DROP and see if it makes a
difference.

~~~
rocho
Be careful, as tarpitting connections can consume your resources faster than
those of the attacker.

~~~
hinkley
Most spiders limit the number of requests per domain, so if it's stupidity and
not malice, you probably don't have a runaway situation.

... unless you're hosting a lot of websites for people in a particular
industry. In which case the bot will just start making requests to three other
websites you also are responsible for.

Then if you use a tarpit _machine_ instead of routing tricks, the resource
pool is bounded by the capacity of that single machine. If you have 20 other
machines that's just the Bad Bot Tax and you should pay it with a clean
conscience and go solve problems your human customers actually care about.

------
omershapira
I once made a PHP page which fetches a random wikipedia article, but changes
the title to "What the fuck is [TOPIC]?". It was incredibly funny, until I
started getting angry emails from people, mostly threatening a form of libel
lawsuit.

Turns out, since it was a front for all of wikipedia[1], google was
agressively indexing it, but the results rarely made it to the first search
page. And since this isn't exactly an important site, old results would stick
around.

Hence a pattern:

1\. Some Rando creates a page about themselves

2\. Wikipedia editors, being holy and good, extinguish that nonsense

1.5. GOOGLE INDEXES IT

3\. Rando, by nature of being a rando, googles themselves, doesn't find their
wikipedia page anymore (that's gone), but does find a link to my site in the
first page of google results with their name.

4\. Lawyers get involved, somehow

Details: [http://omershapira.com/blog/2013/03/randomfax-
net/](http://omershapira.com/blog/2013/03/randomfax-net/)

[1] Hebrew wikipedia. I can't imagine what would've happened on the English
version.

~~~
smileybarry
I understand why people got angry, though. The title was rougly "What is this
'<article name>' shit?"[1], with "shit" in this Hebrew context also able to be
interpreted as calling the subject "shit".

Saying "מה זה לעזאזל X?"[2] is closer to "what the fuck is X?" (except it's
more "what the hell" than "what the fuck").

[1] I wasn't sure how to portray this to non-Hebrew speakers but,
surprisingly, Google Translate actually nailed it:
[https://translate.google.com/#view=home&op=translate&sl=auto...](https://translate.google.com/#view=home&op=translate&sl=auto&tl=en&text=%D7%9E%D7%94%20%D7%96%D7%94%20%D7%94%D7%97%D7%A8%D7%90%20%D7%94%D7%96%D7%94%20%22%D7%90%D7%A4%D7%9C%22%3F)

[2] Google Translate got this example right, too:
[https://translate.google.com/#view=home&op=translate&sl=auto...](https://translate.google.com/#view=home&op=translate&sl=auto&tl=en&text=%D7%9E%D7%94%20%D7%96%D7%94%20%D7%9C%D7%A2%D7%96%D7%90%D7%96%D7%9C%20%22%D7%90%D7%A4%D7%9C%22%3F)

------
bberenberg
Doesn't ignoring headers and taking someones site down fall afoul of the CFAA?
Especially given how comments here are showing that this is a recurring issue?
Depending on which side of the fence you're on, there could be standing to go
after FB for this either for money, or for their lawyers to help set a
precedent to limit CFAA further.

~~~
yupyup54133
When someone does this to Facebook its malicious and they go to jail. WHen
Facebook does it to someone else... "oops".

~~~
occamrazor
From the point of view of the law, the intent (malicious or benevolent) is
often as important as the action itself.

~~~
yupyup54133
Nah, it only matters if you are a large corporation or not. If you are a large
corporation you will ruin your opponent financial with legal fees (whether
they are guilty or not). If you are a small timer going up against a large
corporation they will delay the court case until you can not afford the legal
fees.

------
jokoon
That's 81 request per second on average.

Shouldn't anybody doing such thing be liable, and be sued for negligence and
required to pay damages?

That sounds like a lot of bandwidth (and server stress).

~~~
jrockway
Shouldn't any public facing website have a rate limit? If someone attempts to
circumvent simple rate limits (randomizing the source IP or header content),
then that could demonstrate intent to cause damage, and you'd have a better
case. But if you don't set a limit, how can you be mad that someone exceeded
it?

(I know they're ignoring robots.txt, but robots.txt is not a law. And, it
doesn't apply to user-generated requests, for things like "link unfurling" in
things like Slack. I am guessing the crawler ignores robots.txt because it is
doing the request on behalf of a human user, not to create some sort of index.
Google is attempting to standardize this widely-understood convention:
[https://developers.google.com/search/reference/robots_txt](https://developers.google.com/search/reference/robots_txt))

~~~
sammitch
Plenty of crawlers, including Facebook's, are operating out of huge pools of
IPs usually smeared across multiple blocks. If you have an idea on how to
ratelimit a crawler doing 2 req/s from 300+ distinct IPs with random spoofed
UAs let me know when your startup launches.

~~~
jrockway
There are many features that you can pick up on. TCP fingerprints, header
ordering, etc. You build a reputation for each IP address or block, then block
the anomalous requests.

If you don't want to do this yourself and you're already using Cloudflare...
congratulations, this is exactly why they exist. Write them a check every
month for less than one hour of engineering time, and their knowledge is your
knowledge. Your startup will launch on time!

------
rcfox
Did anyone notice the branded IP address? 2a03:2880:20ff:d::face:b00c

"face:b00c"

~~~
cranekam
FB has 2a00::/12 (and probably other blocks?) and can make something that
looks like its company name from [0-9a-f]. Why wouldn't it do something
harmless and fun like this? It's not as if it requires special dispensation or
is breaking any rules.

~~~
detaro
A single company does not get a /12 prefix. 2a00::/12 is almost half of the
space currently allocated to all of RIPE NCC. Facebook seems to have
2a03:2880::/29 out of that /12, and a /40 through ARIN (2620:0:1c00::/40)

~~~
vvG94KbDUtRa
ugh why is ipv6 impossible to understand :/

~~~
detaro
Is the notation the problem here? IPs have 16 bytes now, so the notation is a
bit more compact: hex instead of decimal numbers, and :: is a shortcut meaning
"replace with as many zeros as necessary to fill the address to 16 bytes"

------
lun4r
Related thread at SO: [https://stackoverflow.com/questions/49577546/facebook-
crawle...](https://stackoverflow.com/questions/49577546/facebook-crawler-is-
hitting-my-server-hard-and-ignoring-directives-accessing-sa)

~~~
lun4r
And FB bug report:
[https://developers.facebook.com/support/bugs/189402442061080...](https://developers.facebook.com/support/bugs/1894024420610804/)

~~~
nickysielicki
It bugs me that they require a login to see a bug report, especially in the
case of this bug where those affected aren't necessarily facebook users.

------
TheChaplain
Reading through the article and comments here I find it odd that a company
like FB can't get a crawler to respect robots.txt or 429 status codes.

Even I would stop and think "maybe I should put in some guards against large
amounts of traffic" when writing a crawler, and I'm certainly not one of those
brilliant minds who manage to pass their interview process.

~~~
callalex
If the dev respected others’ boundaries and norms of expected behavior, they
would probably work somewhere else.

------
danschumann
It's an interesting thing to think about, that someone can make a request to
your website and you pay for it.

This is not how the mail works, where someone needs to buy a stamp to spam
you.

~~~
ortusdux
My office still gets spam faxes to this day. The paper and toner only add up
to a few cents a month, so it's not worth doing anything about.

I knew a realtor that had a sheet of black paper with a few choice expletives
written on it that they would send back to spammers. There was an art to
taping it into a loop so it would continuously feed. This was a few decades
ago when a the spam faxes could cost more than a stamp.

~~~
mulmen
My desk phone at an old job used to get dialed by a fax machine. Not fun
picking that up. I redirected it to a virtual fax line and it turns out it was
a local clinic faxing medical records. I faxed them back with some message
about you have the wrong number but they never stopped.

~~~
fennecfoxen
If you really wanted to make them stop, you might find a contact for their
lawyer and make HIPAA noises at them ;)

~~~
mulmen
I considered it. That was after my time in hospital IT. Really I wanted to
stop getting my eardrums blown out by a robot. Luckily I happened to be
working at a phone company so changing my number just took a couple clicks.

------
avian
This reminds me of a case years ago when I had a small-ish video file on my
server that would cause Google Bot to just keep fetching it in a loop. It
wasted a huge amount of bandwidth for me since I only noticed it in a report
at the end of the month. I had no way of figuring out what was wrong with it,
so I just deleted the file and the bot went away.

------
spiderfarmer
Was it a real FB crawler or a random buggy homemade bot masquerading as one?

Check here:
[https://developers.facebook.com/docs/sharing/webmasters/craw...](https://developers.facebook.com/docs/sharing/webmasters/crawler/)

~~~
hinkley
Someone up to no good using a false flag would not surprise me in the
slightest.

"who do people already like to hate? I'll pretend to be them."

~~~
napolux
Doesn't seem the case [https://apps.db.ripe.net/db-web-
ui/query?searchtext=2a03:288...](https://apps.db.ripe.net/db-web-
ui/query?searchtext=2a03:2880:20ff:d::face:b00c)

~~~
Wingy
IP packets have a source field, which can be fake and not their actual IP.
That's a Facebook IP, but the packet might not have actually come from it.

~~~
detaro
If they made actual requests that went through, a connection got established.
That won't happen with a faked source.

~~~
Wingy
Ah okay, thanks for helping me learn :)

------
crazygringo
If you can verify it's actually coming from Facebook's IP address (as many
other comments here suggest), you should absolutely get in touch with them,
though I'm not sure how. Perhaps their security team. That's a very serious
bug that they'd be glad to catch -- if it's affecting you it's surely
affecting others.

Otherwise it's a bot/malware/etc. spoofing Facebook and gone wrong, which
sucks. And yeah just block it by UA, and hopefully eventually it goes away.

~~~
mikece
From the sounds of it the difference between Facebook's actual crawler and
malware could be hard to define or differentiate, given the circumstances.

------
hodgesrm
This is one of the better HN threads in a while. Starting from an easily
described problem, it goes into a number of interesting directions including
how FB seems to DDOS certain sites for reasons unknown, defenses against such
attacks and DDOS in general, GPL licensing, how to collapse HN conversations,
etc.

Not a profound comment. But it was really fun to follow the path down the
various rabbit holes. This is why I like HN.

------
sradman
Interesting. According to the Facebook Crawler link provided by the OP [1], it
makes a range request of compressed resources. I wonder if the response from
the OP’s PHP/SQLite app isn’t triggering a crawler bug.

Maybe Cloudflare can step up and act as an intermediary to isolate the
problem. Isolating DoS attacks is one of their comparative advantages.

[1]
[https://developers.facebook.com/docs/sharing/webmasters/craw...](https://developers.facebook.com/docs/sharing/webmasters/crawler/)

------
ablanco
It would be nice to understand how you came to the conclusion that this was a
Facebook bot.

~~~
akerro
You can check list of IP addresses?

~~~
huac
the OP did not (explicitly) check, but you can check if the IP falls into a
range allocated to Facebook e.g.
[https://ipinfo.io/AS32934](https://ipinfo.io/AS32934)

~~~
napolux
Of course I've checked. Here is a recent example.
[https://news.ycombinator.com/item?id=23491455](https://news.ycombinator.com/item?id=23491455)

------
lclarkmichalek
Make sure you file a bug - there are a myriad of sources internally, but we
can often hunt it down easily enough (assuming it gets triaged to eng).
Important info is the host of the urls being crawled, and the User Agent
attached to the requests (headers are also good too). A timeseries graph of
the hits (with date & timezone specified) can also help.

~~~
falcolas
Why is this the owner's problem? Someone at Facebook should be filing the bug,
or better yet instrumenting their systems so that incidents like this issue a
wake-up-an-engineer alert.

Fuck this culture of "it's up to the victim of our fuckup to file a bug report
with us".

~~~
mulmen
Not just “file a bug report” but also compile a time series graph (lol) and
then pray that Facebook triages it correctly, which they have no incentive to
do.

~~~
rurp
And to make things even more ridiculous, you need to sign in with a facebook
account to even file a bug there.

------
namanaggarwal
Did you check the ip and not just the UA? Does it matches with the one from
fb?

~~~
napolux
Yes, one random ip from cloudflare:
[https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a...](https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a03%3A2880%3A20ff%3A1b%3A%3Aface%3Ab00c&as_sfid=AAAAAAVa_685ehdliHL-
fKlh3v8lQwcfqf1MvjqW2qgkQyyv8g2-z-m-XBtLtF3u6thsALQbH7xaIC7fHiD2oxyY4P-1HH-
WSJe-3D4Cxyl64YVHDn9XR25Hm8GScudbWCNEWUw%3D&as_fid=4cf20d4e5d982453717d1efee86a730cba0705ee)

------
snorrah
Surprise, Facebook not caring about playing nice with others!

Anyway, what reason would they have for so many requests? Why would they need
to do 300 per second?

------
nraynaud
We had that with google in the past. We put a kml database on a web server and
made the google maps API read them. Google went to get them without any
throttling (my guess is that they primed their edge servers straight from us),
the server went down so fast we sought there had been a hardware failure. We
ended up putting a cdn just for google servers.

------
amelius
What would happen if you created a "rabbithole" for webcrawlers? Would they
fall for it?

~~~
sgift
Modern web crawlers usually have mechanisms against rabbit holes, e.g. for
Apache Nutch you can configure it should only follow to a certain depth of
links from a root, it should only follow a redirect a few times and so on.
It's always a trade-off though. If you cut off too aggressive you don't get
everything, if you cut off too late you waste resources.

------
staycoolboy
I'm new to webadmining in the cloud and my website is getting hammered by
baidu, google, fb, et others causing traffic I/O costs to increase.

What's an AWS LoadBalancer way of blocking this traffic? Again, noob here.
THanks.

~~~
namibj
Don't host this stuff on AWS if you care about cost.

~~~
staycoolboy
So you don't know how to do this in AWS is what I'm hearing?

~~~
namibj
My point is, that you don't want to do this inside of a load balancer there. I
don't recall any traffic filtering abilities that would suffice, but I'm not
fully up-to-date with the configurability either. If the load balancer
supports it, a short search in the net or docs should surface an easily-
applicable guide, and if not, I'd probably put that blocking closer to my app
server.

And the reason against AWS for this would be both the general cost (AWS is not
cost-efficient in many cases, unless you have complicated infrastructure that
takes a lot of management, where the Infrastructure-as-code approach can give
you a sizable benefit), the bandwidth cost in particular, and the lack of
configurability of their services to e.g. apply suitable tar-pitting against
such crawlers.

------
forgotmypw17
Maybe someone at Facebook is testing a project that didn't work as expected?

~~~
hinkley
I wish we had better tools for things like this.

What I've heard happen is someone puts a governor on the system, and much
later your new big customer starts having errors because your aggregate
traffic exceeds capacity planning you did 2 years ago and forgot about.

------
pmlnr
Are you sure it's a crawler and not a proxy of some sort? Eg. one of your
links is on something high traffic in facebook, and all requests are human,
running through fb machines.

~~~
napolux
Nah, it's a specific user-agent for their crawler

------
dan15
Facebook caches URLs pretty aggressively (often you need to explicitly purge
the cache to pick up updates) so I don't quite understand exactly what
happened here. The article is very light on details. Do you have 7 million
unique URLs that have been shared on Facebook? Or is it the same URL being
scraped over and over?

Are you correctly sending caching headers?

------
itsjloh
When I enabled IPv6 traffic at home and logged default denies I constantly see
Facebook owned IPv6 blocks trying to reach addresses browsing Facebook.

I have no idea what its trying to do but its legitimately the only IPv6
inbound traffic that isn't related to normal browsing.

------
zippy786
There was a very old post:
[https://news.ycombinator.com/item?id=7649025](https://news.ycombinator.com/item?id=7649025),
lots of ways to abuse Facebook network.

------
hank_z
I’m a bit confused. Facebook is social media company, why do they send out
crawlers?

~~~
thenickdude
Facebook shows previews of pretty much everything you share a link to, so at a
minimum they have to fetch the page/image to build those previews.

~~~
hank_z
I see, thanks for the explanation.

------
_drimzy
Why would a crawler want to crawl the same page every second (let alone 71
times a second). Even if I were Google, wouldn't crawling any page every few
mins (or say every minute) be way more than enough for freshness?

------
Too
> _I own a little website I use for some SEO experiments [...] can generate
> thousands of different pages._

Do you have more details on this? One could read that as if the page is
designed to attract crawlers.

------
cascader
Is it worth duplicating the fb specific parts of your website at another web
address as a test site, so that If it was similarly affected, you could
provide the address info to help troubleshoot?

------
pier25
Shouldn't have Cloudflare considered this a DDOS attack?

------
mesozoic
Detect the crawler and delay the returns as long as you can. Have it use as
many resources as you can that should be interesting.

------
alpb
Hate to say but this is probably a FB engineer running a test/experiment and
not a production crawler taking robots.txt etc into account.

~~~
mcguire
Or malware infecting some FB-internal machine?

~~~
x0
Doubt it. If you're a bad enough dude to get control over Facebook's internal
infrastructure, I doubt you'd blow that access by using it to spam random
sites.

------
tlrobinson
> And then they’ll probably ask you to pay the bill

Can anyone comment on whether there would be any legal basis for this?

------
JoshMcguigan
Thanks for sharing. I’d like to hear more about the website described here, it
sounds very interesting.

~~~
napolux
I can add I work in SEO (tech side), so the website is "super optimized", but
it's not rocket science.

~~~
Accacin
Super optimized? The site takes longer to load than Hacker News for me.

~~~
dewey
They explicitly didn't mention the site this affected. It's not the one linked
in the OP.

~~~
Accacin
Super optimized with Facebook button on each post then. Not my definition of
'super optimized', but you know.

~~~
dewey
Super optimized in a SEO sense, not in a pleasing the HN crowd kind of sense
I'd assume ;)

~~~
Accacin
You're right, but for some reason the whole SEO thing just winds me up. It's
my opinion that 'good' SEO makes sites worse for actual people to use.

~~~
napolux
SEO is for machines, not for users IMHO.

~~~
dewey
Still affects users.

Potential Positive: page speed, https

Negative: All blog posts with the same length, keywords dropped in every
paragraph

------
kuu
I wonder what the crawler does to it requires so many requests per day to the
same site...

------
drewmol
Title change? A stupid Facebook crawler was making 7M requests per day to my
website

~~~
napolux
Nice one. ;)

------
keevitaja
we had a similar experience with yandex. had to block it. crawler started to
crawl on a larger e-shop by trying out every single possible filter
combination. in other words we got ddos attack from yandex.

------
shadykiller
Then your website isn't stupid but rather the FB crawler

------
accurrent
Why does facebook have crawlers in the first place?

------
tejas5
Do you have some sort of fb plugin in your SEO site?

------
magwa101
Their 2FA page was broken yesterday...it goes on

------
historyremade
Someone could use this for DDOS or Coinhive

------
laurentdc
Unrelated, but has anyone written a Chrome/Firefox extension to browse the web
sending out Googlebot or Facebook user agent?

I wonder if you can bypass paywalls or see things that aren't generally
presented to regular users

~~~
spiderfarmer
Safari has this built in: [https://osxdaily.com/2013/01/16/change-user-agent-
chrome-saf...](https://osxdaily.com/2013/01/16/change-user-agent-chrome-
safari-firefox/)

~~~
monkpit
You chose to mention Safari, but link to an article about Chrome, Firefox, and
Safari? Looks like all major browsers (probably Edge too) have this built in.

------
chirau
I think sending back a `429 Too Many Requests` can solve this. Unless you want
to block their IPs completely, which I doubt you'd want to do.

~~~
jandrese
A well made crawler that would support 429 codes isn't likely to have this
problem in the first place.

------
tejas5
Did you reach out to the fb contact?

------
chrismarlow9
bit rot? 1 off anywhere in a DNS request could slam someone

------
liveoneggs
check these hits for an X-Purpose header or similar

------
6510
How do you monetize a robot?

~~~
thephyber
Establish contractual damages in a ToS for the site. Prove violation and
offender. Take to court and collect damages.

Converting the effort into cash is tough, but the strategy exists.

Project HoneyPot is an API which allows any website to do this for honeypot
email addresses which are injected the website, along with a ToS which says:

> By continuing to access the Website, You acknowledge and agree that each
> email address the Website contains has a value not less than US $50 derived
> from their relative secrecy.[1]

[1]
[https://www.projecthoneypot.org/terms_of_use.php](https://www.projecthoneypot.org/terms_of_use.php)

~~~
the_jeremy
Has that ever worked? I can't find any record of judgments one way or the
other, on their website or elsewhere.

~~~
thephyber
I know ProjectHoneyPot was pushing a $x billion litigation against a spammer.
I don't remember if/how that was resolved.

This guy[1] apparently spent years just suing email spammers and occasionally
winning.

[1]
[https://www.danhatesspam.com/index.html](https://www.danhatesspam.com/index.html)

~~~
the_jeremy
I also found the $1 billion lawsuit (against a bunch, not just one spammer, I
believe), and could also not find any sort of resolution - not in legal docs,
news pages, or project honeypot itself.

------
ruuda
Hi Napolux,

It looks like your site is using a theme based on my website
([https://ruudvanasseldonk.com/](https://ruudvanasseldonk.com/), source at
[https://github.com/ruuda/blog](https://github.com/ruuda/blog)). That is fine
— it is open source after all, licensed under the GPLv3. But I can’t find the
source code for your site, and I can’t find any prominent notices saying that
you modified my source. Could you please add those?

~~~
RhodesianHunter
Despite how it sounds, I ask this with zero judgment and pure curiosity.

Why do you care?

~~~
timy2shoes
Please read [https://sfconservancy.org/copyleft-
compliance/principles.htm...](https://sfconservancy.org/copyleft-
compliance/principles.html). My understanding is that if you do not enforce
your copyright (or copyleft in this case) you can lose the copyright.

~~~
0x44
Whilst you can lose a trademark for not enforcing it, you cannot lose a
copyright by not enforcing it (in the United States).

------
shultays
isn't this what robots.txt for? can't you block facebook?

~~~
detaro
As the article clearly states: no

~~~
shultays
whoops, my laziness. sorry

------
dylan604
Why does FB need a crawler? Their users provide the content for their site. Is
there an FB web search engine?

~~~
youeseh
When you paste in a URL to share it with your friends, Facebook tries to grab
some information from that webpage to provide a summary.

~~~
toomuchtodo
What happens if that link performs an action upon a GET request?

Edit: Folks, I agree with you all, but I've seen a lot of garbage out there.
Just asking the question for the discussion.

~~~
dragonwriter
Then the action needs to be harmless, because GET is defined as not merely
idempotent but also safe.

Didn't we already learn this lesson after all the unsafe-GET problems unveiled
when prefetching browser accelerators came on the scene in, IIRC, the late
1990s?

~~~
thaumasiotes
> Then the action needs to be harmless to repeat, because GET is defined as
> idempotent.

No, this is a terrible response. The action needs to be harmless to execute
_every time_ , not just every time after the first time.

HTTP DELETE is conceptually idempotent, but you don't want to be deleting
stuff with GET requests. That's why the standard provides a DELETE method! The
distinction that really matters is safe/unsafe, not idempotent/unique.

(Do you need to use DELETE for deleting stuff? No, POST is fine.)

~~~
bkanber
> No, this is a terrible response.

As an aside, this kind of hyperbole really gets under my skin. It wasn't a
terrible response. That statement is already _technically correct_ : GET _is_
idempotent, and the definition of idempotency is that it is harmless to
repeat.

Your gripe is that OP didn't mention that GET is not only idempotent __but
must also be "safe" __; i.e. that it should not alter the resource. OP got it
50% correct.

Does that omission make his comment a "terrible response"? No -- just
incomplete.

~~~
thaumasiotes
Yes, it was a terrible response. Here are some examples of idempotent
requests:

\- Change the email address registered to my account from owner@gmail.com to
new_owner@136.com .

\- Instead of sending my direct deposit to account XXXX XXXX at Bank of
America, from now on, send it to account YYYY YYYY at Wells Fargo.

\- Delete my account.

\- Drop the database.

None of these have any business being available to GET requests. Objecting to
a misconfigured endpoint on the grounds that the functionality it implements
is not idempotent implies that the lack of idempotence _is what was wrong_.
That's a bad thing to do - anyone who takes your lesson to heart is still
going to screw themselves over, because you gave them terrible advice. They
may do it more than they otherwise would have, because you gave them advice
that directly endorses really bad ideas. Idempotence or the lack thereof is
_beside the point_.

Messing up on endpoint idempotence means you might hurt the feelings of a
document. Messing up on endpoint _safety_ means you might lose all your data
as soon as anyone else links to your homepage. Or worse.

------
president
I just thought of a malicious idea. If I were Amazon or some other cloud
provider, I could hammer my customers' web sites with tons of requests to make
them pay more for resource usage (network bandwidth, s3 calls, misc per/unit
usage etc). It would be hard to trace as well. Wonder if people are already
doing that today.

~~~
eloff
The amount of bad will that would generate when, not if, it came to light
would dwarf the possible gains. So ethical concerns aside, it would be
idiotic.

Sometimes that doesn't stop the pointy haired bosses, because sometimes they
are the perfect storm of unethical and moronic.

------
k2xl
This is actually a common tactic from malicious actors - pretend to be other
bots to get through to websites. If you reverse lookup the IP address you can
see if it part of the facebook network.

UA is very untrustworthy

~~~
napolux
This is one of the IPs

[https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a...](https://www.ultratools.com/tools/ipv6InfoResult?ipAddress=2a03%3A2880%3A20ff%3A1b%3A%3Aface%3Ab00c&as_sfid=AAAAAAVa_685ehdliHL-
fKlh3v8lQwcfqf1MvjqW2qgkQyyv8g2-z-m-XBtLtF3u6thsALQbH7xaIC7fHiD2oxyY4P-1HH-
WSJe-3D4Cxyl64YVHDn9XR25Hm8GScudbWCNEWUw%3D&as_fid=4cf20d4e5d982453717d1efee86a730cba0705ee)

