You can HTTP GET tweets again by changing your useragent to Googlebot. curl -A "...

neiman · on Dec 18, 2020

In 2020, the only way for netzidents to do get what they naturally deserve is by hacking.

1vuio0pswjnm7 · on Dec 18, 2020

The original web browser, NCSA Mosiac, encouraged users to change their User-Agent string, so-called "spoofing" or "masquerading".

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

The User-Agent header is not mandatory and was never intended to be used by tech companies for denying access or fingerprinting. It was supposed to be used, at the user's discretion, to help with interoperability problems. RFC7231 specifically refers to user-agent masquerading by the user as a useful practice. It explicitly discourages using this header as a means of supposed user identification, e.g., fingerprinting.

https://tools.ietf.org/html/rfc7231#section-5.5.3

cooper12 · on Dec 18, 2020

Setting your user agent would only be considered hacking by the same people who think the Internet is a series of pipes. The browsers themselves copy each other's user agents for interoperability, so it's far past the point that changing it to look like another agent would be considered devious.

nextaccountic · on Dec 18, 2020

Yeah, but on the POV of whoever runs the network, circumventing such blocks is "abuse"

bzb6 · on Dec 18, 2020

How do you “naturally deserve” to access the contents of the Twitter website?

devmor · on Dec 18, 2020

Since it became a dissemination service for public officials. The moment it became illegal for the US President to block people on Twitter, it should have become illegal for Twitter to restrict access to information to the public, for the same reason.

morsch · on Dec 18, 2020

I agree, that's just cruel. No one deserves being subjected to Twitter.

antpls · on Dec 18, 2020

Given that trick is spreading for several sites now, the trick won't last long. Google could for example generate secret unique user agents for the biggest players. Biggest players would then only allow requests from that secret unique UA.

smarx007 · on Dec 18, 2020

I think Google shares IP range blocks so you could implement a check like "if(isGooglebot(user_agent) && isGooglebotIp(ip_addr))" in your system.

Edit: ah no, they stopped https://developers.google.com/search/docs/advanced/crawling/.... I don't think 2 DNS lookups are acceptable to block a GET request but I think it can be done out of band, i.e. isGooglebotIp function can fire away a Redis query and if nothing was found, to put the ip into the DNS verify queue. A few requests later, the bot will now get banned thanks to a new record in Redis.

1vuio0pswjnm7 · on Dec 18, 2020

No need to use a Googlebot UA string. Others will work. Such as

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0

miyuru · on Dec 18, 2020

no need a secret unique UA. just PTR the IP and check the host.

This method is already used by stackoverflow to hide the sitemap.

https://meta.stackexchange.com/a/37272/158100

smileysteve · on Dec 18, 2020

Things that would be inadvisable while most of these companies are actively under suit for antitrust issues.

kobalsky · on Dec 18, 2020

it already happens, try to create your own search engine an index Amazon. captchas everywhere, their robots.txt is just for show.

Too · on Dec 21, 2020

This trick is almost older than the internet, so if someone cared they would block it already. Sending google.com as Referrer is another variant of it. Before stackoverflow days this was very useful for getting past the paywall on expertsexchange for example.

I was under the impression that serving other content to google would greatly punish your pagerank and even pull you off the search results completely.

txdv · on Dec 18, 2020

so much to google embracing the open web

1vuio0pswjnm7 · on Dec 18, 2020

When Twitter announced they were going to stop supporting browsers not on their list of supported browsers, I figured their attempts to block would be something more than just checking the value of the user-agent header.

They should just announce that users must use a particular user-agent header value and provide a list of approved values. If no one else compiles a list of acceptable user-agent header values for Twitter, I might have to do it.

Every user should just use the same user-agent header value. That would negate any utility of the user-agent header.

sethaurus · on Dec 18, 2020

It’s been received wisdom until now that Google penalizes websites which behave differently when scraped by the Googlebot. Is that no longer the case?

SXX · on Dec 18, 2020

Pinterest has proven by spamming SERP for years that if you're big enough Google will close eyes on it.

syshum · on Dec 18, 2020

That applies to more than just google

If you are big enough, there are separate rules for you (or no rules)

smarx007 · on Dec 18, 2020

They will serve same content to users with JS enabled and legit Googlebots while blocking clients w/o JS and bots. I don't think it violates Google rules but ofc is of questionable decency.

ec109685 · on Dec 18, 2020

And Google can verify by crawling the site with and without JavaScript.

Also, google has a commercial relationship with them: https://www.convinceandconvert.com/social-media-research/twi...

67868018 · on Dec 19, 2020

You just need the word "Bot" in your user agent. Requires for fetching Twittercards for link previews too. Changed earlier this year.