Hacker News new | past | comments | ask | show | jobs | submit login

You can HTTP GET tweets again by changing your useragent to Googlebot.

curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://google.com/bot.html)" "https://twitter.com/zarfeblong/status/1339742840142872577"

Peak SEO when users are faced with more friction than Googlebots and crawlers.




In 2020, the only way for netzidents to do get what they naturally deserve is by hacking.


The original web browser, NCSA Mosiac, encouraged users to change their User-Agent string, so-called "spoofing" or "masquerading".

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

The User-Agent header is not mandatory and was never intended to be used by tech companies for denying access or fingerprinting. It was supposed to be used, at the user's discretion, to help with interoperability problems. RFC7231 specifically refers to user-agent masquerading by the user as a useful practice. It explicitly discourages using this header as a means of supposed user identification, e.g., fingerprinting.

https://tools.ietf.org/html/rfc7231#section-5.5.3


Setting your user agent would only be considered hacking by the same people who think the Internet is a series of pipes. The browsers themselves copy each other's user agents for interoperability, so it's far past the point that changing it to look like another agent would be considered devious.


Yeah, but on the POV of whoever runs the network, circumventing such blocks is "abuse"


How do you “naturally deserve” to access the contents of the Twitter website?


Since it became a dissemination service for public officials. The moment it became illegal for the US President to block people on Twitter, it should have become illegal for Twitter to restrict access to information to the public, for the same reason.


I agree, that's just cruel. No one deserves being subjected to Twitter.


Given that trick is spreading for several sites now, the trick won't last long. Google could for example generate secret unique user agents for the biggest players. Biggest players would then only allow requests from that secret unique UA.


I think Google shares IP range blocks so you could implement a check like "if(isGooglebot(user_agent) && isGooglebotIp(ip_addr))" in your system.

Edit: ah no, they stopped https://developers.google.com/search/docs/advanced/crawling/.... I don't think 2 DNS lookups are acceptable to block a GET request but I think it can be done out of band, i.e. isGooglebotIp function can fire away a Redis query and if nothing was found, to put the ip into the DNS verify queue. A few requests later, the bot will now get banned thanks to a new record in Redis.


No need to use a Googlebot UA string. Others will work. Such as

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0


no need a secret unique UA. just PTR the IP and check the host.

This method is already used by stackoverflow to hide the sitemap.

https://meta.stackexchange.com/a/37272/158100


Things that would be inadvisable while most of these companies are actively under suit for antitrust issues.


it already happens, try to create your own search engine an index Amazon. captchas everywhere, their robots.txt is just for show.


This trick is almost older than the internet, so if someone cared they would block it already. Sending google.com as Referrer is another variant of it. Before stackoverflow days this was very useful for getting past the paywall on expertsexchange for example.

I was under the impression that serving other content to google would greatly punish your pagerank and even pull you off the search results completely.


so much to google embracing the open web


When Twitter announced they were going to stop supporting browsers not on their list of supported browsers, I figured their attempts to block would be something more than just checking the value of the user-agent header.

They should just announce that users must use a particular user-agent header value and provide a list of approved values. If no one else compiles a list of acceptable user-agent header values for Twitter, I might have to do it.

Every user should just use the same user-agent header value. That would negate any utility of the user-agent header.


It’s been received wisdom until now that Google penalizes websites which behave differently when scraped by the Googlebot. Is that no longer the case?


Pinterest has proven by spamming SERP for years that if you're big enough Google will close eyes on it.


That applies to more than just google

If you are big enough, there are separate rules for you (or no rules)


They will serve same content to users with JS enabled and legit Googlebots while blocking clients w/o JS and bots. I don't think it violates Google rules but ofc is of questionable decency.


And Google can verify by crawling the site with and without JavaScript.

Also, google has a commercial relationship with them: https://www.convinceandconvert.com/social-media-research/twi...


You just need the word "Bot" in your user agent. Requires for fetching Twittercards for link previews too. Changed earlier this year.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: