I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.
I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?
I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.
They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?
So I tried convincing them that if it’s okay here on PG’s creation, then it should be okay on his other creation.
I thought knowledge was free, and the Baltimore sun sucked anyways. They charge money and don’t even write hood stuff anymore. They laid off a bunch of people, and moved printing to Delaware. My bet is the next step is that they announce they are shutting down all Locust Point operations, and are selling out so that Kevin Plank can build some new buildings there.
I think I had to appeal my ban with a mod, and they mentioned how it’s posted all over by the auto bot that sharing links to websites that bypass paywalls are against their subreddit rules :(
I even tried an official proposal to r/Baltimore to reconsider and life that rule. The general consensus on the poll was that people felt that the Baltimore sun and the writers should be getting paid for their work, and I shouldn’t be bypassing their paywalls lol.
I still can't find anything in the subreddit rules that clearly says this. (Not that most people read the rules first). Why don't they just add it to the rules?
This is one of the things I dislike most about reddit, it seems to be common to ban people for a single rules violation of a poorly-documented unstated rule.
My main problem with reading the Sun online is it has so much adware that my browser slows to a crawl and sometimes crashes when I try to read it!
This is a plausible explanation but is it true?
Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).
I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.
$ curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: curl/7.54.1' | head -1
$curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' | head -1
It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)
edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.
Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.
This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.
To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..
If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?
I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.
archive.is does "something" to get the full page for sites that specifically do not send all the content to non-logged-in user agents, and it's definitely different / more complex than simply running noscript.
In that case, the government is fine with it.
Also, extension likely has also terms of usage prohibiting uploading copyrighted content shifting liability on users.
They went after seeders
I think the distinction between the two is moot in this post. The question could very well have been "How does archive.org bypass paywalls?" Though it's interesting that archive.is seems to just crawl the URL. Indeed that means they wouldn't necessarily be able to bypass the paywall.
Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.
Amp pages are paywalled:
archive.is isn't: https://archive.md/LaiOX
It might be using other techniques as well for bypassing paywalls, be it referer/user-agent spoofing (some old archives of sites that echo back HTTP request headers have archive.is sending a Referer of google.co.uk).
There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.
Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.
Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.
Or maybe they're okay with letting the archive index their content...
But it's kind of an open secret, you just don't look in the right place.
My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.
But I don't know the answer either.
Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...
You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json
This is almost right. They recommend two steps:
1. Use reverse DNS to find the hostname the IP address claims to have. (The IP address block owner can put any hostname in here, even if they don't own/control the domain.)
2. Assuming the claimed hostname is on one of Google's domains, do a forward DNS lookup to verify that the original IP address is returned.
The second step is the important one.
it's trivial to bypass most paywalls isn't it?