Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How does archive.is bypass paywalls?
133 points by flerovium on May 24, 2023 | hide | past | favorite | 102 comments
If it simply visits sites, it will face a paywall too. If it identifies itself as archive.is, then other people could identify themselves the same way.



Nice try, media company employee ;)

/jk


My sentiments exactly.


Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.


> Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

I would expect to see login information rather than "Sign In" and "Subscribe" buttons on archived articles then. Unless they're stripping that from the archive?


Exactly. It also would not be difficult for website operators to embed hidden user info in their served pages, thereby finding out the archive.is account. This approach seems risky for archive.is.


They could just copy the div with the content over to evade detection of the website’s owner


> Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like https://pptr.dev/ to automate login and article retrieval.

I wouldn't be surprised. IIRC, the whole thing is privately funded by one individual, who must have a lot of money to spare.


I don't think anyone knows who runs archive.is. I've tried looking into it a couple of times in the past but there is surprisingly little information to be found. It must cost thousands if not tens of thousands a month to host all that data and AFAIK they do not monetize it in any way. From what I gather it probably is some Russian person as there were some old stackoverflow conversations regarding the site that lead to an empty github account with a russian name. Also back in 2015 the site owner blocked all Finnish ip addresses due to "an incident at the border"[1]. Finnish IPs have since been unblocked. It appears the site owner somehow thought he could end up in EU wide blacklist which seemed like very conspiratorial thinking from him.

1: https://archive.is/Pum1p


When I visit each page has three ads, Left right and bottom. Maybe you have an ad blocker?


Would it be possible to check if archive.is is logged into a newspaper site by archiving one of the user management pages?


Negative. I used to assume this as well, but they somehow also bypass local paywalls which have gotten me temporarily banned from r/Baltimore lol.

They can somehow even bypass the Baltimore suns paywalls, and I doubt they have subscriptions to every regional paper, could they?


Wait, you got banned from /r/Baltimore for posting archive.is links there? That's against the rules there? I would not have known that myself! (Also a Baltimorean).


I even tried to convince them to be in the mindset that Paul Graham created Hacker News to get more mindshare on YC. He gave the idea of Reddit to the 3 brilliant Ivy League founders who applied to YC with a basic GMAIL extension I think that copied emails or something.

So I tried convincing them that if it’s okay here on PG’s creation, then it should be okay on his other creation.


Yeah! Hahah

I thought knowledge was free, and the Baltimore sun sucked anyways. They charge money and don’t even write hood stuff anymore. They laid off a bunch of people, and moved printing to Delaware. My bet is the next step is that they announce they are shutting down all Locust Point operations, and are selling out so that Kevin Plank can build some new buildings there.

I think I had to appeal my ban with a mod, and they mentioned how it’s posted all over by the auto bot that sharing links to websites that bypass paywalls are against their subreddit rules :(

I even tried an official proposal to r/Baltimore to reconsider and life that rule. The general consensus on the poll was that people felt that the Baltimore sun and the writers should be getting paid for their work, and I shouldn’t be bypassing their paywalls lol.


You did it ONCE and got banned?

I still can't find anything in the subreddit rules that clearly says this. (Not that most people read the rules first). Why don't they just add it to the rules?

This is one of the things I dislike most about reddit, it seems to be common to ban people for a single rules violation of a poorly-documented unstated rule.

My main problem with reading the Sun online is it has so much adware that my browser slows to a crawl and sometimes crashes when I try to read it!


But is it true? What evidence is there?

This is a plausible explanation but is it true?


So scihub but for newspapers


Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.

Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).

I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.


They are probably just checking headers such as user agent and cookies. Would copy whatever your normal browser sends and put it in the urllib.request. If that doesn’t work, then it is likely more sophisticated.


I will try that, but a quick look at the error page makes me think it tries to run a javascript blob.


They're just checking the user agent

    $ curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: curl/7.54.1' | head -1
    HTTP/2 403
    
    $curl -s -I 'https://www.sfgate.com/' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0' | head -1   
    HTTP/2 200  
One "trick" is that Firefox (and I assume Chrome?) allow you to copy a request as curl - then you can just see if that works in the terminal, and if it does you can binary search for the required headers.


It probably does. But there are better modern tools like headless Chrome / Puppeteer that can fully render a page with scripts.


Sounds like an ADA lawsuit waiting to happen. I'd send the editor an email explaining how they've reduced usability of the site; especially if you're a paying customer.


I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.

[1] https://i.imgur.com/lyeRTKo.png

[2] https://i.imgur.com/IlBhObn.png


That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.


It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.


Definitely incorrect, the paths in the robots.txt are prefixes, so `/` means anything starting with `/`, that is, everything. Look up the specifications for the robots.txt for more information! (Or, for instance, look up how you'd block the whole site in robots.txt if you wanted to!)


No, / means the entire site, since root and anything lower.


That's an interesting idea, but is it true?


Websites usually want their pages indexed for search engines, as it increases the traffic they receive. They also often try to allow archival usage. The robots.txt usually has defined user agents used by search engines defined, as one purpose is to reduce load on the website by not indexing pages that do not need to be indexed.

It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)

edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.

Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.


Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.


Your missing the point of "how does it bypass firewalls"


Surprisingly, nobody has mentioned this here yet. I’m thinking the key to this is SEO, SERP’s, and newspapers wanting Google to find and index their content.

This is my best guess for this. I’ve really put some thought into this, and this is the best logical assumption I’ve arrived to. I used to be a master of crawlers using Selenium years ago, but that burned me out a little bit so I moved on.

To test my hypothesis, you can go and find any article on Google that you know is probably paywalled. You click the content google shows you, and you navigate into the site, and “bam! Paywall!”..

If it has a paywall for me, well then how did Google crawl and index all the metadata for the SERP if it has a paywall?

I have a long running theory that Archive.is knows how to work around an SEO trick that Google uses to get the content. Websites like the Baltimore Sun don’t want humans to view their content for free, but they do want Googlebot to see it for free.


Sorry, thought it was obvious. Since it's using backend infrastructure to fetch the assets, it can crawl them as a bot in the same way that search engines do, without allowing cookies to be saved. Since scripts are often involved in the full rendering of a page, it clearly does allow for the scripts to load before snapshotting the DOM. But only the DOM and the assets and styles are preserved. Scripts are not. Most paywalls are simple scripts. If you disable JS and cookies, you'll often see the full text of an article.


Some paywalls don't hide the content with JavaScript. It's just not there. They make you pay and then redirect you to another page.


I browse with scripts disabled by default and while some paywalls rely on js to block interactions after load many simply send only partial content and a login dialog.

archive.is does "something" to get the full page for sites that specifically do not send all the content to non-logged-in user agents, and it's definitely different / more complex than simply running noscript.


There are a lot of paywalls that are done server-side - for instance the Herald Sun, which is one of the biggest newspapers in Australia, does it like this. Even if you check the responses there's nothing in them but links to subscribe and a brief intro to the article.


paywalls*


I think a browser extension which people who have access to the article use to send the article data to the archive server.


You mean the pages are crowdsourced? I don’t think so because many pages are archived only upon request. If I ask to archive a new page, archive.is provides it very quickly. This is not possible if the archive is built from crowdsourced data.


That is how RECAP works ("Pacer" spelled backwards).

In that case, the government is fine with it.


I think that's how Sci-hub works, at least at some time in the past.


I thought people would send their journal credentials to Sci-hub


Can you explain? Who has purchased the subscription? I'm sure there's a no-redistribution clause in the subscription agreement.


The person who installed the browser extension would be paying the subscription and ignoring said clause.


curious if eventually companies with start watermarking articles and catch and sue extension users.


I suspect most content publishers would go to the source. If there are people who are already willing to pay for subscriptions and ignore the terms of those subscriptions, it's not much of a stretch that they'll ignore the fact that they got their subscription cancelled once (or twice, or however many times). The publisher would more likely see results taking legal action against the archivist.


It didn't stop the RIAA from suing loads of people over downloading mp3s in the past 2 decades, claiming damages of thousands of dollars per song the individual downloaded.


in this case (archive.is) they have stronger case, since many people who potentially could buy subscription read it on archive.is because extension user violated terms of subscription.

Also, extension likely has also terms of usage prohibiting uploading copyrighted content shifting liability on users.


*uploaded

They went after seeders


downloaders also received legal letters.


But what is the relationship between archive.is and the user who installed the extension?


The user helps free the Internet by using archive.is as an openly accessible backup platform.


dude... haha it's a random person on the internet who is doing it for free.


They (archive.is) would have built the extension to send the current page content to their servers and the user would have installed it so they can archive internet pages. https://help.archive.org/help/save-pages-in-the-wayback-mach... (item 2)


You are confusing archive.is with archive.org. Although archive.is does have an extension[1] it doesn't appear to capture any of the page contents, it just simply sends the url for archive.is to crawl.

1: https://chrome.google.com/webstore/detail/archive-page/gcaim...


I wasn't exactly confusing them but yeah, I did link to an archive.org article. I was having difficulty finding something specific to archive.is.

I think the distinction between the two is moot in this post. The question could very well have been "How does archive.org bypass paywalls?" Though it's interesting that archive.is seems to just crawl the URL. Indeed that means they wouldn't necessarily be able to bypass the paywall.


> If it identifies itself as archive.is, then other people could identify themselves the same way.

Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.


It would be far better and more secure for archive.is to publish a public key on its site and then sign requests from its private key, which sites could optionally verify.


You just described client certificate auth


+1 on this!


In theory, this might work. But is it true? Do lots of sites have an archive.is whitelist?


I really don't see why they would, if they're using a paywall in the first place.


Follow the magnolia trail...




For WSJ at least, it appears that archive.is is fetching the AMP page, which returns the full content of the article and is hidden with CSS, and modifying the page to unhide the paywalled content + hide ads.

It might be using other techniques as well for bypassing paywalls, be it referer/user-agent spoofing (some old archives of sites that echo back HTTP request headers have archive.is sending a Referer of google.co.uk).



I can access the wsj article without any account using https://gitlab.com/magnolia1234/bypass-paywalls-firefox-clea... (bypass paywall clean)


Wow, an actually good use for Amp? Amazing.


I'm sure it was an accident or honest mistake!


A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.

There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.

Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.


Not specifically related to archive.is, but news sites have a tightrope to walk.

They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.

Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.


Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.

Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.

Or maybe they're okay with letting the archive index their content...


Just FYI google and bing publish their user agent strings[1][2] for the crawlers. At least in my experience most of the typical ad-infested and paywalled news sites wont display the paywall if you change the user agent to a crawler they prefer.

[1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...


Doesn't almost every site on the web know exactly what the Google bot looks like?


Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.


Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.


If the people who know that tell you, they could lose access to said ressources.

But it's kind of an open secret, you just don't look in the right place.


I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)

(https://archive.is/1h4UV)


I thought they used this browser extension: https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean


That extension does work, but do we know they use it?


They don't always use it, because I can archive a new page from my mobile phone browser, which doesn't even support extensions.

My guess is that most content providers with paywalls serve the entire content, so search engines can pick it up, and then use scripts to raise the paywall - archive.is takes their snapshot before that happens / doesn't trigger those scripts.


It's actually the opposite, for some news sites this extension links to archive.is because that's the only known way to bypass the paywall.


There are known ways to bypass paywall which are just impossible to implement within a browser extension while trivial on 12ft or archive. For example, to use Ukrainian residential proxy as some news websites granted free access from.


Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.

But I don't know the answer either.


I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik


12ft.io also doesn't work or is disabled for many sites that archive.is still works on


Maybe because the creator of 12ft.io isn't anonymous


Wouldn't sites be able to see that requests from 12ft.io isn't coming from Google's IPs?


Yes.

Google recommends using reverse DNS to verify whether a visitor claiming to be Googlebot is legitimate or not: https://developers.google.com/search/docs/crawling-indexing/...

You can also verify IP ownership using WHOIS, or by examining BGP routing tables to see which ASN is announcing the IP range. Google also publishes their IP address ranges here: https://www.gstatic.com/ipranges/goog.json


https://search.google.com/test/rich-results?url= operates from legit Googlebot IPs so it allows anyone to get the paywalled content even archive.is fails to fetch (from theinformation.com, for example)


"Google recommends using reverse DNS to verify..."

This is almost right. They recommend two steps:

1. Use reverse DNS to find the hostname the IP address claims to have. (The IP address block owner can put any hostname in here, even if they don't own/control the domain.)

2. Assuming the claimed hostname is on one of Google's domains, do a forward DNS lookup to verify that the original IP address is returned.

The second step is the important one.


My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.


Follow the magnolia trail...


What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.


your browser usually downloads an entire article and certain elements are overlayed.

it's trivial to bypass most paywalls isn't it?


Not for some (I think the Wall Street Journal). Apparently the AMP version of the page does work this way for WSJ though, which is how IA gets around the paywall.


They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.


every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content


It’s internet magic. <rainbowmagicsparkles.gif> ;)


Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: