Hacker News new | past | comments | ask | show | jobs | submit login
Aren't AMP Caches committing copyright infringement? (ctrl.blog)
199 points by superkuh on March 27, 2019 | hide | past | favorite | 95 comments

Whether or not it is "technically" copyright infringement, it is going to be impossible for a publisher to argue that in court considering they themselves have to publish an AMP-compatible version of their page in the first place, and caching is sold as a core feature of AMP (https://www.ampproject.org/learn/overview/).

Yup, feeding AMP to consumers is a lot different than, say, https://outline.com/ which is straight up copyright infringement.

Does anyone actually know who is running outline.com? Are they definitely infringing on copyrights or is that just a popular assumption?

I've been wondering for awhile, given how successful it is at navigating paywalls, if it might actually be run by the newspapers themselves as a way of attracting potential customers. Giving them content to keep them interested while guilting them into paying.

> Does anyone actually know who is running outline.com?

The only hint I can get is their Terms of Service are governed by California law and the courts of Santa Clara County [1]. (It also refeeences “business transfers” in its privacy policy [2], implying it’s a for-profit entity.)

Otherwise—and this is unusual for any legitimate activity—I can find no reference to any legal person anywhere on their site.

[1] https://outline.com/terms.html

[2] https://outline.com/privacy.html

Unusual, but not as far as I know illegal or anything.

It would be unusual for an illegal operation to be

- Agreeing to the laws of a US court

- Have business transfers and be a for profit entity in that sense, hell, to even have a privacy policy.

(of course, they could also have just written that for fun)

You can pretty easily do a historical record look-up on the domain and its DNS. It does not appear to have any ownership tie to major media.

This is less useful than it was in the 90s:

    Registrant Name: PERFECT PRIVACY, LLC

    Name Server: NS-497.AWSDNS-62.COM
    Name Server: NS-1669.AWSDNS-16.CO.UK
    Name Server: NS-861.AWSDNS-43.NET
    Name Server: NS-1406.AWSDNS-47.ORG

Exactly, this tells us nothing.

If media were running this, the whole point is to not have it be publicly known, make people feel guilty for using the service.

If criminals are running this, obviously they don't want to be publicly known.

What if outline.com simply did a GET request on the source from the browser, stripped out paywalls and ads and served the content from the source server. Is that still copyright infringement?

That's basically what Brave does - web browser with a built-in adblocker. Or, for that matter, Chrome with any adblocking extension. It's not illegal, but it's not exactly popular with webmasters, and they're within their rights to block access to these browsers. In practice it tends to evolve into a cat-and-mouse game where adblockers block the ads, websites try to detect the adblockers and show you a pop-up encouraging you to turn off the adblocker, adblockers try to block the pop-ups, and so on.

Brave does more than that: https://news.ycombinator.com/item?id=18734999

Adblock Plus should fork its own web browser with built-in Acceptable Ads whitelisting. It'd be more honest than Brave.

Acceptable Ads is anything but honest. They take money to whitelist ads. Including ads from Taboola, one of the worst ad network. https://www.businessinsider.fr/us/google-microsoft-amazon-ta...

Now, I'm not a user of Acceptable Ads, but I don't think either their specific policy of taking money to whitelist AA-compliant ads from large companies, or having that policy apply to entities that have otherwise scummy ads is necessarily dishonest.

- Taboola ads are scummy

- Acceptable Ads is not supposed to allow scummy ads

- Taboola paid to get their ads accepted

- Acceptable Ads is dishonest

If you google a bit, you'll find that the ads that get whitelisted under Acceptable Ads are nothing different from the normal Taboola bullshit. In fact, the whitelist is quite simple: They allow the whole taboola network to operate.

Webmasters hate it when you do this but there’s nothing they can do to stop you

Yep, it's one secret technique they don't want you to know about.

That sounds like Reader View. I don’t think copyright law requires rendering the entire webpage exactly as the server requests.

I see no difference between visiting a webpage with a browser or through a program which modifies the content prior to delivery. That's exactly what any a browser plugin does.

The big difference is that if the program is run on the server of somebody else, it needs rights for redistribution (copyright) for the redistributor server, which is a different entity than you (who is running the local program).

Exactly, people want to make this a simple case (outline is really just a browser by a different name) but copyright isn't a bright line domain: intent matters and outline is just re-hosting other peoples content for broad consumption.

This would be true for every switch on the route.

Those don't modify or cache the content, only serve it to the user which the publisher approved serving it to, and increasingly with HTTPS they can't even see it. Moreover, the publisher is implicitly accepting that by being on the Internet.

Of course a lot of content is cached along the route, for frequent access urls.

Not by network hardware, which is the context of this discussion.

It's also not the type of activity being discussed, which is unauthorized caching: a content provider who uses a CDN is doing so intentionally and while shared local proxies are increasingly uncommon they also respect the Cache-Control headers set by the source — see e.g. https://redbot.org/?uri=https%3A%2F%2Fwww.wsj.com — so again there's the distinguishing factor of authorization.

I'm absolutely foggy about the specifics, but I wasn't talking about CDN, rather about the ISP-side.

Those have an implied license by the copyright holder.

Because this is disallowed by CORS/single origin policy

I think the OP stated a "what if", ie "What if CORS didn't exist?" I also think you could argue "What if outline loaded articles in an iframe?" (and at the same time "what if same-origin policy wasn't a thing?") If it was technologically possible, would it be infringing?

> If it was technologically possible

It's actually pretty easy, you can start chrome with --disable-web-security flag [1]

> I also think you could argue "What if outline loaded articles in an iframe?"

I'm sure this would be legal as it's equivalent to loading the site in a tab. The parent site wouldn't be able to manipulate any of the content/ads/paywalls/functionality, and the content site gets the full hit.

[1] https://stackoverflow.com/questions/3102819/disable-same-ori...

> The parent site wouldn't be able to manipulate any of the content/ads/paywalls/functionality

What? What do we disable CORS for if not to allow Javascript from one domain manipulate content in an iframe of an other domain? Am I missing something?

Disabling CORS would allow you to make straight requests to foreign content from your site and manipulate the responses exactly as though they came from your own servers - no iframe needed. CORS does not disable iframe sandboxing.

CORS is just a security feature, it does not imply anything about copyright or terms of use.

The DMCA ties the two by prohibiting users of copyrighted works from circumventing technological protection measures. It could be argued that bypassing CORS applies.

CORS isn't a technological protection like a DRM and isn't design as such, it's purely a security measure, by default you don't even specify it. Browsers are free to ignore them as they wish (but with increased security risks of course).

I agree. CORS is something my user agent does to protect me. It has nothing to do with the upstream site; I could easily browse it with a user agent that doesn't support CORS and nothing would break. CORS is just some annotations that lets my user agent determine "hey these scripts might be up to something shady". It is not a copy-protection mechanism by any means.

Yeah, but my point is what was being suggested is physically impossible with CORS in place, so it does imply something about what is in the realm of possibility.

CORS is really just a security for embeded pages and elements. It's not intended and cannot enforce usage restrictions/rights since it requires the client (browesr) to honor the setting. If I wget a page and strip the text from it, I'm not embeding the page in any manner so CORS is irrelevant. The current 'aggrement' for respecting copyright (wether it would hold up in court even with a TOS is beyond my knowledge) is robots.txt which, I'll admit is pretty dated and a very poor solution for dynamic pages and still requires client .

The best solution for copyright/paywall enforcement is to roll your own. If the request doesn't have the required cookie to access the full article, don't respond with the full page. This works very well for dealing with sites such as outline.com .

Sites like outline.com would be really interesting/usefull if they allowed you to upload your login cookies so that they could get paywalled articles and still strip the ads.

> If the request doesn't have the required cookie to access the full article, don't respond with the full page

The way outline.com works is by loading the article unsuspiciously once from their server, then serving it any number of times from their infrastructure. How would this stop that from happening?

Yes, it is. It's a form of republication, and as such, protected under copyright law. The difference between a third-party server side solution and client-side solutions is easy to see: client-side solutions don't redistribute.

If that is copyright infringement then so is Google Chrome.

Outline is a browser in a browser.

If it were accessing unauthorised content in a shady way I would say they have a leg to stand on, but serving up content to Google different from what it serves to the user is already borderline illegal for a variety of reasons. (It's a misrepresentation) I don't think anyone is wanting to go down that road.

Remember with the law intent is usually what matters most.

The technical stance that outline is serving up content that the user requests from another site is by design not redistribution.

The funny part is that the argument that allows this is the same stupid argument that permits copyright to exist in a world where everytime you open or read a file you are technically copying it.

> serving up content to Google different from what it serves to the user is already borderline illegal for a variety of reasons

This is called spoofing. Google doesn’t like it because it makes for a bad user experience, but it is certainly not illegal (or even borderline).

I think the legal argument you’re trying to make died with the Aereo supreme court decision. The ”outline is a browser in a browser” statement is cute but it doesn’t pass the duck test.

> I think the legal argument you’re trying to make died with the Aereo supreme court decision. The ”outline is a browser in a browser” statement is cute but it doesn’t pass the duck test.

About Aereo, could you elaborate more on that? I never heard of it, checking the Wikipedia article, I cannot find the word browser inside.

I'm seriously interested in that argument because years ago I was considering an idea to do exactly that. I mean look at Rubinius (Ruby in Ruby) or PyPy (Python in Python). Those are actually serious projects that are more than just research - as far as I know some thing can be even done faster that way and gave inspiration for the reference implementation.

Speaking about JS, React is basically re-implementation of the DOM in JS with an XML like language.

Nobody minds if Chrome and Firefox include Translations that transforms websites, that people use screen readers etc. I think there are limits of reason of what a content providers can restrict.

Outline is doing a transformation and then serving a copy of it, not just doing it in place with the copy that the user lawfully obtained.

Aereo tried to do the same for TV broadcasting (they were claiming they didn't copy, just digitalize and transmit on behalf of the user), and the courts struck that down.

Right. In Aereo's case they even went as far as to nominally "rent" the colocated hardware to the user such that it was the user who was receiving the broadcast and transmitting it to themself for personal use. Which is clever, but the court ultimately decided that it doesn't work that way.

How is this different from what the web archive does?

Google also caches and serves everything else its robot finds, so if this was a problem it was already a problem long before AMP.

Caching can be disabled https://support.google.com/webmasters/answer/79812?hl=en

But per the wording "noarchive - Prevents Google from showing the Cached link for a page" - and it seems likes it is technically just avoiding showing the cached link.

But we are in the f up territory. Does EU law says anything about meta tags? If not, then unless explicitly allowed you can't copy it.

On so many levels.

Copyright infringement is a tort though, so it's down to content owners to sue Google if they feel damaged by this "caching".

I think countries added workarounds for computer caching, allowing transient copies. But Google's "cache" or more of a short term archive, I'd guess they called it "cache" to semantically bypass the issue of it being an infringing copy.

Yes, actually it does. So does AU, US, and CA laws. Or, not expressly, but it does say a caching service must respect recognized industry standards for updating, removing, and excluding content from being cached. That covers HTML meta elements, HTTP caching headers, /robots.txt files, etc.

I'm pretty sure you can disable AMP for your site too.

You can stop me copying all your works, just find me and ask nicely. So, I can never be successfully sued for copyright infringement now, because it's easy to "disable" my potential infringement. Yay. /s

So we returned to the original comment: "Google also caches and serves everything else its robot finds, so if this was a problem it was already a problem long before AMP."

that's not really a valid defense. the default state is copyrighted unless otherwise stated, this assumes the contrary.

This is where the caching exceptions in copyright laws comes into play. A service can automatically cache content passing through it and not be held liable. Caching is really broadly defined so just about anything can be considered caching.

Any communications channel transmitting data is committing copyright infringement. It’s not like the source bit is deleted once received at the destination. That’s how the internet works.

Copyright law is more complicated than that, but regardless OCILLA drastically reduced copyright liability for the pipes:


I was just taking the argument to its logically absurd conclusion. Of course, copyright law is more complicated than that.

I think this is looking at things from the the perspective of an enginneer, not a lawyer.


As others have said, the publishers opt in to it... but I wondert whether litigious photographers could argue that their contracts never indicated their photos could exist anywhere but on the specific domain in the contract.

In any case, that wouldn’t be google’s fault.

Depends. If Google seek permission from the page owner and the owner warrants the fact that photos are licensed for AMP use, then it's all good.

If Google don't bother, just assuming they have the rights, then they're infringers. In copyright law the distributor is usually down for the largest punishment, that would be Google.

"I didn't bother to check if it was an infringement for me to distribute the work" doesn't seem like it would negate a claim of tortuous infringement?

They did check. Now, that’s not nessisarily enough but it is a defense.

> [...] contracts never indicated their photos could exist anywhere but on the specific domain in the contract

I doubt many contracts are written for specific domain names. I expect most contracts allow for a publisher to use it in their publications without specifying exactly whether that publication is a website, magazine, publisher branded app, or an Apple News article.

You would think so, but with publications on the ropes, it gets sliced and diced more than you might think. Photo rights are cheaper to obtain by the publication if they limit the platforms (and the time frame of rights, and a bunch of other factors), and savvy photographers will call out pubs on that.

But the way Google uses AMP goes way beyond caching, at least in the way news articles are presented. Do I really give anyone an implied license to republish AMP content on their own properties as soon as I put some AMP pages online? There might be some limited control by using robots.txt but that's rather coarse-grained.

My main irk with AMP is that it's mostly an all-or-nothing solution. I'd love to publish AMP pages without allowing anyone to republish my content. The way Google implements AMP also seems like a huge antitrust issue given their near-monopoly in search. I wouldn't be surprised if one of the next EU fines will be over AMP.

The new EU copyright directive expressly excludes caching services from the new publishers' rights (the "link tax").

But don’t you have to opt-in to AMP to begin with? Wouldn’t there be some sort of implicit permission in providing the AMP formatted content to begin with?

You can deploy and AMP page, and set a meta header that announces it exists. This is not an explicit opting in to Google's AMP cache though. That's what the author means by implicit license.

You opt in to the internet by making content available on it too, so are we rolling back to the free copying of last millennium?

As discussed in the article, some content management systems (like hosted WordPress.com instances) produce AMP variants of each published page. The author/rightsholder may not be aware of this and thus has neither expressly or implicitly agreed to anything of the sort.

Then the author should pay attention.

If wordpress by default said on the footer "content licensed under CC", how would it be my fault if I reuse the content, just because the author didn't bother to change that?

Do you have to opt-in to AMP e-mail? Keep in mind that the sender then holds the copyright. For that matter, can your web hosting platform opt-in to AMP without your express consent?

Seems like an overly-technical interpretation of "copying". But then again, EULAs were based on a similar principle.

One thing I've never seen addressed is the way AMP hijacks android Chrome behavior. It hides and then locks out the address/menu bar until the user scrolls all the way to the top of the page and then drags down again. If any other website is allowed to do this I haven't seen it and would love a way to disable it.

This one thing is the single biggest reason why I avoid AMP like plague. It's like this thing is designed with someone who does not know what tabbed browsing is in mind.

It does a similar thing on iOS and is super annoying. I run into it on Firefox Focus (where I google most often), but presumably it affects other browsers as well.

Problem doesn't occur for me. Sounds like a bug

I wondered the same about text in search results. My guess is that web page authors would rather let it happen than lose out on the traffic.

This is exactly the type of thing that “fair use” is supposed to cover. It gets blurrier though when you get into fuller media like images and video. Does a thumbnail count? How about a 10-second clip? How about a speech-to-text transcript?

No, because you have to do things that allow AMP caching. There is implicit approval and even request.

I think we are just all over thinking this stuff. The purpose of the web is to share things with other people.

> The purpose of the web is to share things with other people.

How does that in any way remove the need for a sound legal framework defining the limits of IP rights for this use case? Do you think everything put on the web should be considered public domain?

> Do you think everything put on the web should be considered public domain?

In my personal opinion -- Emphatic yes. Copyright law is tyrannical and should be abolished.

If you don't want to serve the content then don't serve the content.

Agree. Copyright and the patent system are leftover vestiges of monarchical rule.

I prefer to share things with other people without a corporation injecting itself into the middle of that transaction for the purposes of surveillance and ad revenue.

Doesn't matter, still sharing. Add a robots.txt if you explicitly want to disable access to Google's crawler.

Summer child, the web is to extract money from consumers for benefit of property owners. If it was about sharing copyright law would not apply.

Best way to kill AMP is to increase bandwidth limits on mobile internet. That's the only justifiable reason for this monstrosity to exist. I can't wait to see AMP die until then I will bide my time...

AMP is mostly a response to sites that want to load 20 MB video and endless popups timed to appear just when you want to click on "Next".

Build reasonable web sites and don't use AMP.

I won't even use mobile web without Firefox and Ublock anymore. The difference between that and stock Chrome is ridiculous. Browsing with Chrome results in about 5x the data usage.

Actually, you'd have to solve the physics that is holding back latency on mobile networks to compete with AMP.

I don't experience much noticeable latency on 4g. 5g is supposed to fix that with 1ms latency.

Tl;dr: probably not. But we're not lawyers, and Google/Bing has enough to be confident that it isn't.

Isn't web crawling? Or scanning whole books surfacing snippets and making them searchable?

It's all fair use.

And what about in the rest of the World outside USA. We don't have Fair Use in the UK for example.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact