Hacker News new | past | comments | ask | show | jobs | submit login
Understanding and managing the impact of machine learning models on the web (w3.org)
138 points by kaycebasques 9 months ago | hide | past | favorite | 36 comments



I agree with the general idea of tagging content to help classify.

I'd given this some thought via MIME and ended up with a kind of BioNFT.. so named bc it uses NFTs piecewise, but tracking the creation events and agents types (bio, ai, etc) as part of the content lifecycle

https://twitter.com/PMayrgundter/status/1638016474483683328

Highlight..

What if:

  - devices sign source creations with a biosignature

  - editing tools sign input

  - media types include that, effectively saying:

    ai_edited(human_created(photo))
and do this under the experimental namespace in MIME:

  image/x.bio(pablo@example.com/photo123).html

  image/x.adobe.photoai(http://x.bio(pablo@example.com/photo123)).html


Perhaps I am being too cynical, but how do you protect this scheme against hostile actors?

Bits don't have color after all (see https://ansuz.sooke.bc.ca/entry/23 if you don't get this reference) and it would be fairly trivial to manually alter such media types to anything you want. For example if someone posts a `ai_edited(human_created(photo))` online, it would be straightforward to take the pixels of the photo and re-publish them as a "new" image with only the `human_created(photo)` tags. You could also randomly start adding `ai_edited` tags to things you want discredited, etc.


Ah no problem. Just a sketch..

Looking at C2PA, I'd now use "claim" to describe those tags. They're short and human readable but not self-contained. They'd have to be authenticated with tool signing to be very useful. As it also emphasizes, it's about establishing trust for a domain, eg can Samsung photo signing be trusted?

Eg for a normal photo, your device could sign it with.. and OTP coordinated with the manufacturer? Then include the signing metadata in the EXIF tags of the image. To validate, you send the image+signed_claims to manufacturer and they know the OTPs (?) and can verify the signing

Doesn't make it unforgeable, bc you could intercept the hardware otp/signing protocol.. but would raise the cost quite a bit. Maybe with frequent key resets you can raise this further

Something like that sounding more useful?


The analogue hole makes this all pointless. Ultimately at some point down the chain there's regular camera sensors that can be replaced with anything you want.


Exactly. This won't stop any determined bad actor, if anything, it's just going to boost their credibility when they inevitably come up with a bypass that allows them to sign any piece of media as "authentic" (in a few days or weeks after a device implementing such a scheme is released).

What then? 99.9% of honest actors are paying with their privacy[1] and usability again, while remaining 0.1% of dishonest actors are using a bypass - which sounds a lot like piracy and various DRM schemes. We've been here before, and it never works.

[1] While this might not be an immediate goal, I think it's pretty obvious any such scheme would eventually be corrupted to encode a traceable device ID into every photo and video taken with the device - to protect the children and stop CSAM, of course!


Hmm, how about watermarking the photo with some function of the OTP? (like bit steganography; invisible to normal human use)

Seems to preserve easy checking by device manufacturer, but not immediately clear to me how to replace the sensor in a way that gets a signed image.

And if it seems like a bit of trouble, I think it's still worth consideration by the manufacturer, as a feature for their images and support of a community of users who want to produce images that can be so validated


A business of Internet-connected cameras which automatically hash and commit the hashes of photos they take, unaltered, would give that claim much more credence. I suppose this idea doesn't come up all that often because of the hardware changes involved, but I could see this being very handy for anyone who needs to verify that photos are real in a professional capacity.


>Doesn't make it unforgeable, bc you could intercept the hardware otp/signing protocol

If hardware driven attestation like that used by Xbox One is used, intercepting the protocol becomes impossible.

Preventing computing hardware from running unauthorized processes is already a largely solved problem.


What you're describing here is basically what C2PA is


Reference, for context: https://c2pa.org/

And: the BBC just started using C2PA across some content. The BBC's R&D team talking about it: https://www.bbc.co.uk/rd/blog/2024-03-c2pa-verification-news...


Related critique of the BBC's use of C2PA, and C2PA in general: https://www.hackerfactor.com/blog/index.php?/archives/1024-I...


That was an interesting rabbit hole of articles, thanks. From an earlier article: [1]

> At FotoForensics, I'm already seeing known fraud groups developing test pictures with C2PA metadata. (If C2PA was more widely adopted, I'm certain that some of these groups would deploy their forgeries right now.)

> To reiterate:

> * Without C2PA: Analysis tools can often identify forgeries, including altered metadata.

> * With C2PA: Identifying forgeries becomes much harder. You have to convince the audience that valid, verifiable, tamper-evident 'authentication and provenance' that uses a cryptographic signature, and was created with the backing of big tech companies like Adobe, Microsoft, Intel, etc., is wrong.

> Rather than eliminating or identifying fraud, C2PA enables a new type of fraud: forgeries that are authenticated by trust and associated with some of the biggest names on the tech landscape.

[1] https://www.hackerfactor.com/blog/index.php?/archives/1013-C...


This C2PA looks flawed from the start. It just helps making the untrustworthy looks more trustworthy.


Thanks for the ref! Checking it out


C2PA would be better onchain so maybe you could do a proof of concept implementation of that


Trying to parse recursive urls seems like it will break everything everywhere.


> the copyright system creates a (relatively) shared understanding between creators and consumers that, by default, content cannot be redistributed, remixed, adapted or built upon without creators' consent. This shared understanding made it possible for a lot of content to be openly distributed on the Web.

That is not remotely a shared understanding, is wrong, and has nothing to do with making it possible for a lot of content to be openly distributed on the web. Content is distributed quite widely without concern for copyright.

> A number of AI systems combine (1) automated large-scale consumption of Web content, and (2) production at scale of content, in ways that do not recognize or otherwise compensate content it was trained from.

> While some of these tensions are not new (as discussed below), systems based on Machine Learning are poised to upend the existing balance. Unless a new sustainable equilibrium is found, this exposes the Web to the following undesirable outcomes:

> Significantly less open distributed content (which would likely have a disproportionate impact on the less wealthy part of the population)

That's even more ridiculous. The wealthy stand the most to gain from restricting the flow of information to channels which collect rent on behalf of their capital. It's the "less wealthy" who routinely find ways to distribute content outside of rent-seeking channels. It's the "less wealthy" who benefit the most from the commoditization of creative content via generative algorithms.

Quite frankly, I expected better from W3C.


> > the copyright system creates a (relatively) shared understanding between creators and consumers that, by default, content cannot be redistributed, remixed, adapted or built upon without creators' consent. This shared understanding made it possible for a lot of content to be openly distributed on the Web.

> That is not remotely a shared understanding, is wrong, and has nothing to do with making it possible for a lot of content to be openly distributed on the web. Content is distributed quite widely without concern for copyright.

I'm not sure if the switch from active voice in the original quote to passive in yours was deliberate or not, but "understanding between creators and consumers" is very different from your "content is distributed".

It is the case, yes, that people widely distribute content on the web with no regard for copyright law. But those people aren't generally creators of that content.

The article is talking about the incentives that the web places on content creators. If the result of AIs harvesting every bit of content on the web and regurgitating it without sending consumers over to the creator's website, then creators will stop putting stuff online.

People cloning and resharing content without regard to copyright has not so far seemed to have systemic negative effects on the web. Search engines seem to be pretty good at pointing users to upstream original sources of copyright content, so plagiarism is commong but apparently not common enough to cause context authors to stop putting it online.

AI risks tipping that balance such that content creators really might stop posting stuff online. Why waste a meaningful chunk of your life creating a thing and putting it on the web if the only thing that will ever see it and know that it came from you is an AI slurping it up?

> It's the "less wealthy" who routinely find ways to distribute content outside of rent-seeking channels.

Again, I think you're presuming a world where content magically exists a priori and the network is simply a mechanism for deploying it. The article is about what happens when the system discourages people from making at all.

Poor people can find ways to pirate just about every book on Earth... except for those books that never ended up getting written because the incentives placed on the author didn't work out.


> Again, I think you're presuming a world where content magically exists a priori

Plenty of works were created before copyright. I think the background section of Wikipedia's page on copyright is telling:

    The concept of copyright developed after the printing press came into use in Europe[16] in the 15th and 16th centuries.[17] It was associated with a common law and rooted in the civil law system.[18] The printing press made it much cheaper to produce works, but as there was initially no copyright law, anyone could buy or rent a press and print any text. Popular new works were immediately re-set and re-published by competitors, so printers needed a constant stream of new material. Fees paid to authors for new works were high, and significantly supplemented the incomes of many academics.[19]

    Printing brought profound social changes. The rise in literacy across Europe led to a dramatic increase in the demand for reading matter.[16] Prices of reprints were low, so publications could be bought by poorer people, creating a mass audience.[19] In German language markets before the advent of copyright, technical materials, like popular fiction, were inexpensive and widely available; it has been suggested this contributed to Germany's industrial and economic success.[19] After copyright law became established (in 1710 in England and Scotland, and in the 1840s in German-speaking areas) the low-price mass market vanished, and fewer, more expensive editions were published; distribution of scientific and technical information was greatly reduced.[19][20]
So basically, cheap reproduction meant poorer people could get in on the production game and/or afford to participate in the consumer side of the market, lack of copyright drove higher fees for new works from authors, and the introduction of copyright increased the price of works and reduced the distribution of scientific knowledge. Sounds great, sign me up.


If you’re the middle man, sounds like a great deal. Pay less, charge more, waste less paper.


> If the result of AIs harvesting every bit of content on the web and regurgitating it without sending consumers over to the creator's website, then creators will stop putting stuff online.

Which is a thing that is already happening. Not in significant enough numbers to matter at this point, but I expect this trend to get larger with time.


I think you are not taking into consideration the new content being created by AI with human in the loop, such as in the chatGPT interface. With 100M users OpenAI might be generating on the order of 1 trillion tokens per month. These logs are mixed AI and human text, the human part containing tasks and feedback.

By transforming these chat logs into training examples and fine-tuning the model there is a way to integrate LLM and human signals. Of course it is necessary to be mindful of copyright and PII during this process, not everything is generally useful to be included in the model. But having hundreds of millions of people inputting feedback into the model can scale even more than unassisted content publishing.

Besides the AI users, we have the social networks. Billions of people comment on the news, if you scrape a comment thread such as this one you can readily generate an article from it, grounded in human feedback (sample based on this thread https://pastebin.com/R229b41s)

Ultimately I don't believe publishers will stop creating content because there are many reasons to do it besides collecting advertising fees, but as a backup we have LLMs continuously learning by assisting their users or reshaping informal communications into well formed content. It's a two way street - both the LLM and human get help from each other.

In turn, LLMs learn from other LLMs, usually Mistral and LLaMA would learn from GPT-4, this process works very very well. So any skills learned by a SOTA model with massive human chat logs would eventually be extracted for small models. Feedback percolates back to all AI agents.

tl;dr Turning comment threads or human-AI chat logs into articles is a possible solution.


Exactly. This feels like the same arguments used against adblocking.

Similarly, the analysis is quite one sided. Publishers' weakening hold on copyright does disincentive them from publishing online, but that's counterbalanced by their economic need to be on the Internet.

AI weakens copyright. That's a good thing. It has been absurdly strong for way too long. Don't just consider the wants of the publishers and authors; consider the wants of the consumers as well, to be able to redistribute, remix, adapt or build upon existing materials without creators' consent (e.g. memes). [1]

I also note that their use of "open distribution" refers to free as in beer, not as in speech. This article goes completely against the principles of free culture [2].

[1] https://www.techdirt.com/2011/04/08/if-youre-arguing-that-so...

[2] https://en.wikipedia.org/wiki/Free-culture_movement


> Don't just consider the wants of the publishers and authors; consider the wants of the consumers as well

The problem is that copyright can only be enforced by those with the means to do so; but more importantly, it can be abused by anyone with the means to do so. The food chain is roughly:

1. Mega corporations

2. Small corporations

3. Individual for-profit leeches (e.g. the vast majority of "react" content on Youtube)

4. Small creators

Nobody will use "AI" tools to erode the copyright of megacorporations like Disney or Nintendo. You could be entirely within your rights to do what you're doing, but they'll sue you regardless. The overwhelming majority of people will roll over because they don't have the resources to fight it.

On the other end of the spectrum, everyone already steals from small creators, and AI tools will make it even easier to do so. As a rule they don't have the resources to start lawsuits against corporations. Most of them don't even fight against leech ("react") content because it comes with an implicit threat of public retaliation, harassment, or worse. This kind of infringement is thriving across various platforms and very few people seem to care.

So in summary, I'd say that you're not so much weakening copyright as a concept, you're just further weakening the rights of small creators who are already being infringed on by everyone - not cool.


> Nobody will use "AI" tools to erode the copyright of megacorporations like Disney or Nintendo.

People have been using every turn at their disposal to erode copyright of megacorporations for decades. Their love of suing hasn't stopped us yet. I can't really see it stopping any time soon.


What I tried to get across is that AI won't erode their copyright in any meaningful way. If they don't like what you're doing, they'll sue regardless of what the copyright laws say.


I disagree with the first part (but maybe I'm just too optimistic) but certainly not the second.


This is exactly the thought trap I was talking about. Your analysis solely focuses on the creator, but leaves out the rights of the consumer. You view copyright as a moral right (it being "not cool" to violate it), but forget to consider that it is first and foremost an economic tool.

Remember that copyright is a temporary monopoly granted by the government; a compromise to incentivise creators to ultimately benefit the public. The public's needs should come first, and only restricted when required for limited purposes, e.g. incentives.

And I believe it is in the public's interest for data to be readily analysed and remixed, no matter what the creators want, with restrictions only there to allow things like paywalls and (actually) purchasing digital media.


I hope your comment doesn't get downvoted too heavily, because I think you raise good points.

What seems to be happening, and is happening in this document by W3C as well, is that the social value of information and the economic value of information are being conflated. Social media has created markets for creative works where these two values become entangled.

Another way to say this is that commercial art and fine art are different things, but they are treated the same by the web and perhaps they shouldn't.

When someone creates fine art, they are not creating art for the sake of its economic value. They are creating a work of art for its social value, and want it distributed as widely as possible.

When someone creates commercial art, they are creating art specifically for its economic value. That value may be enhanced by wider distribution, but it may also be diluted by wider distribution.

Because these two types of art need to be treated differently by the web, we can't have one solution that benefits both kinds of art.

We need both copyright to protect commercial artworks, but we need a system that encourages wide distribution of the collective information of humanity, allowing equal weight to everyone's ideas, outside of their economic value.

i.e. Kafka's ideas are more valuable to humanity than Beavis and Butthead.

This W3C draft doesn't take that into account. It needs to. We need to think beyond the needs of artists who rely on social media to ply their wares, while also taking into account their needs. We should not, however, codify the social media influencer art market, because that is not a market worth protecting. It's an aberration that encourages people to share personal information and works that benefit the platforms and harms society. We want to build something that benefits society and the individuals who contribute artworks that benefit society today and into the future. And if you can find a way to make a buck in the middle there somewhere, we should encourage that, too.


I think you're making distinctions that artists themselves don't make, and underestimating how much they can be financially motivated. The most surprising example I know is illustrated by this apocryphal story:

"...someone visited [Picasso's] studio, stood in front of a painting for several minutes, and asked Picasso, 'What does it represent?' Picasso replied without hesitation, 'Two hundred thousand dollars.'"


Whether it is possible to find pure-economic or pure-societal art is not relevant. And we can't dismiss the existence of these two dimensions, which you can also call "consumption mode" or "purpose".

In (my) definition, art is everything a human make or do that is capable of evoking emotional reaction from another human. So it follows that IMO, art's value is primarily societal in nature. The economic value comes afterwards.

The economic value of art mainly harnessed by people that want to evoke feelings in other people, those in entertainment industry like movie producers, game directors are an example of this. And this is where the push to make art-making labor cheaper mostly come from.

The societal value mainly comes in two forms. 1) Capturing the world around us for a record, and 2) as a medium for communication in emotional or subconscious level. This value is seperate from the economical one, and I think is the most important one.


> When someone creates fine art, they are not creating art for the sake of its economic value.

They often are, though.

"Fine art" only means that it's art without practical utility beyond being art. A painting is fine art, for instance, where an ornate silver teapot is not, as it has practical utility.

Whether or not either type was made for economic reasons doesn't enter into it.


You are correct to say that the distinction between fine and commercial art is beyond an artwork's practical utility, but I think we could both agree that the market price of art does not necessarily equate to its overall social value. That is what I was getting at. It's more about economic markets not capturing social value, and copyright's role in protecting economic value to the detriment of social value.

Outside the scope of my comment is whether copyright is even capable of protecting social value (I tend to think it isn't), but if it is, W3C should be the organization that steps up to make the attempt.


Prediction: soon we will want to label things written by actually useful AI, because it will be more helpful than things written by humans and parrot tier AI


This would have been a better link: https://www.w3.org/reports/ai-web-impact/


Ok, we've changed to that from https://github.com/w3c/ai-web-impact above. Thanks!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: