Hacker News new | past | comments | ask | show | jobs | submit login
AI's Desperate Hunger for News Training Data Has Publishers Fighting Back (tvnewscheck.com)
23 points by joshdappier 6 months ago | hide | past | favorite | 16 comments



Obviously these bots don't obey the crawling directives today, why would they change in the future?


Hey! From the Dappier team here, the team quoted in this piece.

In our opinion, the industry's brand new! The fact that everyone is still visiting websites & turning to Google for info is a sign of that.

Companies like Google are incentivized to figure this out: they risk killing their ads business by building up their own AI scraping product w/ Gemini & AI search.

All sorts of publishers & content owners have started to explore their legal options or launch lawsuits. We're building to provide a fair & realistic alternative!

If you're interested, please take a look at dappier.com, try building an AI agent & add a data set & let us know if you have any feedback. :) Right now we support AirTable & RSS integrations!


What about small publishers?

Can they change their license to fight back the AI data gobblers? A watermark is easy to implement and prove if your content gets picked up in the next training set. Or is this something that is not possible?


> A watermark is easy to implement and prove if your content gets picked up in the next training set.

Is it? How would you watermark raw text? Images maybe, but I'm skeptical even there.

My high-school cousins tell me kids use one AI to write these days, and another to rewrite it to avoid AI detectors. I view fighting against this as a Sisyphean task.


Sure. If you have a really small site then it is possible that your data will never be picked up.

However, if it does get picked up then the watermark can be something as simple as a fake concept. For example, “Who is the Siberian Spectral Parrot?” - you can even present this as an alter-ego or something, so you don’t need to hide it from your users. Creativity is really the limit here.

And I think there has been evidence that ChatGPT had picked up small things like Reddit usernames.

But I am open to walking my statement back on it being easy. Maybe you do need a lot more references for some information to be included.

It’s also possible you thought I meant watermarking every content piece. I am talking about a site-wide content license.


Hey! from the Dappier team here, which was quoted in this piece.

We're building for publishers of all sizes! We turn your proprietary data into an AI-ready data model, and let you control the license by pricing on our marketplace.

We connect data to AI at the content access level - instead of training an AI on your data, we let an AI agent connect to your data model, and ensure you get paid every time an AI generates a response using your content.


AI companies have historically not given two figs about licenses or copyright. The law's murky enough right now that they can claim it's outside of the protections offered by copyright, so a publisher's only real recourse is to either poison the articles for AI consumers or block their crawler's access entirely.


This is why we're building Dappier, which was quoted in this article.

We're building a self-serve platform for licensing publisher data for AI usage & access on dappier.com.

If you have an RSS feed, it only takes 1 click to transform your data into an AI-ready data model & set your own price point. We'll be onboarding other methods of syncing data soon!


I find the structure of the headline interesting, "AI" sounds to me like a single entity there.

I know it's a stylistic choice but it makes me wonder what the future holds in store for humanity.


Publishers are just plain desperate.

On one hand if you look at the Gannett newspapers annual report you would think the king is on the throne, the pound is worth a pound, etc.

Look at a Gannett newspaper, like The Ithaca Journal, and it is a different story. I picked up a free weekly newspaper (not the Journal) the other day and got accosted by a witness who thought I must be brain-damaged to take an interest in daily newspaper which seems to be a page shorter every day than the day before.

A few weeks ago we had a hotly contested school board election after the superintendent had asked for a 12.5% increase in the budget at a board meeting which was pruned to 8.5% after immediate public uproar.

Well the upcoming school board election was highly anticipated after that and the morning after everybody in town wanted to know the results. All three weekly papers had the election results on their web sites that morning. The daily didn’t have results in paper the next day (though I’ve frequently bought a paper the next day to get election results) and didn’t post the story to their web site until 2:50 pm.

Despite a lack of local news coverage (it’s not like they are sending around reporters to major public meetings like they did 20 years ago) the Journal has a few random stories of national and international news on the home page every day.

Similar arrogance can be found at the first tier papers like The New York Times, WaPo, and such. Most recently post.news got shut down by its investors and one reason was probably that first-tier papers weren’t at all interested in engaging with a paywall alternative.


AI is best way to make a bag in modern history! Get on AI and make your bag today !


Malfunction. Need Input! - Johnny #5

Woah, woah, woah. You must license our "input" because you're not allowed to read a book like a human since you're not a single entity. - Publishers


> Ultimately, two things are becoming clear: AI needs journalism to thrive, and journalism needs to find a sustainable way to coexist with AI

Why would AI need journalism to thrive? This seems just thrown in there, I bet journalists would want that to be true but I don't see anything here stating why.


Because after the initial scraping where the AI for example can tell you about each American president, users want the AI updated to know if any present is a convicted criminal (for example).

When we were all naive the AI bots came through and just took our content, now as everyone wises up, folks are laying down a cost for the next round of web scraping.

Imagine your bot could only scrape once a year.. your AI would quickly get out of date on many topical issues. This ‘latest updates gap’ is where the journalists and publishers see leverage.


AI is not a primary source. Journalists (should) be doing more than just collecting data from online sources.


In case it should be said, journalists without first-hand experience in direct evidence gathering tend not to be primary sources either. It may be the case that AI will have some types of relevant first-hand experience, and may count as primary sources in a number of circumstances. I can see many possible cases in the future (maybe even a decade from now) where AI can engage in a variety of thorough investigative practices. Reaching out to communicate with people, flying a drone to inspect, or perhaps executing more more complex processes may be coming. Justification or sufficient positivistic authority may come in different degrees and kinds in this cluster of practices as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: