There are a couple of reasons, but the biggest one is that intercepting HTML at ...

specialist · on June 24, 2020

Nice ELI5s/TLDRs, thank you.

Sharing notion (dumb question) here, since you're smart about this stuff:

At what point will "ad blocking" flip to "content scrapping"? All this cat & mouse, arms race seems like a bad ROI.

I imagine an adaptive super reader view mode. Meaning emphasis on content scrapping rules vs ad blocking rules.

Take snapshots of top websites, do some rendered page aware content diffing, with some user directed setting & curation, infer where the content is, distill that down to "good enough" scrapping rules.

Any way. Thanks for answering some of my lingering questions.

danShumway · on June 24, 2020

I am not an expert on this stuff, I've just spent slightly more time reading some of the wikis than other people. Take what I say with a grain of salt, other people who are actually building ad blockers or managing blocklists would have more insight.

If you want to swap from a blacklist model to a whitelist model, there are a couple of problems to solve off the top of my head:

- you need a way to refer to content that supports re-hosting. You need a way to convert the Facebook/Twitter link someone shares into the scraped version without you loading the original link. See IPFS, but also see Invidious, Nitter, and the Internet Archive for a lower-tech, more straightforward version of what that might look like.

- you need good enough copyright exemptions that it's OK to re-host or proxy the content somewhere else. This is kind of a gray area, people are rehosting web content without getting sued, but it's not clear to me if that's scalable moving from sites like Youtube to the Internet as a whole. I guess nobody's called to take down Pocket, so maybe the situation there is better than I think.

- You need the web to stay relatively semantic. There are a lot of sites that don't work with Pocket/reader mode. A lot of the sites that do work are because there isn't a highly adversarial relationship between Pocket and site operators.

I'm not sure whether a cat-and-mouse game around content scraping would be better or worse than what we have. I can imagine it would be better in the sense that you'd only need to do it once per page, and then distribute the scraped version. But that's assuming that copyright would allow you to do that.

And I suspect that breaking a scraper is easier than breaking an ad blocker. But maybe someone could prove me wrong there.

specialist · on June 28, 2020

Belated response, apologies.

re "whitelist" vs "blacklist"

Nice phrasing. Stealing it.

re rehosting, copyright exemptions

I hadn't connected those dots. Thanks. Very interesting.

Tangentially, I've been chewing on an idea similar to quotebacks (recently on HN's front page). My naive take was to create URL shortener to support my use cases. For example, by shortener would attempt to link to OC, falling back to Internet Archive (or whatever). I'll now learn about IPFS, Invidious, Nitter.

Also, I didn't do a good job explaining my 1/2 baked notion for implementing a "whitelist" based content scrapper.

I imagined distributing the whitelist to be used by client's browsers to do the actual transformation locally. Your explanation of how uBlock can also rewrite HTML DOM client side sparked this line of thinking. For the whitelist's curators, my notion for capturing and diffing was providing tools, like a better debugger, which could also be used by front-end developers.

Back to your notion of server side processing, for some combo of caching, transformation, render: I love it.

Opera did something similar with their mini mobile browser, ages ago. The server would render pages to GIF (?) with an image map for interaction, pushed out to the mobile device. Squint a bit and that architecture resembles MS Remote Desktop, PC-Anywhere, X-Windows, etc.

I keep hoping someone trains some advertising hating AI, that will scrap content for me.

re cat-and-mouse

No doubt.

I have written a few scrapers in anger. On mostly structured data. Often mitigating compounding ambiguity from standards, specifications, tool stacks, and partner's implementations.

So I just tossed the traditional ETL stuff (mappings, schema compilers), treated the problem space as scrapping. Generally speaking, I used combos of "best match" and resilient xpath-like expressions (eg prefer relative to absolute paths) to find data.

Worked pretty good. Super easy to troubleshoot.