Hacker News new | past | comments | ask | show | jobs | submit login
Screenshots as the Universal API (matt-rickard.com)
108 points by robinhouston on Jan 18, 2023 | hide | past | favorite | 90 comments



This is brilliant.

"Hey we have this gazillion flop/s available, shall we try to put it to some good use?" - "Nah, let's just waste them recovering information that was lost because it's so much simpler to share a screenshot instead of text or even html."

Sometimes, I wonder if the hardware industry sponsors a certain sunset of software developers to make sure every possible performance gain is immediately wasted. Electron was a strong hint, but this is a whole new level...


I think they actually have a point. Parsing HTML isn’t trivial: aside from bad/invalid HTML (think: missing opening/closing tags, quotes, etc), there’s also a lot of content that requires javascript to render in the first place, for example, which means the page needs to be rendered and have access to window and DOM, etc.

Then, of course, you have embedded objects, such as iframes, or that aren’t text that traditional parsing can’t easily identify/extract, even if everything is rendered OK. For example, a video, animations, interactivity, etc. Another contrived example which illustrates the point would be a site that uses images for buttons/links rather than text, or even non-semantic HTML such as a <div> and click handlers as "links", or buttons with <a> tags.

I can think of several other examples.

A "screenshot API" enables automation to capture "picture of a group of people celebrating" distinct from advertisements that might also appear on the page without the need to dig through CSS selectors/classes, domain name/query parameters, and handles cases where the image might be base64 embedded directly. Another example might be to simply/easily extract data from a Google Sheets/Excel table, which itself might include embedded images or non-text/HTML objects. Such an API could help accessibility by enabling screen readers for sites that weren't built to be accessible.

I learned from this post that you can tap/copy objects from photos in iOS 16! I just did this for the first time and it’s CRAZY, using a photo of my basement and a bunch of shoe boxes, tools, etc. I pressure tapped the group of shoe boxes and hit copy, pasted into the Bear app (a markdown editor for Apple devices), and it pasted just the shoe boxes, perfectly clipped around the edges as if I’d used photoshop!

I think ultimately, the point is that such an API which uses object detection, image-to-text, sentiment analysis, etc., on the backend could make trivial the tasks and edge cases that today require non-trivial effort and time, and could enrich the data prior to its retrieval.


> Parsing HTML isn’t trivial: aside from bad/invalid HTML (think: missing opening/closing tags, quotes, etc), there’s also a lot of content that requires javascript to render in the first place, for example, which means the page needs to be rendered and have access to window and DOM, etc.

Double standard. If you're going to make a fair comparison, then you need to compare like with like; you need to compare the subset of things about e.g. HTML that give you what you can also get with a screenshot. It makes no sense to hold the performance penalty of script execution against browser runtimes when (a) you don't have to execute any scripts to effect anything that gives you parity with a static image, and (b) you can't with static images do anything like what executable scripts enable.

And whether or not parsing HTML is trivial (which is debatable), it's still not strictly greater than the computational resources that are needed for the kind of computer vision and widgetry that lets you e.g. select the text in a screenshot...


Perhaps you could think of it like operating at a different abstraction layer, a sort of overlay network for social communication. It's easier to screenshot new technology than it is to make it interoperable with old technology, so there will always be a lead time where screenshots are the dominant sharing method driving a new hype cycle. Look at ChatGPT for example - they launched with no share feature, and people were sharing screenshots of their conversations. It's probably still the most common method of sharing them, even after OpenAI added a share feature. Screenshots have the advantage of working everywhere and being cheap to produce.

The loss of information is a solvable problem. In fact if you take a screenshot on an iPhone, there is already quite a bit of information stored along with the photo, but it's removed when using the WebKit <file> input to upload a photo to a website. Also, Safari has a "screenshot full page" option, meaning it already has a specific implementation of the screenshotting for its specific use case of taking a picture of a likely text-heavy Web page (which the same program taking the screenshot already rendered for the user anyway).

On the receiving side, Apple's automatic device-local OCR, which already runs on every photo in your camera roll, could increase fidelity of information in a screenshot that you receive. Apple could add a feature for your device to locate the original URL of a screenshot you receive, using both the metadata of screenshots generated on other Apple devices (if shared), and the results of searching the Web for text in an image.

This feature could have high accuracy for the limited cases where the content is available on the public web rather than on some ephemeral screen inside an app. But in fact, such cases are even more of a reason that screenshotting will remain a dominant way of storing and sharing information. They're completely decentralized and interoperable, with acceptable tradeoffs that double as advantages - like robustness (screenshot of content will survive longer than a server will continue serving that content), and privacy (the server of the original content doesn't get notified when you read it in a screenshot).


Kind of the opposite of the semantic web.

Not only will people not mark up their data in useful ways, often they're actively hostile to you using or remixing it, but the screenshot will always get through. Unless the display-driver level DRM gets involved.


A camera will take the picture of the screen that could not be screenshotted. Unconvenient, lower quality, unstoppable.


a few years ago there were talks about having cameras shut down when they detect watermarked content to close the analog hole. Luckily it didn't go anywhere, but start hoarding cameras!


Cinavia does this for audio.


"Unstoppable" only as long as camera manufacturers don't add the stop. You cannot make a color photocopy of (at least European) paper currency. That sort of thinking could be pushed to cameras too.

(Though the original EURion was more about counterfeiting than about Hollywood studios.)

https://en.wikipedia.org/wiki/EURion_constellation


Could you use machine learning to reconstruct the original screenshot?


I've wondered about this, although you have to be careful with both the "enhance" process and subsequent OCR to not just insert lies. See this old Xerox bug: https://hexus.net/tech/news/peripherals/58605-serious-number...

I expect someone will do "reconstruct original HD movie from one or more theater cam pirated copies" at some point too.


Or even better, a way of reproducing the work of Harmy's Despecialized Edition of Star Wars. You have a lot of frames of video in the blu-ray reissues that didn't really differ in content from the frames of video in the old laser disc, VHS, bad interlaced DVD transfers. I imagine it wouldn't be hard for AI to fill in the gaps to produce despecialized high-definition frames from the old Star Wars transfers.

In other cases, I'd imagine you could use old post production NTSC broadcast and the OG 35mm films to do something much more akin to how Star Trek TNG was updated, than a simple interpolated upsampling.


Maybe prospective transformations would be enough, plus some pixel polishing.


The analogue aperture, so to speak.


> the screenshot will always get through

Dubious. Cloudflare and friends promote a culture of demonizing bots. For this to work you'd need to put as much energy into pretending to be a human as parsing the screenshot.

At least with a dedicated API you declare that you aren't hostile to a user who wants to use the website on their own terms.


> Cloudflare and friends promote a culture of demonizing bots

What does this mean? Usually it's the website owners who are hostile to scraping for commercial reasons, or simply bandwidth, to which Cloudflare is the solution; it's not some prejudice which has to be marketed to people.


Here's one example of Cloudflare marketing.

https://www.bigmarker.com/innovatus-digital/The-newest-appro...

> We know a third of web traffic comes from bots with their insatiable appetite for attacks. From credential stuffing, to stealing inventory, to price and content scraping, stopping bots is critical to a strong web experience for customers that is not undermined by bots.

This seems like classic demonizing tactics to me. Generalizing a whole class of users as an out-group, associating them with negative behaviour, painting them as criminals.


> Generalizing a whole class of users as an out-group, associating them with negative behaviour

Isn't this the reverse: the only criterion for this "out group" is behavior. Specifically that some people use a large number of automated clients to engage in behavior which the site owner regards (rightly or wrongly) as harmful. I can see that you're trying to draw racism analogies but that's not going to work.

And ("stealing inventory") sometimes this is at the expense of other customers, who might want to buy at the RRP from the official website rather than have to go through scalpers.


Then talk about 'scalpers' or 'scammers'. Not 'bots'.


They added explicit qualifiers to “bots” that make apparent they are referring to abuse, not “bots” in the more general sense, such as a search crawler/indexer.

Do you disagree that credential stuffing, scalping, content-scraping (to build fake social media profiles, phish, and scam advertisers/consumers) are problems and that bots perform those activities?

Your defensiveness doesn’t seem justified, to me, unless you’re using bots for those purposes.


This is marketing copy so I assume that the sentence was deliberately constructed.

> We know a third of web traffic comes from bots with their insatiable appetite for attacks.

I read that run-on "with their" as a universal quantifier. There are no bots in that sentence that do not have an insatiable appetite. They could have added that quantifier to indicate the subset of bots they refer to:

> We know a third of web traffic comes from THOSE bots WITH AN insatiable appetite for attacks.

or even, simpler:

> We know a third of web traffic comes from bots with AN insatiable appetite for attacks

That's how I read the grammar. Language can be slippery, and can be made slipperyer. If we read it different ways we read it different ways.

To your question, of course those things are bad, and of course people use bots to do this. They also use web browsers and humans to do it. Some bots are bad. Some people are bad.


No. Mobile screenshots are a horrendous way to pass information around.

My father would send me screenshots of Amazon product listings and suggest I buy the item. There was never a visible URL, and typically no or only a small fraction of the product name. If he'd just sent a link, I could just click it, instead it became an investigation/research project.

Better yet was when the listing was obviously not Amazon, and not a site I could identify, or had ever seen before. I'd get zero context clues. Perhaps an AI could figure it out, but again, I'd rather he just sent a link.

When I get a screenshot, I'll have to submit it to some service, of which there will be a dozen different services, each requiring an account, each building a profile and tracking me based on whatever crap my dad sent me.

No. Just no. Stop. I pre-emptively opt out.


> Screenshots are always available

Unfortunately not, certainly not on mobile. I don't know if this is the case on iPhone, but on Android some apps make it really hard or impossible to screenshot them.

I think in recent Android versions they can even prevent the screen from getting mirrored to a connected PC / other casting device.


Even on Windows and Mac (don't know about Linux) when browsing, you sometimes can't take screenshot of a portion on the website (only seen this with videos, though).

We are not allowed to take a screenshot on a device we own, for content we paid for. What a world.


This smells like DRM to me, an it's not a thing on Linux (thankfully). You'd probably have to install DRM yourself for this to be a thing.

Note: I mean Digital Restrictions Management[1] here, not Direct Rendering Manager[2].

[1]: https://www.defectivebydesign.org/ [2]: https://www.kernel.org/doc/html/v5.4/gpu/introduction.html


I think this has to do with the content in question being rendered on the GPU separately from the website itself. Historically I recall this being an issue with all forms of video display in some browsers but at this point (with website rendering generally being at least in part GPU accelerated) it shouldn't be an issue I think unless for some reason the browser is set up to use software rendering for the website itself.

I do faintly recall not being able to see video player content in screenshots taken on Ubuntu (but the player itself could take snapshots just fine) because of a divide like this some years ago.


> This smells like DRM to me, an it's not a thing on Linux (thankfully). You'd probably have to install DRM yourself for this to be a thing.

That is also the reason why Linux doesn't (and likely never will) get served high resolution content by any significant paid service however.


I think it's called HDCP? When you take a screenshot of a paid movie in Apple TV or something it's just a black box.


It's DRM. I refuse to deal with any company that tries to push this BS on us.


I have never experienced that on any device, do you have an example?


I still find it weird that this has no override.

I can vaguely understand the security concern, but it's my device, running my OS, why can't I make a simple screenshot?


Probably because it’s also tied in with DRM video decoding / HDCP where the prevailing opinion of those bodies is that it very much isn’t your device.


A banking app an Android showed multiple obscure error messages when trying to log in and reporting the error was a PITA because they forbid screenshots.

I know sharing a screenshot of my bank statement is a bad idea, but forbidding it completely is a ridiculous extreme.


FWIW the concern here isn't you taking a screenshot, it's that an illicit app in the background is screen recording your phone and sending it off.


Well prevent that then. That's undesirable outside of banking apps.

(On Android, you can only get into that swamp by opting in to "accessibility" features, which specifically need access to screen contents.)


You prevent that by stopping any software in userland from being able to access the video path, which is an all or nothing thing.

Accessibility is only required if the app doesn't have system permissions, or if it's not carried out a privilege escalation. Though granted a privilege escalation might also be able to gain root to turn off setting a secure surface too (that's not enough to get past Widevine if that's been used but banks generally are not).


> I can vaguely understand the security concern, but it's my device, running my OS, why can't I make a simple screenshot?

It's not about screenshots, the point is that screen recording is blocked because the OS can't see the encrypted video path. As it can't there's nothing to take a screenshot of.

(Indeed, you can take a screenshot. It just comes out with all the video data nulled, so it's a black square.)


With some loss of fidelity, convenience and reaction time, you can always just point a camera at the screen. If no other device is available, a mirror might help.


A mirror doesn't help as you can't simultaneously take a photo and be in a different non-camera app looking at something.


I think my phone will take a photo without bringing up the camera app UI with a certain button combination. But I am not sure if I could make this use the front camera.

EDIT: I stand corrected, the button combination does not work when the phone is unlocked and it always uses the rear camera. But I guess at least in principle one could build this.


The silliest example of this I've seen is "private browsing" mode not allowing screenshots. The point of the mode for me and everyone I know is to have isolated browser sessions for the websites I visit and not have the browser add them to my browsing history. Not to shield what I'm seeing from the OS itself when I try to disclose it.


It's not the case on iOS. On iOS applications can't prevent screenshots -- though applications are notified when you take a screenshot.

I briefly used Android and the fact that applications can prevent screenshots is a horrible decision, closer to DRM than to a user-centric design.


> On iOS applications can't prevent screenshots

Maybe not, but it’s all the same if they can make the screenshot useless. Using Netflix as an example: when you take a screenshot video controls will be visible, but the content behind them won’t be.


DRM’d video is the one exception.


Iphone can always screenshot , no restrictions there.


Super easy on iOS.


Do you mean super easy to take a screenshot of any app, or super easy for apps to prevent that?


I meant easy to do as user. Sorry.


impossible to prevent


Not sure. Try screenshoting a Netflix movie


That’s DRM’d video, the one exception.


> Screenshots are always available (similar to the era of web crawling)

DRM enters the room…

But on a serious note. I like the idea of using screenshots as a form of storage, but a lot of metadata is lost in the process like data hierarchy, data available in different visual states etc.


I was skeptical about this concept when I first heard it, until I actually tried some of the latest image models.

Now, I’m fully bought in that this is correct.

There’s a model you can use on hugging face, for example, where you can feed it any PDF or image of a document, ask it a query (“What is the total of the invoice?”), and it just spits out the right answer.

Turns out decades of work trying to make universal data interoperability standards will most likely be replaced by screenshots and images! (Again, this is for inter-app data movement)


From an invoice in a standardized XML format (some countries require this by low) total amount can be extracted in milliseconds if not microseconds. How long it takes to do this from an image?


How long would it take you to get your plumber to convert his handwriting into a standardized XML?



Do you remember what the model was?


>"Easier to parse than highly complex layout formats."

I disagree, the complexity is just hidden inside the model you have to train. What about a low resource device that could have difficulties in running the model? And how you handle the mistakes that the model will make?


I recently saw a post on Mastodon where someone figured out they could build a simple Automation[1] on their iPhone to assist them in avoiding foods that contain allergies. They take a picture of the label, then the automation OCRs the text and searches for $ALLERGEN.

It seems the technology for making "screenshot APIs" a less zany proposition is emerging.

[1]: https://support.apple.com/guide/iphone/create-an-automation-...



Hey! Nice! I can now read my corporate emails for the day, because I most certainly already read the dumbest thing I will read today.


I disagree. Information is presented to humans visually. If your models can process visual information, it is the most robust of possible solutions.


Talked with a founder building something in this space that was inspired by Matt's post in October. lynq.ai (no affiliation, I just know Paul).

I think when you look at this problem originally, you say - that's a bad idea. Why would you take structured data, output it to non-structured format and then have ML parse that. Lots of wasted CPU cycles all around.

However, when you think about the complex dynamics of standards around documents, tens of years of digital formats, hundreds of standards, lack of adherence to those standards, proprietary formats, hundreds of years of print and legal documents, the argument is akin to self-driving cars.

The state we have today around data & dashboards is a hugely emergent & dynamic system, just like our road transport infrastructure. We are closer to a machine being able to navigate the same way a human can, than we are to one simple (or one set of) standard that work more the way a machine would want to consume.

Screenshots as a universal API simply meets the world where it is vs assuming the world is going to change towards something simpler and more elegant.

I think part of the problem with how this comes across at first glance is how it's framed. "Screenshots" as an API evokes some dirty feelings for most of us in tech because the format is so unstructured.

I think if you think about the idea of building something once that both a human and a machine consume from the same target (the UI), this makes a lot more sense in many ways even if it feels like there's an expensive level of indirection in there.

Excited to see what'll happen here over time.


This seems like a last resort API. I would much rather have the information in a more convenient form than a screenshot... Also, some apps (e.g. banking apps), do not allow you to take screenshots on certain devices.


If someone wants to build a screenshot-based bookmark/reminder/everything tool with me … holler.


Hey dude, I am working on exactly this, would love to chat if you're free? My email is paul (at) lynq (dot) ai


I've actually been building something like this: a desktop app that basically records everything you see and hear, and makes it searchable. Maybe we could team up: my email is govind <dot> gnanakumar <at> outlook <dot> com.


The Arc browser has a feature called Easels that does this; I quit like it even though it lacks some polish. I imagine they want to do more with it in the future

The browser is in beta and macOS only currently, but a friend recently invited me. E-mail me (in profile) if you'd like an invite to explore what they've done with it.

I can give out 5 invites I think, if anyone else wants it. So, first come first served!


Arc is great!


Print-to-PDF, build your own local Internet .. best bookmark system ever, literally doesn't require anything to be installed, just put the resulting PDF's in folders and use pdfgrep and ag to your hearts content .. "ls -alF | grep thatarticlesubjectiremembervaguely"


this is also something I’ve been thinking about for a while. I messaged you on IG (I’m semistrict there)


Hey dude, would you mind shooting me an email too?


How would you design it? What are your thoughts?


One example: take a screenshot of an email or flight schedule. Automatically add all dissolve details to my calendar.


> Screenshots-as-API solves a few problems

They forgot to mention that the website owners may deliberately make it difficult to interface with the HTML directly, e.g. by changing the internal formatting structure every now and then.


At this point, AI also solves this problem with pretty reasonable accuracy. You can feed gpt3.5 some html and ask it to write a python script to parse all of the button text.


I use this technique to build a personal dashboard. Rather than try to scrape data, then come up with a nice presentation for it, I just find a nice representation on the web for the data I want on the dashboard, then use Puppeteer[1] to automatically screenshot the specific DOM element that contains the thing I want. Works like a champ.

[1]: https://github.com/puppeteer/puppeteer



An unexpected plot-twist for out-of-work digital artists, they can reboot their careers ...in online security testing.


>Permissionless. Many applications won't allow you to export data. Screenshots are always available (similar to the era of web crawling).

False.

Some pdf readers and games will make your screenshoot black

But you can always use your phone

PS. I do wonder when they will add some watermark so phones will make the pictures black too


It’s actually possible to at least make it harder to do screen grabs on iOS. But you can always take a photo of the phone’s screen using your other iPhone :)

https://stackoverflow.com/questions/18680028/prevent-screen-...


Signal on Android blocks screenshots too.


I suppose this is satire? I mean, how would I call such an API to make any changes to the application state? Create an image of the effect that I would like to see? Am I missing here, or is the joke just lost on me?

If this is serious and the idea is to use this only for quick and dirty read-only access, I'll stick to my time-tested "select+copy" API for now. Usually does a better job at extracting what I care about (particularly for content longer than one screen page). Yes, app owners can make that harder if they want to, but the same is true for screenshots.


An existing implementation of this idea is screenotate:

https://screenotate.com/

> Screenotate is an app for macOS and Windows that might help you with your screenshots. Every time you take a screenshot, Screenotate steps in to recognize and save the text inside (using Optical Character Recognition), along with the URL and the title of the place where you took the screenshot (where possible).


I already use it a lot for stupid things that don't let me copy text:

Twitter, Reddit on mobile and some keyword protected PDFs.

iOS recent builtin OCR is a boon! You don't even have to save it to photos.

Sidenote: when will Apple finally organize my screenshots into a folder distinct from my family photos??


There already is a screenshots album in photos (under media types). It’s just mixed in your library if that’s what you meant.


Thanks, really helpful! Do you also know of a way to see (media type) "Photos" without those screenshots?


I don’t think that’s possible?


I'd love to see an ad-blocker based on this. Perhaps I can even use it on my TV.


Isn't this what Wolfram alpha uses? Or used?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: