Hacker News new | past | comments | ask | show | jobs | submit login
Web scraping via JavaScript runtime heap snapshots (adriancooney.ie)
354 points by adriancooney on April 29, 2022 | hide | past | favorite | 65 comments



Very interesting. Can't wait to give it a shot.

I personally use a combination of xpath, basic math and regex, so this class/id security solution isn't a major deterrent. Couple of times, I did find it to be an hassle to scrape data embedded in iframes, and I can see the heap snapshots treat iframes differently.

Also, if a website takes the extra steps to block web scrapers, identification of elements is never the main problem. It is always IP bans and other security measures.

After all that, I do look forward using something like this and making a switch to nodejs based solution soon. But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.


> But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.

I find my time is by far the most limited resource. I am usually scraping huge corporations at scale and don't care/doubt I will impact their resources. If they would open their APIs I would use those.

That being said, I often end up reverse engineering to preserve my own resources. I can and do run thousands of instances of chrome but it isn't cheap.

Also, related to IPs, carrier grade NAT has been a blessing ;)


> carrier grade NAT

Are you using something you have built or a service?


> is more ethical

How do you deal with pages that use JS to load their content (e.g. SERPs) and restrict those endpoints to be called from within that page?

I'm lucky if I can use cheerio to just traverse the DOM on a given page, but increasingly I have to render the page and that "scales" as well, at least in terms of maintainability since I can more or less use the same API to traverse the (then JS-modified) DOM


You can observe the network requests that the JS makes under the Network tab of the Developer Tools console. The restrictions you mention can be bypassed by setting the Origin and Referer HTTP headers to whatever satisfies the server.


In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.

If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.

[1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...


For this use-case, selenium-wire for Python could be really useful.


You can also inspect the application storage, monitor for cookie changes, etc using the dev tools protocol


Awesome, I wonder if there is a possibility to create a chrome extension that works like 'Vue devttools' and show the heap and changes in real-time and maybe allow editing. That would be amazing for learning / debugging.

> We use the --no-headless argument to boot a windowed Chrome instance (i.e. not headless) because Google can detect and thwart headless Chrome - but that's a story for another time.

Use `puppeteer-extra-plugin-stealth`(1) for such sites. It defeats a lot of bot identification including recaptcha v3.

(1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth


Not _quite_ what you're describing, but Replay [0], the company I work for, _is_ building a true "time-traveling debugger" for JS. It works by recording the OS-level interactions with the browser process, then re-running those in the cloud. From the user's perspective in our debugging client UI, they can jump to any point in a timeline and do typical step debugging. However, you can also see how many times any line of code ran, and also add print statements to any line that will print out the results from _every time that line got executed_.

So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.

Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.

Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.

[0] https://replay.io

[1] https://github.com/RecordReplay/replay-protocol-examples

[2] https://github.com/RecordReplay/devtools/pull/6601


Wowzers, that must make you an impressive attack target for all the session data that gets uploaded to your site. How do you deal with user consent in those cases?

I was curious to see what that experience was like from a client side, but it seems https://newaer.com/ is bombing the .min.js include, which of course doesn't turn on said session capture

BuiltWith alleges you use replay on replay.io but I didn't see any references to it on the main page, and app.replay.io is a white screen due to getInitialTabsState blowing up in src/ui/setup/index.ts


Ah... sorry, not sure I followed all the train of thought there.

The marketing page at https://replay.io is built with Webflow.

The data privacy and security approach is described at https://www.replay.io/security-privacy .

The actual client app at https://app.replay.io is a Next.js app, albeit a fairly complex one:

https://github.com/RecordReplay/devtools

If you're seeing a bug in our client app, could you either try using it again and see if it's fixed now? (Or even better, use the Replay browser to record that error happening, submit it, and file a Github issue with the recording link so we can fix it! Seriously, we dogfood Replay to debug Replay on a daily basis.)


I guess your description of the product omitted the "one must use a special build of Firefox" part because when I go to https://app.replay.io with a non-magic browser, it is a white page. However, after downloading your copy of Firefox, it seems to default to app.replay.io and presents a login page

Confusingly, however, clicking on the "Sign in" button launches a link in my system's browser. Fascinating.

> The data privacy and security approach is described at https://www.replay.io/security-privacy

Fine, but my experience with these time traveling session recording frameworks is that they exfiltrate all events to The Cloud, so I wondered how any user of a website that uses that technology would give consent to being monitored in that way. However, after hearing you say "download the browser," maybe that's where the misunderstanding happened

Is it a development time tool, or a sniffer that is designed to be included in a "script" tag on random websites and record all interactions? because those are entirely different threat models

> If you're seeing a bug in our client app, could you either try using it again and see if it's fixed now?

I dunno if "client app" means your rebrand of Firefox, or https://app.replay.io. If the former, well, that's not at all what I was testing. If the latter, no, definitely not fixed


Hmm. Lemme step back and see if I can clarify.

The first step is to make the recording. That does currently require that you download our custom fork of Firefox, which has been instrumented to capture all the OS syscalls. (We also have a Node fork for CI usage, and are working on Chrome for Mac support). In the FF browser, sign in to Replay so it knows who's making the recording. Open the app you want to debug, press the "Record" button, reproduce the issue, and hit "Stop". That uploads the recording data to the cloud. So, no other "script tags" or anything. It's about recording all browser operations as you interact with a given page, while you have the "Record" button active.

The "client" that I'm referring to is https://app.replay.io . This is just a standard web app, and you can use any browser to open that page, log in, and start debugging. (I typically do my dev work in Chrome, personally.) If you're seeing problems with that, I'm very curious what's going on - we've had some backend issues this week, but I haven't seen any reports of client-side bugs.

The data that gets uploaded to the Replay cloud is exactly what _you_ had in the FF browser as you were navigating that page and recording it. So, sure, if you had open a proprietary page, the data from that page is getting uploaded. However, the recording itself is private by default, so only you have access to it, and Replay's client UI will tell you what sorts of domains have data included in the recording as you're looking at it.

Happy to chat more about all this if you'd like. Best place would be to drop by our Discord ( https://replay.io/discord ) - feel free to ping me directly if you do.


Before the whole project was killed, Node-ChakraCore had a time travel debugger that worked pretty well. I don't know how easy it would be to port the methods it used to a chrome extension (my guess is somewhere between difficult and impossible), but browser vendors could implement this natively.


That's an exceedingly clever idea, thanks for sharing it!

Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer


Ah thank you for the reminder. Added it now!


If this catches on, web developers may start employing memory obscurification techniques like game developers

https://technology.riotgames.com/news/riots-approach-anti-ch...


Love this approach, thanks for sharing!

I am trying this on a website for which Puppeteer has trouble loading so I got a heap snapshot directly in Chrome. I was trying to search for relevant objects directly in the Chrome heap viewer but I don't think the search looks inside objects.

I think your tool would work: "puppeteer-heap-snapshot query -f /tmp/file.heapsnapshot -p property1" or really any JSON parser but it requires extra steps. Would you say this is the easiest way to view/debug a heap snapshot?


Wow this is brilliant. I've sometimes tried to reverse engineer APIs in the past, but this is definitely the next level.

I used to think ML models could be good for scraping too, but this seems better.

I think this + a network request interception tool (to get data that is embedded into HTML) could be the future.


The article brings up two interesting points for web preservation:

1. The reliance on externally hosted APIs

2. Source code obfuscation

For 1, in order to fully preserve a webpage, you'd have to go down the rabbit hole of externally hosted APIs, and preserve those as well. For example, sometimes a webpage won't render latex notation since a MathJax endpoint can't be connected to. Were we to save this webpage, we would need a copy of MathJax JS too.

For 2, I think WASM makes things more interesting. With Web Assembly, I'd imagine it's much easier to obfuscate source code: a preservationist would need a WASM decompiler for whatever source language was used.


This is great, thanks a lot.

It's my understanding that Playwright is the "new Puppeteer" (even with core devs migrating). I presume this sort of technique would be feasible on Playwright too? Do you think there's any advantage or disadvantage of using one over the other for this use case, or it's basically the same (or I'm off base and they're not so interchangeable)?

I'm basing my personal "scraping toolbox" off Scrapy which I think has decent Playwright integration, hence the question if I try to reproduce this strategy in Playwright.


My understanding of Playwright is that it's trying to be the new Selenium, in that it's a programming language orchestrating the WebDriver protocol

That means that if you are running against Chromium, this will likely work, but unless Firefox has a similar heapdump function, it is unlikely to work[1]. And almost certainly not Safari, based on my experience. All of that is also qualified by whether Playwright exposes that behavior, or otherwise allows one to "get into the weeds" to invoke the function under the hood

1 = as an update, I checked and Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

I didn't see any such thing in Safari


Well kind of for Firefox, there is this profiling tool which you could use (semi-built in)

https://github.com/firefox-devtools/profiler. Which let you save a report in json.gz format


I had understood that Playwright actually used the DevTools protocol rather than the WebDriver protocol, as mentioned here:

https://github.com/microsoft/playwright/issues/4862

And there's a bit of detail about how they're different here:

https://stackoverflow.com/q/50939116/142780

However that's more a detail and doesn't really undermine your point about Firefox / Safari being handled differently, it's just that Playwright implemented their own versions of the protocol for those two non-Chromium based browsers


> Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

Those .fxsnapshot files are gzipped binary heaps. There is a 3rd-party decoder for it:

https://github.com/jimblandy/fxsnapshot


A neat idea for sure, I just wanted to point out that this is why I prefer XPath over CSS selectors.

We all know the display of the page and the structure of the page should be mutually exclusive, so why would you base your selectors on display? Particularly if you’re looking for something on a semantically designed page, why would I look for an .article, a class that may disappear with the next redesign, when they’re unlikely to stop using the article HTML tag?


CSS selectors don't have to select purely by classes. They can be something like:

div > div > * > *:nth-child(7)

XPath doesn't have any additional abilities, it's just verbose and difficult to write. It's a lemon.


I might be wrong, but xpath has contains, where you can look for a text content inside an element, which I don't think CSS can do


Yeah, for sure XPath is the more powerful of the two, so much so that Scrapy's parsel library parses CSS selectors and transforms then into the equivalent XPath for execution

To the best of my knowledge, CSS selectors care only about the structure of the page, lightly dipping its toes in the content only for attributes and things like :first-char and :first-line


Well that is 100% originally an XPath selector (:nth-child), so kudos if CSS selectors support it now.

Still, using // instead of multiple *’s (and the two divs) still seems better for longer-term scraping.


> Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.

In general, screen readers don't use class names or IDs. In principle they can, to enable site-specific workarounds for accessibility problems. But of course, that's as fragile as scraping. Perhaps you were thinking of semantic HTML tag names and ARIA roles.


Anything relying on id/class names has been broken since the advent of machine-generated names that come part and parcel with the most popular SPA frameworks. They’re all gobbly-dook now, which makes writing custom ad block cosmetic filters a real PITA.


React doesn’t do that. You may still find gibberish on hostile sites like Twitter which intentionally obfuscate class names, using something like React Armor.


React by itself doesn't do that, but CSS-in-JS libraries are fairly common in the React ecosystem, and most of them have random autogenerated class names.


Even though this is a good point, I am somewhat disappointed that people downvoted a trivially correct statement. React is arguably the most popular web framework and it definitely doesn’t do this, nor does anything about it make it especially make sense to do this. Not every React app uses styled components or a CSS in JS solution at all. Arguably, a fair bit don’t. Anything using Blueprint doesn’t entirely use CSS in JS.


Scraping is inherently fragile due to all the small changes that can happen to the data model as a website evolves. The important thing is to fix these things quickly. This article discusses a related approach of debugging such failures directly on the server: https://talktotheduck.dev/debugging-jsoup-java-code-in-produ...

It's in Java (using JSoup) but the approach will work for Node, Python, Kotlin etc. The core concept is to discover the cause of the regression instantly on the server and deploy a fix fast. There are also user specific regressions in scraping that are again very hard to debug.


This isn't future proof at all. Game dev had been using automatic memory obfuscation since forever. If this become popular, it will take no more than adding a webpack plugin to defeat, no data structure changes required.


Very interesting! I have a feeling that this will break if people use the advanced mode of the Closure compiler. It's able to optimize away object attribute names. Is this not something commonly done anymore?


Nice this won't work anymore then


Exactly my thoughts - the author is using it 'in production' - speaking out loud to a forum where Facebook/Meta employees (and other Silicon Valley folk) are definitely observing is a rookie mistake


How would you prevent it from being possible?


Awesome experimentation! I'd be curious to how you navigate the heap dump in some real website examples.


I've used a similar technique on some web pages that get returned from the server with an in-tact redux state object just sitting in a <script> tag. Instead of parsing the HTML, I just pull out the state object. Super


This sadly does not help if js code is minified/obfuscated and data is exchanged using some binary/binary-like protocol like grpc. Unfortunately this is increasingly common.

The only long term way is to parse visible text.


I've never seen grpc from a browser on a consumer-facing site; do you have an example I could see?

That said, for this approach something like grpc would be a benefit since AIUI grpc is designed to be versioned so one could identify structural changes in the payload fairly quickly versus the json-y way of "I dunno, are there suddenly new fields?"


Not aware of any actual grpc websites but given grpc-web has 6.5k stars on github something must be out there.

Google's websites frequently use binary-like formats where json is just an array of values with no properties, and most of these values are numbers. See for example Gmail.


That's likely just an encoding into JSON of protobuf messages. The typical wire format contains a sequence of field tag followed by contents. They can just make it an even-sized array.


Is he scraping the heap because the data wasn't present in the HTML, or is he doing it because the API response, present in the heap changes less often than the HTML?


Seems easy to defeat by deleting objects after generating the HTML or DOM nodes? Although I suppose taking heap snapshots before the deletions would avoid that.


Depending on how exactly the page is loading data, it might be easier to use something like mitmproxy and observe the data flow and intercept there.


Would this method work if the website obfuscated its HTML as per the usual techniques, but also rendered everything server side?


If it’s rendered server-side - no. The data likely won’t be loaded into the JS heap (the DOM isn’t included in the heap snapshots) when you visit the page. You might be in luck if the website executes JavaScript to augment the server-side rendered page however. If it does, your data may be loaded into memory in a way you can extract it.


Someone knows if a Chrome browser extension has access to heap snapshots?


You'd want to use the debugging API, which is available either via websockets or via the chrome.debugger extension API. The latter will require a specific permission though, I think.


Why doesn't the example chosen, YouTube, use something like Cloudflare "anti-bot" protection or Google reCAPTCHA.

When I request a video page, I can see the JSON in the page, without the need for examining a heap snapshot.


Because cloudflare, recaptcha etc. mean this is not general in possible. You need to quack like a normal user for it to work. If a site is really against scraping they could probably completely make it uneconomical by tracking user footprints and detect unexpected patterns of usage.


They detect and block headless browsers just as easy.


Really cool approach, great work


[flagged]


You're banned, which means everything you post will be marked as "Dead". Only those with "showdead" enabled in their profile will be able to see your comments and posts. Others can "vouch" for your post to make it not "dead" so it can be replied to (which is what I have done)

As for why you were banned: https://news.ycombinator.com/item?id=30275804


Odd, I see his comment and I'm playing HN with undeads off ("showdead" is "no").


I had to vouch for the comment (make it not dead) to reply to it


Oh I see, you're playing medic.

Thanks, I always wondered what "showdead" meant (tho not enough to Google it I guess).



Yeah, found when it happened - https://news.ycombinator.com/item?id=30275804




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: