Hacker News new | past | comments | ask | show | jobs | submit login
Tips for reliable web automation and scraping selectors (medium.com/brick-by-brick)
122 points by tschiller 9 months ago | hide | past | favorite | 18 comments

Another tip I've found extremely helpful for webscraping: check the <head> for <meta> tags or a <script type="application/ld+json"> tag that might already have the information you want collected neatly in one place. You may be able to save yourself a lot of time and grief.

We built a library for extracting these data - https://github.com/indix/web-auto-extractor

Also, if the site is based on WordPress, the API is often open for read-only access, so you can fetch richer information and you won’t have to parse the full HTML document to get the content in question.

Unfortunately this data is often inconsistent with the visual representation. For example, webshops often list their product as 'InStock' regardless of the actual stock status. Since products are in stock most of the time, you will not find out about this and thus likely extract wrong data in the future.

This was especially apparent when I tried to get my hands on some weights during the early corona pandemic. All webshops were out of stock, but in about 80% of them the schema markup indicated otherwise

That's definitely the easiest when it's there. In some cases the microdata will instead be embedded in the HTML tags in the body: https://schema.org/docs/gs.html

Here's a browser extension for working with selectors that was shared on the front page sometime last year: https://github.com/hermit-crab/ScrapeMate

Edit: I think it was from this discussion: https://news.ycombinator.com/item?id=24057228

Thanks for the links! Will check them out

As far as selectors go, we're currently working through how to support more specific selection when there's multiple HTML element alternatives for a single visual element (e.g., nested tags with no padding/margin).

This isn't too big of an issue for scraping text, but comes into play when pulling attributes, or when using the selectors to modify the page structure. (Our tool lets you place buttons, panels, etc., copying the style/structure from existing elements)

Author here, happy to answer any questions

For our product (PixieBrix) we actually generally grab the data directly from the front-end framework (e.g., React props). It's a bit less stable since it's effectively an internal API, but it means you can grab a lot of data with a single selector and can generally avoid parsing values out of text

More than an internal API, unless the app is compiled in debug mode, don't you get compiled/name-mangled/tree-shaken code and symbols?

I would assume this might change on recompiles or at least library updates, never mind internal code changes. Do you find that it works in practice?

For React and similar frameworks, the component names get minimized. In practice, JS compilers/bundlers can't mangle property names because 1) alias analysis is hard, and 2) property name string logic is ubiquitous in JS 3) the data often flows from APIs, and mangling would make API maintenance hard. Google's closure compiler and other static compilers are an issue.

As I mentioned in the post, dynamic CSS classnames are also tricky depending on how much gets mangled. We have some techniques in the pipeline for better handling those

I see, so you see class and function names that make no sense, but they get called with nice clean JSON objects. Neat trick!

What is your experience with other framework than React? Which one is the least stable, i.e. hardest to scrape?

Great question. Our general approach is to look up the devtools browser extension for the framework, and use that as a reference point for determining how to interface with the framework

The most popular framework we haven't implemented support for yet is Angular. (AngularJS, the old version, is straightforward.) Any of the compiled frameworks, e.g., Google Closure Compiler are difficult because they mangle identifiers. I suspect Svelte might also be tricky, but we haven't tried that yet

At the end of the day though, every framework has to write to the DOM and be accessible. So you can use selectors, or in the worst case OCR/computer vision. (IIRC, FB actively inserts dummy elements to try to prevent structural scraping).

Both the :has and the :contains selector (as in ul:has(> li:contains("Built")) ) were new to me. So thanks to the author for sharing that little trick!

For e2e testing I have seen various patterns, and the article mentions data-test-id for instance. In my own tests, I have opted for something similar, that has given a bit more flexibility.

Singular elements: data-test-save-button, data-test-name-input

Elements that are a part of a list: data-test-user={user.id}, data-test-listing={listing.id}

This allows us to name our elements with data test attributes, but also provide values to them where applicable.

I have also created a testSelector function that takes id and value, and spits out either [data-test-${id}="${value}"] or [data-test-${id}].

We have also experimented with letting shared components popuplate their own data-test-* attribute automatically based on other props. Like in our modal component, which sets data-test-modal={title}. data-test-delete-user-modal vs. data-test-modal="Delete user". But in the latter case, the dev does not need to provide the data-test-* attribute manually, since the component takes care of it.

Selectors are very brittle. I do not use them and IMO the scrapers I create are less likely to break and easier to fix if they do.

Nice list, esp for anyone getting started. I remember web scraping was my entrypoint into web development. I take it for granted now, but 15+ years ago I loved the idea of being able to completely mine a website of all its content.

Same! I find it very satisfying to see the scraper in action when it's finished.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact