My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.
If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.
Also, I think very few people use MechanicalSoup nowadays. There are libraries that allow you to use headless Chrome, e.g. Playwright.
It looks like the author of the article just googled some libraries for each language and didn't research the topic.
Yep, this seemed like an aggregate Google results page.
I was initially intrigued by the article and then realized it was a list of libraries the author found via Google. There were significantly notable omissions from this list and a bunch of weird stuff that feels unnecessary. I don't think the author has actually scraped a page before.
although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_
> Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor
> It looks like the author of the article just googled some libraries for each language and didn't research the topic
Heh, oh, new to the Internet, are you? :-D
Here's one: https://chrome.google.com/webstore/detail/headless-recorder/...
npx playwright codegen wikipedia.org
When I google, I see it advertised as a "testing" tool.
Can I also use it for scraping? Where would I learn more about doing so?
It's similar to Google's puppeteer, but in my opinion even with chrome much more pleasant and productive. Microsoft's best developer tool IMO, saves me tons of time.
$30 / month for 300K requests, rotating residential proxies, uses headless Chromium, etc.
My main concerns though were about testing. What if you want to create tests to check if your scraper still gets the data we want? Colly allows nested scraping and it's easy to implement but you have all your logic into one big function, making it harder to test.
Did you find a solution to this? I'm considering switching to net/http + GoQuery only to have more freedom.
I don't remember exactly, but I think it was around 100 or 200 loc, so not exactly something that took long to write.
In fact the most difficult thing was to figure how to pass the right args to Chromium.
I wonder what does a scraping framework offer?
HTTP requests, HTML parsing, crawling, data extraction, wrapping complex browser APIs etc. Nothing you couldn't do yourself, but like most frameworks, they abstract the messy details so you can get a scraper working quickly without having to cobble together a bunch of libraries or re-invent the wheel.
Currently sport a mix of curl + grep + xsltproc + lambdasoup (OCaml) and am happy with it. Sounds like a mess but is shallow, inspectable, changeable and concise. http://purl.mro.name/recorder
Not only has it been a blast to try out, but also surprisingly easy to setup.
I now have around 11 domains being scraped 4 times a day through a well defined pipeline + ETL then pipes it to Firebase Firestore for consumption.
Next step is to write the page on top of it.
My SaaS requires some technical knowledge to use (call a web api) which I suppose is why it's not ever in these lists.
Some of my customers are *very* large businesses. If you are looking at evading bot countermeasures, my product isn't probably the best for you. but for TCO nothing beats it.
It looks like phantomjscloud.com also supports Puppeteer.
I categorically refuse to do when I’m browsing websites using it. I find this new captcha utterly unacceptable.
It’s no “protection” at this point anymore. Websites are using it as an excuse to become even more user hostile. I am worried for the future of the web.
> Cloudflare's bot protection mostly makes use of TLS fingerprinting, and thus pretty easy to bypass.
https://news.ycombinator.com/item?id=28251700 -> https://github.com/refraction-networking/utls
Disclaimer: haven't tried it.
I put together a toy site  recently that uses this approach for JIT price comparisons of events. When you click on an event, the backend navigates to requested ticket provider pages through a pool of Puppeteer instances and waits for JSON responses with pricing data.
How do you know this if it is not your website?
Also, the internet has no time zone.
The Internet has no time zone, but its human users all do.
It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.
Can you tell me more?
What would be a better solution, if you have any to recommend?
This does put limits on how quickly they can crawl, of course, but scrapers find ways around it like changing ip and user agent (ip is probably the main one, bec you can then pretend that you are multiple humans browsing the site normally).
CF has a view on a significant chunk of internet traffic across many sites and feeds that into some kind of heuristics/machine learning. Even if we assume that your behavior on the scraped website looks human-like, you may still get blocked or challenged because of your lack of traffic on other sites.
The IPs you'd get from a typical proxy service would only be used for bot activity and would've been classified as such a long time ago, and there's no "human activity" on it to compensate and muddy the waters so to speak.
The best solution is to use IPs with a chunk of legitimate residential traffic, and keep scraping sessions constrained to their IPs - don't rotate your requests among all these IPs, instead every IP should be its own instance of a human-like scraper, using its own user account, browser cookies, etc.
Going as low as 0.2 dps could easily be doable I think.
All Chrome is doing is stop appending the current semver in the UA it sends.
Yes, at any time the UA could be ignored and clients could be fingerprinted, but now the UA is being made next to useless, so fingerprinting will now become the default everywhere.
poster says - in order to be able to scrape effectively you should appear to be a real human, use different UAs etc.
So as this change happens different UAs become one less thing that you can easily change to seem less suspicious, as a non-frozen UA would then be a suspicious sign after some time.
So a sort of side effect.
on edit: so I'm thinking as there will only be one UA floating around then, sure, older UAs can exist, but those become progressively more suspicious.
The strategy to deal with it is to behave well when making requests so that you don't get rate limited.
Here are a few comparisons if you're curious:
It's also fascinating to see how developers-who-aren't-me setup their APIs when they assume that nobody's looking.
* those which don't offer a _reasonable_ API, or (I would guess a larger subset) those which don't expose all the same information over their API
* those things which one wishes to preserve (yes, I'm aware that submitting them to the Internet Archive might achieve that goal)
* and then the subset of projects where it's just a fun challenge or the ubiquitous $other
As an example answer to your question, some sites are even offering bounties for scraped data, so one could scratch a technical itch and help data science at the same time:
Scraping works great to get the data.
I don't like node/js but I use it to do the scraping as I view the code as trash and full of edge cases and unreliable data / types and I can't complain, a dynamic scripting language is great for that.
It tells you who is your governor, local/federal representative, senator and municipal president.
Each representative lives on a different website so I wrote scrappers for each one.
GoodRX built a scraping system that tapped into all the major providers. Thats what a group of vaccine hunters in my state used to get appointments for folks that had tried but were unable to.