I once worked on a spider that crawled article content and I ran into the same problem. I always wanted to try the following solution to it but never had the chance.
Assume you have a database of URLs and the fields you've scraped from them in the past (title, author, date, etc). If you ever fail to scrape one of those values from a new URL, here's what you do:
- Go back to one of the old URLs where you already have the correct value (let's say it's the title).
- Walk through the whole DOM until you find that known title. At each node you will have to remove child nodes except for text, to deal with titles like "Foo <span>Bar</span>" which you want to match against "Foo Bar". So this is going to be an expensive search.
- Generate several possible selectors which match the node you walked to (maybe you have ".title", ".title h2", ".content .top h2", etc).
- Test each new selector on several other already-crawled pages. If any of the selectors work 100% of the time, there's your new selector.
Any thoughts on whether something like this would work?
What I do is call a regression test every x minutes. If it fails, set a flag to save/store the html everytime we crawl pages. Now we can go back and process these saved pages when we fix our crawler
I do the same at $dayJob where I'm parsing results of an internal API. Instead of making a call later that may not have the same data, I store the json and just process that. I feel like treating network requests as an expensive operation, even though they're not really, helped me come up with some clever ideas I've never had before. It's a premature optimization considering I've had like 0.000001% of failure but being able to replay that one breakage made debugging an esoteric problem waaaaaay simpler than it would've been otherwise.
Do you have any recommendations for places and/or interview practices?
Creating a domain-specific model can be done a million ways, obviously, but the nice thing about HTML is the markup gives hints all along the way. The tree, the class names, the tags, etc. Coupled with the content of each tag, it's absolutely possible to determine collections of items with metadata using ML.
But like most ML problems, the underlying data that feeds the model is the time-consuming part.
If you are working in a single or few domains, I 100% recommend this approach. If you're scraping something far more generalized, first you need to have models that you care about and then you need to create models to determine the content type of what you're scraper is looking at.
1) What kind of content do I have?
2) Does this match a known domain with a model?
3) Apply appropriate model to domain, hopefully extract correct data
Another huge issue is, of course, validation, because you're going to be dealing with an inordinate amount of unknown and unpredictable data depending on what you're looking at.
If it really works 100% of the time then probably. A lot sites though use multiple markup styles for seemingly no reason though. E.g. if you created an account before a certain date then your profile keeps the old HTML, even though the old pages look identical to the new pages.
Suppose alternatively that I have previously shown the status as "available" or "out of stock" and now I change it for some products to "no longer available". Can your system handle those edge cases?
We'll make a note of this! It does seem like a cool idea to put these 2 features together and automate the updating of API endpoints
That said, i haven't found a need to do that yet to verify the idea itself.
Maybe that is similar to what they do using their "ML approach" mentioned?
At Blekko I developed a number of ways to deal with people that tried to scrape the web site for web results. The three most effective ways are blackholing (your web site vanishes as far as these folks are concerned), hang holding (basically using a crafted TCP/IP stack that does the syn/ack sequence but then never sends data so the client hangs forever), and data poisoning (returning a web page that has the same format as the one they are requesting but filling it with incorrect data).
We had a couple of funny triggers of the anti-bot stuff during the run, once when a presenter on stage showed an example query and enough people in the audience typed it into their phones/tablets/laptops and it all came from the same router that it looked like a bot. The other where the entire country was behind a single router and a school had all all of their students making the same sort of query at the same time (in both cases the trigger was a rapid query rate for the exact same search query from an address).
In Blekko's case since bots either never clicked on ads or always clicked on the same ad (in both cases we got no revenue) keeping bot traffic off the site was measurable in terms of income.
Many bypass protections by limiting request rate and using a pool of lesser known proxies/IPs.
One of the things we learned at Blekko was that people that run botnets often sell 'proxy service' as a thing, we identified several made out users of the Time Warner "road runner" service. That put us as a web site in a bind in that the proxy service that was running on an infected computer was violating our terms of service but the user might be completely unaware. If they were also a customer and we black holed their IP it would also cut off legitimate traffic. Since we didn't keep a logs that could identify these relations over time (privacy issues) we had to rely on other methods. We never got enough penetration into the search market to make this a huge concern however so the problem remained largely theoretical. We started a program of doing exponential banning where an IP would be banned and then an hour later unbanned, and if it resumed its bad behavior banned for 2 hours then 4 Etc. Once you get to 1024 hours it is pretty safe to assume they are lawfully evil as it were.
These guys fake their user agent, mask their IP addresses, and generally work hard to defeat anti-bot measures. They know they are over the line, but the law has yet to catch up to them.
I'm thinking of RyanAir suing Expedia, United vs wandr.me, Southwest suing SWMonkey.com, I'm sure there's countless others.
Intuitively I would think that this sort of problem would profit from using asynchronous ingestion at the edge pushing unprocessed contents to a multi-threaded/multi-process backend. (Because I'd expect that network latencies mean you need lots of threads to saturate I/O, which I'd expect would conflict with effectively using the available CPU power to do the actual document processing).
Edit: In the UK I see this: https://imgur.com/a/zlWOByh
Of course if I was actually trying to read the link I would have to give up, because there appears to be no way to navigate through and opt out.
This is what I get in the UK: https://imgur.com/a/zlWOByh
I'm in Switzerland (Europe), don't get the https://imgur.com/a/zlWOByh
The thing is that most ecommerce websites of note generate shopping feeds in easily machine readable formats (JSON, XML) for Google Shopping, Facebook and the like. These feeds also go down to SKU level. The URL might not be advertised but it won't be blocked or protected with a user/password API key.
If buying a T Shirt, the product page might list all sizes and all colours only showing a master 'variant' SKU (that is not a real SKU) and the backend might then add the actual size/colour specific SKU to the basket, of which there could be twenty on the 'variant' product page.
Meanwhile the product feed will list every SKU variant, complete with latest pricing and other pertinent information, e.g. barcode, product image etc.
I am sure that most retailers would prefer to just point the scraping party to the feed rather than have them grind the site to a crawl with multi-threaded crawlers hiding behind proxies. So that is how these scrapers can be 'accommodated'.
The sitemaps that go with the ecommerce game are also pretty reliable, these are high up the SEO checklist and will say when products were last updated.
Then there are rich snippets - or whatever they are called now. The trend in these is to have some JSON-LD attached to the page in some format GoogleBot likes. Not hard to ingest.
Sites that don't have their act together for Google Shopping and SEO really are not worth scraping, they will never make it to the Google top 100 search results unless they are selling something that nobody else sells, e.g. 'Tibetan Monkey Stones' where you probably don't need to compete.
To me it sounds like these scraping concerns just need to pay a bit extra for ecommerce developers to show them how the 'puzzle was made' and to stop abusing people's business websites that are not built to be scraped on a daily basis by some random third party on the other side of the globe.
Also the plain old telephone helps. If your brand owning ecommerce team get a call from an interested party saying that they would really like to get a list of their products for their comparison/whatever site then they just might say yes, here is the URL for the feed, oh and here is the one for locale_en_xy. But people would prefer to hack away at some hacky spider rather than just pick up the phone and ask.
So I would think most scraping services assume they will be refused when doing such requests so they never bother.
The only valid reason for starting a large scraping operation is because you can argue that permission would be granted anyway.
Additionally, in my local market the owners of e-commerce websites are extremely narrow-minded and have zero tech education so all they will ever hear from you is "I want to steal that guy's data" which is of course not true at all. But try and argue with a 50-year old guy with the mindset of a feudal master who never truly worked in their life but want to control how everybody around them works.
If the survival of my business was at stake, I would just scrape one page every 3 or so seconds as a reasonable compromise. In fact I have done so for my amateur scraping experiments, although there the timeout was even steeper -- 10 seconds per page.
As demeaning and offensive many people would find that statement to be, I still found it to be the sad reality most of the time.
Plus my local community is much smaller and I would not want vengeful businessmen who understand NOTHING from what I am trying to achieve, to actively sabotage me. They can easily call my ISP and deny me service, for example.
So I opted for ethical scraping without asking questions. Seems to be the best working compromise.
Thanks for sharing your experience. Let's bathe in the confirmation bias it dips us in. :D
I did find it surprising that this article has a whole section on "Challenge 4: Anti-Bot Countermeasures" (and how to bypass them) but doesn't mention giving any consideration as to whether this is a reasonable way to behave.
I've played with Elixir as well, and it's also great for this type of thing.
proxycrawl.com looks very cool, I'm actually looking for a proxy service for my current scraping project. Are they also a good choice if you're doing lower tiers (like thousands of requests a day)?
In our case, competitors were scraping pricing data in order to competitively price their products without having to do the work.
So we just randomly start to give them incorrect prices on every few products. Not only would it make the whole data set useless, they had no way of figuring out which data was correct without manually checking and since we didn't do it to everything and started at random intervals, it made it too difficult for them to figure out when their ip had actually be quietly blacklisted.
E.g. Amazon and Walmart both do a lot of their own scraping.
Every major ecommerce site scrapes, it would be a competitive disadvantage if they didn’t.
But once they lose a bunch of money the first time, they tend to stop trying. We tracked down one competitor that was mirroring our prices on an hourly basis. So we waited until late at night, tanked our price on a few expensive items, then placed orders on the competitors site.
The human touch tends to scare off scrapers faster than a technological fence anyway.
After all, the web was built on accessibility of information, not on purposeful obfuscation.
If you go so far as to essentially flatten the webpage to the point where you might as well print it out and then do OCR on it then you've thrown out the baby with the bathwater, you had all that information when you started. Or at least, you should have had it.
Otherwise we might as well kiss HTML goodbye and render the web as pdfs, with or without links.
Essentially that's what Diffbot (https://www.diffbot.com/) does, except we don't the render pages as an image nor do OCR.
Diffbot renders the page in a headless browser, and uses computer vision to automatically identify the key page attributes and extract normalized data for specific page types (Articles, Products, Discussions, Profiles, Images, and Videos).
This approach enables us to work in any language and on sites that we've never come across before automatically with better than human level accuracy.
Answered on the parent, but it's somewhat similar.
There's a lot of very bad HTML out there.
This article made me realize I assumed wrong.
Can you provide an example of such service?
How effective are scraping countermeasures anyway?
A dedicated person will eventually work his way around all available counter-measures, though.
I disagree on this point. Starting with a single threaded model allowed my team to scale quickly and with little additional overhead. What we have lost with performance we gained in simplicity and developer productivity. That being said tuning and porting portions of the app to a multi-threaded system is slotted to take place within the next year.
Start with single threaded and simple, move to multi-threaded scrapers when the juice is worth the squeeze.
I've done several very amateur scrapers in the last several years, I am never going back to languages with a global interpreter lock, ever.
Now that I think about it, it's even less than 4 lines:
from multiprocess.pool import Pool (or ThreadPool)
pool = Pool()
However, for example when generating reports, try use the same instrument for serializing 4 pages of DB records to 4 pieces of a big CSV file, each working on a single CPU core. There the languages without GIL truly shine. And languages like Python and Ruby struggle unless their GIL implementations compromise and yield without waiting for an I/O operation to complete.
Am I mistaken?
If you were to use multithreading instead, you would generally have a problem if you were doing non-I/O work.
It seems that now we are both on the same page. Single process & many threads are problematic for GIL languages and that's why I gave up using Ruby for scrapers. GIL languages can work very well for the URL downloading part though.
As for Elixir itself, here's a quick example:
# Assume this contains 1000 URLs
urls = [....]
# This will utilize 100 threads; if the second parameter is omitted, it will use threads equal to CPU cores. For I/O bound tasks however it's pretty safe to use much more.
results = Task.async_stream(&YourScrapingModule.your_scraping_function/1, max_concurrency: 100)
It's honestly that simple in Elixir. For finer grained control the line count is little bigger -- but little. Not hundreds of lines for sure.
The better handling of malformed HTML by default is the much bigger deal.
Valuable info, thanks!
On the plus side there were some nice memory improvements for Meeseeks in OTP 21.
Don't let this sound patronizing because it's not -- but have you looked at how many times is the boundary between the BEAM and the Rust code crossed? I haven't inspected Meeseks' code so can't talk, just wildly guessing.
My ancient experience with Java <-> C++ bridges has taught me that if your higher-level language calls the lower-level language very often then the gains of using the lower-level language almost disappear due to the high overhead of constantly serializing data back and forth.
Anyhow, we should probably take this discussion to ElixirForum and not here. :)
(I am @dimitarvp there and almost everywhere else on the net, HN is one of the very few exceptions of inconsistent username for me).