
Lessons learned scraping 100B product pages - Ian_Kerins
https://blog.scrapinghub.com/web-scraping-at-scale-lessons-learned-scraping-100-billion-products-pages
======
apeace
> As most companies need to extract product data on a daily basis, waiting a
> couple days for your engineering team to fix any broken spiders isn’t an
> option. When these situations arise, Scrapinghub uses a machine learning
> based data extraction tool that we’ve developed as a fallback until the
> spider has been repaired.

I once worked on a spider that crawled article content and I ran into the same
problem. I always wanted to try the following solution to it but never had the
chance.

Assume you have a database of URLs and the fields you've scraped from them in
the past (title, author, date, etc). If you ever fail to scrape one of those
values from a _new_ URL, here's what you do:

\- Go back to one of the old URLs where you already have the correct value
(let's say it's the title).

\- Walk through the whole DOM until you find that known title. At each node
you will have to remove child nodes except for text, to deal with titles like
"Foo <span>Bar</span>" which you want to match against "Foo Bar". So this is
going to be an expensive search.

\- Generate several possible selectors which match the node you walked to
(maybe you have ".title", ".title h2", ".content .top h2", etc).

\- Test each new selector on several other already-crawled pages. If any of
the selectors work 100% of the time, there's your new selector.

Any thoughts on whether something like this would work?

~~~
AznHisoka
That sounds like a very creative idea.

What I do is call a regression test every x minutes. If it fails, set a flag
to save/store the html everytime we crawl pages. Now we can go back and
process these saved pages when we fix our crawler

~~~
w0rd-driven
I crawl a specific site somewhere up to 50 unique URLs a day. I store both the
unparsed full html as a file and the json I'm looking for as another separate
file. The idea is if something breaks instead of taking a hit to make the call
again, I have the data and I should just process that. It's come in extremely
handy when a site redesign changed the DOM and broke the parser.

I do the same at $dayJob where I'm parsing results of an internal API. Instead
of making a call later that may not have the same data, I store the json and
just process that. I feel like treating network requests as an expensive
operation, even though they're not really, helped me come up with some clever
ideas I've never had before. It's a premature optimization considering I've
had like 0.000001% of failure but being able to replay that one breakage made
debugging an esoteric problem waaaaaay simpler than it would've been
otherwise.

~~~
pdimitar
Off-topic: I so wish I worked for a company where my work involves scraping
and storing and analyzing data. :(

~~~
hoju
Now is a good time to work in this field since data science is hot and
companies need web scrapers to provide the data for these models. Atleast that
has been my experience in finance. Try applying!

~~~
pdimitar
I have zero experience in data science though. I am a pretty solid and
experienced programmer and can learn it all but... don't know. Maybe I should
just try indeed.

Do you have any recommendations for places and/or interview practices?

------
ChuckMcM
Ah yes, Challenge 4 (anti-bot measures).

At Blekko I developed a number of ways to deal with people that tried to
scrape the web site for web results. The three most effective ways are
blackholing (your web site vanishes as far as these folks are concerned), hang
holding (basically using a crafted TCP/IP stack that does the syn/ack sequence
but then never sends data so the client hangs forever), and data poisoning
(returning a web page that has the same format as the one they are requesting
but filling it with incorrect data).

We had a couple of funny triggers of the anti-bot stuff during the run, once
when a presenter on stage showed an example query and enough people in the
audience typed it into their phones/tablets/laptops and it all came from the
same router that it looked like a bot. The other where the entire country was
behind a single router and a school had all all of their students making the
same sort of query at the same time (in both cases the trigger was a rapid
query rate for the exact same search query from an address).

In Blekko's case since bots either never clicked on ads or _always_ clicked on
the same ad (in both cases we got no revenue) keeping bot traffic off the site
was measurable in terms of income.

~~~
jd20
I thought the number one anti-bot measure was a cease and desist letter :)
seriously though, some of these websites clearly don't want to be scraped,
what's stopping them from sending scrapehub a C&D letter and forcing them to
comply?

~~~
ChuckMcM
Sure, if you can make a reasonable assumption it is them scraping you. As they
point out in the article they invest in proxy networks to make their requests
appear to come from a bunch of addresses that don't lead back to them.

One of the things we learned at Blekko was that people that run botnets often
sell 'proxy service' as a thing, we identified several made out users of the
Time Warner "road runner" service. That put us as a web site in a bind in that
the proxy service that was running on an infected computer was violating our
terms of service but the user might be completely unaware. If they were also a
customer and we black holed their IP it would also cut off legitimate traffic.
Since we didn't keep a logs that could identify these relations over time
(privacy issues) we had to rely on other methods. We never got enough
penetration into the search market to make this a huge concern however so the
problem remained largely theoretical. We started a program of doing
exponential banning where an IP would be banned and then an hour later
unbanned, and if it resumed its bad behavior banned for 2 hours then 4 Etc.
Once you get to 1024 hours it is pretty safe to assume they are lawfully evil
as it were.

These guys fake their user agent, mask their IP addresses, and generally work
hard to defeat anti-bot measures. They know they are over the line, but the
law has yet to catch up to them.

~~~
jd20
My experience, when it comes to scraping airline websites, the airline's legal
department usually doesn't wait to have proof that you were the one that
actually scraped them. If you have their data on your website, they send you a
C&D, and if they continue to find their data on your website, they will
happily sue you. In other words, doesn't matter how you got the data, you
must've broken the law if you got their data.

I'm thinking of RyanAir suing Expedia, United vs wandr.me, Southwest suing
SWMonkey.com, I'm sure there's countless others.

~~~
bryanrasmussen
But if you have Amazon turk workers 'scraping' the data is that illegal?

~~~
nl
It’s the reproduction of the data which they are suing over, not the method.

------
blattimwind
> Multi-threading is a must, when scraping at scale. The more concurrent
> requests your spiders can make the better your performance - simple.

Intuitively I would think that this sort of problem would profit from using
asynchronous ingestion at the edge pushing unprocessed contents to a multi-
threaded/multi-process backend. (Because I'd expect that network latencies
mean you need lots of threads to saturate I/O, which I'd expect would conflict
with effectively using the available CPU power to do the actual document
processing).

~~~
mgliwka
That's been exactly my experience. Most time is spent connecting or waiting
for the server response (TTFB). Using an async I/O event loop approach in
combination with EPOLL/KQUEUE you can handle thousands of concurrent
connections. You then push the response to your worker nodes, which process
the data in a multi-threaded fashion. Stream Processing Frameworks like Apache
Spark or Storm work great for that.

------
afandian
Can I piggy-back off this submission to ask HN if you're running a scraper,
have the recent wave of GDPR splash-screens caused you issues? How are you
dealing with them?
[https://news.ycombinator.com/item?id=17471599](https://news.ycombinator.com/item?id=17471599)

~~~
moltar
You can just remove them from DOM or hide with CSS.

~~~
afandian
What about e.g.
[http://discourseontheotter.tumblr.com/](http://discourseontheotter.tumblr.com/)
?

Edit: In the UK I see this:
[https://imgur.com/a/zlWOByh](https://imgur.com/a/zlWOByh)

~~~
jklein11
I don't see a GDPR challenge on this page?

~~~
afandian
That's probably because it's being inconsistently applied and you're not in
Europe. If that's the case it's all the more insidious! Or you accepted the
Tumblr terms in the past.

This is what I get in the UK:
[https://imgur.com/a/zlWOByh](https://imgur.com/a/zlWOByh)

~~~
YouKnowBetter
:s/Europe/EU/

I'm in Switzerland (Europe), don't get the
[https://imgur.com/a/zlWOByh](https://imgur.com/a/zlWOByh)

------
ainiriand
Just by chance we experienced a scraper bot on the site past week and we
discovered some performance problems thanks to it. It literally fried our
ancient caching system and we finally took the step towards using cdn for
static delivery and redis for api responses. I wonder if there were those guys
because it was some solid scraping.

~~~
jacquesm
Badly behaved scrapers should be blocked, not accommodated.

~~~
greenyouse
I'd agree. As somebody scraping content, what's so bad about increasing the
timeout to like 10 seconds? That way the servers can handle the traffic easily
and you're not being a jerk. If you have one async thread for each domain you
can still get lots of data quickly. Causing a denial of service attack is very
avoidable.

~~~
jacquesm
And you're going to be hitting millions of hosts anyway, so all you have to do
is rotate from one host to the next and randomize your worker queues. It might
take a little longer but you will not blow up someone's aging server. Being a
good citizen of the net means to take into account that even if you have
gigabits of bandwidth to burn the counterparty may not (and could easily be on
the sharp end of a bandwidth capped contract).

------
jjeaff
As a side note, I have had quite a bit of experience trying to block automated
scraping services. And I found that the best way is to quietly attempt to
detect scraping. Then, serve up tainted data.

In our case, competitors were scraping pricing data in order to competitively
price their products without having to do the work.

So we just randomly start to give them incorrect prices on every few products.
Not only would it make the whole data set useless, they had no way of figuring
out which data was correct without manually checking and since we didn't do it
to everything and started at random intervals, it made it too difficult for
them to figure out when their ip had actually be quietly blacklisted.

~~~
ikeboy
What's ironic is that most of the sites with anti scraping protection also do
scraping of their own.

E.g. Amazon and Walmart both do a lot of their own scraping.

~~~
Doctor_Fegg
Really going to call for a [citation needed] on that "most"!

~~~
ikeboy
Maybe I should rephrase to “put the most effort into anti scraping”.

Every major ecommerce site scrapes, it would be a competitive disadvantage if
they didn’t.

------
taitems
I hope you too got a chuckle from reading their anti-bot counter measures
section only to see their form protected by Google’s “I’m not a robot” CAPTCHA
plugin.

------
fareesh
I've always wondered if it makes more sense to render the page as a jpeg and
run some kind of machine learning to identify and read off the relevant
details

~~~
jacquesm
I'd go the other way and say that pages that no longer contain relevant
information in a normally digestible format should be dropped from search
engines and other automated indices.

After all, the web was built on accessibility of information, not on
purposeful obfuscation.

If you go so far as to essentially flatten the webpage to the point where you
might as well print it out and then do OCR on it then you've thrown out the
baby with the bathwater, you _had_ all that information when you started. Or
at least, you should have had it.

Otherwise we might as well kiss HTML goodbye and render the web as pdfs, with
or without links.

~~~
zzzcpan
The biggest search engine doesn't have your best interests at heart and has
been trying to make HTML and accessibility of information obsolete for years.
Some pages now render only with javascript or require solving javascript
challenge to even get to the rendering (hello cloudflare) and essentially
kissed HTML goodbye.

------
cmjqol
I always assumed Web Scraping wasn't something particularly challenging
because of how many libraries existed for this purpose.

This article made me realize I assumed wrong.

~~~
mipmap04
It gets especially difficult with dynamic content or when trying to scrape
sites written on very heavy frameworks like ASP.NET Webforms that require
passing the view state with every request. I made a calendar aggregator for
adult hockey times in my area that scrapes rink websites[0] and it was far
more difficult than I had thought it would be because of the fact that the
rinks all used Telerik Webforms controls to do their calendars. It turned a 30
minute job into a 2 hour job.

[0]:
[http://dpscschedule.azurewebsites.net/](http://dpscschedule.azurewebsites.net/)

------
polskibus
Wouldn't such scraping be under TOS of scraped sites?

~~~
merinowool
Bot cannot consent to or understand TOS...

~~~
jd20
What if the site requires you to create an account and login? Can the bot
create it's own account, and still claim to not consent?

~~~
merinowool
Is bot aware of what it is doing? I don't think so.

------
baxtr
I had to stop reading the otherwise interesting sounding piece when at the
left bottom corner a "Get the Enterprise Web Scraping Guide" box popped up
(second paragraph or so). Maybe I'll give it a second chance later.

~~~
moltar
It was a pretty thin self promoting post. You didn’t miss much.

------
Exuma
> However, our recommendation is to go with a proxy provider who can provide a
> single endpoint for proxy configuration and hide all the complexities of
> managing your proxies.

Can you provide an example of such service?

THanks!

~~~
danni
I run [https://www.scraperapi.com](https://www.scraperapi.com) which does
this!

------
misterbowfinger
> A large proportion of these bot countermeasures use javascript to determine
> if the request is coming from a crawler or a human (Javascript engine
> checks, font enumeration, WebGL and Canvas, etc.).

How effective are scraping countermeasures anyway?

~~~
jd20
They work pretty well for any scraper that's not using an actual browser with
JavaScript engine. It keeps the riff-raff out.

A dedicated person will eventually work his way around all available counter-
measures, though.

------
livando
"Multi-threading is a must, when scraping at scale."

I disagree on this point. Starting with a single threaded model allowed my
team to scale quickly and with little additional overhead. What we have lost
with performance we gained in simplicity and developer productivity. That
being said tuning and porting portions of the app to a multi-threaded system
is slotted to take place within the next year.

Start with single threaded and simple, move to multi-threaded scrapers when
the juice is worth the squeeze.

~~~
pdimitar
Or use a language where fully utilizing all CPU cores is transparent, like
Elixir? There's zero complexity, you basically add 4-5 lines of code and
that's it. Honestly, not exaggerating.

I've done several very amateur scrapers in the last several years, I am never
going back to languages with a global interpreter lock, ever.

~~~
iooi
I'm assuming you're talking about Python, which is also "4-5 lines" to use
multithreading or multiprocessing. Can you explain what's wrong with GIL
languages?

Now that I think about it, it's even less than 4 lines:

from multiprocess.pool import Pool (or ThreadPool)

pool = Pool()

pool.map(scrape, urls)

~~~
pdimitar
When the pooled functions are I/O bound then the GIL is not a problem. Any GIL
language will do.

However, for example when generating reports, try use the same instrument for
serializing 4 pages of DB records to 4 pieces of a big CSV file, each working
on a single CPU core. There the languages without GIL truly shine. And
languages like Python and Ruby struggle unless their GIL implementations
compromise and yield without waiting for an I/O operation to complete.

~~~
iooi
I'm not sure you understand how the GIL works in Python. If you're using
multiprocessing, there's no locking across the code executing on each core.
Also, if you're writing to the same file from four processes, you're going to
need locking.

~~~
pdimitar
What I have last known is that GIL languages work well in multicore scenarios
as long as all N tasks have I/O calls that serve as yielding points for the
interpreter, and they do not use preemptive scheduling like the BEAM VM
(Erlang, Elixir, LFE, Alpaca) do.

Am I mistaken?

~~~
iooi
As far as Python goes, yes. Multicore implies multiple processes, which means
that each process will have it's own Python interpreter, each with it's own
GIL.

If you were to use multithreading instead, you would generally have a problem
if you were doing non-I/O work.

~~~
pdimitar
Then I think we have a misunderstanding of terms. To me "multicore" == "single
process, many threads". Apologies for the confusion.

It seems that now we are both on the same page. Single process & many threads
are problematic for GIL languages and that's why I gave up using Ruby for
scrapers. GIL languages can work very well for the URL downloading part
though.

------
rosha
I tried several different queue systems best version I got is using Erlang
Queue, Elixir & Kafka on top for doing high concurrent crawler, the project
was to develop a realtime Amazon product ASIN price monitoring system for our
company as a challenger prototype. Our main problem was basically proxies, we
stopped buying them as managing thousands of proxies is a huge effort that we
did not want to take, also lack of data means our Hadoop clusters gets thirsty
and machines stops learning properly. Currently we are using a third party
[https://proxycrawl.com](https://proxycrawl.com) on very high tiers > 10B with
a great discount and we are happy to get that part solved. Other lessons
learnt are like sometimes things fail and logs help a lot so you will need a
highly available Logging and monitoring system.

~~~
jd20
Back when I worked for a very large tech company, building their web crawler,
I had good success with Golang. On four servers, with 10 GigE interconnect and
SSD, and a very fast pipe to the Internet, I was able to push about 10K pages
/ second sustained. At any given time, there were probably several million
connections open concurrently.

I've played with Elixir as well, and it's also great for this type of thing.

proxycrawl.com looks very cool, I'm actually looking for a proxy service for
my current scraping project. Are they also a good choice if you're doing lower
tiers (like thousands of requests a day)?

~~~
rosha
Golang is a good choice too but in my experience its nothing compared to what
you can do with Erlang Queue and Elixir. Regarding your question about
proxycrawl, I do not know honestly, I tested the service for few days on some
few millions per day and it was great too. I would say they are good for a
very high volume, we are still using it, so that should be a good signal to
try them.

