
How would you build an internet scale web crawler? - JustinGarrison
What techniques would you use to build an internet scale web crawler&#x2F;search engine.<p>* How would you schedule and manage the crawlers?<p>* How would you collect and parse incoming data?<p>* What would you use to store the data?<p>* How would you process and draw meaning from the data?<p>* What information would be important to build a pagerank system for searching?<p>* What other major components are missing?
======
tgamba
I worked at Alexa, then affiliatated with the Internet Archive, on exactly
this, in the late 90s. We built our own server farm to crawl, process and
store the data. We had 30TB of storage, holding three "snapshots" of the web,
and thought we were pretty hot stuff. That would sit on your desktop today.

Crawling was the easy part. We had two processes of up to 40 threads each
bringing the data down. Even this we had to throttle because we would use the
bandwidth for the entire office, then based in the Presidio.

Processing the data was the bottleneck. Parsing, extracting and pushing to the
database took months sometimes and the system broke down frequently. I was
online 24/7 maintaining this system and it put me off working for startups
forever.

All of the software, from the crawler to the parsers to the database system,
were built in-house-- there was nothing out there to handle data of that scale
at the time.

Our biggest concerns at that time were getting the cleanest data possible
without duplicate pages, and being able to retrieve that data as fast as
possible for real-time analysis. The engineers at Alexa produced some
remarkable solutions to these problems.

Alexa's plugin gave us real time information on what people were actually
looking at, and combining that with the the crawl data, we could have built
PageRank. Alexa could have been Google, but went in another direction. We were
acquired by Amazon in 1999.

To do this today would be an entirely different problem. The dynamic nature of
the web, single-page apps, the orders of magnitude of scale--only the largest
companies could begin from scratch with it.

However, you could build a simple system at home that could probably yield a
few billion pages, process those, get users logs from some big routing point,
and build a mini-Google.

~~~
ismail
“Alexa could have been Google, but went in another direction.“

Curious, why did they go in another direction?

~~~
tgamba
Amazon's stated reason for acquiring Alexa was to use Alexa's technology to
build a recommendation engine. Search was never a priority for Alexa itself.
We are acquired for $100 million, so it was take the money and run.

------
bryanrasmussen
For pagerank you are going to need a system for large scale graph processing,
a la pregel. Neo4J has an algorithms module with pagerank built in, but I
don't think it will scale large enough.

On the personal projects I'm working on right now I've decided to do as
follows -

crawl and make graph/index of links, initial data to get overview of site(s),
secondary crawlers might be sent out to get extra data if needed (this is
generally if I am using some library that analyzes DOM in-browser), if I just
need to analyze whole DOM with my own library I send that dom for analysis
right away while crawling to next links.

collecting and parsing incoming data really depends on what you want to do,
for stuff that is going to have to access the whole html of the page I'm
passing that html wrapped in object to resque queue. Crawling the page is
going to need to have a javascript aware crawler given the current state of
the web. one thing I generally do is wait for links to be rendered, count the
number of links on page if < than what I think a page should have wait
+seconds for page to be rendered by react, angular, whatever and then analyze
page again.

storing data is a decision regarding money, and how much time you have to do
stuff, and probably what your competencies are.

however what I am doing is not an internet scale operation, it is at best
Customer/Project X uses Crawler Y to crawl Site Z scale.

------
asibiryakov
Disclaimer: I’m a creator of Frontera crawl frontier framework, so the vision
below may be biased accordingly.

It’s a tough question to answer in general. The answer differs significantly
depending on the business goals of this system. But, I’ll assume the goal is
acquire content at least once, and re-crawl it from time to time, something
what Common Crawl does. I’ll start with defining the problem and then proceed
with a possible solution.

My estimate of the world total count of web hosts is ~500M and registered
domain names are at least 900M. There could be big websites with millions of
pages, or small ones with only a few. Let’s assume we want to create a birds
view collection and will be crawling of no more than 100 pages per website.
Therefore the target is to crawl maximum ~50B pages. Some of the registrars
are keeping their zone files private, and web is not ideally interlinked thus
there is no way to discover all the hosts in general. But let’s limit the
problem to 50B pages, just to make further design easier.

The data volume can be estimated using average, uncompressed page size of
60Kb. Overall, it turns out to be 3 Pb uncompressed and 1 Pb of data
compressed with Snappy.

Next thing to think about is the time we would like to spend to acquire this
content. Let’s say we’re poor and fine with just 6 months It assumes we crawl
~280M pages daily, ~12M per hour and ~3240 per second. Also general throughput
is expected to be ~190Mb per second and 2-4Gbit network connection should be
enough for our cluster.

Alright, so this is how our engineering problem looks like. Let’s move on to
the possible solution and answer the questions stated above.

There are not that many open source crawlers available capable of doing large
scale crawls, but just to name few Apache Nutch, Heritrix and Frontera/Scrapy.
In this case I’ll stick with latter. It will require ~160 fetchers to run in
parallel and 80 strategy workers to run the crawling strategy code. 1 core per
process plus some overhead for monitoring/links db storage/queue operation
resulting to ~280 cores, which is 7 modern machines 48 core each. This is only
crawler part. Frontera requires Apache Kafka and HBase to operate, so it’s
going to be around 10-12 more machines to run these services.

Q: How would you schedule and manage the crawlers? I would say there are two
common ways to do that. Either your crawler operates in batches: crawl of the
batch, stop, parsing, links extraction, crawl of the next batch and so on, or
the crawler is online: e.g. it crawls, parses, extracts and schedules links
without stops. The latter are usually faster. Batch operation is implemented
in Nutch, using cmd line calls and online is done in Heritrix, StormCrawler
and Frontera/Scrapy.

The modern way to do that is to run the crawler processes in container farm.

Q: How would you collect and parse incoming data? Depending on the purpose.
Online processing is getting popular these days. Have a look at Apache Kafka
and Storm. There are many HTML and other types of documents parsers available
open source.

Q: What would you use to store the data? The main issue here is the way you’re
going to access the data. If it’s an ad-hoc random access of document and
occasional full scan then Apache HBase or other column-based storage is a way
to go. If you’re going to perform full scan only (for example to collect some
stats or build an index) then Apache Kafka or HDFS can be used as a storage.

Q: How would you process and draw meaning from the data? In general case it’s
Solr or Elastic. If there something specific needed then it could be sampling
using linear scan and Apache Spark applied on sample. Again, these days are
various workers running in docker containers and processing the content online
is very popular.

Q: What information would be important to build a pagerank system for
searching? Pagerank requires only link graph. You would need to extract it
from HTML content after the crawling. You could also use Apache Giraph or
Spark GraphX to do the computations. These days Pagerank is abused for many
commercial purposes and antispam filtering will be needed to clean up the
graph.

Q: What other major components are missing? Rendering engine. Many websites
nowadays require Javascript executing and limit links discovery for non-
browsers. Deduplication. Sometimes website serves the same content under
different URLs, also stealing of the content on the web is quite common. So
deduplication is need if you’re going to build a search engine. Monitoring.
The crawl will take a long time, so it’s important to monitor the system
performance (the current crawl speed aggregated, queue contents, etc.) and
health. Orchestration. Your crawler is going to be a multiprocess application
and it’s important to be able quickly and reliably run, finish and check
status of all the processes.

Good luck.

~~~
technokrat233
Having worked with/on two large-scale web crawling systems at FAST Search &
Transfer; The C-Crawler that powered alltheweb.com (owned now by Yahoo) and a
python-based crawler that was a workhorse for FAST ESP/Scirus.com/etc.. (now
owned by Microsoft); this run-down is pretty good. Some things I'd add:

\- Link traps. If you limit to 100 pages per site not as big an issue but if
you want to go deeper you need a way to detect when a site is generating
garbage.

\- Near duplicate detection. There's lots of sites like you mention that
republish content of others, but some just present it in different ways with
different headers, timestamps, etc..

\- Content/meta-data detection/extraction, once crawled you want to do
something with the content and detecting the actual content of pages is non-
trivial if you don't want headers/ads/etc..

\- How do you handle non-HTML content (PDF, Docs, etc?)

\- How do you handle large content (sample, truncate, ignore)?

~~~
hcoyote
I used to work at a vertically-focused web search engine and ran the
operational side of the crawler.

Also missing from this discussion would be a mechanism to rate limit (and
determine adequate rate limits, based on your error rates) the crawl.

Also, detecting that you've been blocked and backing off so as not to further
hammer the site you're crawling with requests. Related:

IP management is an issue here as well: lots of places just carte blanche
block whole ranges from crawling activity. And will you be honoring robots.txt
or not?

Be prepared for people to block you in new and stupid ways: once got blocked
from hitting the site's name servers to even do lookups against them. They
blackholed our packets. So what should have been a ~500ms DNS query at each
http request turned into a 15s pause while the DNS request timed out ...
eventually this stacked up across all threads, backing the overall crawling
infrastructure to deadlock.

The Wayback Machine architecture is probably a good, public implementation of
a large scale crawling mechanism. This post[1] about it may be a bit dated,
but it's probably still accurate.

[1] [http://highscalability.com/blog/2014/5/19/a-short-on-how-
the...](http://highscalability.com/blog/2014/5/19/a-short-on-how-the-wayback-
machine-stores-more-pages-than-st.html)

------
saran945
What other major components are missing? \- Autosuggestion \- web block
detection and removal of header, footer, sidebars \- Handling of Spam content
\- Learning to rank

------
ThePhysicist
Concerning the crawling aspect of this, Michael Nielsen wrote a great article
about how to download/crawl a significant part of the Internet in 40 hours:

[http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-
bil...](http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-
webpages-in-40-hours/)

It's a bit dated (from 2012) but probably still relevant.

------
fghtr
You can look at the crawler of YaCy, free and open-source, decentralized web
search:

[http://yacy.net](http://yacy.net)

------
xstartup
One of the major problems is removing duplicate or near duplicate content like
images, text etc....

------
usgroup
Have a look at common-crawl. They are well documented.

------
leowoo91
IaaS + lots of willing

