
How to crawl a quarter billion webpages in 40 hours (2012) - _ao789
http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/
======
secondtimeuse
This was written in 2012, Its even easier these days by using SQS and Cloud
Formation. 250 Million is a small number you are better of first going through
Common Crawl and then use data from crawls to build a better seed list.

Common Crawl now contains repeated crawls conducted every few months and also
urls donated by blekko.

[https://groups.google.com/forum/m/#!msg/common-
crawl/zexccXg...](https://groups.google.com/forum/m/#!msg/common-
crawl/zexccXgwg4w/oV8qeJnawJUJ)

~~~
mei0Iesh
Common Crawl contains the HTML?

I wonder how this is legal and considered acceptable. I wish I knew how even
Google and others get away with scraping content, saving it, and utilizing it
for profit without sharing any revenue with the original webmasters.

I know people can opt out of crawling, for those that actually respect that.
But still, am I the only one who feels like this is wrong?

I guess I have this view that your domain is yours, and you invite the public
in like an open house. It's my house, my property, and the door is open, where
people can come in and look around at my stuff. But the expectation is that
only locals will arrive, in small number, and they'll be good guests. If
someone is breaking the lock on the bedroom door and going through the private
drawers, that's wrong. If someone is taking photographs of everything, to then
create a virtual tour of my house they charge for, that's wrong. The
expectation is you're being nice by providing free and open access to
information you created and own, and people should behave courteous to that.

Then if you as the webmaster choose, you can provide an API, or database dumps
for people to download, along with the licensing terms. That is when it feels
right for people to do things like this with the data, because you
intentionally provided it through a non-personal interface.

To me the web is still a personal interface. I expect humans to use it, in an
ordinary human-like way where it is somewhat ephemeral and courteous. I feel
like Google cheated their way to success, and Common Crawl is stealing to rise
their position in an unfair similar manner.

These all seem like parasites to me. They didn't create anything, they just
steal it en masse.

There's so many businesses like that, such as Domain Tools that gets rich by
hoarding everyone's contact details from WHOIS:
[http://whois.domaintools.com/commoncrawl.org](http://whois.domaintools.com/commoncrawl.org)

They have a screenshot history they won't ever delete even if you ask nicely.
Here is a picture of Common Crawl from 2011:
[http://thumbnails.domaintools.com/domaintools/2016-01-08T19:...](http://thumbnails.domaintools.com/domaintools/2016-01-08T19:20:04.000Z/juwC52sCoU2vfhzE8XBhOgzFINQ=/commoncrawl.org/fullsize/39bc3b10013e1f43ccfec67ed654dbd3/1322786356.jpg)

~~~
jahewson
> I wonder how this is legal and considered acceptable. I wish I knew how
> Google and others gets away with scraping content [...]

Well, here's the answer: "transformative" reuse of content is explicitly
permitted under copyright law. Simply reproducing the content and charging for
it would not fall under this provision, but building an archive of publicly
available information is - quite appropriately, permissible.

There was recently a very large court case regarding this principle and its
application to Google Books. Google won, by demonstrating that their search
index is not equivalent to and does not affect the market for the original
work - a "transformative" use.

Sharing is good. Publicly available works achieve their aims only by being
consumed by others - anyone who publishes a work free of charge should expect
it to be, and remain, publicly accessible.

~~~
mei0Iesh
I don't think "Sharing is good" is true in the real world. If you apply that
as a blanket statement, you'll end up in trouble.

What is legal is not always ethical. I think there's an interesting story
there about how Google is legal, if someone doesn't automatically assume it
should be just because it is.

The text online isn't always similar to a published text of the past. There is
a personal overlap today that changes the rules. Such as this text I'm
publishing right now. Forgetting about all the legalities and technicalities,
I still feel like it is different than a page published in a book. I still
feel like I should have the power to edit or delete it whenever I want in the
future, yet Hacker News disagrees and removes my right to modify it, forever
capturing it as if it owns it, not me. I still feel like this text is more
transitory, where its relevance is mostly right now, and if it were deleted in
a month it would be fine, because it's mostly just chit chat.

Certainly we could live in a world where everyone has microphones transcribing
everything they ever say, which is transmitted to Google, and provided to
researchers, where all kinds of uses could emerge. But that's a different
world than the one where we've developed rules for today. Right now, I feel
like most things I say are in passing, and should not only disappear, but
won't spread where someone is capturing and propagating it beyond my control.

What control do I have over my text that is in this Common Crawler database?
What if it captured information that was considered to be ephemeral in the
website's context, and ripped it out of its home where it's now part of this
collective publication, where anyone can use it for anything?

Sharing could be good in a world where people are not selfish and malicious.
But in this one, many people will use whatever data they can get their hands
on for selfish and malicious purposes, that do not benefit you, the author, in
any way. I bet a large percentage of use for that Common Crawler database was
harmful to society, such as for helping spammers generate fake content.

~~~
effie
> "These all seem like parasites to me."

Your impression is wrong. Search engines and other services based on web data
provide great value to society. They don't create documents they link to, but
they deliver relevant links to people's queries. That's a great service.
Without the search engine service, people may not even find the web page.
That's why large portion of website owners and webmasters are glad search
engine crawlers visit them and even expect indexing to databases to be fast
and smooth.

If you publish anything on your web, you're facilitating free use and
duplication of it in the whole world. If this was not your intention, but you
still published your stuff on your web, you misunderstood the original intent
and reality of the Web for sharing information.

There is a widely known standard of communication between robots and web sites
called robots.txt standard. It is a file where you can state your intent to
restrict crawler downloads. There is also html tag <meta name="robots"
content="noindex,nofollow"> that signalizes to crawlers your wish that the
page should not appear in search engine results. If you want to prevent people
from accessing and using your documents, use these. Both Google and Common
Crawl seem to obey them. If you want to _make_sure_ nobody accesses and uses
your documents, don't publish them on the Web.

There is no practical way to achieve your documents are accessible only for
some limited period you want. If you release them to the world, you always
lose control over their distribution and use.

------
worried_citizen
Web crawling is just like most things: 80% of the results for 20% of the work.
It's always the last mile that takes the most significant cost and engineering
effort.

as your index and scale grow you bump into the really difficult problems:

1\. How do you handle so many DNS requests/sec without overloading upstream
servers?

2\. How do you discover and determine the quality of new links? It's only a
matter of time until your crawler hits a treasure trove of spam domains.

3\. How do you store, update, and access an index that's exponentially
growing?

Just some ideas.

~~~
betolink
I've been working on something similar and I have ran into some of the issues
you mention. As you correctly pointed out, quality and post processing is also
relevant to not crawl irrelevant/spam sites, which can be HUGE! The work
presented here is cool but it does not address the whole picture. Having a
crawler that takes quality and user feedback into account is the hard part.
Not to mention if you are being polite with the requests... we need to scale
but not ignoring Robots.txt

So crawling a billion of links in X number of hours is not trivial but not
that hard specially with cloud infrastructure like AWS, it's just a matter of
a good enough implementation and how much money one wants to spend on it.

------
supername
No one ever talks about a particular topic though when it comes to web
crawling etc. How do you avoid all the "bad" sites as in, really bad shit? The
stuff that your ISP could use as evidence against you when in fact it was just
your code running and it happened to come across one of these sorts of sites.
How do you deal with all that? That is the only thing stopping me from
experimenting around with web crawling.

~~~
johnward
I used to do some crawling on comcast using what is now IBM Watson Explorer. I
got a ton of phone calls from them. It sounded like those calls would go away
if I just paid a little bit more for business class service.

------
pella
old HN comments ( 3 years ago )
[https://news.ycombinator.com/item?id=4367933](https://news.ycombinator.com/item?id=4367933)

------
tegansnyder
I feel like more companies are building their businesses around web crawling
and parsing data. There are lots of players in the eCommerce space that
monitor pricing, search relevance, and product integrity. Each one of these
companies has to build some sort of a templating system for defining crawl
jobs, a definition of parsing rules to extract the data, and monitoring system
to alert when the underlining HTML of a site has changed from their predefined
rules. I'm interested in these aspects. Building a distributed crawler is
easier than ever.

------
jdrock
This isn't particularly difficult anymore. The most interesting challenges in
web crawling around turning a diaspora of web content into usable data. E.g.,
how to get prices from 10 million product listings from 1,000 different
e-retailers?

------
packersville
I don't understand his hesitancy in releasing his crawler code. I imagine
there are plenty for people to access and alter for malicious use if they
desired, so why does releasing his such a big deal?

------
pbreit
Is "quarter billion" used to make it sound like a bigger number? Even "half"
is aggressive, imo.

~~~
johnnymonster
I was going to ask this very question. Seems like the appropriate way would be
to say 250 million. I mean you wouldn't say I got a quarter hundred dollars or
quarter thousand...

