After crawling your list of seed URLs, where do you go next? How do you make sure you don't crawl the same content multiple times because it has a slightly different URL? How to avoid getting stuck on unimportant spam sites with autogenerated content?
Because the author only crawled domains from a limited set and only for a short time, he did not need to care about that part. Nonetheless, it's a great article that shows many of the pitfalls of writing a webcrawler.
... infinite ...
Filtering out certain query parameters (JSESSIONID, PHPSESSIONID, etc) can help.
I'm rather impressed that search engines do it so well. I imagine the right approach involves examining the contents of the pages and doing checksums, but I'd love to know what the real search engines do.
I do know that a naive crawler will completely fail to crawl any significant portion of the real web without solving this. It's quite possible those 250 million "pages" were actually something like 1 million distinct pages.
Interesting point. Q: Is it possible for two different pages to give the same checksum? (Asking for my own info; I do know what checksums are, in overview, from way back, but haven't checked :) them out much.
Also it doesn't help you at all because pages returned that you want to treat as the same aren't actually the same because they include the request time or other unique element.
Edit: Added sentence about comparing with domains.
A very simple and reliable web crawler can be built by building something that takes a list of in-urls, fetches the pages, parses them, and emits a list of out-urls.
The out-urls from that phase become the in-urls of the next phase.
If you do this you'll find the size of the generation get larger for a while, but eventually it starts getting smaller and if you take a look at it you'll find that almost everything getting crawled is junk -- like online calendars that keep going forward into the future forever.
At some point you stop crawling, maybe between generation 8 and 12, and you've dodged the web trap bullet without even trying.
This classifier can be trained using logistic regression.
This is slightly tricky, but effective. I've been meaning to write in more depth about this topic (smart crawling).
[edit: I'm stepping out so I can't write more right now. You can email me if you have any questions about this.]
Something I didn't deal with at all was spider traps. My thinking there was that this was a relatively shallow crawl, so the dangers of getting badly stuck weren't too great. For deeper crawls this would be essential.
Then you have a giant heap with several billion elements that you're going to be updating 100s of entries on each webpage you crawl. I'm pretty sure that'll be your new bottleneck.
Besides, it still doesn't solve the problem of content farms and spam.
>set a max pages to crawl per site to prevent never ending single site crawls?
You either set the limit too low and don't crawl, for example, all of wikipedia, or you set the limit too high and still waste tons of resources.
It's probably best to add an IP based limit factor too to handle link farms.
Something like "max-pages crawled is directly proportional to the (incoming links / linking unique domains)"; or just have it naively proportional to the linking unique domains to reduce processing cost.
There are span sites with more incoming links than Wikipedia....
Designing crawl policy is a lot like designing security policy. The opposition is very well-funded, smart, experienced, and very motivated.
If you're just smart, they've got you beat.
Google apparently solve this with a domain trust factor discounting link juice from low value domains for SERPs purposes but I'm not sure what they feed forward to control their bots.
This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement ... so I'm wondering if such [spam] domain owners are being as smart as you think??
It's been several years since I worked on Yahoo's crawler.
It's interesting that at the beginning of this, you doubted that spam domains could have large number of inbound links (and suggested that inbound link counts was a powerful indicator) but by the end, you seemed aware of such domains....
I'm not. I've never done any research on "spam domains". Do you have any examples or not?
Then who wrote
> This means that for some time it has appeared that large numbers of low quality inbound links has been a negative for Google SERP placement
But none of that presupposes any knowledge of "spam sites", nor of the particular instances you note of spam sites with lots of inbound links (ie the claimed far more than wikipedia [per IP address]).
But it seems you don't have any examples so I'm not sure why you're persisting. Even if I did know about specific spam sites with lots of incoming links I don't see the relevance of my knowing that to you answering the question of if you can give any examples of sites with your claimed characteristics. The points are orthogonal. Your pre-knowledge of such sites is not in any way bound by my knowledge of such sites.
I daren't ask if you can answer the question again. But just suppose you could then a response would still be of interest.
other way round? bloom filter provides a low probability of erroneously believing a URL had already been added when it had not, zero probability of believing a URL had not already been added to the filter when in fact it had.
using a bloom filter in this way guarantees you won't ever hit a page twice, but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.
My (not-very fresh) memory of what bloom filter actually is tells me that this "non-zero rate" you're talking about must be HUUUUGE. In order of millions of pages. Am I right?
That is not a good reason. There are many crawlers out there. Anyone can easily modify the string "robots.txt" in the wget binary to "xobots.txt".
Release your code so that others can learn. Stop worrying that you are giving some special tool to bad people - you aren't.
while :; curl "http://www.ycombinator.com"; done
One of the most interesting things for me is that a lot of the 'frothiest' web pages (those that change every day or several times a day) have become pretty significant chunk of the web from even 5 years ago. I don't see that trend abating much.
Our crawler is written in perl. It uses an evented architecture (written using the AnyEvent library). We use Redis to store state (which urls have been crawled - using the hash- and determine which urls to crawl next - using sorted sets)
Instead of using a bloom filter we used sorted sets to dedupe urls and pick the highest priority urls to crawl next (some sort of priority queue).
For the actual distribution of crawling (the 'map reduce' part) we use the excellent Gearman work distribution server.
One major optimization i can suggest is caching the dns (and also do it asynchronously). You can save a lot of time and resources, especially at that scale, by simply caching dns requests. Another optimization would be to keep the socket connection open and do the download of all the pages from the same domain asynchronously.
*Shameless plug: We just launched our private beta. Please sign up and use our API using this link:
Edit: AWS still has free imbound BW.. my bad
I had a bit different constraints (only hitting frontpage, cms/webserver/... fingerprinting, backend has to be able to do ad-hoc queries for site features), but it's nice to see that the process is always somewhat the same.
One of the most interesting things I experienced was, that link crawling works pretty ok for a certain amount of time, but after you have visited a large amount, bloom filters are pretty much the only way to protect against duplication in a memory efficient way.
I switched to a hybrid model where I do still check for links, but to limit the needed depth, I switched to using pinboard/twitter/reddit to find new domains. For bootstrapping you can get your hands on zonefiles from places on the internet (e.g. premiumdrops.com) that will keep you from having to crawl too deep pretty fast.
These days, I run on a combination of a worker approach with redis as a queue/cache and straight elasticsearch in the backend. I'm pretty happy with the easy scalability.
Web Crawlers are a great weekend project, they allow you to fiddle with evented architectures (github sample ), scaling a database and seeing the bottlenecks jump from place to place within your architecture. I can only recommend writing one :)
Even though your implementation is a lot complex and very well documented, IMO using non blocking I/O it's a much better solution, because crawling is very intensive I/O and most of the time is spent with the connection (request + response time). Using that many machines and processes, the time should be much shorter with node.
I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.
For me, it was more a proof of how efficient and fast a crawler can be.
Also, a response from IMDB was very fast in less than 0.4 seconds, so not that much time was lost there.
Did you know that IMDB makes a subset of their data publicly available? http://www.imdb.com/interfaces/
IMO, using kue was a success because it also offers a web interface where you can check the progress and restart/check failed jobs.
His question: "Who gets to crawl the web?" is an interesting one though.
Do not assume that Googlebot is a smart crawler. Or smarter than all others. The author of Linkers and Loaders posted recently on CircleID about how dumb Googlebot can be.
There is no such thing as a smart crawler. All crawlers are stupid. Googlebot resorts to brute force more often than not.
Theoretically no one should have to crawl the web. The information should be organised when it is entered into the index.
Do you have to "crawl" the Yellow Pages? Are listings arranged by an "algorithm"? PageRank? 80/20 rules?
Nothing wrong with those metrics; except of course that they can be gamed trivially, as experiments with Google Scholar have shown. But building a business around this type of ranking? C'mon.
If the telephone directories abandoned alpha and subject organisation for "popularity" as a means of organisation it would be total chaos. Which is why "organising the world's information" is an amusing mission statement when your entire business is built around enabling continued chaos and promoting competition for ranking.
Even worse are companies like Yelp. It's blackmail.
If the information was organised, e.g., alphabetically and regionally, it would be a lot easier to find stuff. Instead, search engines need to spy on users to figure out what they should be letting users choose for themselves. Where "user interfaces" are concerned, it is a fine line between "intuitive" and "manipulative".
The people who run search engines and directory sites are not objective. They can be bought. They want to be bought.
This brings quality down. As it always has for traditional media as well. But it's much worse with search engines.
What do you mean by this statement? I can see from your post you have a contrarian view on how to organize and find stuff on the web, but I'm struggling to understand what alternative you're proposing.
Maybe during the registration process more metadata should be demanded of people and anonymity prohibited or reduced. That way for example if you wanted a list of all the .com blogs it is just a grep away and tied into mostly real people for example. Corporate websites tied to their business entity with an EIN or something and verified. 'etc.
The thing is.. that ship has sailed a long time ago so we are stuck with google.
tl;dr - He didn't think about it until just before launching the experiment and worried that it would take too much time to understand the implications of changing his approach. Though he estimates it may represent a factor of five decrease in price.