
Ask HN: Web crawling theory - simlevesque
Hi, I&#x27;m looking for documents or books about web crawling, ideally not about a language but more general. Can you help me ? Thank you.
======
blackflame7000
Well you could start with 0.0.0.0 and ping each ip (~4.2 billion) until
255.255.255.255 on port 80/443 and you have browsed the front page of every
website on the IPv4 internet. Next repeat the process but this time follow
every link in a breadth search pattern to begin indexing the internet or web
crawling if you will.

If you wanted to make a plagiarism detector for example, on the pages you
index create a histogram of word triads for each indexed web page. Then
compare a document for validation with the word triad signatures you created
earlier to see if there is a potential match

