
Ask HN: Designing a crawler to extract all the links from a website (site map)? - jurgenwerk
How would you build a bot which receives a website address for an input, and then it extracts and visits all the (sub)pages that can be found on the website? Then, it uses the gathered data to save the statuses of the pages (response codes, title, description, load time...) and it saves the pages in a data structure where it’s possible to create a tree map of the pages (like a folder structure in a file browser). This data structure needs to have weekly snapshots for making comparisons throughout time.<p>I’m thinking about the two main aspects of this bot - first one being the crawling strategy and the second one the data structure to store this data so it can be queried efficiently. Regarding the crawling algorithm, probably the easiest would be:<p>- Visit the page (level 1)<p>- Extract all the internal links<p>- Visit the first link, save data<p>- Go to step 2 (uncover the next level of links)<p>Obviously, there are some critical problems with this strategy. When do we know when we are done? How to prevent cyclical issues? What are the possible problems when crawls are performed concurrently?<p>The second point in question is the database for storing these links. Data should have the following properties:<p>- Associated to a specific website crawl at some point in time (to be compared with other crawls in different time)<p>- Links in each crawl need to be pointed to each other, so a website tree can be constructed.<p>This perhaps calls for a graph database, but that’s expensive (learning it + maintaining cost). What about a traditional RDBMS (Postgres)? A “links” table, referenced by “crawls” and “websites” table, where links are uniquely identified by its URL and can point to other links - for example the parent link (previous level).<p>Can you point me to some good algorithms and strategies?
======
Piskvorrr
[https://www.gnu.org/software/wget/manual/html_node/Recursive...](https://www.gnu.org/software/wget/manual/html_node/Recursive-
Retrieval-Options.html)

------
tedmiston
Check out [https://scrapy.org/](https://scrapy.org/) to start

