Hacker News new | past | comments | ask | show | jobs | submit login

There could be collisions where `fname2` is the same for different pages; resulting in unintentionally overwriting. A couple possible solutions: generate a random string and append it to the filename, set fname2 to a hash of the URL, replace unsafe filename characters like '/' and/or '\' and/or '\n' with e.g. underscores. IIRC, URLs can be longer than the max filename length of many filesystems, so hashes as filenames are the safest solution. You can generate an index of the fetched URLs and store it with JSON or e.g. SQLite (with Records and/or SQLAlchemy, for example).

If or when you want to parallelize (to do multiple requests at once because most of the time is spent waiting for responses from the network) write-contention for the index may be an issue that SQLite solves for better than a flatfile locking mechanism like creating and deleting an index.json.lock. requests3 and aiohttp-requests support asyncio. requests3 supports HTTP/2 and connection pooling.

SQLite can probably handle storing the text of as many pages as you throw at it with the added benefit of full-text search. Datasette is a really cool interface for sqlite databases of all sorts. https://datasette.readthedocs.io/en/stable/ecosystem.html#to...


Apache Nutch + ElasticSearch / Lucene / Solr are production-proven crawling and search applications: https://en.m.wikipedia.org/wiki/Apache_Nutch

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact