There’s also ARIA [1]. I actually think the one linked above seems more interesting. At the same time, doesn’t seem all that different from just rolling your own solution.
> LLM Chain querying documents with citations [e.g. a scientific Zotero library]
> This is a minimal package for doing question and answering from PDFs or text files (which can be raw HTML). It strives to give very good answers, with no hallucinations, by grounding responses with in-text citations.
pip install paper-qa
> If you use Zotero to organize your personal bibliography, you can use the paperqa.contrib.ZoteroDB to query papers from your library, which relies on pyzotero. Install pyzotero to use this feature:
pip install pyzotero
> If you want to use [ paperqa with pyzotero ] in an jupyter notebook or colab, you need to run the following command:
> Semantic search and workflows for medical/scientific paper
python -m paperai.report report.yml 50 md <path to model directory>
> The following [columns and answers] report [output] formats are supported: [Markdown, CSV], Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.
If you want a full fidelity warc file, browsertrix-crawler is nice. It is slower than wget on account of using chrome, but it works better for sites with highly dynamic content and can generate wacz files which can be efficiently served via a file/objectstore when used with something like replayweb.page.
Webrecorder's are by far the best imo. https://archiveweb.page and Browsertrix Crawler. ArchiveBox uses wget internally for it's WARC generation but I'd love to integrate it with Webrecorder in the future for this part.
Is this something that could be used to create a chatbot with a knowledge base based on all pages and all snapshots of those pages of a specific domain archived by web.archive.org/?
> it is not meant to replace keyword and filter-based search, but instead to complement it, providing a different starting point for scholarly inquiry and potentially expanding access to otherwise hard-to-reach information.