Zotero can do this. It cannot always find the metadata but there are additional plugins available too. Additionally the browser plug-in allows you to easily find the item on a library’s site or Amazon and add the metadata and associate it with the pdf/other file.
I’ve been using Zotero since I was in grad school and at one point wrote large chunks of the BibTeX export. My first papers were added to my library in October 2006. It’s amazing to have a single source of basically every paper I’ve read in the past 16 years and many books, newspaper articles, etc - including notes. I’ve always had it backed to my own WebDav server that has, somehow, survived despite numerous migrations.
The UI is starting to feel a bit clunky, but it’s fine as a standalone application and the mobile version is great too.
Yes, it can be a tad clunky and the mobile app for a long time was just a catalog.
However, their new iOS app [1] is awesome. The ability to read and highlight right in the app is what I have been wanting for years.
I had previously been using the ZotFile [2] plugin to achieve something similar previously. ZotFile is also very powerful, I only used a subset of features.
I throw all documents into Calibre [1]. It can extract metadata from files and extend them from online sources. It has a lot of related plugins too.
As a plus it can convert to make the document book reader friendly. There are web applications capable of reading calibres meta-data files and allowing to share the library.
I just put my Calibre library in Dropbox and get apps like Calibre Sync for Android. It integrates with Dropbox, downloads the library metadata on demand, and lets you download the books you are actually reading.
I combine this with sorting by recently modified/custom Calibre columns for managing what is Todo or various 'shelves'. It lets you make a custom 'tag' column.
Calibre Web, running on a Raspberry Pi. It comes with login, you can use Github to login too. Expose it via your router (put Nginx+LE in front) or cloudflared.
Recoll http://www.recoll.org/ . It's an excellent metadata indexing and search program. It is not hashing based for organizing uniqueness of files but it will help you with finding things within the files.
PDFs only get names once I'm finished reading them. I click "Save As" and sit for 30 seconds and try to figure out what search terms I would use in the future to look for this document. Sometimes a title, sometimes a summary, sometimes a topic, author, year, journal, etc. Everything is built into the filename. It doesn't matter if it's long. Computers can handle it. This also has the side effect of ensuring I know what I just read and putting it into a mental filesystem as well as an electronic one.
Once I name it, I move it from the Downloads or temp folder to a Documents folder (though it should really be called 'Library') and I sync it to the cloud with the Google Drive app. In it, there's a Reading-List folder, with a _done folder. I have a few category folders - usually for reading groups. If I read the same document in multiple reading groups, I store it in the folder of the first reading group I read it in. This way feels nostalgic to me. I will also put jpeg screenshots, txt files for notes, etc. This way feels nice, like when I used to go to the public library and see the DVDs and audiobooks next to the books.
The main query interface is MacOS spotlight, though Google Drive also works, and so would something like fzf, or any other finder. I like not having to download or use software. For annotations, I usually save a xyz.pdf and a xyz-Annotated.pdf. For news articles, I just Right Click > Print to PDF and save to the subfolder for a particular topic. You never know when something will get taken down from the Internet. If I search for something and find duplicates, I try to prune then and there.
I don't have to download any apps except for Google Drive, which I already use for e-mail, etc. I can at any point port this entire system to Dropbox, Box, self-hosted FUSE solution, or a flash drive, and keep all its functionality with no software on a computer except for a filesystem and a document viewer.
Names are hyphenated - e.g. Intro-to-Civil-War.pdf. This way feels readable with my eyes, and also most filesystem search utilities seem to tokenize well on hyphens. Tokenizing on spaces would mostly work, except you have to escape spaces on the command line.
Years ago I standardized on a pretty flat folder hierarchy for such PDFs as well as a file naming convention such as the following examples: topic-author-YYYY-MM-DD.pdf or article-title-website-domain-YYY-MM-DD.pdf, etc. In about 2 or 3 times in my life, i have also drafted a small summary of the content of a PDF into a sibling README (text file) for ease of findability later on. This was because those few PDFs were a very old docs that were scanned in, etc. Also, the sibling README file would be named the exact same as the PDF, though have a file extension of .txt or .md or ...-README.txt, etc.
Nowadays, there are more tools available - as others have cited - which help with finding metadata. But i like to keep things simple, and because there's still way too many PDFs whose metadata starts with "Microsoft Word...", that i still stick to my file naming convention to help give hints about the contents.
You will then be able to find things chronologically, they’ll be grouped as you found them, and you can find things you can’t remember except that you worked on them at the time you were doing that other thing.
All your other ways of finding them still work.
Bonus: also works as a better versioning convention than “Title v4 Bob Copy 2.docx”.
I appreciate your suggestion! However... About 2 decades ago or so, i in fact tried using the date in front...and found that i actually remember more vividly a particular topic (or author or even domain name), rather than dates. Often my mind squishes days together too much to be useful for me. Like people who remember faces great, but not names...i seem to recall topics vastly better than "when" such material was obtained. Hence, using date prefixes made it a tad more cumbersome to find things, at least at a glance. I use the date at the end merely for reference, but not so much for searchability. If i really wanted to find something by date - rare, though it happens - then i can either search via regex in GUI or command line search, such as *YYYY-MM-*.pdf ...etc. Thanks anayway!
EDIT: Asterisks were previously omitted accidentally in my regex search.
I think the original filename is a critical piece of information and I usually leave names.
For every day when I do literature research (or happen to come across an interesting paper), I create a new folder with the context of my search, e.g. `2022-08-23_einstein_original_papers`. I save the paper (and others on that day) to the folder as is, then add them to Zotero (, or Mendeley, if I were less privacy sensitive).
I agree about the original filename being worth preserving. I have usually renamed them with the paper's title and then the original filename, separated by a hyphen. Sometimes I add a keyword/company name.
e.g.
Wang03-shazam.pdf
=>
An Industrial-Strength Audio Search Algorithm (Shazam) - Wang03-shazam.pdf
Your folder system is nice, I like the idea of organizing by context in which I found it. Thanks for sharing.
I usually rename them manually, with Unix-y filenames that start with the last/company name of the author (e.g., `adobe-postscript-language-reference-manual.pdf`, `blandy-programming-rust-1e-early.epub`).
In a company, I put relevant ones in a Git repo, or sometimes in a wiki. (In any case, the company/engineering wiki will probably reference them, and code in the Git repo might as well.)
For non-public documents, I track the provenance of each. In a company Git repo, usually in `README.md`, or maybe `<name-of-the-doc-file>.README.md`. And you have to be clear whether something was received under NDA or other restrictions, which can restrict sharing, quoting, facts, or even who within the organization is allowed to look at it.
For paid docs I personally own, I'm currently experimenting with inventorying all my paid digital content (books, videos, games) in GnuCash (especially since the payment transactions will be there). (I'm still slightly uncomfortable with, say, a $50 `.pdf` just sitting in my homedir, without indication that it's not public and shareable.)
I manually manage by year/folder, synchronizing thousands of items via DropBox. Each folder has all the associated links and metadata I can muster.
By "manual" I mean various scripts, on MacOS. "bib" for example is an Alfred keyword that runs a script scraping the front browser page for bibtex data, and initializing a new folder based on this data. Other scripts write search output files that "Find Any File" thinks it saved; FAF offers a nice Finder-like interface for selectively opening search results.
I've avoided any canned solution. I don't want vendor lock, and I'd rather be crippled by my own lack of imagination than someone else's.
By far the most useful metadata are the dates of access. One wanders continuously through an idea space no software can adequately categorize, yet adjacent items in time are likely related.
The holy grail I haven't coded yet is to generate web pages with URI schemes that open each PDF in a viewer. GoodReader is a nice viewer on iOS with such an interface; stunningly its own browser can't understand these links, but DropBox can. To simplify versions for MacOS and iOS, one can write a custom URI handler on MacOS that recognizes the GoodReader link, and redirects it on a Mac.
The goal here is to meaningfully browse closer to the speed of light than clumsy current standards. Just as lock-picking is both skill and cycles, research is both skill and cycles: One wants to have the right three ideas in close enough proximity for our feeble brains to notice the connection. We have to be in our happy place when we're flailing for hours, days, years; it would nevertheless be nice to accelerate this process.
I've been in constant friction with the MathSciNet team: I believe that it should present itself as the premier playground for machine learning mind mapping experiments, in addition to maintaining its stodgy hand curation. There is some controversy as to how math is organized. Many great mathematicians wander freely; others never read a paper outside their field after the age of 23. Departmental hiring meetings degenerate to "it's the number theory group's turn", perpetuating a dated view of how math is organized. Independent views of the MathSciNet database could give us new understandings of our field.
I keep the original names. Problem with metadata is that they can be stripped or you can have the same content with different metadata in it. If you download paper from publisher site or a copy provided by author, they don't need to be the same. When I was working on my PhD thesis I've tried using tools like Zotero for keeping bibliographical references, and pulling the metadata when I needed citations. I wasn't to satisfied with results. Maybe I was using it in the wrong way.
I've been using Tellico for cataloguing and ISBN lookup, in conjunction with a script to hunt for ISBNs in PDFs, but may try Zotero now. I've been normalizing file names to the form [author ‘:’] title [metadata] [‘.’ type] where the metadata is ‘[’ key ‘=’ value [‘;’ …] ‘]’, normally including ‘isbn=’ for books and ‘doi=’ for papers (and other things for audio and images).
I manually name them as follows as soon as I download them:
[title] - [optional subtitle] - [comma separated contributor list] (publisher name, year of publication).<file-extension>
Contributors can include authors, editors, translators; in that order. If the name becomes too long for the file system, I opt for "et al." after a few names.
Examples:
Structure and Interpretation of Computer Programs - Harold Abelson, Gerald Jay Sussman, Julie Sussman (The MIT Press, 1996).pdf
The Phenomenology of Spirit - Georg Wilhelm Friedrich Hegel, Terry Pinkard (Cambridge University Press, 2018).pdf
Computer Graphics - Principles and Practice - John F. Hughes, Andries van Dam, Morgan McGuire, et al. (Addison-Wesley, 2014).pdf
I have a Zotero database to which everything gets added. Files are renamed by Zotfile to Author1_Author2_(etal)-Title-Publication-Year.pdf.
Zotero generally is pretty good about finding metadata, and there are plugins for full text search. I've probably had as many cases of SciHub serving up the wrong document as Zotero failing to find metadata.
I tried recoll, but with the default parameters on my setup the db filled my disk. I think the database was close to 1/15 of indexed disk size. After freeing uo disk space I still use it to find things occasionally, but haven't checked whether the db is updating.
I don't organize at the file level. I just dump all pdf files into a folder. That folder syncs with home-server and cloud-storage.
I organize at the app level. I have different apps on all my devices for different kinds of documents.
One for work/research, one for novels/non-fiction/poetry. I rely on the recently opened section for continued reading. All of the apps have big book-shelf features, so the title doesn't matter. When transfering, I sometimes email the document to myself with the title in the subject.
I use Okular, native doc viewer in Pop OS, Ebookdroid and Ebookdroid Pro.
Git-versioned folder with by-topic structure. I haven't found a way to generate metadata and file name based on content though - I still name them manullay.
In some cases, I preface with a YYY-MM-DD date, for sorting. In cases where I want to link to something (like when I attribute images in my writing), I may have a title like "wikimedia-commons-man-on-a-unicycle.pdf", or somesuch.
Other times, I consider it important to keep the original filename.
I'll often rely on the container context, to provide classification.
Depends on the content. For academic papers and books there are tools out there that can rename files by scanning the document for the DOI number and then looking that up on CrossRef or whatever.
Surprised no one has mentioned iBooks or Books.app from Apple. It works great and syncs across devices. You can add personal PDFs if you have a iCloud account.
If it's a paper with a DOI, it's easy to re-generate the metadata. Same for an ISBN. My guess is that this type of file won't hash well, because the metadata and the data (cover page) are mutable. Zotero tries to extract metadata from PDF files, but it doesn't always succeed.
I use Zotero with the ZotFile and BetterBibTex plugins for my academic papers, and Calibre for my ebooks.
I'm not an archivist, and I don't read enough to benefit from automating this problem. https://xkcd.com/1319/
It’s open source to boot.
https://www.zotero.org/