For a meta analysis I need to retrieve the full author names for around 10k medical papers. Unfortunately, most databases only have the authors listed with their initials + last names.
Some websites (like ResearchGate) list full names but usually only for a subset of the authors. Also, doing a Google search like “{author_last_name_with_initials} full name {institute_name}” usually returns the full name of the author somewhere in the search results.
My current approach would be to (Python script):
* Retrieve the desired list of papers through the PubMed eutils API (gives me the title, the PMID and the DOI for every paper)
* Use the PMID to retrieve the metadata for every paper through the PubMed eutils API (gives me the last names of their authors, their initials and their institutes)
* Use Google Search via SerpApi, search for the term “{author_last_name_with_initials} full name {institute_name}”, forward the resulting JSON to a LLM, ask it to return the full name of the author and the link to the source
I tried this approach with a few papers and it seems to work. However, I wonder if there is a more elegant solution to this problem. I was not able to find a free, API-accessible service that provides this kind of information.
For example, there’s https://revstat.ine.pt/index.php/REVSTAT/article/view/382, with two authors with the same first and surname working in the same institute and field. Try separating them without ORCID.
Such problems aren’t rare if you have papers that only mention initials, certainly not with Chinese or Korean names, as those are countries where name clashes are a lot more common than in the ‘west’ (over 1:5 of South Koreans have the surname 김 (Kim), 1:7 이 (Lee), according to https://en.wikipedia.org/wiki/List_of_Korean_surnames. China is slightly less bad (https://en.wikipedia.org/wiki/List_of_common_Chinese_surname...), but compensates by having a much larger population)
I can’t find it, but remember seeing a paper with 4 or 5 authors named “Kim” with the same initials.
That author name disambiguation Wikipedia page links to https://github.com/neozhangthe1/disambiguation. I don’t know how good it is, but you should consider it.
And as others have said, you should use ORCID, if available. You should also use email address (often included in article metadata at least for the corresponding author), but can assume neither that every author has a single email address nor that a single email address belongs to a single person.
Another case to worry about is that names can change, for example because of use of a different romanization (https://en.wikipedia.org/wiki/Romanization), marriage, or gender change.