Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Retrieve full author names for a large volume of medical papers?
3 points by oliverluk on Dec 24, 2023 | hide | past | favorite | 13 comments
For a meta analysis I need to retrieve the full author names for around 10k medical papers. Unfortunately, most databases only have the authors listed with their initials + last names.

Some websites (like ResearchGate) list full names but usually only for a subset of the authors. Also, doing a Google search like “{author_last_name_with_initials} full name {institute_name}” usually returns the full name of the author somewhere in the search results.

My current approach would be to (Python script):

* Retrieve the desired list of papers through the PubMed eutils API (gives me the title, the PMID and the DOI for every paper)

* Use the PMID to retrieve the metadata for every paper through the PubMed eutils API (gives me the last names of their authors, their initials and their institutes)

* Use Google Search via SerpApi, search for the term “{author_last_name_with_initials} full name {institute_name}”, forward the resulting JSON to a LLM, ask it to return the full name of the author and the link to the source

I tried this approach with a few papers and it seems to work. However, I wonder if there is a more elegant solution to this problem. I was not able to find a free, API-accessible service that provides this kind of information.



You’re doing author name disambiguation (https://en.wikipedia.org/wiki/Author_name_disambiguation). That’s a difficult, messy problem.

For example, there’s https://revstat.ine.pt/index.php/REVSTAT/article/view/382, with two authors with the same first and surname working in the same institute and field. Try separating them without ORCID.

Such problems aren’t rare if you have papers that only mention initials, certainly not with Chinese or Korean names, as those are countries where name clashes are a lot more common than in the ‘west’ (over 1:5 of South Koreans have the surname 김 (Kim), 1:7 이 (Lee), according to https://en.wikipedia.org/wiki/List_of_Korean_surnames. China is slightly less bad (https://en.wikipedia.org/wiki/List_of_common_Chinese_surname...), but compensates by having a much larger population)

I can’t find it, but remember seeing a paper with 4 or 5 authors named “Kim” with the same initials.

That author name disambiguation Wikipedia page links to https://github.com/neozhangthe1/disambiguation. I don’t know how good it is, but you should consider it.

And as others have said, you should use ORCID, if available. You should also use email address (often included in article metadata at least for the corresponding author), but can assume neither that every author has a single email address nor that a single email address belongs to a single person.

Another case to worry about is that names can change, for example because of use of a different romanization (https://en.wikipedia.org/wiki/Romanization), marriage, or gender change.


Thanks a lot for the detailed response! I was indeed not aware of the scope of the problem and the fact that it has its own Wikipedia page.

Also many thanks making me aware of ORCID (together with "DamonHD"). After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles, which is great.

I will take a closer look at the "disambiguation" project you linked to and will see what approach I can take for reliably resolving email addresses to first names (while filtering out non-personal email addresses).

That being said, I fear that resolving non-ORCID + non-email authors using the SerpApi + LLM approach I described in my initial post is still the best shot I currently have.


Even ORCID is not foolproof, there's been cases where it misattributes papers. For a proper job you'd need to look at actual journal article text and cross-reference possible authors' CVs if easily available. It's not something that can be seamlessly automated.


Interesting, thanks for the heads up. I will implement the following safeguard:

* Known: Paper DOI, paper PMID, author last name, author initials

* Get connected ORCIDs based on the DOI / PMIDs

* Check if the known last name matches the last name of the ORCID profile (also include the "Also known as" section of the profile)

This may lead to some false negatives (for example in case of name changes that were not properly recorded) but if I can reduce the amount of manual lookups to a number below 100, it's already a win.


> forward the resulting JSON to a LLM

Sounds like a great way to get a ton of inaccurate information to be honest


I agree and I'm not too happy about this approach as well. Still, this appears to be my best shot for the authors without ORCIDs. I will try to increase the accuracy through guardrails, good prompting and occasional spot checks.


You can get lots of metadata from Crossref.org. See https://www.crossref.org/services/metadata-retrieval/


I already tried this route (using the DOI), unfortunately, I was not able to get the first names of the authors this way. Example: https://search.crossref.org/search/works?q=10.1111%2Fanae.14...


Would you mind detailing out the step of "forward the JSON to an LLM" - as I have a tangentially similar thing I'd like to solve, but I don't know how to do that step - or point me at whatever helped you to learn that step.


The process described by "ailef" is exactly what I have in mind. I manually tried the following prompt (which surely has a lot of room for improvement) in ChatGPT (GPT 4.0) with a few examples and got exactly what I needed.

Prompt:

Disambiguate the initials of "X Y Lastname" based on the following JSON input. Do not conduct a web search. Return the full name and the link to the reference as a JSON object with the keys "first_name", "last_name", "link_to_reference". Return "not found" in case you are not able to disambiguate the initials. Do not return anything else.

JSON input:

[the array scoped under the key "organic_results" of the JSON object SerpApi returns when searching for "{author_last_name_with_initials} full name {institute_name}" using Google]


You write a text containing the JSON with all the data and instructions on how to analyze it and you instruct the LLM to extract what you're looking for (the "prompt"). The easiest way to start is probably the OpenAI API, or you can try one of the self-hosted LLM although I've never interacted with them programmatically.


Can you add ORCID into the mix for at least some of the papers?


Many thanks for the hint! After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles. While not all of the authors of the papers I checked are included (because some obviously do not have ORCIDs), this will likely solve the issue for a good chunk of the authors. Great!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: