Ask HN: Retrieve full author names for a large volume of medical papers?

Someone · on Dec 24, 2023

You’re doing author name disambiguation (https://en.wikipedia.org/wiki/Author_name_disambiguation). That’s a difficult, messy problem.

For example, there’s https://revstat.ine.pt/index.php/REVSTAT/article/view/382, with two authors with the same first and surname working in the same institute and field. Try separating them without ORCID.

Such problems aren’t rare if you have papers that only mention initials, certainly not with Chinese or Korean names, as those are countries where name clashes are a lot more common than in the ‘west’ (over 1:5 of South Koreans have the surname 김 (Kim), 1:7 이 (Lee), according to https://en.wikipedia.org/wiki/List_of_Korean_surnames. China is slightly less bad (https://en.wikipedia.org/wiki/List_of_common_Chinese_surname...), but compensates by having a much larger population)

I can’t find it, but remember seeing a paper with 4 or 5 authors named “Kim” with the same initials.

That author name disambiguation Wikipedia page links to https://github.com/neozhangthe1/disambiguation. I don’t know how good it is, but you should consider it.

And as others have said, you should use ORCID, if available. You should also use email address (often included in article metadata at least for the corresponding author), but can assume neither that every author has a single email address nor that a single email address belongs to a single person.

Another case to worry about is that names can change, for example because of use of a different romanization (https://en.wikipedia.org/wiki/Romanization), marriage, or gender change.

oliverluk · on Dec 24, 2023

Thanks a lot for the detailed response! I was indeed not aware of the scope of the problem and the fact that it has its own Wikipedia page.

Also many thanks making me aware of ORCID (together with "DamonHD"). After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles, which is great.

I will take a closer look at the "disambiguation" project you linked to and will see what approach I can take for reliably resolving email addresses to first names (while filtering out non-personal email addresses).

That being said, I fear that resolving non-ORCID + non-email authors using the SerpApi + LLM approach I described in my initial post is still the best shot I currently have.

zozbot234 · on Dec 24, 2023

Even ORCID is not foolproof, there's been cases where it misattributes papers. For a proper job you'd need to look at actual journal article text and cross-reference possible authors' CVs if easily available. It's not something that can be seamlessly automated.

oliverluk · on Dec 24, 2023

Interesting, thanks for the heads up. I will implement the following safeguard:

* Known: Paper DOI, paper PMID, author last name, author initials

* Get connected ORCIDs based on the DOI / PMIDs

* Check if the known last name matches the last name of the ORCID profile (also include the "Also known as" section of the profile)

This may lead to some false negatives (for example in case of name changes that were not properly recorded) but if I can reduce the amount of manual lookups to a number below 100, it's already a win.

KomoD · on Dec 24, 2023

> forward the resulting JSON to a LLM

Sounds like a great way to get a ton of inaccurate information to be honest

oliverluk · on Dec 24, 2023

I agree and I'm not too happy about this approach as well. Still, this appears to be my best shot for the authors without ORCIDs. I will try to increase the accuracy through guardrails, good prompting and occasional spot checks.

andrewgilmartin · on Dec 24, 2023

You can get lots of metadata from Crossref.org. See https://www.crossref.org/services/metadata-retrieval/

oliverluk · on Dec 24, 2023

I already tried this route (using the DOI), unfortunately, I was not able to get the first names of the authors this way. Example: https://search.crossref.org/search/works?q=10.1111%2Fanae.14...

samstave · on Dec 24, 2023

Would you mind detailing out the step of "forward the JSON to an LLM" - as I have a tangentially similar thing I'd like to solve, but I don't know how to do that step - or point me at whatever helped you to learn that step.

oliverluk · on Dec 24, 2023

The process described by "ailef" is exactly what I have in mind. I manually tried the following prompt (which surely has a lot of room for improvement) in ChatGPT (GPT 4.0) with a few examples and got exactly what I needed.

Prompt:

Disambiguate the initials of "X Y Lastname" based on the following JSON input. Do not conduct a web search. Return the full name and the link to the reference as a JSON object with the keys "first_name", "last_name", "link_to_reference". Return "not found" in case you are not able to disambiguate the initials. Do not return anything else.

JSON input:

[the array scoped under the key "organic_results" of the JSON object SerpApi returns when searching for "{author_last_name_with_initials} full name {institute_name}" using Google]

ailef · on Dec 24, 2023

You write a text containing the JSON with all the data and instructions on how to analyze it and you instruct the LLM to extract what you're looking for (the "prompt"). The easiest way to start is probably the OpenAI API, or you can try one of the self-hosted LLM although I've never interacted with them programmatically.

DamonHD · on Dec 24, 2023

Can you add ORCID into the mix for at least some of the papers?

oliverluk · on Dec 24, 2023

Many thanks for the hint! After doing some digging I found ORCID API endpoints which are able to resolve DOIs and PMIDs to ORCID profiles. While not all of the authors of the papers I checked are included (because some obviously do not have ORCIDs), this will likely solve the issue for a good chunk of the authors. Great!