Drug Discovery with Vector Search

dalke · 2024-05-12T04:41:51

FWIW, the traditional, 1980s approach uses fingerprints, which are binary vectors. Here's the same search using RDKit's "Morgan" fingerprints, with radius 2, and count enumeration simulation enabled, using my "chemfp" project.

  % curl -O 'https://gist.githubusercontent.com/fzliu/8052bd4d609bc6260ab7e8c838d2f518/raw/f1c9efb816d6b8514c0a643323f7afa29372b1c4/fda_approved_structures.csv'
  % chemfp csv2fps --type "RDKit-Morgan radius=2 countSimulation=1" fda_approved_structures.csv -o fda_approved_structures.fps
  % simsearch --query "CC(C)CC1=CC=C(C=C1)C(C)C(O)=O" -k 10 fda_approved_structures.fps --out csv
  query_id,target_id,score
  Query1,Dexibuprofen,1.0000000
  Query1,Ibuprofen,1.0000000
  Query1,Loxoprofen,0.5185185
  Query1,Tyrosine,0.4782609
  Query1,Suprofen,0.4716981
  Query1,Iofetamine I-123,0.4583333
  Query1,N-acetyltyrosine,0.4339623
  Query1,Naproxen,0.4259259
  Query1,Hydroxyamphetamine,0.4222222
  Query1,Carprofen,0.3898305

This is where it would take a pharmaceutical chemist to compare the two lists.

For example, Suprofen is a nonsteroidal anti-inflammatory drug which made my list but did not make the other list. There are oodles of ways to compare molecules, and decades of discussion about when to use what.

Also, RDKit could not parse two of the SMILES strings:

1) Cyanocobalamin has a 4-valent nitrogen

  Explicit valence for atom # 84 N, 4, is greater than permitted

2) Temoporfin uses an invalid bond notation in "C\-1N2".

These are the sorts of issues which require a deeper analysis to detect than molecule_vectorizer provides.

EDIT:

It looks like molecule_vectorizer uses RDKit to parse the SMILES string and generate RDKit fingerprints by default (or Morgan fingerprints as an option). These are binary, and count simulation wasn't used.

Those SMILES parse errors should have been caught as part of the process. Without digging deeper, I can't figure out what happened.

(My initial guess was that used text-based tokenization, like the n-mer approach in "Lingos, Finite State Machines, and Fast Similarity Searching", by Grant et al.)

Knowing now that it uses the RDKit path fingerprints, I can re-run chemfp on the same data set:

  % chemfp csv2fps --type RDKit-Fingerprint fda_approved_structures.csv -o fda_approved_structures.fps
  % simsearch --query "CC(C)CC1=CC=C(C=C1)C(C)C(O)=O" -k 10 fda_approved_structures.fps --out csv
  query_id,target_id,score
  Query1,Ibuprofen,1.0000000
  Query1,Dexibuprofen,1.0000000
  Query1,Loxoprofen,0.6725664
  Query1,Phenylacetic acid,0.5175439
  Query1,Naproxen,0.5056180
  Query1,Fenoprofen,0.4825737
  Query1,Ketoprofen,0.4333333
  Query1,Dexketoprofen,0.4333333
  Query1,Mandelic acid,0.4127517
  Query1,Oxeladin,0.4080000

which is a perfect match to the list given on the linked-to writeup.