FWIW, the traditional, 1980s approach uses fingerprints, which are binary vectors. Here's the same search using RDKit's "Morgan" fingerprints, with radius 2, and count enumeration simulation enabled, using my "chemfp" project.
This is where it would take a pharmaceutical chemist to compare the two lists.
For example, Suprofen is a nonsteroidal anti-inflammatory drug which made my list but did not make the other list. There are oodles of ways to compare molecules, and decades of discussion about when to use what.
Also, RDKit could not parse two of the SMILES strings:
1) Cyanocobalamin has a 4-valent nitrogen
Explicit valence for atom # 84 N, 4, is greater than permitted
2) Temoporfin uses an invalid bond notation in "C\-1N2".
These are the sorts of issues which require a deeper analysis to detect than molecule_vectorizer provides.
EDIT:
It looks like molecule_vectorizer uses RDKit to parse the SMILES string and generate RDKit fingerprints by default (or Morgan fingerprints as an option). These are binary, and count simulation wasn't used.
Those SMILES parse errors should have been caught as part of the process. Without digging deeper, I can't figure out what happened.
(My initial guess was that used text-based tokenization, like the n-mer approach in "Lingos, Finite State Machines, and Fast Similarity Searching", by Grant et al.)
Knowing now that it uses the RDKit path fingerprints, I can re-run chemfp on the same data set:
For example, Suprofen is a nonsteroidal anti-inflammatory drug which made my list but did not make the other list. There are oodles of ways to compare molecules, and decades of discussion about when to use what.
Also, RDKit could not parse two of the SMILES strings:
1) Cyanocobalamin has a 4-valent nitrogen
2) Temoporfin uses an invalid bond notation in "C\-1N2".These are the sorts of issues which require a deeper analysis to detect than molecule_vectorizer provides.
EDIT:
It looks like molecule_vectorizer uses RDKit to parse the SMILES string and generate RDKit fingerprints by default (or Morgan fingerprints as an option). These are binary, and count simulation wasn't used.
Those SMILES parse errors should have been caught as part of the process. Without digging deeper, I can't figure out what happened.
(My initial guess was that used text-based tokenization, like the n-mer approach in "Lingos, Finite State Machines, and Fast Similarity Searching", by Grant et al.)
Knowing now that it uses the RDKit path fingerprints, I can re-run chemfp on the same data set:
which is a perfect match to the list given on the linked-to writeup.reply