To me, this looks very similar to local sequence similarity search (e.g. BLAST), where there are very rapid methods that use tuple-lookup and banded alignment to quickly identify "homologs" (the same entity). The nice thing about similarity searching algorithms is that they give you a very accurate probability of whether two strings are "homologous" (belong to the same entity). Perhaps I have the scale wrong, but it is routine to look for thousands of queries (new entities) among hundreds of millions of sequences (known entities) in an hour or so (and sequences are typically an order of magnitude longer than entity names). The problem is embarrassingly parallel, and very efficient algorithms are available for entities that are usually very similar.
BLAST is so deeply embedded in modern genome biology that it is hard to find short descriptions. Basically, there are two parts:
(1) local sequence (string) similarity scores (scores that allow mismatches, but need not extend to the end of either string) are distributed as the "extreme value distribution", which means that it is very easy to distinguish between similarity by chance and similarity because of common ancestry (I think of the names for the same entity as ancestral). This is a very powerful concept, that allows one to accurately set false-positive rates (it does not help with false negatives)
(2) BLAST calculates local similarity scores at O(n^2)/K where K is very very large by first identifying significant conserved "words" using a lookup table, then looking for runs of those words along an alignment diagonal, and then trying to extend that diagonal using an efficient banded dynamic programming algorithm.
To me, this looks very similar to local sequence similarity search (e.g. BLAST), where there are very rapid methods that use tuple-lookup and banded alignment to quickly identify "homologs" (the same entity). The nice thing about similarity searching algorithms is that they give you a very accurate probability of whether two strings are "homologous" (belong to the same entity). Perhaps I have the scale wrong, but it is routine to look for thousands of queries (new entities) among hundreds of millions of sequences (known entities) in an hour or so (and sequences are typically an order of magnitude longer than entity names). The problem is embarrassingly parallel, and very efficient algorithms are available for entities that are usually very similar.