

Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers [pdf] - dasmiq
http://viraltexts.org/infect-bighum-2013.pdf

======
dang
One distinctive thing here is the need for the matching to be robust under OCR
errors. From skimming the paper, it sounds like the main thing you did to
address it was use shorter n-gram sequences. Is that correct? What are the
_dis_ advantages of using shorter sequences—more false matches? Could one use
fuzzy matching where characters that OCR often confuses get normalized into a
single bucket before computing the n-gram? Is there any way to know, or at
least to estimate, how many legit matches failed because of OCR errors?

~~~
dasmiq
Yes, short n-grams or skip n-grams were the primary method to improve recall,
and yes, the problem is that you'll have to examine more candidates--i.e., run
slower.

Taking advantage of some (high precision) many-to-one hashes is a nice idea,
which has the advantage of reducing the index size. Some people might be
familiar with Soundex
([http://en.wikipedia.org/wiki/Soundex](http://en.wikipedia.org/wiki/Soundex)),
a hashing scheme for English homophones designed in 1918 to be computable by
hand. While Soundex in particular would probably lead to too many collisions,
it would be fun to think about learning a better hash from data.

One thing we're working on now is unsupervised learning of error correction
models (in the form of probabilistic finite-state transducers) to be able to
augment the index with some number of multiple hypotheses.

Further afield, an interesting open problem is the development of locality
sensitive hashing (LSH) schemes for edit distance. Some work has gotten as far
as Hamming distance and cyclic shifts of the bits:
[http://www.mit.edu/~andoni/papers/cyclicShifts.pdf](http://www.mit.edu/~andoni/papers/cyclicShifts.pdf)

In the paper, our evaluation uses the "pooling" of the results of several
different runs. This is similar to how recall is evaluated for information
retrieval over large collections. For a given query to a web search engine,
you don't want to read the whole web to figure out what your recall is.
Instead, you can take the union of the results from several different search
engines and parameter settings thereof and evaluate the "pseudo-recall" of any
one system with respect to that set (cf. Boggle). In our case, if we include
some runs that take a long time but have higher recall, we can estimate
performance on the more realistic parameter settings.

In addition, we're also working on getting clean transcripts of a sample of
newspapers so we can produce some more direct estimates, as well as train
better error models.

------
dasmiq
Author (David Smith) here. My coauthor, Ryan Cordell, and I will be around for
any computer science/literary studies tag-team questions you have.

You can read a little more about the project at:
[http://viraltexts.org/](http://viraltexts.org/)

Also, our lab at Northeastern University in Boston will soon be posting an ad
for a research programmer to help us take this analysis further and to build
visualization and retrieval systems for large, noisy historical collections.

