Exercise for the reader, write a regex to distinguish random noise from English ...

jxf · on Jan 7, 2014

Almost any regex containing a common English phrase of a few words, like /of course/, would give you a very high accuracy rate on a large enough dataset and a very low false positive rate. English has low entropy per bit.

tlarkworthy · on Jan 7, 2014

That strategy would return results for any phrase not containing "of course". Which would be rather a lot of results in any document. I am envisioning a regex that plucks hashes from emails, for example. Some parts English, some parts cryptographic "noise". Pick the noise from the English

(note the inspiration was a nefarious regex scanner for finding bitcoin hashes. I have no intention of building such a thing, but the idea of a random detector regex intrigued me. Is it possible?)

My idea was more like calculate the statistics of character n-grams in english (all of which exist in random noise too), but count the most unlikely occurrences until you hit some probabilistic threshold that indicates some well thought out decision boundary.

thaumasiotes · on Jan 7, 2014

I had to recognize English for a cryptography project in college. This was my entire strategy, which the professor advised for rejecting non-English (it turns out recognizing non-English is easier than recognizing English), and works well:

1. Parse the full text of some long book available on Project Gutenberg. Record every trigram that occurs. Frequency is irrelevant; we just want to know whether a trigram occurs or not.

2. Go through your text, counting the number of trigrams that didn't exist in the sample text. If the number of strange trigrams exceeds a threshold (for my project I used a threshold of 3, but you can tune this), reject the text as non-English.

Given the constant finite threshold I used, that does amount to a regex, but I don't recommend trying to write it out explicitly.

pbhjpbhj · on Jan 8, 2014

How effective was your algo?

Presumably it would false-negative texts with neologisms and spelling errors and such.

There must be trigrams which simply don't appear in dictionary-English? Do you have results posted somewhere.

andreasvc · on Jan 7, 2014

If you're after hashes you could just match a string of hexadecimal digits of a specific length (such as [0-9a-f]{16} ); much easier to define than English text.

tlarkworthy · on Jan 7, 2014

did not think of that!

might be interesting to see how that stacks up against the other idea thread.

Groxx · on Jan 8, 2014

For detecting hashes? As in, SHA, or {key: value}?

If the former, you're possibly better just looking for high-entropy chunks of data of the right size. Of course, that'll match all kinds of encrypted data, but there may not be much you can do there.