
Editing distance clusters: a greedy string classification algorithm - wjholden
https://wjholden.com/clusters
======
mlyle
Edit distance for syslog message classification. Interesting!

One notable caveat is that there's no clear axis of classification. Sometimes
things may be classified together because they talk about the same resource
(with a long name). Sometimes things may be classified together because they
have the same message template. Sometimes thing will be classified together
because they share a lot of digits of time (not necessarily _close_ in time).

Odds are you get some of each type of clustering.

------
keanzu
Was just about to say that it doesn't work well with large lists of words
because at some point the gaps between words are small enough there's a path
right through the entire set which causes them to be "clustered".

    
    
      holden
      golden
      golder
      colder
      corder
      border
      bonder
      wonder
      wander
      warder
    

But I see that once you get longer strings then it works great. I like the log
message processing.

