Data Structures for Text Sequences (1998) [pdf]

bellomello4 · on March 31, 2023

Really interesting how this concept evolved over time. Really fascinating how this paper is discussing a method for clustering (grouping) similar items together in a dataset 25 years ago, knowing as this evolves, it will be come vectorized. Well not exactly, its actually a "Sequential Dichotomizer" and works by dividing the data into subsets based on their similarities until each subset contains only items that are very similar to each other.

The method is useful for tasks such as image classification, where a computer needs to identify what objects are in an image. The Sequential Dichotomizer can group similar images together so that the computer only needs to learn a few key features of each group rather than trying to learn all the individual differences between each image.

The paper also discusses some technical details of how the Sequential Dichotomizer works, including how it decides which features to use to split the data and how it handles outliers (data points that are very different from the rest of the group).

Anyways, I found this really interesting, thanks for this find! Shows how useful clustering data became in making machine learning algorithms more accurate.

rablackburn · on March 31, 2023

> its actually a "Sequential Dichotomizer" and works by dividing the data into subsets based on their similarities until each subset contains only items that are very similar to each other.

As someone who was considering this exact problem about an hour ago, thank you for handing me the exact term to look up!

dang · on March 31, 2023

mcqueenjordan · on March 31, 2023

Kind of ironic that the fifth word of the first paragraph is presumably supposed to be "to", but it is typo'd as "ot", when the subject of the sentence is about the sequence of characters. I have to imagine this was an intentional little easter egg.