Hacker News new | past | comments | ask | show | jobs | submit login
Are topic models reliable or useful? (medium.com/pew-research-center-decoded)
2 points by pvankessel on Sept 24, 2021 | hide | past | favorite | 2 comments



This is consistent with my own experiences with topic models, although I'm left wondering to what extent these observations generalize and why. I tried to find more details in previous posts about the models used etc but couldn't find much.

There's a lot of interest in overfitting with ML but it tends to focus on supervised methods; I think there's a need for more focus on unsupervised methods in general, with regard to overfitting in particular but also just in general.


We started off by trying LDA and NMF, but the topics were too messy so we wound up switching to CorEx (https://github.com/gregversteeg/corex_topic), which is a semi-supervised algo that lets you "nudge" the model in the right direction using anchor terms. By the time our topics started looking coherent, it turned out that a regex with the anchor terms we'd picked outperformed the model itself. This case study was on a relatively small sample of relatively short documents (~4k survey open-ends) but for what it's worth, we also tried to use topic models to classify congressional Facebook posts (much larger corpus and longer documents) and the results were the same.

Overfitting is certainly part of the problem - in one of my earlier posts I talk about "conceptually spurious words," which are essentially the product of overfitting - but the more difficult problem is polysemy. I'm sure there are ways to mitigate that - expanding the feature space with POS tagging, etc. - but ultimately I think the solution is to simply avoid using a dimensionality reduction method for text classification. Supervised models are clearly the way to go - even if those "models" are just keyword dictionaries curated based on domain knowledge.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: