Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd opine the choice of model matters less than the general approach used. In this case, features are taken from the data that are more like metadata about an account, metadata about the frequency and such of tweets, and automatically generated text features about the tweet text which are used to train/predict the binary classifier. Whether it's gbm, lda, logistic regression, etc. used on this data is probably not that huge a deal (maybe small % differences in accuracy) assuming each are used correctly.

What I'd want to also see however, would be curation of a relevant corpus of tweet content that could be integrated in the classification process in terms of actual tweet content (bag of words, sentence structure, & other NLP features, images, other media) being used to determine a given account's similarity between bots and non-bots -- instead of relying essentially on metadata similarity in addition to crude auto-extracted text features like simple total number of words and character counts. I can't say for sure if it'd improve predictions but would seem key for this type of problem.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: