Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: NLP strategies to detect clickbait in HN titles?
3 points by throwawaybutwhy on Aug 21, 2020 | hide | past | favorite
HN boasts very high S/N ratio, and that's why I'm using it.

However, there are still many nuisance items that can clog /active and front pages.

Some of them are easily filtered out with CSS + XPath rules in uBlock Origin (major paywalled newspapers and top karma users associated with them, frequent blogspammers, posts with keywords from the two programming language proselytizing communities).

There are others that can be spotted with a bit of manual bipartite graph (poster -> site) analysis, yet I'm loath to set up an algorithm to hunt for the long tail of fresh accounts.

It is a low-hanging fruit to catch new blogspammers with account names matching their URLs with regexes. However, regular expressions are generally a slow trainwreck waiting to happen.

I have searched for a deep learning-based NLP solution, tons of clickbaity papers on clickbait on arXiv, yet their definitions aren't necessarily aligned with mine [0]. Most use fine-tuned BERT for classification, are there any other approaches that I've missed?

[0] https://www.clickbait-challenge.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: