
Show HN: My weekend project: A classifier that classifies text as stupid/clever. - StavrosK
Over a weekend at a friend's house we wrote a text classifier that classifies comments as stupid/clever. We used various techniques to train it, but here it is:<p>http://www.pythiafilter.com/<p>It doesn't understand content, only style (mostly), so if you post something incoherent but grammatical it will probably think it's clever. It's not perfect, but it got 94% accuracy or so on the test set (which, admittedly, wasn't of very high quality). YMMV, and this is just something we did for fun one day, so don't take anything it says seriously.
======
StavrosK
Clickable: <http://www.pythiafilter.com/>

~~~
mishmash
Is it possible to link to the results page to pass it around?

~~~
StavrosK
Unfortunately not, it doesn't store anything for now, it's an API that just
gives you a 0/1 for stupid/clever. That's a good suggestion, though. I'll add
a feature to save the results and give you a link in exchange for manually
marking your text as stupid/clever, so the filter gets trained.

A sort of a "tit for tat" scenario, it might actually work well, thank you for
the suggestion!

~~~
mishmash
Cool good luck with it. :)

~~~
StavrosK
Thanks :)

------
solipsist
Everyone will be happy to know that entering the entire comment section of
Hacker New's top post at the moment[1] returned ' _Your text was of high
quality_ ' while Reddit's top post[2] returned ' _Your text was of low
quality_ '.

This text classifier is great!

[1] - <http://news.ycombinator.com/item?id=2086628>

[2] -
[http://www.reddit.com/r/gaming/comments/ez7po/i_agree_logite...](http://www.reddit.com/r/gaming/comments/ez7po/i_agree_logitech/)

~~~
StavrosK
Haha, I used upvoted reddit comments for the "high quality" dataset and
YouTube comments for the "low quality" one, I am sorry for the internet
racism!

~~~
growt
You might end up with a classifier that considers some memes and topics
(tech,games) that originate from reddit as high quality and youtube centric
stuff as low quality (mainstream media, music, etc)

~~~
StavrosK
It doesn't analyse the content, so it doesn't know about that. I do plan to
have the ability to train it, though, so that would quickly improve it by
quite a bit...

------
requinot59
"I think a grain of salt have more value than a possible bird might shake
nowadays then caliber it for turtles and China."

Mark V. Shaney -- _High quality_ troller since 1983 ;-)

------
templaedhel
Viewing the demo, then clicking the logo gives a 404 because it links to
demo/index.htm not index.htm

~~~
StavrosK
Thank you, I was wondering where that 404 was coming from in the logs. I I'll
change it now.

------
instakill
"Hello, my name is John" is of low quality.

------
candre717
Will you be releasing the source code?

~~~
StavrosK
Unfortunately no, I'm thinking of developing it further into a full-blown
comment classifier for forums/etc.

------
johnthomas682
Unfortunately not, it doesn't store anything for now, it's an API that just
gives you a 0/1 for stupid/clever. That's a good suggestion, though. I'll add
a feature to save the results and give you a link in exchange for manually
marking your text as stupid/clever, so the filter gets trained. Thanks <a
href=[http://used.gov-auctions.org>used](http://used.gov-auctions.org>used)
cars</a>

~~~
sorbus
Wow, this is a new strategy for spambots. Taking the content of another
comment (in this case <http://news.ycombinator.com/item?id=2086932> ), and
posting it in the same thread (ensuring that it's somewhat relevant ... well,
not in this case, as it's out of context) but with a link to a website at the
end. Still has the new-account thing, though. Has anyone else seen this
happening?

If you're not a spambot, I am, of course, very sorry for the accusation, but
you must admit that the evidence is on my side.

~~~
StavrosK
I saw that comment and thought it was mine, so I was wondering why you were
accusing me of being a spambot and was confused. Then I saw the username, I
hope HN doesn't start to get overrun by spambots...

~~~
sorbus
Quite sorry about the confusion; but no harm done, at least.

If you have showdead turned on, there are lots of spam posts around - mostly
just bunches of links from users who are dead (they can post stuff and see
their own posts normally, but no one else can) - but this strategy is rather
worrying, because it's harder to catch automatically. On the upside, the
community is large enough now that most of these posts should be caught and
flagged.

Simply skimming through noobcomments looking for spam might not work as a
strategy anymore, though, especially if they start grabbing comments which
originally contained links (especially citations) and then replacing them with
legitimate looking links to new sites; basically, if they figure out a way to
stop the links from looking spammy, they're going to require a bit more human
effort to catch, which might let lots of them get through - especially by
posting on older articles which have fallen off the front-page.

(Great, I'm giving advice to spammers now. Bit worried about posting this,
actually, but someone else is going to think of it - probably already has - so
I'm probably not doing any harm.)

