

An open-source filter software that can detect rampant stupidity in written English - jakewolf
http://stupidfilter.org/main/index.php?n=Main.About

======
pg
I've often thought about doing something like this for comments. I think it
would work.

The hard part is getting the initial corpora of stupid and non-stupid text.
Stupid writing is harder to recognize than spam. It might work to use sites as
proxies.

Another related filter that might be worth trying to build would be one for
recognizing trolls. It would be easy to collect the bad corpus for this
filter, because the design of most forums makes it easy to see all the
comments by a particular user, e.g.

<http://reddit.com/user/qwe1234/>

~~~
mixmax
"The hard part is getting the initial corpora of stupid and non-stupid text"

Can't you simply use comment votes for this - use the comments with the least
votes? According to the users of a site these would be the stupid ones. Maybe
comments with some algorithmic tool to include the karma of the user that
wrote the comment.

The advantage of this approach is that it will work across sites where the
definition of stupid might differ.

~~~
comatose_kid
I don't know if the stupidity of the message correlates to its vote tally. For
example, a comment with a negative rating may hold an unpopular but valid
view.

In fact, if the mapping was that good, one wouldn't need to run a 'stupid
filter' on the message body in the first place.

~~~
derefr
The data might be more useful together--first generate a simple intelligence-
scale number (between -1 and 1), then weight all of a user's votes by their
intelligence, and recalculate everyone's intelligence from their weighted
karma. That is to say, if a lot of stupid people hate you, you appear smarter.

The problem might be with trolls, where the stupid are tricked and vote down
in retribution, but the intelligent notice right away but are simply amused at
the skill of the execution, and vote up in humor. This makes the troll appear
much more intelligent than they are.

------
LogicHoleFlaw
The XKCD folks implemented what they call "Robot9000". Robot9000 attempts to
ensure that every comment being added to a site or chat channel is unique when
compared against the history of the channel. It basically hashes a somewhat
stripped-down version of each comment and compares it against the entire
historical corpus of their chat. If the comment is found then the user is
muted for an exponentially-increasing amount of time for each infraction. I
believe there's a slow decay on the mute duration as well. This sort of filter
won't stop stupid text, but it seems to be working for them. It's a novel
approach to the problem of signal-dilution as a social network grows.

Robot9000 release announcement:
[http://blag.xkcd.com/2008/01/14/robot9000-and-xkcd-signal-
at...](http://blag.xkcd.com/2008/01/14/robot9000-and-xkcd-signal-attacking-
noise-in-chat/)

Perl source: <http://media.peeron.com/tmp/ROBOT9000.html>

------
ambition
I'd be more interested in the complement -- filters tuned to pick up smart or
interesting writing. I'm not convinced that it's necessarily an identical
problem.

It would be neat to run a battery of standard semantic analysis tools against
the text of web pages ranked highly on HN, compared with pages not ranked
highly.

~~~
as
That's actually a fascinating idea.

------
TrevorJ
// _USER COMMENT REDACTED BY STUPIDITY FILTER_ //

------
aykall
Thats a really hard task to accomplish. Is it poor english stupid? What about
foreigners writing in english? Are they all stupid, they won't write 100%
proper english. What about misspelled words?

I think irrelevancy and inaccuracy are the best way to distinguish stupid from
smart and the key to know what is one what is the other is probably on the
subject of the comments and that would be a related/non-related filter not a
stupid filter.

Honestly, by the name it got I think it is more intended to get a lot of buzz
than to really become a real product. Isn't Mr. Ortiz just trying to get some
attention? The definition of stupid is directly related to the reader so you
can't have a filter for that, it would have to be personal.

~~~
derefr
> The definition of stupid is directly related to the reader so you can't have
> a filter for that, it would have to be personal.

If you mean that it is subjective, Oritz's entire experiment is to find the
_degree_ to which it is so, or rather, the degree to which one can objectively
measure intelligence in writing from its characteristics.

------
jakewolf
From WSJ blog [http://blogs.wsj.com/buzzwatch/2008/03/24/idea-watch-can-
thi...](http://blogs.wsj.com/buzzwatch/2008/03/24/idea-watch-can-this-man-
banish-stupidity-from-the-internet/?mod=WSJBlog?mod=homeblogmod_buzzwatch)

~~~
as
"these sample comments are then compared to “smart” text from a body of work
on sites like Project Gutenberg, an online catalog of great world literature.
Mr. Ortiz says he took snippets from classics by such authors as Jules Verne
and J.D. Salinger to serve as a baseline for “the edited English language."

Did anyone else laugh at the thought of setting Holden Caulfield as the
paragon of English prose?

~~~
aston
Anyone who didn't is a phony.

------
petercooper
It is an easy mistake to believe that tools as simple as Bayesian filters can
emulate intelligence. It requires intelligence to determine whether someone
else is intelligent or not, not a bunch of rules and filters.

As we've seen with spam, any unintelligent system can be circumvented given
enough time and ingenuity. Bayesian filters are now but a small part of e-mail
analysis.

~~~
jcl
For things as simple as _simple_ Bayesian filters, I'd agree. But I wouldn't
be surprised if a sufficiently advanced filter is indistinguishable from
intelligence. Otherwise, you could make the argument: "It is an easy mistake
to believe that things as simple as neurons can emulate intelligence. It
requires intelligence to determine whether someone else is intelligent or not,
not a bunch of firing thresholds."

------
earle
>>> An open-source filter software that can detect rampant stupidity in
written English

Text is not likely to be stupid.

CLASSIFY succeeds; success probability: 0.5043 pR: 0.0075 Best match to file
#0 (/home/sfp/code/nonstupid_cor.css) prob: 0.5043 pR: 0.0075

------
edw519
Input: "If I had 6 hours to chop down a tree, I'd spend the first 4 sharpening
the axe."

Output: "Text is not likely to be stupid."

Input: "You wanna see my pics?"

Output: "Text is likely to be stupid."

~~~
jakewolf
Yes, huge flaws and easily gamed. I just liked the attempt.

------
Prrometheus
Reddit should use a filter like this that a comment must pass before it is
allowed to be posted (except for on the lolcat and NSFW subreddits).

------
mtw
not scalable, look at their "stupid" and "non-stupid" data

------
xlnt
They give an example of filtering out lowercase text. Lowercase and stupid are
very different. I sometimes write whole essays in lowercase.

~~~
maximilian
++ on the lowercase. I do uppercase sentence starts, but the rest i like to
leave lowercase.

German uppercases all nouns. its really hard to get used to doing.

