
Show HN: Automated writing analysis of Twitter and Reddit feeds - ddod
http://anthropologize.com/
======
ddod
I made this over the last two days in Node.js. The analysis is still pretty
simple, and I'd like to expand it over time. It updates every 30 minutes, and
you can already see that there's some significant shifts in literacy between
the daytime and very early morning.

I had originally started the project to grab Flesch-Kincaid grade levels, but
it was taking me 45 minutes to analyze about 20 pages of comments, so I
switched it out for Automated Readability Index.

I really wanted to include HN, as I feel it's in a period of social shift, but
it wouldn't have the same sort of statistical significance due to its size (it
would most likely just be meaningless jarring jagged lines across the graphs).
If someone can think of a good way to include HN, I'm all ears. I'd also be
interested in other methods of writing analysis that I could include.

~~~
xaa
What exactly is Twitter sample size? If it means "the number of records
processed", why does it vary over time?

One way to approach the "significance" problem for smaller communities like HN
is to create larger bins of misspellings/correct spellings. However, I don't
think you're going to see many "anyways" or "yea" on HN at all, much less
significant fluctuations over time.

Finally, an interesting question might be to what extent individuals fluctuate
in their ARI/FK levels in different contexts. What if a poster in /r/lolcats
writes a very asinine comment but then goes five minutes later to
/r/programming to speak intelligently. Or is it the case that an idiot is an
idiot, regardless of context?

~~~
ddod
Sample size is the anyway|anyways|yeah|yea instances. I get those as they get
posted, so that's why it fluctuates with time. Since they're so commonly used,
it should incidentally give you an idea of all of Twitter's load. I'm also
grabbing some other words that I haven't implemented on the clientside yet,
but I don't include them in the sample size.

HN comments tend to be more varied and sparse, so I think you're right that
measuring yea:yeah, etc. wouldn't be too enlightening. That said, I've noticed
a defined qualitative shift in HN comments over the past few months, and I'd
like to develop ways of measuring that before they reach Reddit/Twitter
levels.

As for your last point, I could track individuals but a relational comparison
based off of /all/ data would be pretty difficult due to the number of
comments vs. the few number of any individual's comments. Also, ARI isn't a
great metric (hence me putting it in a tiny graph) because it measures chars
instead of syllables. For example, "FFFFFUUUUUUUUU" has the same score as
"constructivism".

------
_delirium
Quibble: I don't think I'd consider either "yea" or "anyways" _misspellings_
per se, at least in informal chatting. "Anyways" is heard in some spoken
dialects, so you'd expect it in written forms of the same dialects. And I
think of "yea" (in modern usage) as a variant of "yeah", like "yah" is. All of
them are phonetic spellings of dialectal versions of "yes" to begin with
(along with "yep" and "yup" and such).

They don't, at least, seem in the same category as clear spelling errors like
copyright/copywrite, or bureaucracy/beaurocracy.

~~~
ddod
I chose those two to start things off for two reasons: The first is that they
stand out to me as pretty basic misspellings that don't appear in any sort of
popular literature, so it speaks to someone's reading experience. You can take
that for whatever importance it is to you, personally. The second reason is
their frequency. If the unit of measurement was larger than a half-hour or
hour, I could measure a bunch of other things (which I might still do). As it
stands, anyway|anyways|yea|yeah occur 25k-70k times per 30 minutes.

~~~
_delirium
I guess I'm disagreeing that they're "basic misspellings" in the first place,
or related to someone's reading experience. Informal spoken English is often
different from written English, and some people Tweet closer to how they'd
speak. That doesn't mean that the same people would write the same way if they
were writing a book (they probably wouldn't use "yeah" in a book, either), or
that a certain dialect of spoken English is "incorrect".

I think you're measuring something other than spelling here, closer to
measuring prevalence of certain dialects. Especially in the "anyways" example:
if there's something objectionable about that word, it's a usage objection,
not a spelling objection. People really do say "anyways", and that would be
the correct way of writing it if you accept that usage. Spelling it "enyways"
or "annyways" or something would be an orthographic error, on the other hand.

~~~
ddod
There's no link between spoken and written English in the context of yea:yeah,
as they're pronounced the same but either written correctly or incorrectly.

As for anyways, I think your classification there will depend on who you speak
to (which is the point of the analysis).

------
mst3kzz
"Yea" pronounced "yay" is a valid word with practically the same meaning as
"yeah" (affirmative). Granted "yea" is archaic and most of the instances you
are catching are just yeahs with the "h" dropped, but it would be nice to
choose something that is definitively wrong. The anyway/anyways is a better
choice and I hope that "anyways" does not become widely accepted because it is
like Bieber to my ears.

~~~
ddod
I'm working on the assumption of statistical significance of people using
"yea" pronounced "yeah" vs. the very few people who might have reason to take
to Twitter to use "yea" in a voting or biblical context.

------
brador
Label yo axis son.

Expanding on this could lead to some interesting insights. How about "could
of" and "could have"? and the old "could care less"...

~~~
ddod
Those might work for Twitter, but there may be some lurking variables on
Reddit. While they never bother to catch yea and anyways, Redditers do have a
tendency to pile on about the few grammatical/spelling errors they can spot.
For example, if a post title says "could of", there's likely going to be
multiple comments discussing the error and thus tripping my hooks.

Edit: As for the axes, I made some stylistic decisions that I wouldn't dare to
make were this being submitted to the Annual New Media Socio-Linguistics
Journal.

------
gmu3
There is somewhat of an incentive to consciously misspell words on Twitter to
save characters which muddles potential analysis like this.

Admittedly that doesn't apply here with "anyways" but it perhaps does with
"yea". Perhaps throw out tweets over a certain length?

~~~
ddod
That's why I'm not relying on any single metric, as well as framing it in the
context of time. If Justin Beiber asks his fans to vote yea or nay on
something, it might mess up the yea:yeah metric, but it should eventually
normalize, and while it isn't normalized, the anyways:anyway metric should
still be fine (Unless JBeibs is asking his fans to vote yea or nay on naming
his new album "Anyways").

