
Text Mining South Park - eamonncarey
http://kaylinwalker.com/text-mining-south-park/
======
nanis
I was in the process of reading this when I thought to check who this person
is. Of course, by that time the site had failed, so I haven't read the whole
thing yet.

But, it seems to me that the author is falling in to a trap many an unwary
data "scientist" falls by not understanding the discipline of Statistics.

When one has the entire population data (i.e. a census), rather than a sample,
there is no point in carrying out statistical tests.

If I know _ALL_ the words spoken by someone, then I know which words they say
the most without resorting to any tests simply by counting.

No concept of "statistical significance" is applicable because there is no
sample. We can calculate the population value of any parameter we can think
of, because, we have the entire population (in this specific instance, _ALL_
the words spoken by all the characters).

FYI, all budding data "scientists" ...

~~~
vsbuffalo
You're treating this sample-is-the-population issue as if it's resolved in the
statistics literature. It is not. Gelman has written on this [1][2], as the
issue comes up in political science data frequently. As Gelman points out, 50
states are not a sample of states—it's the entire population. Similarly, the
Correlates of War [3] data is every militarized international dispute between
1816-2007 that fits certain criteria—it too is not a sample but the entire
population.

Treating his population as a large sample of a process that's uncertain or
noisy and then applying frequentist statistics is not _inherently_ wrong in
the way you say it is. It may be that there's a better way to model the
uncertainty in the process than treating the population as a sample, but
that's a different point than the one you make.

[1]:
[http://andrewgelman.com/2009/07/03/how_does_statis/](http://andrewgelman.com/2009/07/03/how_does_statis/)

[2]:
[http://www.stat.columbia.edu/~gelman/research/published/econ...](http://www.stat.columbia.edu/~gelman/research/published/econanova3.pdf)
(see finite population section)

~~~
dragonwriter
> Similarly, the Correlates of War [3] data is every militarized international
> dispute between 1816-2007 that fits certain criteria—it too is not a sample
> but the entire population.

Its the entire population of wars meeting a certain criteria in that time
frame. If that is the topic of interest, then it is also the whole population.
OTOH, datasets like that are often used in analysis that is intended to apply
to, for instance, "what-if" scenarios about hypothetical wars that _could_
have happened in that time frame, in which case the studied population is
clearly _not_ the population of interest, but is taken to be -- while there
may be specific reasons to criticize this in specific cases for reasons other
than "its the whole population, not a sample" \-- a representative sample of a
broader population.

------
seankross
Here's the accompanying GitHub repo:
[https://github.com/walkerkq/textmining_southpark](https://github.com/walkerkq/textmining_southpark)

------
wodenokoto
> Reducing the sparsity brought that down to about 3,100 unique words [from
> 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying
Characteristic Words using _Log Likelihood_ and using _tfidf_. ?

~~~
minimaxir
Relevant line in code:

    
    
       # remove sparse terms
       all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215
    

I believe it corresponds to the tfidf factor.

------
cadab
I've found an image, which i'm guessing it taken from the site:
[http://imgur.com/IEudyni](http://imgur.com/IEudyni), worth looking at if the
sites still down.

------
LoSboccacc
I would have loved to see log characterization for the canadians characters,
even if they aren't part of the main cast

------
dropdatabase
This is amazing, I wonder what results you'd get from The Simpsons

~~~
charlieegan3
Not sure subtitles contain character information but the people running
[https://frinkiac.com/](https://frinkiac.com/) might have the data.

------
rhema
Pretty interesting. This Large Scale Study of Myspace
([http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...](http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_2008.pdf))
paper shows a similar method for finding characteristic terms, using Mutual
Information.

------
peg_leg
This should be nominated for an igNobel

------
agentgt
I wonder how the results would change if it was based not on words but rather
by lines (not string lines but actor lines in conversation).

Its also funny how Stan talks more than Kyle given the show now has a
recurring joke that makes fun of Kyle's long educational dialogues.

~~~
cdubzzz
Maybe because of Kyle's decision to not give long speeches last season (:

------
gulbrandr

      Error establishing a database connection
    

Someone has a cached version please?

~~~
eeturunen
[http://i.imgur.com/IEudyni.jpg](http://i.imgur.com/IEudyni.jpg)

