
Why literature is the ultimate big-data challenge - Hooke
http://www.economist.com/blogs/prospero/2017/03/revenge-maths-mob
======
DrNuke
Can't really agree, literature is literature because of form not content,
every literary work is its own separate world with references to other
similarly separate worlds from the past. Data analysis can help finding
invariants, like in Meter in Poetry from Prof Nigel Fabb
[https://www.amazon.co.uk/Meter-Poetry-Theory-Nigel-
Fabb/dp/0...](https://www.amazon.co.uk/Meter-Poetry-Theory-Nigel-
Fabb/dp/0521713250) , but not the reason why literature is literary at all,
which comes from social sciences. Mythopoesis as the cumulative sum of the
real and imaginary worlds thought by humans is another possible result but it
would not be literature, languages and theory of language instead.

~~~
wjn0
> every literary work is its own separate world with references to other
> similarly separate worlds from the past.

This sounds like the basis of hypertext fiction
([https://en.wikipedia.org/wiki/Hypertext_fiction](https://en.wikipedia.org/wiki/Hypertext_fiction))
which has of course existed for much longer than big-data as a concept.

As for what characterizes literature, I'm inclined to agree with you. However,
is there not inherent value in another (albeit computational in nature)
reading, so to speak? If we take Barthes to be correct, what does it say when
a well-trained NLP model draws similar conclusions to humans with regard to
imagery, analogy, irony, metaphor, etc. when reading major literary works?
Different conclusions? Or no conclusions? What if some works "compute" and
some don't? What if some works's features are culture-independent, i.e., a
model trained on an Eastern literary corpus computes similar features as a
model trained on a Western literary corpus, while some features aren't?

Perhaps these questions are more superficial than I'm making them out to be,
but it seems presumptuous to assume that methods that look at this problem
from this angle won't get at _any_ literary features.

~~~
DrNuke
You're right, of course. My point is literary grade is not an invariant, it
changes with time and culture and the historical events of a given community.
There are so many cases of artists being considered shit one period, good
another century, then shit again, indifferent or even canonic for a while.
What exactly should data science train for a universal literary / not literary
classifier then? Or unsupervised learning for clustering what? HN is full of
bright minds, so I'm sure a formula might be reasonably suggested, but that's
not the point of literature as a form of its own, like maths or music.

------
hackuser
We often start with the null hypothesis of artistic "exceptionalism, which
imagines him [or her] as a freak of isolated genius", i.e., we assume that the
credited creator worked alone.

I think that's wrong and few serious human endeavors are accomplished alone.
In the arts, look at stories about how works were really created. Pop music is
an easy example because it's well known: Songs are written with a suggestion
from a friend, input from the producer, are based on something the performer
heard on the train, are created when someone uncredited sits in on the session
and provides the key hook, etc. A friend is writing a book, and I spent hours
reading it and offering suggestions; I'll receive no credit (and I don't want
or deserve it). In the code people write, how much is done without any help
from others, without using existing code and ideas? As the saying goes, _good
artists borrow; great artists steal._

~~~
aghillo
Howard Becker's Art Worlds book looks at this very issue - it's a very
interesting read.

~~~
hackuser
Thanks. What does he say about it?

~~~
aghillo
It's specifically focused on the domain of Art and argues that Art is
essentially a cooperative networked activity - a network of producers,
suppliers, distributers, influencers etc who all contribute to a final piece
of Art.

------
Jun8
Although the field had its crackpots it was never as off-the-rails as the
Economist writer makes it out to be. Mosteller and Wallace's analysis of the
_Federalist Papers_ is a well-known early effort
([https://priceonomics.com/how-statistics-solved-a-175-year-
ol...](https://priceonomics.com/how-statistics-solved-a-175-year-old-mystery-
about/)).

Part of my MA covered this topic (stylometry), you can take a look
([https://www.dropbox.com/s/3q9ljrgnntgs6ee/t.pdf](https://www.dropbox.com/s/3q9ljrgnntgs6ee/t.pdf)).
It lists some of the references up until that time (c. 2004).

~~~
Isamu
Thanks!

------
Isamu
I would be interested in reading some technical papers on the subject, could
someone post a few? Whether about Shakespeare or other literature analysis?

That said, this is SO not big data. It is small data, but potentially
interesting analysis.

~~~
NarcolepticFrog
I think it depends on what you mean. I don't think that just the number of
bytes of data should be used to measure the size of your dataset. For example,
if I have 1TB of all 1's, this is a lot of data, but not very interesting.

I think a more nuanced notion of size is the /information content/ of the
datasets. I haven't thought about it carefully, but I'm sure you can quantify
this more explicitly in terms of information theory or other non-bit-based
complexity measures.

From this point of view, literature may not be large (in terms of the number
of bytes), but in terms of information content, it is incredibly dense. A
large portion of human knowledge is written somewhere. From this point of
view, it is in fact big data.

~~~
Isamu
Maybe you would find the theory of information interesting:

[http://math.harvard.edu/~ctm/home/text/others/shannon/entrop...](http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf)

Here in his famous tour de force Claude Shannon lays out the way we can
estimate the amount of actual information in an act of communication (e.g. a
literary work) and relates it to system entropy.

To your point, !TB of all 1's compresses to a just few bits of actual
information.

But I suspect you are not actually speaking about the information contained in
a literary work. You are probably thinking: how much expansive commentary and
explanation could a literary work spawn? That is another question. The answer
is always: unbounded.

~~~
killjoywashere
> The answer is always: unbounded.

But after initial perturbations die out, the growth rate will likely look
something like O(log n) or less.

------
zwischenzug
Reminds me of my first ever blog post...

[https://zwischenzugs.wordpress.com/2011/03/06/shakespeare_un...](https://zwischenzugs.wordpress.com/2011/03/06/shakespeare_unexceptional_vocabulary/)

------
woliveirajr
I'm interested in this subject, as I've studied it before. Was really nice to
see how many techniques are used, with those functional words being just one
(with good results, it must be said). Even reducing words to n-grams is
interesting, since it catches the radix of the words (leaving those prefixes
and suffixes out of the question).

------
bmc7505
Interesting talk about Authorship Detection from 28c3:
[https://www.youtube.com/watch?v=-b0Ta9h62_E](https://www.youtube.com/watch?v=-b0Ta9h62_E)

------
awinter-py
interesting they use romeo & juliet as the first example and attribute to
marlowe.

most of romeo & juliet is copied scene-for-scene from the english translation
of an italian play by the same name. Most of the lines you remember from the
play were added in shakespeare's version.

In the example in this article, I'm getting chills from how much better the
shakespeare line is vs the marlowe line he stole.

If you read R&J side by side with its italian source, start with the 'if I
profane' scene ('let lips do what hands do').

