
Rappers, Sorted by Size of Vocabulary - sinned
http://rappers.mdaniels.com.s3-website-us-east-1.amazonaws.com
======
loso
I enjoyed reading this chart but I hope it doesn't reinforce the bias that
some fans have that word complexity is the only way to tell if a rapper is
good or not. There are several ways to judge the strength and weaknesses of a
rapper. Complexity is one of them, flow is another. Story telling ability is
also another very strong in indicator. The best rappers are able to bring a
mix while some are just so strong in one area that they explode no matter if
they are really weak in other areas.

~~~
harryh
To fully understand rap, we must first be fluent with its meter, rhyme, and
figures of speech. Then ask two questions: One, how artfully has the objective
of the song been rendered, and two, how important is that objective. Question
one rates the song's perfection, question two rates its importance. And once
these questions have been answered, determining a song's greatest becomes a
relatively simple matter.

If the song's score for perfection is plotted along the horizontal of a graph,
and its importance is plotted on the vertical, then calculating the total area
of the song yields the measure of its greatness.

~~~
octo_t
god. fucking. damn.

 _rips page out of book_

~~~
makaveli8
I completely agree.

------
unfunco
This is fascinating. I'm only a recent listener of hip-hop (primarily because
of Earl Sweatshirt and Odd Future) and I'm in awe of the vernacular.

And similarly, as a boredom exercise a few weeks ago I did some lexical
analysis of the song Timber (the monstrosity was being constantly played on
the radio at the time) and here's what I came out with:

"83.1% of the words in the lyrics are five letters or less, 58.9% are four
letters or less. The lexical density (the number of unique words divided by
the total number of words, multiplied by one-hundred) is 29.1%. There is only
one word in the song which has three or more syllables. Eleven people were
involved with the writing of the song, each of them capable of producing just
nine unique words each."

~~~
orbitur
> Eleven people were involved with the writing of the song, each of them
> capable of producing just nine unique words each.

I'm not sure why this is notable when you consider that lyrics are probably
the least important aspect of a song intended for the top 40.

If you take a moment to listen to the melody and production, you'd probably
see why it's credited to 10+ people. That song is a well-oiled machine.

~~~
unfunco
My last sentence was intended as a satire, lyrics are obviously not uniformly
distributed between writers. I completely agree with you though, the song
(despite my disdain for it) is incredibly catchy, and definitely not intended
to be thoughtful or thought provoking in nature.

------
bretthopper
Looked for Canibus near the top and wasn't surprised to find him 4th. If
anyone hasn't heard of him, highly suggest listening to his older stuff such
as his first Can-I-Bus, 2000 BC and Mic Club.

He raps about science and space all the time which is cool.

Here's an example of his ridiculous lyrics: [http://rapgenius.com/Canibus-
poet-laureate-infinity-lyrics](http://rapgenius.com/Canibus-poet-laureate-
infinity-lyrics)

~~~
shawnz
Additionally, many HN users have probably already heard Canibus rapping even
if they don't know it, since he wrote the Office Space theme song. :)

~~~
Oxxide
Always loved that song.

My personal favorite Canibus track is Master Thesis, though.

~~~
StevenNunez
Yey! I'm not the only Canibus fan! Seriously though Mic Club is awesome.

------
seizethecheese
Many here seem to be interpreting vocabulary size as a signal for quality.
When it comes to rap I completely disagree. Firstly, the repetition is rap's
main ingredient. I read an article a while ago where researchers found that
listening to a spoken phrase that is looped activates the same part of the
brain as music, which helps explain this phenomenon.

Personally, if I want food for thought I read. Rap is not an intellectual
pursuit. I've been perusing rappers on this list, and the top artists have not
been good at all to my ears. It seems that the best rappers are in the middle,
and being on either extreme is a negative signal.

~~~
pandler
> Firstly, the repetition is rap's main ingredient.

Is it though? I don't dispute that sheer vocabulary size isn't a sign of
quality, but that seems like a very ignorant generalization of rap.

~~~
TheCoelacanth
Repetition is a key ingredient in all genres of music. You would be hard
pressed to find many significant pieces of music that don't use repetition.

------
Aardwolf
> Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words,
> suggesting he knew over 100,000 words

Why does that suggest he knew over 100k words? Maybe it means he knew 28,829
and used all of them? Would he really know over 70,000 words he never used in
his works? What would those 70,000 words be? Probably very obscure ones. How
can you know that many obscure ones?

~~~
mbillie1
Vocabulary has for a while been considered in terms of 'receptive' and
'productive' capacity, with the assumption being that ones 'receptive'
vocabulary can be larger, since it is easier to hear/read/understand a word
than it is to use it correctly in reading/writing (this is not necessarily the
popular opinion anymore [[http://www.readingconnect.net/web/FILES/english-
language-and...](http://www.readingconnect.net/web/FILES/english-language-and-
literature/elal_LSWP_Vol_4_Pignot_Shahov.pdf)] but may provide the context for
the claim about Shakespeare). The notion is that you are able to understand
more words than you commonly use in your speech/writing, which is on some
level intuitive, although of course it is an empirical question.

~~~
maaku
Assumptions tend to break down at the extremes.

E.g. Shakespeare actually _invented_ a lot of the obscure words he used.

------
nmac
Its a nice touch including portmanteaus and 'incorrect' ebonics on the list
(like "ery'day"), since authors like shakespeare, joyce and others took the
same liberties with language. Arguably, that's how language develops and makes
it interesting to study and think about. The OP could have easily stuck to
words in the OED, kudos.

~~~
danielsf
op here: thanks yo!

------
krick
Really interesting, but not as representative as it should be. It's not clear
why some have larger vocabulary than others. It could be using words like
"zeitgeist" (in case of Aesop Rock) or some clever wordplay (I don't know much
about hip-hop, so I can't find example for some artist from the list right off
the bat, but I remember Marilyn Manson using word "gloominati" for instance)
or pretty meaningless made up words like "schizzle" (in case of Snoop Dogg) or
usual derivatives like "fuckedy fuck". Moreover, in many transcripts for hip-
hop people write down words as they are pronounced, which can be pretty much
distorted for some artists (which of course ideally shouldn't count as a "new
word", but that's complicated, yeah).

While Aeson Rock and DMX are clearly extreme and not surprising at all, it's
not that clear for some guys in the middle.

So, first off, for every data project sources should be provided, or at least
more specific definition, how text was processed, tokenized, analyzed. Second,
several more "data slices" should be provided, for instance _100 most used
words which are unique for that artist compared to other artist in the list_.

~~~
duney
The example you used for clever wordplay, "gloominati," is actually considered
a portmanteau word. It's the result of combining multiple words to create a
new word. (I say this not to be a pedant, but because I learned the term
recently and was amused that we actually have a word for it.)

~~~
krick
Yes, portmanteaus are exactly what I was meaning by "clever wordplay". English
isn't my primary language so it's hard to remember the right word sometimes,
sorry for that. :)

------
coherentpony
Maybe this is just me, but it's a little unfair to compare to literary
_texts_.

Humour me for a moment.

When an artist writes a song, he (or she) has constraints. Most rappers would
like to rhyme the ends of their sentences. I know sometimes they don't (like
poetry), but it's certainly pleasing to the ear to have that constraint.
Artists endeavour to make their songs catchy, that's highly correlated with
the gross sales of the product.

When an artist writes a novel, this constraint is not weighted quite as
highly. I know Shakespeare wrote poetry, too, and to call me out on this
comparison is entirely fair. That said, there's also an argument to be made
for eye rhymes. Shakespeare used these a lot. Eye rhymes are words that don't
rhyme aurally, but _do_ rhyme visually. It's the story that pleases the
reader, not necessarily its aural 'catchiness'. I probably made that word up.
But Shakespeare made words up too. The point is, you knew what I meant.

At the end of the day these comparisons, while certainly _interesting_ ,
should be taken with a pinch of salt. While I'm at it, this advice can easily
be extrapolated to any dataset. Always understand there may be unknown
correlations.

~~~
danielsf
OP here: the shakespeare thing is really just a hook, food for thought rather
than an academic/cultural judgement.

I also had several suggestions to use shakespeare's sonnets rather than plays,
which I should have done.

and yes, this is all just pinch of salt barbership discussion :)

------
thinkpad20
Is Del tha Funkee Homosapien on this list? I'd be curious, since he has pretty
non-standard lyrics.

~~~
Xcelerate
I'd never heard of Aesop Rock before and decided to check out some of his
music. He sounds a lot like Del.

~~~
thrownaway2424
They are also on the same same label. Definitive Jux has some quality product.

If you're looking for expansive vocabularies you should consider exploring
other dorky rappers like Scroobius Pip.

------
habosa
Not surprised to see Wu Tang at the top and Drake at the bottom. Started from
the bottom ... still there.

~~~
pandler
Haha I was thinking that as you move left on the scale the more likely you are
to see rappers that people tend to mock.

------
orblivion
This looks at the first so many lyrics in each rapper's career. Aesop Rock
came out with some weird stuff right off the bat. I wonder if some of these
other rappers became more sophisticated over time. Maybe an average per song
would be better, or average uniques per word, would be better.

~~~
sfrank2147
The problem with average per song is that you "use up" words in every new
song, so all things being equal each marginal song has progressively fewer new
words.

~~~
plorg
I bet you could get something insightful from plotting "unique words" versus
"total words" \- That might give a good idea of the amount of repetition over
time, the length or quantity of output, and the total vocabulary.

~~~
danielsf
here's what this looks like. ugly as sin as useless for comparing rappers.

[http://www.mdaniels.com/vocab/scatter.png](http://www.mdaniels.com/vocab/scatter.png)

love your other ideas – hopefully can do them later.

------
randomdrake
For those who aren't familiar with Aesop Rock, I'd invite you to give him a
listen sometime. His earlier albums, in particular, have been very influential
to me in many ways. Both in my artistic and professional careers.

From comments on the conditions of the working man and the condition of
feeling trapped in a "j-o-b"[1]:

    
    
       "Now we the American working population
       Hate the fact that eight hours a day
       Is wasted on chasing the dream of someone that isn't us
       And we may not hate our jobs
       But we hate jobs in general
       That don't have to do with fighting our own causes
       We the American working population
       Hate the nine-to-five day-in day-out
       When we'd rather be supporting ourselves
       By being paid to perfect the pastimes
       That we have harbored based solely on the fact
       That it makes us smile if it sounds dope"
    

To storytelling masterpieces regarding living and dreaming[2]:

    
    
       "Look, I've never had a dream in my life
       Because a dream is what you wanna do, but still haven't pursued
       I knew what I wanted and did it till it was done
       So I've been the dream that I wanted to be since day one!"
    

Aesop Rock takes language and linguistics to entirely different levels than
one might expect from the single genre that is hip-hop. He even challenges
himself and the listeners, playing fantastic word games, for instance re-using
the letters L, S, and D in odd and rhythmical ways after a mention[3]:

    
    
       "Lazy summer days
       Like some decrepit landshark dumb luck squad dog lurks sicker deluded
       Last sturdy domino lean's secluded
       Don't let stupid delusions lesson super-duty labor students
       Dragnet lifer solutions
       Daddy loved sloppy dimensions like son-daughter links
       Such determinated lepers, successfully disheveled
       Little soliders developed like serpents despite life sentence ducking
       Lemmings
       Some don't like sobriety's dirty lenses
       Some do"
    

And then there are just incredible gems that stick with you like[4]:

    
    
       "I don't flick neeedles like my sick friend
       I don't march like Beetle Bailey through a quick trend
       I don't frequent church's steeples on my weekend
       And I don't comment if you formulate a weak Zen"
    

There's a lot to explore from Aesop Rock. Should you find this type of hip-hop
interesting, a decent place to start is with the label you can find these
songs on, Definitive Jux[5]. Incredible talent has been on and off that label
over the years. So much good stuff.

[1] - "9-5ers Anthem" \- [http://rapgenius.com/Aesop-rock-9-5ers-anthem-
lyrics](http://rapgenius.com/Aesop-rock-9-5ers-anthem-lyrics)

[2] - "No Regrets" \- [http://rapgenius.com/Aesop-rock-no-regrets-
lyrics](http://rapgenius.com/Aesop-rock-no-regrets-lyrics)

[3] - "The Greatest Pac-Man Victory in History" \-
[http://rapgenius.com/Aesop-rock-the-greatest-pac-man-
victory...](http://rapgenius.com/Aesop-rock-the-greatest-pac-man-victory-in-
history-lyrics)

[4] - "Save Yourself" \- [http://rapgenius.com/Aesop-rock-save-yourself-
lyrics](http://rapgenius.com/Aesop-rock-save-yourself-lyrics)

[5] -
[http://en.wikipedia.org/wiki/Definitive_Jux](http://en.wikipedia.org/wiki/Definitive_Jux)

~~~
leorocky
I don't know man, I listened to a couple of the tracks and he definitely has
lyrical skills, and I like some of the tracks, but the quotes you selected
aren't very good at all, at best obvious topics with all the insight of a
million college freshmen. Having said that I like "None Shall Pass" that has a
really great sound.

To be entirely honest, I love rap, but not for any insight rappers have in
world affairs, but for their lyrical ability. Some are very good at providing
unique ways to describe their own insights about their lives but when someone
starts rapping about world problems I just want to shut my brain off because
it's usually pretty banal. Then with my brain off I can still at least enjoy
the way the rap sounds.

~~~
randomdrake
> at best obvious topics with all the insight of a million college freshman

Art is weird like that.

We have to remember that it isn't all about needing to learn something new
from the experience. Sometimes it's just about getting something out of it.

Looking over the lists of the best songs of all time[1], we can see that there
aren't a lot of incredibly _insightful_ songs. Quite frankly, most speak of
your "obvious topics" and probably don't talk about them with any sort of
magnificent linguistic grandeur.

But that doesn't mean they aren't great songs and don't offer their listener
an experience worth sharing and repeating for generations.

[1] -
[http://en.wikipedia.org/wiki/List_of_songs_considered_the_be...](http://en.wikipedia.org/wiki/List_of_songs_considered_the_best)

~~~
qwerty_asdf
I think the term you're grasping for is "Lowest Common Denominator."

~~~
LukeShu
I don't disagree that there is a lot of catering to the lowest common
denominator in the music industry, both in pop music and the "best songs of
all time" lists.

However, I think the parent post has a point. (Now, I'm having a hard time
figuring out how to effectively articulate it) The point of music isn't
necessarily an _insight_. Listening to music is an _experience_ , which is
about how it makes you feel. Sometimes part of that is giving an insight,
sometimes it isn't. Often, it is about combining an idea or concept with a
performance or presentation; the idea/concept doesn't need to be insightful to
be effective.

~~~
catshirt
exactly. if you connect with art based on how "obvious" you find it, you are
going to have a very shallow and boring art career.

sometimes being obvious is what makes it art in the first place. hell, some
art _needs_ to be obvious.

~~~
qwerty_asdf
The inverse of broad appeal is the concept of The Long Tail, where there's a
vast array of niche artists, that appeal to a small number of people.

[https://en.wikipedia.org/wiki/Long_Tail](https://en.wikipedia.org/wiki/Long_Tail)

There's a book by the same title, written by Chris Anderson:

[http://www.thelongtail.com](http://www.thelongtail.com)

Prior to the internet, when advertising and media outlets were centralized,
and retail businesses were distributed geographically, it was very difficult
to gain a large following with niche appeal. But now that the internet has
inverted the scenario, with decentralized, global exposure, and centralized
market places like ebay and amazon, niche artists have a fighting chance at
becoming famous within their genre.

In other words, it used to be that the only way to catch some exposure was to
appeal to centralized broadcasting networks, and they only took chances on
performers who were low risk. Now, with the internet, risk doesn't really
matter, and mass appeal is literally measured by the size of your following.
The larger your following is, by default, the more compromises you'll have
made to appeal to everyone following you.

If you capture 1/2 the world as your audience, then you appeal to a broader,
and more diverse audience, which has less in common with each other member of
your audience, than if you managed to capture 1/4 the world. Getting half the
world to agree on something, as opposed to creating something that three
quarters of the world cannot relate to.

So, Aesop Rock raps about hating your boss, and many people say: "Gee, yeah, I
hate my boss too! This guy's awesome!", but Kool Keith raps about Kenworths
with wings, and lots of people are like: "Is he weird?" because lowest common
denominator.

------
Ryanmf
OP: Did your analysis of MF DOOM include his work alongside Madlib as
Madvillian or his various other pseudonyms (King Geedorah, Viktor Vaughn,
etc.)?

I find it a little hard to believe he's not at least in the Wu Tang/Canibus/KK
cluster, if not #1 overall.

~~~
Tycho
Yeah I would have though Doom would be very high. But the density of his
lyrics perhaps stem more from allusions/references and humour than from the
words themselves.

------
quux
I wonder where Weird Al Yankovic would come in on this ranking.

~~~
coherentpony
Weird Al's songs are not articulate masterpieces, but cheap parodies of other
rap songs and rappers. He's probably somewhere around the 5,000 mark with the
other artists.

A cursory google on the size of the average vocabulary [1] yields an
interesting fact. I'm not sure how watertight it is. I realise it's probably
unfair to compare the size of the average vocabulary to that of a series of
songs. Songs being shorter for one. Still, it's interesting.

~~~
Moto7451
Not sure if that's fair to Weird Al. The people he parodies wouldn't really
agree either [1]. It's not like he's doing the cheap morning show tactic of
swapping clean words for dirty words or bad puns but leaving the rest of the
song intact. They maintain a consistent theme which is really tough.

[1]
[http://www.weirdalforum.com/viewtopic.php?t=5673](http://www.weirdalforum.com/viewtopic.php?t=5673)

~~~
coherentpony
Yeah, perhaps I was a little harsh. Don't get me wrong, I like Weird Al. I
probably could have phrased my comment a little better. I should have said
something along the lines of, "In my experience listening to Weird Al, it
doesn't feel like he explores a lot of the English language."

Your link is cool, thanks for sharing that.

------
DigitalSea
Makes me very happy to see Aesop Rock in the number #1 spot. He isn't as
underground as many people assume, still relatively unknown in the mainstream,
but well known enough to sell records and sell-out shows. I wasn't a big fan
of his 2012 release Skelethon, but the way he structures his lyrics and the
meaning behind them means he never writes a bad lyric.

Interestingly Eminem whom I would have thought would rank pretty highly for
his clever method of word bending and enunciation is only in the middle of the
scale. Still a whole lot better than some of his counterparts, but still
surprising. Another interesting thing to note is Eminem being grouped in the
same league as the likes of Jay-Z, Rakim and Lupe Fiasco. With only a couple
of hundred unique words separating them from one another.

~~~
xentronium
I always thought eminem was famous for his clever wordplay, not his vocabulary
diversity. FWIW, as a non-native speaker I can gather most of his verses.
Aesop Rock, on the other hand, is totally indecipherable for me without
printed lyrics.

------
riggins
I find it hilarious that DMX is dead last.

I've now got empirical evidence of what I always thought.

I think DMX rhymes words with themselves more than any rapper I've ever heard.

~~~
poink
I'm pretty sure this fails to take into account DMX's rich canine vocabulary.

------
ballstothewalls
This is a great graph, but I think it would be neat if a y-axis was thrown in.
My first thought was album sales or some other metric of popularity that help
you find specific rappers quick instead of going through the huge bunch of
little pics.

------
sareon
This reminds me of a PyCon talk from this year in analyzing rap lyrics with
some basic NLP techniques

[http://pyvideo.org/video/2658/analyzing-rap-lyrics-with-
pyth...](http://pyvideo.org/video/2658/analyzing-rap-lyrics-with-python)

The author was trying to see if rappers are considered more hateful towards
women by their usage of "bitch per song". The results are quite interesting.

------
zopticity
Lil Jon should be at the bottom with 7 words: "Yeah!", "Okay!", "Shots!" and
"Turn down for what?"

~~~
axx
He's not, because you forgot: "WHAAAAAT?", "SKIT", "SKIT" and "SKIT!".

------
rthomas6
This infographic doesn't take into account other rappers possibly copying
earlier really influential artists, making the earlier influential artists
rank lower. More generally, it would be cool to see this chart ranked by the
amount of original words present in the first 35,000 lyrics _that were not
present yet at the albums ' time of publication_.

------
ryan1234567890
To put some perspective on this: ryan@3G08:~/Desktop/bleh$ pdftotext David-
Foster-Wallace-Infinite-Jest-v2.0.pdf ryan@3G08:~/Desktop/bleh$ python dfw.py
size of vocabulary: 30725

The man passed Shakespeare by 1,896 words with that book.

code:

    
    
      import nltk
      from nltk.stem import *
      import string
      
      raw = open("/home/ryan/Desktop/bleh/David-Foster-Wallace-Infinite-Jest-v2.0.txt",'rU').read()
      
      exclude = set(string.punctuation)
      raw = ''.join(ch for ch in raw if ch not in exclude)
      raw = raw.lower()
      
      tokens=nltk.word_tokenize(raw)
      
      stemmer = PorterStemmer()
      stemmed_tokens = set()
      for token in tokens:
      	stemmed_tokens.add(stemmer.stem(token))
      
      print "size of vocabulary:", len(set(stemmed_tokens))

------
Tycho
I've been wanting to do some NLP on rap genius's corpus for ages. This is a
great analysis. What I had thought of is write a program to detect
ghostwriting. Rappers probably have some sort of lyrical 'DNA' in the
construction of their verses. How often they use certain words, number of
words per line, number of unique words per song, ratio of adjectives to nouns,
that kind of thing. You could probably unmask some ghost-writing secrets.

Looking at the analysis here, it's interesting to see some clustering in the
results. IMO the second cluster is the sweet spot: Wu Tang's excessive
invention of vocabulary is cool but probably detracts from the poetic effect.
Meanwhile rappers like 2Pac are just kind of boring IMO, at least going by
their lyrics alone.

------
dmourati
I'm a big fan of the project and the way it is presented. Not sure why Wu-Tang
features so prominently but I guess I'm okay with that. Kool Keith should be
broken down further into his constituent parts. I also would have thought the
Beastie Boys would have run higher.

~~~
twic
I'm pretty sure Kool Keith would not be okay with that.

~~~
dmourati
He'd be Kool with it.

[http://www.koolkeith.co.uk/personas.html](http://www.koolkeith.co.uk/personas.html)

------
andybak
I would have been rather surprised not to see Aesop Rock fairly high up the
list. I was reading the Rap Genius pages for a few of his tracks the other
week and the sheer density of wordplay was fairly overwhelming.

It is rap for geeks though ;)

------
danielsf
author here: hit me up with questions you've got.

~~~
derwiki
What are Jay Z's stats?

[Edit] also Notorious B.I.G. :)

~~~
eieio
Jay Z's stats are there: he's at 4,506. Pretty middle of the pack.

About Biggie:"35,000 words covers 3-5 studio albums and EPs. I included
mixtapes if the artist was just short of the 35,000 words. Quite a few rappers
don’t have enough official material to be included (e.g., Biggie, Kendrick
Lamar)"

And to respond to your child comment, I'm sure the same problem (not enough
material) applied to Big L.

------
Aqueous
Greatly enjoyed the analysis but while I was reading it I felt a lot like this
guy:

[https://www.youtube.com/watch?v=GKlDBi0cyIA](https://www.youtube.com/watch?v=GKlDBi0cyIA)

------
NAFV_P
All the rappers listed seem to be American.

Whack this through your Bowers and Wilkins:

[https://www.youtube.com/watch?v=p_SQEUZomug](https://www.youtube.com/watch?v=p_SQEUZomug)

------
S_A_P
I think the only problem I see is that some rap groups are listed as rappers.
For instance beastie boys, de la soul and wu tang are listed. So there is some
collective vocabulary being compared to single rappers. That said this is cool
and pretty telling. From what I could see it is probably loosely couple to the
intelligence of the rappers listed. I will echo the sentiments about DMX here.
Looks like some shock jock rappers definitely are low on the list (too short).

~~~
rmk2
> "Wu-Tang Clan at #6 is fucking impressive given that 10 members, with vastly
> different styles, are equally contributing lyrics. Add the fact that GZA,
> Ghostface, Raekwon, and Method Man's solo works are also in the top 20 –
> notably, GZA at #2. Perhaps their countless hours of studio time together
> (and RZA’s mentorship) exposed each rapper’s vocabulary to one another."

At least in the case of the Wu-Tang Clan, this seems to be done on purpose and
suggests that there is a strong correspondence between the individual members'
repertoire on one hand and the group's vocabulary as a whole, with a presumed
exchange in both direction (i.e. both as a deductive and an inductive
process).

------
kenjackson
This is an interesting analysis.

I love the fact that E-40 is about on par with Shakespeare. I'm sure he would
take it as a compliment to be called the modern day Shakespeare.

~~~
danielsf
author here: thanks!

------
htk
"Each word is counted once, so pimps, pimp, pimping, and pimpin are four
unique words"

So much for the modern Shakespeares on the list.

~~~
octo_t
unlike shakespeare who was _so_ high class and never made 'your mom' jokes or
used any toilet humour or anything like that.

~~~
redacted
And literally made pimp jokes:

[http://poetry.rapgenius.com/William-shakespeare-hamlet-
act-2...](http://poetry.rapgenius.com/William-shakespeare-hamlet-
act-2-scene-2-annotated#note-1567189)

------
koala_advert
I keep getting this error, in Firefox and Chrome:

<Error> <Code>AccessDenied</Code> <Message>Access Denied</Message>
<RequestId>3CB1F41D7DFDC794</RequestId> <HostId>
wHCPzEYPDsmkMJX+YIgjU40YPrGYytHrk5B44dApi7663NkQQI0RKx9A/6EX7Iph </HostId>
</Error>

~~~
dfc
Do you have https everywhere? I kept getting this error with HTTPS Everywhere.
You need to turn off the rule for AWS.

------
dnautics
How about a 2d visualization with a sliding 10000 word window, with the y axis
as unique words out of 10k and the x aaxis time. Are there cultural trends
that are time dependent? Did young mc and Del use more words than contemporary
artists? Did their trends as artists follow the global trend over time?

------
selimthegrim
Maybe this will help me answer that nagging question at the back of my brain:
What does DJ Khaled actually _do_?

------
Grue3
Would be interesting to see how they compare to rock bands like Titus
Andronicus, Fucked Up or Bad Religion.

------
Totient
I wonder where things like classic rock / broadway musicals / opera / etc.
fits on this spectrum.

I really appreciate including Shakespeare and Moby Dick on the spectrum, but
I'd still like some more perspective. For that matter, I wonder how many
unique words _I_ use every day.

------
tokipin
Just a note, those artists don't necessarily use all their vocabulary. Eminem
for example clearly holds back on his vocabulary. Rap is as much an art as
anything can be so there are all sorts of factors. Be careful what you might
want to draw here other than curiousity.

~~~
x3ro
> Eminem for example clearly holds back on his vocabulary.

What makes you say this?

~~~
likeclockwork
I'm sure the answer to this question is "He looks more articulate than the
average rapper."

~~~
omegaham
Gibes regarding racism aside, some people seem more articulate due to the fact
that they carry themselves differently during interviews.

Of course, many musicians will keep a persona going during interviews as well,
so it's still not a very reliable metric.

The most extreme example I've seen was Marilyn Manson, but there are plenty of
musicians who rap / sing about really inane stuff and then show that they're
way smarter than the way they present themselves with their music.

~~~
matwood
_there are plenty of musicians who rap / sing about really inane stuff and
then show that they're way smarter than the way they present themselves with
their music._

As mentioned in the article, Jay-Z even raps about doing just that.

------
sbierwagen
Cool to see Canibus so high in the rankings.

It'd also be cool to add the members of AOTP to the analysis.

------
msutherl
I would love to see this analysis without filters. Who is _the_ rapper with
the largest vocabulary? What does the distribution look like at the top?
Surely Antipop Consortium or MF DOOM have larger vocabularies than Aesop for
instance.

~~~
Fishkins
MF DOOM is on the list. He's above average, but well below Aes or most of Wu-
Tang. I've listened to a fair number of rappers, and I was pretty confident
Aes would be at the top of this list.

I agree it would be cool to see a list of all rappers, though. I was surprised
not to see Del, and maybe there is some more obscure rapper I'm not thinking
of with a broader vocabulary than Aes.

------
gfody
I'm pretty sure E-40 scored so high because of all the made up words. He's
highly regarded for being innovative and influential but you know for every
piece of slang that stuck there's like ten that didn't.

------
ff10
Really surprised MF Doom is not ranked higher – are his side projects
included?

------
oakaz
Why Jedi Mind Tricks is not counted? He'd be the first in this list;
[https://www.youtube.com/watch?v=TlZgiK6FiO0](https://www.youtube.com/watch?v=TlZgiK6FiO0)

~~~
VeejayRampay
Jedi Mind Tricks is not one person. It's two rappers (Vinnie Paz and Jus
Allah).

~~~
oakaz
It doesn't matter, they are still rappers and I bet they'd be one of the top

~~~
VeejayRampay
You're right. And I agree, they most likely would.

Army of the Pharaohs would be up there as well.

------
Mikeb85
Not particularly surprised at the list. Aesop Rock, the whole Wu-tang Clan,
and guys like Nas, Wale, all near the top. DMX and Too Short at the bottom...

Definitely comes out in their music...

~~~
oinksoft

      > Definitely comes out in their music...
    

That's right, Too $hort's music is laser-focused!

------
jarnix
How many words in "fo shizzle ma nizzle" ? 4 or 0 ?

~~~
danielsf
4!

------
camus2
I would love the same chart but sorted by vulgarity.

------
ignacioelola
I would love to see the same analysis across different music styles. How
compare vocabulary size of Madonna, Bob Dylan and Justin Bieber?

~~~
danielsf
I only have rap data, sadly :(

~~~
ignacioelola
Matt, I can very easily gives you corpus for other artists, I'll send you an
email

------
bladecatcher
I would like to see Dälek included in the study. I'd be surprised if they
didn't show up on the far right on the scale.

------
konceptz
What I would like to see, is this same comparison done against album sales
with the implication of mainstream vs. underground.

~~~
danielsf
I kept this focused on vocab so that the data viz was very straightforward and
easy to digest/draw insights from. I've had many requests for album sales to
be added, and I plan to as soon as possible :)

------
b3b0p
Was it mentioned where the data was sourced from? I'm not seeing anything and
I went back and checked. Did I miss it?

~~~
mryan
> All lyrics are provided by Rap Genius, but are only current to 2012. My lack
> of recent data prevented me from using quite a few current artists.

------
jomtung
Killah Priest should be grouped with Wu-Tang.

~~~
danielsf
OP: he's an associate, not a member

------
zeppelinnn
This is awesome. Reminds me of all the data viz they are doing on rapgenius.
You forgot Atmosphere though (Slug)

------
m_mueller
I'd be interested in how Nerdcore rappers compare to this, such as MC
Frontalot or Professor Elemental.

~~~
xkarga00
Have you checked out MC Paul Barman?

~~~
m_mueller
Not yet, thanks for the hint.

------
devindotcom
Couldn't find Aceyalone - I thought he'd be in the top 10, I guess he wasn't
included.

~~~
crusty
Top 10? I would put money on him being at least #2 and giving Aesop Rock a run
for his money for #1 (depending on which albums his 35,000 words fall on -
he's got a solid discography). I just put on A Book of Human
Language([https://www.youtube.com/watch?v=lwVNp42l3Xo](https://www.youtube.com/watch?v=lwVNp42l3Xo)
\- full album or
[https://www.youtube.com/watch?v=GnsCO0Fxw3A](https://www.youtube.com/watch?v=GnsCO0Fxw3A)
\- solid song selection). Along with that, Wu Tang as the only crew analyzed?
how about Freestyle Fellowship (Aceyalone's crew), Quannum Collective,
Jurassic 5.

I'd also like to see Mos Def on the list, along with everyone from Quannum,
and the Soulsides and SoundBombing volumes.

Also, (unique words : total words) might be an interesting scoring method, and
would allow comparison over entire works regardless of their individual
volumes of output. Or choosing a random sample of # of words as opposed to
first # of words, as someone who started publishing as a young buck may take a
hit for early immaturity.

~~~
officialjunk
Mos def is in there

~~~
crusty
thanks, must have missed him

------
thegasman
No mention of MF Doom? Metalface? Doom? Victor Vaughan? (All the same
gentleman from LA)

~~~
joshschreuder
MF DOOM is on there pretty much right on the Shakespeare line. It's not clear
if it includes his work as Danger / Villain / etc.

------
tps12
So funny comparing this to the same graph they did for pop lyricists.

------
snarfy
I wasn't surprised to see Canibus and Outkast up there.

------
dnlserrano
Awesome. This guy should definitely work for RapGenius.com.

~~~
danielsf
author here: the data is straight from RP, so I have been working with them :)

------
shaggyfrog
Incredibly, a list about rapper vocabulary is missing anyone associated with
nerdcore.

I'm interested to see where the likes of MC Frontalot, Wordburglar, YTCracker,
etc. rank on that scale...

~~~
GFK_of_xmaspast
I don't see why novelty acts would be relevant.

~~~
shaggyfrog
Why do you dismiss nerdcore as "novelty"?

~~~
Oxxide
because most of the artists are gimmick acts.

------
jmt7les
I'd love to see Immortal Technique also.

------
prg318
Thank you based god!!

------
1ris
Shouldn't that be adjusted to the size of the text corpus?

~~~
jemfinch
The text corpus was normalized to the first 35k words for each rapper.

------
moron4hire
This might be the best-made infographic I've ever seen.

~~~
danielsf
author here: oh man. flattered.

------
pinkskip
Woah so awesome!

------
allan_
where is KRS-ONE?

~~~
twic
4585 unique words.

------
benihana
I'd really like to see this broken down by established vocabulary and made up
vocabulary. I think that would really start to show who were the best
lyricists on both ends. Rappers with a lot of made up words might be on the
far left, and rappers with a lot of unique words that aren't made up would be
on the far right. Both sides of the scale would show rapping talent on
different dimensions. Influential rappers like E-40 who add new words to the
vocabulary, and wordy rappers like Aes on the right who use a really dense and
descriptive vocabulary.

~~~
coherentpony
>Rappers with a lot of made up words might be on the far left

It's interesting that you think that. I'd recommend you look at this:
[http://www.shakespeare-
online.com/biography/wordsinvented.ht...](http://www.shakespeare-
online.com/biography/wordsinvented.html)

At what point do 'made up' words become 'established'. After they've been
published? If so, every made up word in a song should be considered
'established'.

~~~
dredmorbius
Somebody was listening to "To the Best of Our Knowledge" this weekend,
methinks.

~~~
coherentpony
I don't know what that is, but I'll check it out.

~~~
dredmorbius
They happened to have a bit on Shakespeare's word invention:
[http://www.ttbook.org/book/shakespeares-inflence-stephen-
mar...](http://www.ttbook.org/book/shakespeares-inflence-stephen-marche)

Claim is that some 1700 or so of his 30,000 word vocabulary was invented by
him. Including terms such as "assasination" and "gnarled".

------
skylan_q
Kool Keith should be exempt from this list. He's not from any of the 4 regions
listed, but from Jupiter.

~~~
dylanz
I checked for him immediately once I saw this chart. Between Doc Oc, Dr. Dooom
and Black Elvis, he has some insanely weird lyrics.

    
    
      We stuck together when one of my parakeets died
      You broke down and cried, for the love of animals
      I used to always cut the legs off a roach
      See if he'll stay there on a piece of tissue
      And give him a piece of toast
      That morning, he would wake up and be gone
      What, the insect had a ambulance?"

~~~
jboggan
I had never heard this particular Kool Keith rap before but as soon as I read
the lyrics I could hear his very distinctive delivery pattern. I listened to
the track on YouTube and was not surprised it was just as I have imagined. How
distinctive.

------
thrownaway2424
Gotta wonder about the garbage-in factor of Rap Genius. From one randomly
selected Aesop Rock cut:

"Please I want to donate my brain to the monstrous Panasonic profit"

I guess it could be. I always heard it as "monstrous Panasonic prophet." It
would be in keeping with the previous lyric "Television, all hail grand
pixelated god of fantasy."

~~~
thrownaway2424
There's some other howlers in the same song like "canope" instead of "canopy",
"intervenes" instead of "intravenous".

~~~
nightpool
fixed.

This song has no credited transcripter, so it was one of the originals added
to "bootstrap" the site, and hence has a lot lower quality then the later
songs added and cleaned up by users.

------
sarreph
We might all be self-confessed _hackers_ , but we'll never explicitly confess
our adoration for the gloriousness of the genre that is _gangster rap_.

~~~
coherentpony
This comment is nothing more than a cheap attempt to inject musical elitism.
Please, if you have nothing critical to say that doesn't also inspire well-
meaning debate, then take your business to Slashdot.

------
simonster
The estimate of vocabulary size here is based on the number of unique words
used. This seems like it is strongly biased: if two artists have the same size
vocabulary, but one has released more albums and thus used more words, that
artist will probably have used more unique words. To underscore this point,
the number of unique words used by Aesop Rock is half of the estimated
vocabulary size of the average college student, although to be fair that
estimate is the number of words that an individual can recognize, not the
number of words they use. (Edit: the bias is somewhat mitigated by the fact
that the same number of words is used to estimate the vocabulary for each
artist, but the bias is not dependent on sample size alone but also upon the
size of the artist's underlying vocabulary; see my comments below.)

The underlying problem is one of estimating the cardinality of a multinomial
distribution given a fixed number of samples. In isolation this problem is
ill-posed, since it is always possible that there is a word in a given
lyricist's vocabulary that he uses with very low frequency and that is
unlikely to appear in any sample, but with appropriate prior information it
may be possible to obtain an accurate estimate.

This is not my field, but a brief Google Scholar search shows that there are
several papers on estimating vocabulary size, or equivalently, estimating the
number of species based on sampling. There is a somewhat dated review
([http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf](http://cvcl.mit.edu/SUNSeminar/BungeFitzpatrick_1993.pdf))
that details some methods of estimation (in this case, I believe we are in the
domain of "infinite population, multinomial sample" with unequal class sizes).
The paper notes that there is no unbiased estimator available without
assumptions on the distribution of word use frequencies, but some of the
proposed estimators may be more accurate than the naive estimate used here.

~~~
logicallee
Are you looking at a different link than us? The intro reads "I decided to
compare this data point against the most famous artists in hip hop. I used
each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z,
could be compared to newer artists, such as Drake." and the title of the
infographic (if that's all you looked at) reads "# of Unique words used within
artist’s first 35,000 lyrics"

This seems to address your concern completely?

~~~
simonster
Yes, I missed that, even though it is very clearly spelled out (oops!). It
makes the ordinal comparison valid (modulo noise), but it does not completely
address the concern. If you have two artists, and artist 1 uses 5,000 unique
words in 35,000 lyrics while artist 2 uses 10,000 unique words in 35,000
lyrics, artist 2's vocabulary may be substantially more than twice as large as
artist 1's. It is unlikely that a lyricist exhausts their entire vocabulary in
such a small sample, particularly if their vocabulary is large and contains
many words that they use infrequently.
[http://www.jstor.org.libproxy.mit.edu/stable/2284147](http://www.jstor.org.libproxy.mit.edu/stable/2284147)
has a correction that can be applied, although even there the author notes
that, when applied to James Joyce's Portrait of an Artist, their technique
appears to greatly underestimate Joyce's total vocabulary.

~~~
logicallee
This is a very good point - Aesop Rock, for example, uses one unique word
every 5 words (7k unique in 35k), and if this does not stop, maybe he would
continue and we would find the same average in 70k or 120k words. After all,
you still have to have filler words like "to", "a", "the", "have", etc - he
could be saturating the spots where he can put uniques.

So this could substantially underrepresent vocabularies. There are only so
many unique words you can put in a sentence. As an extreme, if we looked at
the first hundred words of every rapper, we would not find a hundred unique
words in any of them: (due to repeats of grammatically common words) even
though, clearly, all rappers have a vocabulary over a hundred.

I wonder if this is a fatal flaw? How can we estimate where the distortion
stops? (For example, if someone uses 1000 words in their first 35,000,
intuitively this seems to imply to me that's most of their stock. But if
someone uses 5,000 in 35,000 - that is not so clear at all.)

~~~
simonster
The paper I linked to in my previous comment uses Zipf's law (briefly, the
frequency of word use is inversely proportional to its rank; more at
[http://en.wikipedia.org/wiki/Zipf%27s_law](http://en.wikipedia.org/wiki/Zipf%27s_law))
to estimate the "distortion." This should produce a better estimate than the
naive method, but there are still problems: the plot on the Wikipedia page
shows that Zipf's law is not a particularly good fit to word frequency for
Wikipedia past the ~10,000th word, and it's not clear that rap music
represents a typical natural language corpus. It is probably still possible to
devise a correction if one knows how word use frequencies are distributed.

A second related problem that that paper touches on toward the end is that
sequential words from the same text are not independent samples from an
author's vocabulary. Two artists may have the same vocabulary, but if one
artist uses more non-sequiturs, fewer articles, fewer repeated phrases, or
generally tries to use more unique words within a given song, then that artist
will come out ahead in the measure used here. I'm not sure how much of a
problem this really is for comparing lyrics between artists (depending on what
is of interest, it may actually be desirable), but it may explain the poor
showings for Shakespeare and Melville, since prose is likely to repeat words
more frequently than rap lyrics for reasons unrelated to the authors'
vocabularies. (FWIW, even conservative estimates put Shakespeare's vocabulary
at >15,000 words, which would be hard to measure in a sample of 35,000 words.)

