
Most used words in programming languages - Walkman
https://anvaka.github.io/common-words/#?lang=py
======
Noughmad
In most languages, the top words are self/this and function/def. On the other
hand, in Go, the most common word is "err", and "if err != nil" are three of
the top four words. I really wonder how big part of the code in Go is just
propagating errors.

~~~
fagnerbrack
"err". Because "error" is too hard to type, right?

~~~
stirner
No, because error variables have extremely limited scope and the length of a
variable name should be proportional to its scope.

~~~
kbart
Heh, I always use longer name variables for broader scope somewhat naturally,
but only after reading your comment _realized_ that.

~~~
adtac

        if theErrorReturnedByThePreviousFunction != nil {
             panic(theErrorReturnedByThePreviousFunction)
        }

~~~
kps
A real brogrammer can write COBOL in any language.

------
nonsince
Lots of uses of "the" in C++ code, massively skewed by giant projects that
have the massive section of boilerplate copyright info in a comment at the top
of every single file. I didn't realise until I looked at this, but although
that's really common to see in C, C++, and Java you barely ever see it in the
languages that I spend time in (Rust, Haskell, C#, Python, various flavours of
Lisp, JavaScript).

~~~
adrianN
Wait another fifteen to twenty years or so until Python and Javascript are
"enterprise ready" and this will change.

~~~
TorKlingberg
At least nobody is keeping a revision history at the top of the file any more.

~~~
adrianN
You'd be surprised...

------
chriswarbo
Would be nice to see Haskell included; would ".", "$", "<$>", "<*>", ">>=",
etc. count as words? ;)

Made even better by the use of the language's logo as the word clouds shape;
the Haskell logo is ">λ=" [https://www.haskell.org/static/img/haskell-
logo.svg](https://www.haskell.org/static/img/haskell-logo.svg)

~~~
anvaka
Added: [https://anvaka.github.io/common-
words/#?lang=hs](https://anvaka.github.io/common-words/#?lang=hs)

Unfortunately the symbols will not show up, because I'm ignoring them:
[https://github.com/anvaka/common-words/blob/master/data-
extr...](https://github.com/anvaka/common-words/blob/master/data-
extract/ignore/index.js#L4)

~~~
chriswarbo
> Added: [https://anvaka.github.io/common-
> words/#?lang=hs](https://anvaka.github.io/common-words/#?lang=hs)

> Unfortunately the symbols will not show up, because I'm ignoring them

Makes sense. I notice that some funky unicode stuff has still managed to come
out quite high, e.g. ⊇ ("superset of or equal to") :)

------
TuringTest
The layout algorithm for the word cloud is awesome! How is it made?

~~~
sambeau
Unfortunately it's not using size as a metric like mouse word clouds. This
confused me at first. Look at the size of 'err' in the go layout.

~~~
anvaka
It uses it as long as there is enough space. Once it fails to find a rectangle
to fit a new word, it tries to reduce the size of the word.

In general, word clouds are bad for comparing sizes. For that reason I used
plain list in the sidebar on the left (or at the bottom if you are on mobile)

------
ddavis
The slow adoption of modern C++ is very apparent. Some things not event
listed: forward, unique_ptr, shared_ptr, tuple, constexptr. nullptr is much
lower than NULL and move is quite low. These features are now 6+ years old! ;)

~~~
mschuetz
Smart pointers realy need a native way to be specified, similar to how we use
& to declare references and * for pointers. Filling your code with
shared_ptr<Something> and the likes is just not going to win the majority
over, even if it's useful.

~~~
johannes1234321
A common way is having a "using FooPtr = shared_ptr<Foo>" somewhere and then
only use FooPtr. This also reduces the occurrences in that word cloud.

~~~
ddavis
> This also reduces the occurrences in that word cloud

Ah, yes, very good point.

------
kosma
This might just be my Python upbringing, but... am I the only one to be
troubled by Go's single-letter words? I've always found Go code very hard to
read because it isn't self-descriptive at all.

~~~
Walkman
When you have a strong static type system like Go has, you don't need
descriptive names that much, because even single letter variable names are
evident, because the declarations are there, close to the variable names. It
needs a bit of getting used to if you only programmed dynamic or scripting
languages.

~~~
masklinn
> you don't need descriptive names that much, because even single letter
> variable names are evident

That's an annoying lie, and makes codebases in languages with genuinely strong
static type systems (e.g. Haskell) much harder to read than necessary.

Hell C#'s type system is stronger than Go's, yet there is no such prevalence
of meaningless variable names (first one comes in at #23, Go has 7 single-
letter variable names in the top 23)

~~~
bnegreve
Still, I think GP's comment is interesting. I do agree that type name and
variable/argument name can be redundant in statically typed languages, such as
c++ or java. (I've never used Go)

For example,

    
    
        int getAge(Person p);  
    

is clear enough, whereas

    
    
        int getAge(Person person) ;
    

is redundant.

Haskell does type inference, so even if types are static they are not
explicit. That's why you still need explicit variable names.

~~~
msluyter
I think that depends on the length getAge(). If it's 3 lines, 'p' probably is
fine. If it's 200 -- which, perhaps it shouldn't be but that's another issue
-- then person is probably a better choice because you may lose the original
context as you scan the method.

Also in Java if you use intellij then you'll probably get 'person' as an
autocomplete, which actually makes it roughly as easy to type out as 'p', (and
a better choice if your entire team has standardized on intellij.)

------
dustinmoris
I am quite surprised that "self" is so much more used than the next word in
the list "if". I know many languages where "self" is not a keyword at all, but
I cannot think of a single language where "if" is not a keyword.

EDIT: Ops, I missed that there is a language filter. Ignore my comment :)

~~~
goatlover
I don't think if is a keyword in Smalltalk. IO might not have it either.

~~~
masklinn
> I don't think if is a keyword in Smalltalk.

It's so not a keyword it doesn't even exist. At least in the Smalltalks I've
used, they provided ifTrue: and ifFalse: (and compositions thereof).

------
ape4
Besides "return", C/C++ doesn't have a particular word that stands out.
Probably because you just write write things. eg you don't put "function" in
front of a function.

~~~
mojuba
Yes, I also thought that a list for each language shows exactly what's wrong
with that language. For some it's self/this, Java is import and return, etc.
for C++ there's no clear winner, because it is minimalistic by nature (mostly
due to C heritage of course).

~~~
kibwen
_> for C++ there's no clear winner, because it is minimalistic by nature_

I presume I'm missing the sarcasm here...

------
macygray
Cool stats about Java. "import" is the most frequently used word. It looks
like everything is already exist in Java, so just import all the things, some
clue code and you are done.

~~~
gravypod
Then why isn't this the case for python. It's as equally as kitchen sink,
right?

~~~
the_duke
Java doesn't have syntax for importing more than one member of a package in
one statement.

Either you import a single member, or all with *.

Combine that with IDEs that auto-create the import statement, and there you
go.

~~~
the_duke
I just remembered: also, each class in Java has to be in it's separate file.
That might very well be the biggest factor.

------
mschuetz
The world would be a much better place if self(Python) and this(Javascript ES6
classes) were implicit.

~~~
donatj
I disagree heartedly. Implicit code leads to easy to write hard to read. I'd
much rather have where you are getting a value from be very explicit to not
cause hard to see bugs as well as making the code easier to understand from a
fragment.

~~~
mschuetz
I find it unnecessarely bloats code and in most cases makes it harder to read
due to redundant text, e.g.

    
    
        # 1
        def length(self):
        	return math.sqrt(self.x*self.x + self.y*self.y + self.z*self.z)
        	
        # 2	
        def length(self):
        	return math.sqrt(self.x**2 + self.y**2 + self.z**2)
        	
        # 3
        def length():
        	return math.sqrt(x*x + y*y + z*z)
        

I prefer version 3 by far. Unfortunately, Javascript decided to take the same
path with ES6 classes which forces you to use this in the body. Fortunately,
it does not force you to use this in the argument list.

~~~
lojack
I think that goes against the python dogma of "Explicit is better than
implicit." In #1 and #2 there is no question about where 'x' comes from, while
in #3 it could be a class variable or a global variable or from just about
anywhere.

~~~
blauditore
Huh, I experience python as quite the opposite of "Explicit is better than
implicit":

\- No static types

\- A variable might belong to the scope of a method, object or class,
depending on where it was first set and changed afterwards

\- Implicit execution of code on import of a module (__init__.py, including
parent packages)

\- Any object is truthy or falsey, i.e. conditional statements don't require
an explicit boolean

~~~
sweeneyrod
You can get the "Python dogma" with `import this`. "Explicit is better than
implicit" is part of it, but so is "practicality beats purity" (and "There
should be one-- and preferably only one --obvious way to do it, although that
way may not be obvious at first unless you're Dutch").

------
mpjme
This would improve a lot if they filtered out comments.

~~~
chriswarbo
Really? I think it's interesting to see "TODO" feature quite prominently in
Python, for example :)

~~~
imron
Perhaps both?

I also think filtering out comments would improve it - especially because so
many source files include a copyright statement at the top, and the same
licenses (MIT, GPL, Apache, etc) are found repeated in many different files
and it distorts the results somewhat.

~~~
anvaka
The copyrights are filtered out, because indeed there was a lot of them.

------
donatj
I think it speaks well towards the ferocity and forced completeness of Go's
error handling that err is the most used word.

~~~
di4na
I would say it shows that this is a place where the compiler could help :D

~~~
Klathmon
As much as I hate all the typing I do for error checking in go, I just don't
think a compiler can handle errors that well on their own yet.

The explicitness of go's error handling forces you to handle every error
specifically. There isn't a chance (for the most part) that you'll get an
error from deep in the program that you can't easily handle.

Forcing you to handle them everywhere and anywhere forces error handling to be
a part of your architecture.

~~~
di4na
Erlang took this other way around by designing a system that enable you to not
handle all these errors.

I am more from this school of designing a system around the reality insteas of
trying to patch it everywhere, praying we have enough fabrics to catch it all.

But it would meam rethinking how we build stuff. That was not at all a goal of
Go.

------
cpsempek
I do appreciate this, but word clouds really are a terrible visualization
method for text data. With regard to the python example, I cannot grasp at all
if the frequency of self and None are similar or drastically different. The
table on the value is more informative and less likely to misread.

------
enitihas
Looking at Scala, the difference between val and var is huge, with val being
at 2nd, and var at 38.

~~~
virtualwhys
Usage of `var` as idomatic Scala is an oft used trolling mechanism.

I wonder what percentage of `_` usage is value/type discarding in pattern
matching and type signatures vs. function application.

Would be nice to somehow ditch `case`:

    
    
        adt match {
          Foo(x) if cond x => ...
          Bar(x) => ...
        }
    
        pairs.map{ (a,b) =>
          ...
        }

------
Walkman
One can make very interesting conclusions based on purely this. Examples:

\- Python developers does not follow Clean Code (ala Uncle Bob) as much as
Ruby , because if statement is more frequent than def and return.

\- Ruby makes it possible to write in a much more functional style than
Python. OR Ruby developers like to develop more in a functional style than
Python developers.

\- People don't really care about good variable names ("a" is a terrible
variable name in scripting languages like JS and Python, still top 11)

\- PHP developers might practice "return early" in functions (more return than
function keywords) OR their functions just do too much :)

~~~
dagw
_\- Ruby makes it possible to write in a much more functional style than
Python. OR Ruby developers like to develop more in a functional style than
Python developers._

I instinctively think you might be right (at least with your second
statement), but what are you using as your metric here?

~~~
smnplk
I am not sure if ruby devs like to do much functional style, the parent
mentioned Uncle Bob and his influence on Ruby community with Clean coders. I
think they are more into OO.

------
Insanity
I was really amused by the logos / names in the word clouds. Nice project!

~~~
anvaka
Thank you :)

------
pcwalton
The Rust compiler seems somewhat overrepresented in this data set. I see
"ccx", "fcx", and "CrateContext", which are only used in the Rust compiler
itself.

------
questerzen
It is probably a sign of a good language that the words used most should be
similar in frequency to their use in pseudo code. When words like "end"
(Ruby), "self" (Python), "import" (Java), "err"/"error" (Go and Node) are
over-represented, it's likely a sign that the language is introducing
accidental complexity. By this metric Swift looks astonishingly sane.

------
realworldview
It would be interesting to see the frequency of words found in comments—TODO,
FIXME, LATER, OMG—foreach language too.

~~~
the_duke
Also, there is the glorious phrase "Should never happen", with 24 million
results on Github:
[https://github.com/search?q=should+never+happen&ref=simplese...](https://github.com/search?q=should+never+happen&ref=simplesearch&type=Code&utf8=%E2%9C%93)

~~~
bpicolo
Perfect time to throw an UnreachableCode exception or something, hah

------
cven714
Pretty cool! Some unexpected results, or at least not what I guessed.
"summary" as the top for C#, "SELECT" all the way down at #43 for SQL, "err"
as the top for Go (I'm sure that will spawn some pleasant discussion).

~~~
pepve
I think for SQL their sample includes just ".sql" files, which tend to contain
schema definitions and data dumps, hence CREATE and INSERT. Most of it not
handwritten also.

~~~
cr0sh
I was surprised by the SQL thing too; your explanation makes perfect sense!

------
cr0sh
This is pretty neat! I was surprised to see that for SQL, SELECT was so far
down the list.

I also wonder what the criteria is for which languages to analyze? There are a
few other languages I would like to see, but maybe on github they aren't well
represented...

~~~
anvaka
Thanks!

I'm using file extension to differentiate between extensions.

You can request other languages here: [https://github.com/anvaka/common-
words/issues/4](https://github.com/anvaka/common-words/issues/4)

As long as language's extension is unique, I think I can make a visualization
of it.

------
HissingSound
Good site.

What did you use to get and analyze code and how much code did you analyze?

I thought about something like this but about variables, methods e.t.c. most
used words or even variables name generator based on markov-chain.

~~~
Matumio
[https://github.com/anvaka/common-words#how-was-the-data-
coll...](https://github.com/anvaka/common-words#how-was-the-data-collected)

------
imron
This is really nifty, unfortunately it includes comments and so with thousands
of files all including copyright notices, 'the' is the 3rd most popular word
in c++ files.

~~~
anvaka
I tried to exclude copyright lines as much as I could. I used "license
markers" for that, but I might have missed something.

Here is more information about it: [https://github.com/anvaka/common-
words#how](https://github.com/anvaka/common-words#how)

~~~
imron
That's good to hear. I didn't look in to it in too much depth, I just thought
it was strange that 'the' was so high for c++ so clicked on it to see example
usage and got things like:

    
    
       ** use the contact form at http://qt.digia.co/contact-us.
    
       furnished to do so, subject to the following conditions:
    
       * This file is part of the LibreOffice project.
    
       // with this library; see the file COPYING3. If not see
        

So assumed licenses had not been excluded.

Having a brief look at the source, I think with the licence marking approach
it's still leaving in quite a few lines from each licence (see above for
examples).

------
igravious
Contrary to popular opinion neither `s' nor `t' are words. At least not in
English anyway. :/ Or do they mean that these characters appeared as variable
names?

------
smnplk
I totally expected "should" to be at the top in ruby. Code bases with 2:1 or
even more test code to implementation code ratio = standard.

------
tomw1808
So happy I couldn't find "goto" :)

~~~
Matumio
You wish. It's #325 after you switch to cpp.

------
mi_lk
Well, Lua's word list starts with index 0...

------
xyclos
anyone else find it at all ironic that the <em>least</em> common word in JS
appears to be "validate"?

------
traviswingo
Where do the language files get pulled from? GitHub API or web scraping? Or is
it not file parsing and some other method?

Super rad project :)

~~~
anvaka
Thanks! The data comes form GitHub snapshot, stored on BigQuery. Here is more
details about it: [https://github.com/anvaka/common-
words#how](https://github.com/anvaka/common-words#how)

------
Dowwie
I don't know whether I could believe this study as "foo" and "bar" aren't on
the list.

~~~
qarioz
Well I will slay my developers before they put the metasyntactic variables on
master. I allow i to pass though. Go to j and you have O(n^2) and I will slay
them again.

------
d--b
summary is the most used word in C#! Only from comments! Amazing. EDIT: it
looks like this is reading files that are common to all projects... which is
why the sentence "// The following GUID is for the ID of the typelib if this
project is exposed to COM" appears 459k times!

~~~
masklinn
> EDIT: it looks like this is reading files that are common to all projects…

Yeah, autogenerated files/comments feature extremely prominently e.g. the top
two items for "should" in Ruby are autogenerated Rails comments.

------
woliveirajr
Java: funny to see that "if" is used a lot to point to "visit oracle.com
_if_..."

------
adrianlmm
Interesting that for Ruby the most used word is "option", is not even a Ruby
but a Rails word.

------
alexwebb2
Interesting that in JS, `let` is ranked 536, far lower than I expected.

------
lgessler
Is anyone investigating how well these numbers fit a Zipf distribution?

------
oblib
"that didn't work"

------
AznHisoka
Word clouds may look cool but they are horrible for conveying information.
this should have been just a simple ordered list.

~~~
xyclos
There is an ordered list on the page. If you're on a phone you can tap "show
list" on the bottom.

------
singularity2001
django > try (20 vs 21) try again

~~~
Ensorceled
In my django projects, I have many lines like:

    
    
        from django.<path> import <stuff>
    

for every try block.

------
bryanrasmussen
this study is worthless without a section on frequency of swearwords and
virulence of same.

------
pomber
and the most used word in .cs files is "summary"

------
fagnerbrack
A lot of meaningful and legible words out there

/s

