Most used words in programming languages

Noughmad · on Jan 20, 2017

In most languages, the top words are self/this and function/def. On the other hand, in Go, the most common word is "err", and "if err != nil" are three of the top four words. I really wonder how big part of the code in Go is just propagating errors.

hbex5 · on Jan 20, 2017

One day, maybe half a decade from now, Rob Pike will wake up from a nightmare and suddenly realise they got this one wrong. Maybe it'll occur at the same time as the generics one.

rebeccaskinner · on Jan 20, 2017

There are some was to make error handling much less annoying in go. Since error is just an interface it's pretty easy to make monadic constructs that can carry the error information and let you write pipelined code. IMHO it makes the code much cleaner and easier to read, but a lot of golang enthusiasts haven't really adopted the idea because it's quite different from the established idioms around error handling. Hard-core gophers love them some verbose and stupidly explicit code; but I'm hoping that as the language gets adoption in the wider community the voice of reason will win here.

the_duke · on Jan 20, 2017

Interfaces require casting and hurt performance, though.

You could come up with `Either` types, but due to the lack of generic structs you'd have to have a whole lot of those.

Interestingly, Rust has basically the same error handling idiom, but verbosity is considerably reduced by the '?' operator (previously the try! macro).

When I was coding in Go, i would have loved that feature. The mind-numbing error checking in go is hugely annoying.

tokenizerrr · on Jan 20, 2017

> When I was coding in Go, i would have loved that feature. The mind-numbing error checking in go is hugely annoying.

Ugh. Yes. I really want to love go. No, I really love go. But this is so annoying.

woah · on Jan 20, 2017

Could you link to some instructions?

rebeccaskinner · on Jan 20, 2017

I've been working on putting together an in-depth series of blog posts about it. I'm presenting on the topic of better error handing in go at the Agile Tech Conference this year, and I'd like to start getting everything in order soon- I'll probably post here on HN.

In the short term I have a really rudimentary library I've put up here: https://github.com/asteris-llc/gofpher and a presentation on it here https://github.com/rebeccaskinner/presentations/tree/master/... (should build with pdflatex, I need to get an actual pdf built soon)

Rob Pike has a blog post that looks at a similar, although ideologically different, approach to doing it: https://blog.golang.org/errors-are-values

I think the latter is more go-ish, although also less generic than the monadic approach- it's also more in keeping with the ideology of go, but in practice I still never see it being used much in production code.

s_kilk · on Jan 20, 2017

Same with many node apps. Every other statement is:

    if (err !== null) {
      return callback(err);
    }

Klathmon · on Jan 20, 2017

With es2015+ this is no longer the case, it's back to good old try/catch statements if using async/await, and .catch(...) if using promises.

flgr · on Jan 20, 2017

I was thinking about forking the language just so I could add a Maybe monad.

I now typically deal with this via a Must() function that takes to values and return the non-err value or panics if there was an err.

That means you can call it like con := Must(db.Connect(...)).(db.Connection)

flgr · on Jan 20, 2017

This is not super-conventional, but some of the standard library uses panics in a similar way.

I think as long as panics don't cross library boundaries you're fine. And panics can cross library functions if the function is called MustConnect(). (Which means it'd be neat to have a pre-processor that generated MustX from X if it wasn't already there...)

AsyncAwait · on Jan 20, 2017

To be honest, part of the reason for this is that in Go, 'self' is any name you see fit in the particular context, otherwise it may still be self.

_pctq · on Jan 20, 2017

This made me smile as well. While I'm now used to it, I have troubles saying there's nothing wrong in having `err` a more common word than `if` :)

z3t4 · on Jan 20, 2017

Letting errors surface is a good thing.

keymone · on Jan 20, 2017

having 3 out of 4 lines in your codebase dedicated to letting errors surface is a bad thing.

literallycancer · on Jan 20, 2017

3 out of 4 most common words are related to error handling. How does that imply 75% lines are error handling?

keymone · on Jan 20, 2017

    err := someOperation(args...)
    if err != nil {
         return err
    }

that's 4 lines, 3 of them are error handling

fagnerbrack · on Jan 20, 2017

"err". Because "error" is too hard to type, right?

wwalexander · on Jan 20, 2017

No, because error variables have extremely limited scope and the length of a variable name should be proportional to its scope.

kbart · on Jan 20, 2017

Heh, I always use longer name variables for broader scope somewhat naturally, but only after reading your comment realized that.

adtac · on Jan 20, 2017

    if theErrorReturnedByThePreviousFunction != nil {
         panic(theErrorReturnedByThePreviousFunction)
    }

kps · on Jan 20, 2017

A real brogrammer can write COBOL in any language.

hyperpape · on Jan 20, 2017

From which it follows that you should use the most terse names possible, in order to make it harder to write too-long functions.

(I am only half trolling).

bryanrasmussen · on Jan 20, 2017

I figure it's probably proportional to its importance and difficulty of understanding, as a longer more descriptive name aids in understanding, but things with broader scope tend to be more important.

matthewmacleod · on Jan 20, 2017

Well, `error` is the type, and `err` is the most common variable name. It's not so unusual, just convention!

the_duke · on Jan 20, 2017

Also, it cuts the characters you have to type and read in halve, while not being obtuse.

That's an odd thing to complain about.

open-source-ux · on Jan 20, 2017

I wonder the same thing about "elif" in Python. It's two extra characters to make it the more pleasant, readable "elseif". I don't understand why dropping those letters was felt necessary.

uranusjr · on Jan 20, 2017

A lot of older parts in Python inherit the naming style common in C, which always seems to me they have OCD to shorten everything, e.g. ls for list, mk for make. Newer Python parts are more “healthy” in this regard. But this also results in some wierd inconsistencies like mkdir vs makedirs.

nonsince · on Jan 20, 2017

Lots of uses of "the" in C++ code, massively skewed by giant projects that have the massive section of boilerplate copyright info in a comment at the top of every single file. I didn't realise until I looked at this, but although that's really common to see in C, C++, and Java you barely ever see it in the languages that I spend time in (Rust, Haskell, C#, Python, various flavours of Lisp, JavaScript).

adrianN · on Jan 20, 2017

Wait another fifteen to twenty years or so until Python and Javascript are "enterprise ready" and this will change.

TorKlingberg · on Jan 20, 2017

At least nobody is keeping a revision history at the top of the file any more.

adrianN · on Jan 20, 2017

You'd be surprised...

konsnos · on Jan 20, 2017

In C# the first word is "summary" which is massively influenced by the way Visual Studio formats comments. I wonder though if this is an indication that a lot of people write comments.

chriswarbo · on Jan 20, 2017

Would be nice to see Haskell included; would ".", "$", "<$>", "<*>", ">>=", etc. count as words? ;)

Made even better by the use of the language's logo as the word clouds shape; the Haskell logo is ">λ=" https://www.haskell.org/static/img/haskell-logo.svg

anvaka · on Jan 20, 2017

Added: https://anvaka.github.io/common-words/#?lang=hs

Unfortunately the symbols will not show up, because I'm ignoring them: https://github.com/anvaka/common-words/blob/master/data-extr...

chriswarbo · on Jan 20, 2017

> Added: https://anvaka.github.io/common-words/#?lang=hs

> Unfortunately the symbols will not show up, because I'm ignoring them

Makes sense. I notice that some funky unicode stuff has still managed to come out quite high, e.g. ⊇ ("superset of or equal to") :)

TuringTest · on Jan 20, 2017

The layout algorithm for the word cloud is awesome! How is it made?

jamessb · on Jan 20, 2017

This is explained in the README of the project's Github repository: https://github.com/anvaka/common-words#how-are-word-clouds-r...

delfaras · on Jan 20, 2017

There's a description of the method in the Readme

https://github.com/anvaka/common-words#how-are-word-clouds-r...

sambeau · on Jan 20, 2017

Unfortunately it's not using size as a metric like mouse word clouds. This confused me at first. Look at the size of 'err' in the go layout.

anvaka · on Jan 20, 2017

It uses it as long as there is enough space. Once it fails to find a rectangle to fit a new word, it tries to reduce the size of the word.

In general, word clouds are bad for comparing sizes. For that reason I used plain list in the sidebar on the left (or at the bottom if you are on mobile)

BoorishBears · on Jan 20, 2017

It's humorous that in Go err/nil happens so often that it's "unbelievable" in a literal sense

gjm11 · on Jan 20, 2017

It looks to me as if it is. There just are a lot of uses of "err" in typical Go code.

ddavis · on Jan 20, 2017

The slow adoption of modern C++ is very apparent. Some things not event listed: forward, unique_ptr, shared_ptr, tuple, constexptr. nullptr is much lower than NULL and move is quite low. These features are now 6+ years old! ;)

mschuetz · on Jan 20, 2017

Smart pointers realy need a native way to be specified, similar to how we use & to declare references and * for pointers. Filling your code with shared_ptr<Something> and the likes is just not going to win the majority over, even if it's useful.

johannes1234321 · on Jan 20, 2017

A common way is having a "using FooPtr = shared_ptr<Foo>" somewhere and then only use FooPtr. This also reduces the occurrences in that word cloud.

ddavis · on Jan 20, 2017

> This also reduces the occurrences in that word cloud

Ah, yes, very good point.

radupopescu · on Jan 20, 2017

Nowadays, you can do:

    auto sh_ptr = make_shared(....); // Since C++11

or

    auto uniq_ptr = make_unique(....); // Since C++14

It saves you from having to explicitly name the type parameter and you just need to pass in the parameters of the constructor. It's not as terse as & or *, but it's not that bad.

ddavis · on Jan 20, 2017

I agree it's made C++ a bit verbose, though something like the using keyword helps (which is 30 spots below typedef!)

kosma · on Jan 20, 2017

This might just be my Python upbringing, but... am I the only one to be troubled by Go's single-letter words? I've always found Go code very hard to read because it isn't self-descriptive at all.

gribbly · on Jan 20, 2017

Depends, if it's self evident in the context what it means, like in a method which is declared as (p *player) Attack, p makes perfect sense (to me) rather than typing 'player' everywhere in the function, just like typing i instead of index.

In short, I don't think there is a particular rule that applies, that said my impression is that a lot of variable names in Go code are typically three letters or at least two, like err, buf, src, dst, ok etc.

kps · on Jan 20, 2017

Go is a language in the almost-forgotten algebraic tradition (loosely descending Fortran → Algol → BCPL → C → Go), in which people prefer E=mc² to MULTIPLY REST-MASS BY SPEED-OF-LIGHT-IN-A-VACUUM BY SPEED-OF-LIGHT-IN-A-VACUUM GIVING ENERGY.

the_duke · on Jan 20, 2017

Did you mean single-syllable words?

kissgyorgy · on Jan 20, 2017

When you have a strong static type system like Go has, you don't need descriptive names that much, because even single letter variable names are evident, because the declarations are there, close to the variable names. It needs a bit of getting used to if you only programmed dynamic or scripting languages.

masklinn · on Jan 20, 2017

> you don't need descriptive names that much, because even single letter variable names are evident

That's an annoying lie, and makes codebases in languages with genuinely strong static type systems (e.g. Haskell) much harder to read than necessary.

Hell C#'s type system is stronger than Go's, yet there is no such prevalence of meaningless variable names (first one comes in at #23, Go has 7 single-letter variable names in the top 23)

bnegreve · on Jan 20, 2017

Still, I think GP's comment is interesting. I do agree that type name and variable/argument name can be redundant in statically typed languages, such as c++ or java. (I've never used Go)

For example,

    int getAge(Person p);

is clear enough, whereas

    int getAge(Person person) ;

is redundant.

Haskell does type inference, so even if types are static they are not explicit. That's why you still need explicit variable names.

msluyter · on Jan 20, 2017

I think that depends on the length getAge(). If it's 3 lines, 'p' probably is fine. If it's 200 -- which, perhaps it shouldn't be but that's another issue -- then person is probably a better choice because you may lose the original context as you scan the method.

Also in Java if you use intellij then you'll probably get 'person' as an autocomplete, which actually makes it roughly as easy to type out as 'p', (and a better choice if your entire team has standardized on intellij.)

masklinn · on Jan 20, 2017

> I do agree that type name and variable/argument name can be redundant in statically typed languages, such as c++ or java.

Except neither exposes that pattern (they essentially have only one such variable in their top list, and it's "i"). And I already mentioned C# which is very similar to Java.

> is redundant.

I'm not saying single-letter variables (or no variable at all which is also possible in Haskell) is universally bad, I'm saying it is disturbing to find how absolutely ubiquitous it is in Go. Variable names are sometimes redundant, that is not a universal constant. That redundancy is a factor of the expressiveness of the type system, that's hardly a claim of fame of Go.

> Haskell does type inference, so even if types are static they are not explicit.

Go has local type inference, and leveraging Haskell's global type inference is usually recommended against.

> That's why you still need explicit variable names.

No, it is not.

sidlls · on Jan 20, 2017

My background is scientific and systems programming in C and C++. Currently I write a lot of Python for my job and a lot of Rust for my spare time hobby programming, which consists of things like writing a BLAS, reimplementing other C and C++ projects, etc. So I haven't only programmed in dynamic or scripting languages.

I strongly disagree with the assertion regarding single letter variables. I would gently correct a junior programmer who tried to do that and I would chew the hell out of a senior programmer if he tried it. In any language.

kosma · on Jan 20, 2017

> It needs a bit of getting used to if you only programmed dynamic or scripting languages.

I'm an embedded C developer. Our shop doesn't let single-letter variables pass in code review.

bpicolo · on Jan 20, 2017

Sounds like a good shop! :)

Even for loop variables I tend to use `idx` these days in languages without smarter constructs. Just slightly better and more readable for no effort.

whatshisface · on Jan 20, 2017

I disagree, we're so used to reading i that the (actually less phonetic) idx is really slightly harder to read.

I take my bikesheds green, thank you.

dustinmoris · on Jan 20, 2017

I am quite surprised that "self" is so much more used than the next word in the list "if". I know many languages where "self" is not a keyword at all, but I cannot think of a single language where "if" is not a keyword.

EDIT: Ops, I missed that there is a language filter. Ignore my comment :)

x1798DE · on Jan 20, 2017

This is broken down by language, the linked list is for python, so it just means that self is more frequently used than if in python.

goatlover · on Jan 20, 2017

I don't think if is a keyword in Smalltalk. IO might not have it either.

masklinn · on Jan 20, 2017

> I don't think if is a keyword in Smalltalk.

It's so not a keyword it doesn't even exist. At least in the Smalltalks I've used, they provided ifTrue: and ifFalse: (and compositions thereof).

RodericDay · on Jan 20, 2017

assignments are probably way more common than if-driven program flow control

ape4 · on Jan 20, 2017

Besides "return", C/C++ doesn't have a particular word that stands out. Probably because you just write write things. eg you don't put "function" in front of a function.

mojuba · on Jan 20, 2017

Yes, I also thought that a list for each language shows exactly what's wrong with that language. For some it's self/this, Java is import and return, etc. for C++ there's no clear winner, because it is minimalistic by nature (mostly due to C heritage of course).

kibwen · on Jan 20, 2017

> for C++ there's no clear winner, because it is minimalistic by nature

I presume I'm missing the sarcasm here...

macygray · on Jan 20, 2017

Cool stats about Java. "import" is the most frequently used word. It looks like everything is already exist in Java, so just import all the things, some clue code and you are done.

gravypod · on Jan 20, 2017

Then why isn't this the case for python. It's as equally as kitchen sink, right?

the_duke · on Jan 20, 2017

Java doesn't have syntax for importing more than one member of a package in one statement.

Either you import a single member, or all with *.

Combine that with IDEs that auto-create the import statement, and there you go.

the_duke · on Jan 20, 2017

I just remembered: also, each class in Java has to be in it's separate file. That might very well be the biggest factor.

masklinn · on Jan 20, 2017

In Python you usually import a module and use that as a symbol, or `from module import Symbol1, Symbol2, Symbol3`.

Java doesn't have the former, and the latter requires an import per symbol (it also has the ability to import every symbol in a namespace, and so does Python, but that's usually discouraged)

sleepychu · on Jan 20, 2017

I think if you grouped the import statements by package the numbers would be similar but typically people import individual classes.

mschuetz · on Jan 20, 2017

The world would be a much better place if self(Python) and this(Javascript ES6 classes) were implicit.

donatj · on Jan 20, 2017

I disagree heartedly. Implicit code leads to easy to write hard to read. I'd much rather have where you are getting a value from be very explicit to not cause hard to see bugs as well as making the code easier to understand from a fragment.

mschuetz · on Jan 20, 2017

I find it unnecessarely bloats code and in most cases makes it harder to read due to redundant text, e.g.

    # 1
    def length(self):
    	return math.sqrt(self.x*self.x + self.y*self.y + self.z*self.z)
    	
    # 2	
    def length(self):
    	return math.sqrt(self.x**2 + self.y**2 + self.z**2)
    	
    # 3
    def length():
    	return math.sqrt(x*x + y*y + z*z)

I prefer version 3 by far. Unfortunately, Javascript decided to take the same path with ES6 classes which forces you to use this in the body. Fortunately, it does not force you to use this in the argument list.

lojack · on Jan 20, 2017

I think that goes against the python dogma of "Explicit is better than implicit." In #1 and #2 there is no question about where 'x' comes from, while in #3 it could be a class variable or a global variable or from just about anywhere.

fnl · on Jan 20, 2017

Agree. Even better, though:

    # 4
    def length():
    	return math.sqrt(.x**2 + .y**2 + .z**2)

Ambiguity and "pseudo-keyword" are both gone!

[EDITED code for better readability]

pyrale · on Jan 20, 2017

If you mind with self being too bloated, you can already change it to what you like. The `self` is merely idiomatic.

  def length(s):
    	return math.sqrt(s.x**2 + s.y**2 + s.z**2)

The beauty with self in Python is that self is not at all magic : it merely indicates that the object instance you're using will be passed as first argument of the class method.

Also you can get a custom font with typographic ligatures (e.g. for self, lambda, and so on) in order to make it more visually appealing. For instance (self > 圖) :

  def length(圖):
    	return math.sqrt(圖.x**2 + 圖.y**2 + 圖.z**2)

blauditore · on Jan 20, 2017

Huh, I experience python as quite the opposite of "Explicit is better than implicit":

- No static types

- A variable might belong to the scope of a method, object or class, depending on where it was first set and changed afterwards

- Implicit execution of code on import of a module (__init__.py, including parent packages)

- Any object is truthy or falsey, i.e. conditional statements don't require an explicit boolean

sweeneyrod · on Jan 21, 2017

You can get the "Python dogma" with `import this`. "Explicit is better than implicit" is part of it, but so is "practicality beats purity" (and "There should be one-- and preferably only one --obvious way to do it, although that way may not be obvious at first unless you're Dutch").

lojack · on Jan 20, 2017

I suppose "dogma" was the incorrect word in that case.

methyl · on Jan 20, 2017

For JS, you can do

    function length() {
      const { x, y, z } = this
      return math.sqrt(x * x + y * y + z * z)
    }

I think it's pretty short, readable and explicit.

EDIT: formatting

fnl · on Jan 20, 2017

And make every one-liner a two-liner?

Tarean · on Jan 20, 2017

In rust you can use pattern matching in arguments:

    struct Foo { x: f64, y: f64, z: f64 }
    fn length(Foo {x, y, z}: Foo) -> f64 {
        (x*x + y*y + z*z).sqrt()
    }

...Doesn't work with `self` because that would be implicit again, though. Anyway, I think that it is a minor inconvenience that isn't important for oneliners and extremely helpful when reading larger functions.

masklinn · on Jan 20, 2017

> Doesn't work with `self` because that would be implicit again, though.

AFAIK it doesn't work on self because the [&[mut ]]self parameter is more or less a keyword determining ownership interaction with the call subject.

The UFCS RFC would have made it sugar for `self: [&[mut ]]Self` (IIRC) but I believe that floundered.

drabiega · on Jan 20, 2017

It actually uses less characters, since you've eliminated all the repeated uses of 'this.'. As a general rule I'm in favor of a turning a one liner into a two liner if it improves readability, especially when it reduces actual typing.

pmontra · on Jan 20, 2017

I agree that it's mostly noise.

Python's self in the arglist is a C struct pointer sneaking in from the 80's, which is the time when Python has been designed. It could be excused but there should be a deprecation PEP by now. Make it a keyword and let us type it only when we need it.

patrick_haply · on Jan 20, 2017

That might seem nice for your limited example, but it breaks down when you realize that in example #3 there would be no easy way to differentiate between scopes.

  y = 100
  x = 15
  
  class MyClass(object):
    x = 50
    def __init__(self, x):
      self.x = x

    def length():
      return x * y

What does that mean? Does MyClass(10).length() raise an AttributeError because MyClass doesn't have an attribute named y? Does it automatically recognize there's a y in the outer scope and use that, or does it call __getattr__ first (i.e. method that gets called when a missing attribute is accessed)? Furthermore, how do I specify that I want to access the class attribute x, or that I want to access the nonlocal x?

mschuetz · on Jan 20, 2017

In my opinion, there are far bigger problems with pythons scoping. Why is for/if/etc not it's own scope? You can define a temporary variable in there and mistakenly use it somewhere down the line. In a "for i in ..", the i should be cleared after the loop but it isn't. Why can't you create a new scope within a function? This allows you to bundle related stuff together and be sure the scope is cleared afterwards, so as to not mistakenly reuse variables. The thing that you're promoting as an advantage of explicit self is already broken by having a single scope for the whole function.

Blocks like this are an imensely useful way to keep scopes clean:

   int someMethod(){
   	...
   
   	{ // some small stuff that doesn't warrant a new method but you don't want it to bleed into the remaining part of the function, e.g.:
   		float x = readNextFloat();
   		float y = readNextFloat();
   		float z = readNextFloat();
   		float length = sqrt(x*x + y*y + z*z);
   		cout << length;
   	}
   	
   	// do some other things without worrying about potentially initialized variables
   	...
   }

giancarlostoro · on Jan 20, 2017

Technically "self" could be anything you want it to be called, it's typically "self" by convention. I think it's fine, but that's just my opinion.

mschuetz · on Jan 20, 2017

Yes, I usually call it "s" to reduce it's impact but I'm still not happy with it.

kissgyorgy · on Jan 20, 2017

Don't call it "s", because then you break consistency basically with the whole Python ecosystem. That's not very nice for fellow programmers who already used to read and understand "self".

enneff · on Jan 20, 2017

I look at #3 and ask "What are x, y, and z? How did they get into this scope?"

keymone · on Jan 20, 2017

from all the code i've seen javascript developers have more troubles with explicit "this" and bindings than ruby developers have with implicit self.

dagw · on Jan 20, 2017

Javascript 'this' binding behaves in quite unexpected ways if you're coming from something like python on ruby, which is a major source of trouble for people.

masklinn · on Jan 20, 2017

Do people really use implicit selfs in ruby? Note that @foo is not an implicit self.

matthewmacleod · on Jan 20, 2017

Literally all the time.

In fact, our coding style demands that we avoid using instance variables in preference of adding `attr_accessor` to explicitly encourage the use of implicit self.

pmontra · on Jan 20, 2017

self.attribute = attribute in an ActiveRecord class when the left side is a field of a table and the right side is a variable. A different name for the variable fixes it without using self.

masklinn · on Jan 20, 2017

> self.attribute = attribute in an ActiveRecord class when the left side is a field of a table and the right side is a variable

That seems to be an example of not using an implicit self.

> A different name for the variable fixes it

There's nothing to fix, it works just fine.

keymone · on Jan 20, 2017

every time you call a method without explicit receiver you're using implicit self

kmm · on Jan 20, 2017

I would have liked a shorthand, like prefixing `@`. E.g, `self.member` would be `@member`. It's a lot less to type.

hawski · on Jan 20, 2017

For example in C++ where it's implicit most code styles require adding prefix "m_" or something similar to indicate that the variable is a class member.

So then I expect the same would happen in Python and JavaScript. I don't think it would be better that way.

gpderetta · on Jan 20, 2017

Exactly. Implicit 'this' also complicates the lookup rules. explicit 'this' should have been a requirement in c++ (and a reference, not a pointer).

edit: Also, it is already required for dependent names in templates.

cmdrfred · on Jan 20, 2017

"Explicit is better than implicit."

https://www.python.org/dev/peps/pep-0020/

mschuetz · on Jan 20, 2017

"Beautiful is better than ugly."

https://www.python.org/dev/peps/pep-0020/

twic · on Jan 20, 2017

Point of order: this comment has been downvoted into greyness. My understanding is that we should downvote when a comment is low-value, not when we disagree with it, but i suspect the latter has happened here. Could anyone who has or is tempted to downvote this instead make a comment to present their critique?

dagw · on Jan 20, 2017

Don't complain about down votes, especially after only 15 minutes. There is inherent randomness in votes and it only takes one or two votes to grey out a comment. Most of the time these things work themselves out within an hour or two. In this case the comment was black within 12 minutes of your post.

mpjme · on Jan 20, 2017

This would improve a lot if they filtered out comments.

WatchDog · on Jan 20, 2017

I was thinking the opposite, I would like to see a version with all of the language reserved symbols removed.

giancarlostoro · on Jan 20, 2017

Optional filtering that can filter out either code or comments would be nice.

chriswarbo · on Jan 20, 2017

Really? I think it's interesting to see "TODO" feature quite prominently in Python, for example :)

imron · on Jan 20, 2017

Perhaps both?

I also think filtering out comments would improve it - especially because so many source files include a copyright statement at the top, and the same licenses (MIT, GPL, Apache, etc) are found repeated in many different files and it distorts the results somewhat.

anvaka · on Jan 20, 2017

The copyrights are filtered out, because indeed there was a lot of them.

masklinn · on Jan 20, 2017

How interesting is it to see "summary" is the top C# term?

chriswarbo · on Jan 20, 2017

In that particular case, I'd say it's quite interesting indeed!

Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation (I've never used a Microsoft programming language, but a quick search brought me to https://msdn.microsoft.com/en-us/library/z04awywx.aspx ).

In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.

That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.

Such terms clearly have an effect on a system's documentation, even if they don't have an effect on the CPU instructions being executed. But I'm a programmer, not a CPU; text files containing source code are my main I/O interface, and they most certainly do contain such markup, and hence I find it interesting to see statistics about. In comparison, I don't step through very much assembly day to day, so I don't really care very much about the compiler output (the part which the comments don't affect). I prefer to reason at the level of the language I'm using, where not only do comments appear, they're very useful!

masklinn · on Jan 20, 2017

> Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation

Yes, and the IDE will auto-generate a doc comment with a <summary> because that's pretty much the most basic doc comment you can get.

> In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.

My issue is not that it's a comment, it's that it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically.

> That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.

<summary> is not conventional, it's the primary tag used by the C# documentation system and shown by IntelliSense. <precondition> is not that.

chriswarbo · on Jan 20, 2017

> it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically

Just because an IDE will write boilerplate automatically, that doesn't mean the boilerplate wasn't written, checked into version control, presented to developers, etc. Even if such boilerplate were added by an IDE, and hidden from developers (e.g. using code folding), it's still there in the language.

In this case, the language is C#, not e.g. some "C#-like" language which gets preprocessed/transpiled by an IDE into C# by scattering boilerplate around.

Whilst tooling can help us live with a language's deficiencies, they don't remove those deficiencies ;)

dragonwriter · on Jan 20, 2017

Well, there is a sense in which the language you write (which may not be the language you read) is defined by how you interact with the development environment to produce code.

Which is I prefer a language where I just need to learn one language, and not a separate input language because the language-as-read is to unergonomic to write so a different language needs to be defined for productively writing code.

donatj · on Jan 20, 2017

I think it speaks well towards the ferocity and forced completeness of Go's error handling that err is the most used word.

di4na · on Jan 20, 2017

I would say it shows that this is a place where the compiler could help :D

Klathmon · on Jan 20, 2017

As much as I hate all the typing I do for error checking in go, I just don't think a compiler can handle errors that well on their own yet.

The explicitness of go's error handling forces you to handle every error specifically. There isn't a chance (for the most part) that you'll get an error from deep in the program that you can't easily handle.

Forcing you to handle them everywhere and anywhere forces error handling to be a part of your architecture.

di4na · on Jan 21, 2017

Erlang took this other way around by designing a system that enable you to not handle all these errors.

I am more from this school of designing a system around the reality insteas of trying to patch it everywhere, praying we have enough fabrics to catch it all.

But it would meam rethinking how we build stuff. That was not at all a goal of Go.

cpsempek · on Jan 20, 2017

I do appreciate this, but word clouds really are a terrible visualization method for text data. With regard to the python example, I cannot grasp at all if the frequency of self and None are similar or drastically different. The table on the value is more informative and less likely to misread.

enitihas · on Jan 20, 2017

Looking at Scala, the difference between val and var is huge, with val being at 2nd, and var at 38.

virtualwhys · on Jan 20, 2017

Usage of `var` as idomatic Scala is an oft used trolling mechanism.

I wonder what percentage of `_` usage is value/type discarding in pattern matching and type signatures vs. function application.

Would be nice to somehow ditch `case`:

    adt match {
      Foo(x) if cond x => ...
      Bar(x) => ...
    }

    pairs.map{ (a,b) =>
      ...
    }

nrinaudo · on Jan 20, 2017

And else is used more than if...

kissgyorgy · on Jan 20, 2017

One can make very interesting conclusions based on purely this. Examples:

- Python developers does not follow Clean Code (ala Uncle Bob) as much as Ruby , because if statement is more frequent than def and return.

- Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.

- People don't really care about good variable names ("a" is a terrible variable name in scripting languages like JS and Python, still top 11)

- PHP developers might practice "return early" in functions (more return than function keywords) OR their functions just do too much :)

brak1 · on Jan 20, 2017

'a', 'the' etc seem to be from comments, not 'actual' code

(click them and you can see examples of how each word is being used)

dagw · on Jan 20, 2017

- Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.

I instinctively think you might be right (at least with your second statement), but what are you using as your metric here?

smnplk · on Jan 20, 2017

I am not sure if ruby devs like to do much functional style, the parent mentioned Uncle Bob and his influence on Ruby community with Clean coders. I think they are more into OO.

drivebyops · on Jan 20, 2017

Well Python also doesn't have a Match/switch flow. So that would bump ifs up

giancarlostoro · on Jan 20, 2017

This and the fact that the page doesn't filter out comments which may affect results as well.

masklinn · on Jan 20, 2017

> One can make very interesting conclusions based on purely this.

With a large grain of salt

1. generally, the thing seems to mix words from all context, for instance #6 in Ruby is "should", click on the word and it's mostly comments, same for Rust's #4 "the".

2. "if __name__ == '__main__'" (seriously, TFA counts >600k of those)

3. also unclear what codebases are parsed, lots of django-isms in the conditional examples (if context is None, if request.method == 'POST', if form.is_valid)

See also: Go, where most every function call requires an `if`

> - Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.

Don't confuse writing more functions/methods and writing in a more functional style, they're very different things.

pmontra · on Jan 20, 2017

Yep, the tool should skip strings and comments.

masklinn · on Jan 20, 2017

Or provide for classification/filtering.

Comparing word clouds of comments would actually be interesting. But then you hit the issue of "what is a comment" e.g. many language use comments for item documentation, but python uses (doc)strings.

Insanity · on Jan 20, 2017

I was really amused by the logos / names in the word clouds. Nice project!

anvaka · on Jan 20, 2017

Thank you :)

pcwalton · on Jan 20, 2017

The Rust compiler seems somewhat overrepresented in this data set. I see "ccx", "fcx", and "CrateContext", which are only used in the Rust compiler itself.

questerzen · on Jan 20, 2017

It is probably a sign of a good language that the words used most should be similar in frequency to their use in pseudo code. When words like "end" (Ruby), "self" (Python), "import" (Java), "err"/"error" (Go and Node) are over-represented, it's likely a sign that the language is introducing accidental complexity. By this metric Swift looks astonishingly sane.

realworldview · on Jan 20, 2017

It would be interesting to see the frequency of words found in comments—TODO, FIXME, LATER, OMG—foreach language too.

the_duke · on Jan 20, 2017

Also, there is the glorious phrase "Should never happen", with 24 million results on Github: https://github.com/search?q=should+never+happen&ref=simplese...

bpicolo · on Jan 20, 2017

Perfect time to throw an UnreachableCode exception or something, hah

cven714 · on Jan 20, 2017

Pretty cool! Some unexpected results, or at least not what I guessed. "summary" as the top for C#, "SELECT" all the way down at #43 for SQL, "err" as the top for Go (I'm sure that will spawn some pleasant discussion).

pepve · on Jan 20, 2017

I think for SQL their sample includes just ".sql" files, which tend to contain schema definitions and data dumps, hence CREATE and INSERT. Most of it not handwritten also.

cr0sh · on Jan 20, 2017

I was surprised by the SQL thing too; your explanation makes perfect sense!

cr0sh · on Jan 20, 2017

This is pretty neat! I was surprised to see that for SQL, SELECT was so far down the list.

I also wonder what the criteria is for which languages to analyze? There are a few other languages I would like to see, but maybe on github they aren't well represented...

anvaka · on Jan 20, 2017

Thanks!

I'm using file extension to differentiate between extensions.

You can request other languages here: https://github.com/anvaka/common-words/issues/4

As long as language's extension is unique, I think I can make a visualization of it.

HissingSound · on Jan 20, 2017

Good site.

What did you use to get and analyze code and how much code did you analyze?

I thought about something like this but about variables, methods e.t.c. most used words or even variables name generator based on markov-chain.

Matumio · on Jan 20, 2017

https://github.com/anvaka/common-words#how-was-the-data-coll...

imron · on Jan 20, 2017

This is really nifty, unfortunately it includes comments and so with thousands of files all including copyright notices, 'the' is the 3rd most popular word in c++ files.

anvaka · on Jan 20, 2017

I tried to exclude copyright lines as much as I could. I used "license markers" for that, but I might have missed something.

Here is more information about it: https://github.com/anvaka/common-words#how

imron · on Jan 20, 2017

That's good to hear. I didn't look in to it in too much depth, I just thought it was strange that 'the' was so high for c++ so clicked on it to see example usage and got things like:

   ** use the contact form at http://qt.digia.co/contact-us.

   furnished to do so, subject to the following conditions:

   * This file is part of the LibreOffice project.

   // with this library; see the file COPYING3. If not see

So assumed licenses had not been excluded.

Having a brief look at the source, I think with the licence marking approach it's still leaving in quite a few lines from each licence (see above for examples).

kalimatas · on Jan 20, 2017

And PHP has "div", which seems to be HTML tag name.

igravious · on Jan 20, 2017

Contrary to popular opinion neither `s' nor `t' are words. At least not in English anyway. :/ Or do they mean that these characters appeared as variable names?

smnplk · on Jan 20, 2017

I totally expected "should" to be at the top in ruby. Code bases with 2:1 or even more test code to implementation code ratio = standard.

tomw1808 · on Jan 20, 2017

So happy I couldn't find "goto" :)

Matumio · on Jan 20, 2017

You wish. It's #325 after you switch to cpp.

mi_lk · on Jan 20, 2017

Well, Lua's word list starts with index 0...

xyclos · on Jan 20, 2017

anyone else find it at all ironic that the <em>least</em> common word in JS appears to be "validate"?

traviswingo · on Jan 20, 2017

Where do the language files get pulled from? GitHub API or web scraping? Or is it not file parsing and some other method?

Super rad project :)

anvaka · on Jan 20, 2017

Thanks! The data comes form GitHub snapshot, stored on BigQuery. Here is more details about it: https://github.com/anvaka/common-words#how

Dowwie · on Jan 20, 2017

I don't know whether I could believe this study as "foo" and "bar" aren't on the list.

qarioz · on Jan 20, 2017

Well I will slay my developers before they put the metasyntactic variables on master. I allow i to pass though. Go to j and you have O(n^2) and I will slay them again.

d--b · on Jan 20, 2017

summary is the most used word in C#! Only from comments! Amazing. EDIT: it looks like this is reading files that are common to all projects... which is why the sentence "// The following GUID is for the ID of the typelib if this project is exposed to COM" appears 459k times!

masklinn · on Jan 20, 2017

> EDIT: it looks like this is reading files that are common to all projects…

Yeah, autogenerated files/comments feature extremely prominently e.g. the top two items for "should" in Ruby are autogenerated Rails comments.

noblethrasher · on Jan 20, 2017

At least in C#, we could probably exclude auto-generated files by looking for the whitespace keywords "partial class|interface|struct".

woliveirajr · on Jan 20, 2017

Java: funny to see that "if" is used a lot to point to "visit oracle.com if..."

adrianlmm · on Jan 20, 2017

Interesting that for Ruby the most used word is "option", is not even a Ruby but a Rails word.

alexwebb2 · on Jan 20, 2017

Interesting that in JS, `let` is ranked 536, far lower than I expected.

lgessler · on Jan 20, 2017

Is anyone investigating how well these numbers fit a Zipf distribution?

oblib · on Jan 20, 2017

"that didn't work"

AznHisoka · on Jan 20, 2017

Word clouds may look cool but they are horrible for conveying information. this should have been just a simple ordered list.

xyclos · on Jan 20, 2017

There is an ordered list on the page. If you're on a phone you can tap "show list" on the bottom.

singularity2001 · on Jan 20, 2017

django > try (20 vs 21) try again

Ensorceled · on Jan 20, 2017

In my django projects, I have many lines like:

    from django.<path> import <stuff>

for every try block.

bryanrasmussen · on Jan 20, 2017

this study is worthless without a section on frequency of swearwords and virulence of same.

pomber · on Jan 20, 2017

and the most used word in .cs files is "summary"

fagnerbrack · on Jan 20, 2017

A lot of meaningful and legible words out there

/s