In most languages, the top words are self/this and function/def. On the other hand, in Go, the most common word is "err", and "if err != nil" are three of the top four words. I really wonder how big part of the code in Go is just propagating errors.
One day, maybe half a decade from now, Rob Pike will wake up from a nightmare and suddenly realise they got this one wrong. Maybe it'll occur at the same time as the generics one.
There are some was to make error handling much less annoying in go. Since error is just an interface it's pretty easy to make monadic constructs that can carry the error information and let you write pipelined code. IMHO it makes the code much cleaner and easier to read, but a lot of golang enthusiasts haven't really adopted the idea because it's quite different from the established idioms around error handling. Hard-core gophers love them some verbose and stupidly explicit code; but I'm hoping that as the language gets adoption in the wider community the voice of reason will win here.
Interfaces require casting and hurt performance, though.
You could come up with `Either` types, but due to the lack of generic structs you'd have to have a whole lot of those.
Interestingly, Rust has basically the same error handling idiom, but verbosity is considerably reduced by the '?' operator (previously the try! macro).
When I was coding in Go, i would have loved that feature. The mind-numbing error checking in go is hugely annoying.
I've been working on putting together an in-depth series of blog posts about it. I'm presenting on the topic of better error handing in go at the Agile Tech Conference this year, and I'd like to start getting everything in order soon- I'll probably post here on HN.
I think the latter is more go-ish, although also less generic than the monadic approach- it's also more in keeping with the ideology of go, but in practice I still never see it being used much in production code.
This is not super-conventional, but some of the standard library uses panics in a similar way.
I think as long as panics don't cross library boundaries you're fine. And panics can cross library functions if the function is called MustConnect(). (Which means it'd be neat to have a pre-processor that generated MustX from X if it wasn't already there...)
I figure it's probably proportional to its importance and difficulty of understanding, as a longer more descriptive name aids in understanding, but things with broader scope tend to be more important.
I wonder the same thing about "elif" in Python. It's two extra characters to make it the more pleasant, readable "elseif". I don't understand why dropping those letters was felt necessary.
A lot of older parts in Python inherit the naming style common in C, which always seems to me they have OCD to shorten everything, e.g. ls for list, mk for make. Newer Python parts are more “healthy” in this regard. But this also results in some wierd inconsistencies like mkdir vs makedirs.
Lots of uses of "the" in C++ code, massively skewed by giant projects that have the massive section of boilerplate copyright info in a comment at the top of every single file. I didn't realise until I looked at this, but although that's really common to see in C, C++, and Java you barely ever see it in the languages that I spend time in (Rust, Haskell, C#, Python, various flavours of Lisp, JavaScript).
In C# the first word is "summary" which is massively influenced by the way Visual Studio formats comments. I wonder though if this is an indication that a lot of people write comments.
It uses it as long as there is enough space. Once it fails to find a rectangle to fit a new word, it tries to reduce the size of the word.
In general, word clouds are bad for comparing sizes. For that reason I used plain list in the sidebar on the left (or at the bottom if you are on mobile)
The slow adoption of modern C++ is very apparent. Some things not event listed: forward, unique_ptr, shared_ptr, tuple, constexptr. nullptr is much lower than NULL and move is quite low. These features are now 6+ years old! ;)
Smart pointers realy need a native way to be specified, similar to how we use & to declare references and * for pointers.
Filling your code with shared_ptr<Something> and the likes is just not going to win the majority over, even if it's useful.
It saves you from having to explicitly name the type parameter and you just need to pass in the parameters of the constructor. It's not as terse as & or *, but it's not that bad.
This might just be my Python upbringing, but... am I the only one to be troubled by Go's single-letter words? I've always found Go code very hard to read because it isn't self-descriptive at all.
Depends, if it's self evident in the context what it means, like in a method which is declared as (p *player) Attack, p makes perfect sense (to me) rather than typing 'player' everywhere in the function, just like typing i instead of index.
In short, I don't think there is a particular rule that applies, that said my impression is that a lot of variable names in Go code are typically three letters or at least two, like err, buf, src, dst, ok etc.
Go is a language in the almost-forgotten algebraic tradition (loosely descending Fortran → Algol → BCPL → C → Go), in which people prefer E=mc² to MULTIPLY REST-MASS BY SPEED-OF-LIGHT-IN-A-VACUUM BY SPEED-OF-LIGHT-IN-A-VACUUM GIVING ENERGY.
When you have a strong static type system like Go has, you don't need descriptive names that much, because even single letter variable names are evident, because the declarations are there, close to the variable names. It needs a bit of getting used to if you only programmed dynamic or scripting languages.
> you don't need descriptive names that much, because even single letter variable names are evident
That's an annoying lie, and makes codebases in languages with genuinely strong static type systems (e.g. Haskell) much harder to read than necessary.
Hell C#'s type system is stronger than Go's, yet there is no such prevalence of meaningless variable names (first one comes in at #23, Go has 7 single-letter variable names in the top 23)
Still, I think GP's comment is interesting. I do agree that type name and variable/argument name can be redundant in statically typed languages, such as c++ or java. (I've never used Go)
For example,
int getAge(Person p);
is clear enough, whereas
int getAge(Person person) ;
is redundant.
Haskell does type inference, so even if types are static they are not explicit. That's why you still need explicit variable names.
I think that depends on the length getAge(). If it's 3 lines, 'p' probably is fine. If it's 200 -- which, perhaps it shouldn't be but that's another issue -- then person is probably a better choice because you may lose the original context as you scan the method.
Also in Java if you use intellij then you'll probably get 'person' as an autocomplete, which actually makes it roughly as easy to type out as 'p', (and a better choice if your entire team has standardized on intellij.)
> I do agree that type name and variable/argument name can be redundant in statically typed languages, such as c++ or java.
Except neither exposes that pattern (they essentially have only one such variable in their top list, and it's "i"). And I already mentioned C# which is very similar to Java.
> is redundant.
I'm not saying single-letter variables (or no variable at all which is also possible in Haskell) is universally bad, I'm saying it is disturbing to find how absolutely ubiquitous it is in Go. Variable names are sometimes redundant, that is not a universal constant. That redundancy is a factor of the expressiveness of the type system, that's hardly a claim of fame of Go.
> Haskell does type inference, so even if types are static they are not explicit.
Go has local type inference, and leveraging Haskell's global type inference is usually recommended against.
> That's why you still need explicit variable names.
My background is scientific and systems programming in C and C++. Currently I write a lot of Python for my job and a lot of Rust for my spare time hobby programming, which consists of things like writing a BLAS, reimplementing other C and C++ projects, etc. So I haven't only programmed in dynamic or scripting languages.
I strongly disagree with the assertion regarding single letter variables. I would gently correct a junior programmer who tried to do that and I would chew the hell out of a senior programmer if he tried it. In any language.
I am quite surprised that "self" is so much more used than the next word in the list "if". I know many languages where "self" is not a keyword at all, but I cannot think of a single language where "if" is not a keyword.
EDIT: Ops, I missed that there is a language filter. Ignore my comment :)
Besides "return", C/C++ doesn't have a particular word that stands out. Probably because you just write write things. eg you don't put "function" in front of a function.
Yes, I also thought that a list for each language shows exactly what's wrong with that language. For some it's self/this, Java is import and return, etc. for C++ there's no clear winner, because it is minimalistic by nature (mostly due to C heritage of course).
Cool stats about Java. "import" is the most frequently used word. It looks like everything is already exist in Java, so just import all the things, some clue code and you are done.
In Python you usually import a module and use that as a symbol, or `from module import Symbol1, Symbol2, Symbol3`.
Java doesn't have the former, and the latter requires an import per symbol (it also has the ability to import every symbol in a namespace, and so does Python, but that's usually discouraged)
I disagree heartedly. Implicit code leads to easy to write hard to read. I'd much rather have where you are getting a value from be very explicit to not cause hard to see bugs as well as making the code easier to understand from a fragment.
I prefer version 3 by far. Unfortunately, Javascript decided to take the same path with ES6 classes which forces you to use this in the body. Fortunately, it does not force you to use this in the argument list.
I think that goes against the python dogma of "Explicit is better than implicit." In #1 and #2 there is no question about where 'x' comes from, while in #3 it could be a class variable or a global variable or from just about anywhere.
The beauty with self in Python is that self is not at all magic : it merely indicates that the object instance you're using will be passed as first argument of the class method.
Also you can get a custom font with typographic ligatures (e.g. for self, lambda, and so on) in order to make it more visually appealing. For instance (self > 圖) :
You can get the "Python dogma" with `import this`. "Explicit is better than implicit" is part of it, but so is "practicality beats purity" (and "There should be one-- and preferably only one --obvious way to do it, although that way may not be obvious at first unless you're Dutch").
...Doesn't work with `self` because that would be implicit again, though.
Anyway, I think that it is a minor inconvenience that isn't important for oneliners and extremely helpful when reading larger functions.
It actually uses less characters, since you've eliminated all the repeated uses of 'this.'. As a general rule I'm in favor of a turning a one liner into a two liner if it improves readability, especially when it reduces actual typing.
Python's self in the arglist is a C struct pointer sneaking in from the 80's, which is the time when Python has been designed. It could be excused but there should be a deprecation PEP by now. Make it a keyword and let us type it only when we need it.
That might seem nice for your limited example, but it breaks down when you realize that in example #3 there would be no easy way to differentiate between scopes.
y = 100
x = 15
class MyClass(object):
x = 50
def __init__(self, x):
self.x = x
def length():
return x * y
What does that mean? Does MyClass(10).length() raise an AttributeError because MyClass doesn't have an attribute named y? Does it automatically recognize there's a y in the outer scope and use that, or does it call __getattr__ first (i.e. method that gets called when a missing attribute is accessed)? Furthermore, how do I specify that I want to access the class attribute x, or that I want to access the nonlocal x?
In my opinion, there are far bigger problems with pythons scoping. Why is for/if/etc not it's own scope? You can define a temporary variable in there and mistakenly use it somewhere down the line. In a "for i in ..", the i should be cleared after the loop but it isn't. Why can't you create a new scope within a function? This allows you to bundle related stuff together and be sure the scope is cleared afterwards, so as to not mistakenly reuse variables. The thing that you're promoting as an advantage of explicit self is already broken by having a single scope for the whole function.
Blocks like this are an imensely useful way to keep scopes clean:
int someMethod(){
...
{ // some small stuff that doesn't warrant a new method but you don't want it to bleed into the remaining part of the function, e.g.:
float x = readNextFloat();
float y = readNextFloat();
float z = readNextFloat();
float length = sqrt(x*x + y*y + z*z);
cout << length;
}
// do some other things without worrying about potentially initialized variables
...
}
Don't call it "s", because then you break consistency basically with the whole Python ecosystem. That's not very nice for fellow programmers who already used to read and understand "self".
Javascript 'this' binding behaves in quite unexpected ways if you're coming from something like python on ruby, which is a major source of trouble for people.
In fact, our coding style demands that we avoid using instance variables in preference of adding `attr_accessor` to explicitly encourage the use of implicit self.
self.attribute = attribute in an ActiveRecord class when the left side is a field of a table and the right side is a variable. A different name for the variable fixes it without using self.
For example in C++ where it's implicit most code styles require adding prefix "m_" or something similar to indicate that the variable is a class member.
So then I expect the same would happen in Python and JavaScript. I don't think it would be better that way.
Point of order: this comment has been downvoted into greyness. My understanding is that we should downvote when a comment is low-value, not when we disagree with it, but i suspect the latter has happened here. Could anyone who has or is tempted to downvote this instead make a comment to present their critique?
Don't complain about down votes, especially after only 15 minutes. There is inherent randomness in votes and it only takes one or two votes to grey out a comment. Most of the time these things work themselves out within an hour or two. In this case the comment was black within 12 minutes of your post.
I also think filtering out comments would improve it - especially because so many source files include a copyright statement at the top, and the same licenses (MIT, GPL, Apache, etc) are found repeated in many different files and it distorts the results somewhat.
In that particular case, I'd say it's quite interesting indeed!
Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation (I've never used a Microsoft programming language, but a quick search brought me to https://msdn.microsoft.com/en-us/library/z04awywx.aspx ).
In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.
That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.
Such terms clearly have an effect on a system's documentation, even if they don't have an effect on the CPU instructions being executed. But I'm a programmer, not a CPU; text files containing source code are my main I/O interface, and they most certainly do contain such markup, and hence I find it interesting to see statistics about. In comparison, I don't step through very much assembly day to day, so I don't really care very much about the compiler output (the part which the comments don't affect). I prefer to reason at the level of the language I'm using, where not only do comments appear, they're very useful!
> Presumably "summary" appears quite a lot because C# developers use markup like `<summary>` in their comments, so automated systems can build documentation
Yes, and the IDE will auto-generate a doc comment with a <summary> because that's pretty much the most basic doc comment you can get.
> In that sense, it's not really a comment anymore: it's one machine-readable language embedded inside another.
My issue is not that it's a comment, it's that it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically.
> That's certainly interesting, to me at least. It tells me about the signal/noise ratio of the language, the prevalence of various forms of documentation (e.g. <summary> is conventional, whilst something like <precondition> is not), etc.
<summary> is not conventional, it's the primary tag used by the C# documentation system and shown by IntelliSense. <precondition> is not that.
> it is essentially worthless as your IDE's basic "add method" intention (or whatever) is going to add it automatically
Just because an IDE will write boilerplate automatically, that doesn't mean the boilerplate wasn't written, checked into version control, presented to developers, etc. Even if such boilerplate were added by an IDE, and hidden from developers (e.g. using code folding), it's still there in the language.
In this case, the language is C#, not e.g. some "C#-like" language which gets preprocessed/transpiled by an IDE into C# by scattering boilerplate around.
Whilst tooling can help us live with a language's deficiencies, they don't remove those deficiencies ;)
Well, there is a sense in which the language you write (which may not be the language you read) is defined by how you interact with the development environment to produce code.
Which is I prefer a language where I just need to learn one language, and not a separate input language because the language-as-read is to unergonomic to write so a different language needs to be defined for productively writing code.
As much as I hate all the typing I do for error checking in go, I just don't think a compiler can handle errors that well on their own yet.
The explicitness of go's error handling forces you to handle every error specifically. There isn't a chance (for the most part) that you'll get an error from deep in the program that you can't easily handle.
Forcing you to handle them everywhere and anywhere forces error handling to be a part of your architecture.
Erlang took this other way around by designing a system that enable you to not handle all these errors.
I am more from this school of designing a system around the reality insteas of trying to patch it everywhere, praying we have enough fabrics to catch it all.
But it would meam rethinking how we build stuff. That was not at all a goal of Go.
I do appreciate this, but word clouds really are a terrible visualization method for text data. With regard to the python example, I cannot grasp at all if the frequency of self and None are similar or drastically different. The table on the value is more informative and less likely to misread.
One can make very interesting conclusions based on purely this. Examples:
- Python developers does not follow Clean Code (ala Uncle Bob) as much as Ruby , because if statement is more frequent than def and return.
- Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.
- People don't really care about good variable names ("a" is a terrible variable name in scripting languages like JS and Python, still top 11)
- PHP developers might practice "return early" in functions (more return than function keywords) OR their functions just do too much :)
- Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.
I instinctively think you might be right (at least with your second statement), but what are you using as your metric here?
I am not sure if ruby devs like to do much functional style, the parent mentioned Uncle Bob and his influence on Ruby community with Clean coders. I think they are more into OO.
> One can make very interesting conclusions based on purely this.
With a large grain of salt
1. generally, the thing seems to mix words from all context, for instance #6 in Ruby is "should", click on the word and it's mostly comments, same for Rust's #4 "the".
3. also unclear what codebases are parsed, lots of django-isms in the conditional examples (if context is None, if request.method == 'POST', if form.is_valid)
See also: Go, where most every function call requires an `if`
> - Ruby makes it possible to write in a much more functional style than Python. OR Ruby developers like to develop more in a functional style than Python developers.
Don't confuse writing more functions/methods and writing in a more functional style, they're very different things.
Comparing word clouds of comments would actually be interesting. But then you hit the issue of "what is a comment" e.g. many language use comments for item documentation, but python uses (doc)strings.
The Rust compiler seems somewhat overrepresented in this data set. I see "ccx", "fcx", and "CrateContext", which are only used in the Rust compiler itself.
It is probably a sign of a good language that the words used most should be similar in frequency to their use in pseudo code. When words like "end" (Ruby), "self" (Python), "import" (Java), "err"/"error" (Go and Node) are over-represented, it's likely a sign that the language is introducing accidental complexity. By this metric Swift looks astonishingly sane.
Pretty cool! Some unexpected results, or at least not what I guessed. "summary" as the top for C#, "SELECT" all the way down at #43 for SQL, "err" as the top for Go (I'm sure that will spawn some pleasant discussion).
I think for SQL their sample includes just ".sql" files, which tend to contain schema definitions and data dumps, hence CREATE and INSERT. Most of it not handwritten also.
This is pretty neat! I was surprised to see that for SQL, SELECT was so far down the list.
I also wonder what the criteria is for which languages to analyze? There are a few other languages I would like to see, but maybe on github they aren't well represented...
This is really nifty, unfortunately it includes comments and so with thousands of files all including copyright notices, 'the' is the 3rd most popular word in c++ files.
That's good to hear. I didn't look in to it in too much depth, I just thought it was strange that 'the' was so high for c++ so clicked on it to see example usage and got things like:
** use the contact form at http://qt.digia.co/contact-us.
furnished to do so, subject to the following conditions:
* This file is part of the LibreOffice project.
// with this library; see the file COPYING3. If not see
So assumed licenses had not been excluded.
Having a brief look at the source, I think with the licence marking approach it's still leaving in quite a few lines from each licence (see above for examples).
Contrary to popular opinion neither `s' nor `t' are words. At least not in English anyway. :/ Or do they mean that these characters appeared as variable names?
Well I will slay my developers before they put the metasyntactic variables on master. I allow i to pass though. Go to j and you have O(n^2) and I will slay them again.
summary is the most used word in C#! Only from comments! Amazing. EDIT: it looks like this is reading files that are common to all projects... which is why the sentence "// The following GUID is for the ID of the typelib if this project is exposed to COM" appears 459k times!