Shame, because some of the languages not listed tend to be the most interesting ones.
There also needs to be better filtering of the raw data ("many others" near Python is kind of a head scratcher for example).
Filters for "just languages" or language + path length of X etc. might be cool as well.
A simple NON_AUTHORS = ["others", "et al."] etc. and using that in normalize_data would be a good start imo.
Edit: Plankalkül also has encoding issues :)
* Dialect of
* Implementation of