
GitHub's language detection is broken - Allan_Smithee
https://github.com/github/linguist/pull/748#issuecomment-37633185
======
DannoHung
Why don't they just let project maintainers say, "This project contains x, y,
and z" or something? That'd at least let them get a leg up on doing the
categorization right and I don't think many people would mind _having_ that
capability.

~~~
mindcrime
+1000. Github routinely detects the wrong language for my projects, and there
is no way to manually override it. My take is this: If you want to auto-detect
the language, fine... but let the owner of the repo override your detection
when it's wrong.

It's probably also a bug to even have the notion of "a language" for a repo
given the burgeoning polyglot programming trend. So many repos these days
contain multiple languages, especially when you consider javascript, that I
question if it even makes sense to say 'This project is in language X' at all.

Like you say, the best option really would be to let the repo owners /
maintainers just specify this stuff. They are, after all, the ones who know.

------
pedalpete
I think the idea of automated language detection is pretty cool, but why
doesn't github just give you the option of correcting it, or labelling it with
the language you prefer?

For example, I've got a javascript modules in repositories. For each module, I
make a demo version to show what the module does, and that demo includes a
bunch of css. Apparently, there is more css than their is Javascript, so
GitHub labels the module as css, but the important part isn't css, the
important part is the javascript. In order to resolve this, I've had to move
the css into a different repository, and ignore it in the javascript
repository. Seems like a long way around, when all I want to do is correct
them and say that the module is actually a javascript module.

~~~
michaelmior
Language detection as discussed in the link is per-file. I don't think
overriding individual files makes sense since it's likely to be more trouble
than it's worth. But I can understand the desire to change the detected
language of the project.

~~~
Allan_Smithee
How 'bout a project-specific property list that looks something like this:

.rb=RealBasic .m=Mercury .pl=Prolog .js=SomeCrapOrOther …

~~~
michaelmior
Seems like more effort than it's worth still to deal with project-specific
settings. AFAIK the only two things this practically affects is syntax
highlighting and repository stats. That approach would be a good tradeoff
though if things are important enough.

------
013
This 'lewellyn' person seems to be complaining about the lack of support for
the language Limbo, a language for the Inferno OS. Both seem quite outdated
and out of use. He also complains about how Github is focusing on 'cool' kid
languages. Which I am guessing refer to modern, popular languages (If this is
the definition of cool, then yes, they are.) Which, if I was Github, I would
do the same. It's called priorities. I kind of get the vibe that lewellyn is
some kind of 'hiptser'. His obscure language is better than the 'cool' kids
simply because he's using it. I also would phrase it as "GitHub's language
detection is broken", it's merely missing a feature/language.

~~~
choult
I suspect his - rather labored - point is that there are multiple use cases to
show that the design of Linquist's configuration is flawed as a rule and not
an exception, and the lack of attention paid to this particular issue is
perhaps indicative of a more general Github attitude towards the less trendy
languages and technologies out there.

~~~
scott_s
Which I think is an uncharitable way of saying "Github prioritizes working on
things that will impact the most people."

------
jperkin
Their language detection is indeed terrible. I have a repository
([https://github.com/jperkin/pilights](https://github.com/jperkin/pilights))
which is entirely composed of shell scripts and a single markdown README.
GitHub's analysis?

    
    
      Perl 83.5%	  Shell 16.5%
    

There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere
in the repository, and all scripts begin with #!/bin/sh.

A number of my other repositories have similar problems, but this one is by
far the worst.

~~~
LeonidasXIV
Sorry, I see a huge blue bar saying 100% shell.

~~~
theOnliest
Earlier this morning when I looked it looked like OP said...it looks like
something has changed in the three hours or so intervening.

~~~
jperkin
Huh, yes, they appear to have coincidentally fixed it since I wrote that
comment. Maybe I need to start reporting all GitHub bugs as Hacker News
comments...

------
bru
Inciminated comment:
[https://github.com/github/linguist/pull/748#issuecomment-374...](https://github.com/github/linguist/pull/748#issuecomment-37418098)

> if you'd like Mercury language detection on GitHub then with the current
> implementation of Linguist you need to pick a different (unique as
> Objective-C already defines this) primary_extension and add .m to the
> extensions array which will force Linguist into using the other detection
> methods mentioned above.

~~~
moron4hire
what, then, is the point of the primary_extension field?

EDIT: or as I like to yell at Github for Windows when it can't revert out of a
merge conflict "WHAT IS EVEN THE POINT OF YOU?!"

~~~
skywhopper
It's just a design error in the original implementation where linguist assumes
that the "primary_extension" for any particular language will be unique among
all primary_extensions. Obviously that was a mistake, but that's where we are.
The comment that set people off was perhaps poorly worded, but it was an
honest suggestion to work around the design bug.

~~~
moron4hire
A better suggestion: fix the defect. Just delete primary_extension. At best,
it does nothing that the extensions array can't do, as it doesn't appear at
first glance that any check to primary_extension does not also include a check
to extensions.

At worst, it is confusing to implementers and requires chicanery to work
around... which is exactly the case we're in. We're in the worst case scenario
for this bit of code, and there is no upside to its best-case scenario. Just
delete the code.

~~~
DangerousPie
I bet it's not as easy as "just delete the code". They will probably have to
do quite a bit of refactoring to remove this, followed by a probably even
larger amount of testing.

Long-term this is probably the right solution, but why go through all this
trouble right now if there is a simple workaround? It seems like the only
problem right now is a few people's pride.

~~~
Allan_Smithee
How much you wanna bet?
([https://github.com/github/linguist/issues/985](https://github.com/github/linguist/issues/985))

~~~
DangerousPie
If I'm reading this right all this does is define the first extension as the
primary extension, which still has to be unique. So they would still have to
use a different extension as the first one in the list, it just wouldn't be
called "primary" anymore. How would this help?

~~~
nox_
There is no unicity check anymore; I just kept the first one special because
some private code at GitHub seems to rely on the primary_extension property
for the Gist editor. As I can't hack this (it is private), I can't remove it
entirely.

------
eCa
I have a few Perl projects on Github that uses Bootstrap. Main language
(according to Github): Javascript.

I expect that Javascript's github popularity ranking is (a little bit)
inflated due to such issues.

~~~
mindcrime
I expect it's a _lot_ inflated. I also have repos that are primarily Groovy,
but show up as "Javascript" due to the presence of JQuery, Bootstrap, etc.

~~~
Allan_Smithee
Unfortunately, until this core issue is fixed, users can't really submit
further pull requests to fix the other issues which would correct the
"inflation" we all know and hate.

------
cl8ton
I’ve been in the programming trenches since early 90’s fluent in 5 languages
at the production level and have to say I have never heard of the language
'Limbo’. I don’t fault GitHub one bit.

I suppose I could Google it and act like I know… naw

~~~
rkangel
I'm pretty sure there's no one with an encyclopaedic knowledge of programming
languages. The industry is enormous, and just because you haven't heard of it
doesn't mean that it's irrelevant. "Niche" is not the same as "irrelevant".

And even if it WAS irrelevant and only important to a very small number of
people, that doesn't mean it can be ignored.

~~~
nahname
>And even if it WAS irrelevant and only important to a very small number of
people, that doesn't mean it can be ignored.

I don't follow. That sounds like the exact criteria for something to be
ignored.

~~~
moron4hire
This really explains SV's homeless problem.

------
Aqwis
I don't really care about Limbo, but GitHub seems to think my .m files are all
M(UMPS) files, and not Matlab files, the most obvious choice. Highly annoying.

------
RyanZAG
I don't get it - why is a bunch of people trolling the github project with
fairly irrelevant arguments interesting? Could someone who upvoted this
explain the logic?

~~~
girvo
How is having a differing opinion "trolling"? Seriously, this word has lost
all meaning at this point.

~~~
akerl_
Ignoring the suggested workarounds (setting a unique primary extension and
then having the correct extension in the array, for instance) and continuing
to rampage in the comments in an attempt to stir up the masses seems like the
canonical example of trolling an online community.

~~~
girvo
Except he actually has a point. GitHub's default behaviour is broken.

In my years of experience online, trolling was specifically riling someone up
by saying things the troll doesn't really believe.

Trolling isn't disagreeing that a workaround is sufficient to ignore an actual
issue. But that's just my opinion.

~~~
Allan_Smithee
"Troll" is like "terrorist" these days. It has absolutely no semantic content
beyond "person I disagree with about something".

------
jbranchaud
While it is unfortunate that a pull request on this project has been around
for 5 months without much progress, I think the commenter is being a bit
dramatic. He is acting as if GitHub is blocking all commits with Limbo code.
The language can still be under version control, it just might not have syntax
highlighting and its own color in the repo stats bar.

GitHub isn't discriminating against certain programmers. Stay calm and keep
coding!

~~~
bjz_
> it just might not have syntax highlighting and its own color in the repo
> stats bar.

It is discriminating, and harmful to all programmers. We need to be able to
easily search for these lesser known languages – they are important cultural
works. The commenter points out: "Limbo ... seems to have heavily inspired Go
(which is currently extremely fashionable)". We are worse off for not having
our history readily accessible.

------
mehwoot
All they are asking is to arbitrarily specify some other extension as the
"primary" extension and have ".m" as another extension. Users will still see
the same end result.

~~~
Allan_Smithee
Unless they use gist.

------
kalleboo
I miss Mac OS Classic Filetype/Creator codes... Filename extensions are such
an ugly hack.

~~~
hyperpape
Great example of worse is better in action.

------
deutronium
Could they use Bayesian classifiers? trained on a corpus of different
languages, primarily concentrating on the symbols used in the language.

------
dclowd9901
I think if I was writing a language detector, it would have these features:

\- learning heuristics based on user suggestion.

\- extension filtering to differentiate similar languages.

\- the algo would use prominence and placement of white space and non-word
characters to create the DNA of a language. If the language scores below a
threshold against the DNA, it doesn't presume, it asks the user. If a language
scores high against this DNA, it still allows used override. Whenever a user
would submit their indicator, its file source would be used to train the
heuristic.

~~~
Allan_Smithee
This is because you likely think before you code.

------
awalton
> My esoteric programming language isn't properly supported by the popular
> kids' web tool that I'm likely not even paying to use in the first place.
> I'm OUTRAGED!

Yep, seems about right.

~~~
hk__2
Also, this is untrue. Omgrofl is supported on GitHub, even if nobody uses it.

------
joeblau
What's interesting about this PR is that this case was actually one of the
reasons that I created [http://www.gitignore.io](http://www.gitignore.io).
GitHub's original repo for .gitignore templates had nearly 1000 open PRs until
around Oct 2013 so I built my own repo that would actually accept PRs. Since
then, a few employees have worked on accepting PRs, but I had a similar
feeling of frustration. Unfortunately, the OP can't just fork this repo
because its features are integral to how GitHub works, where as I was able to
hack around the system and create a separate product.

------
skywhopper
The rant linked to appears to misunderstand the problem and the workaround.
@arfon admits there's what amounts to a design bug in Linguist, and so to
identify ".m" files, you have to identify a different extension as the
"primary" and put the real extension into the "alternate" list. That's a hacky
workaround, but it would make the pull request work.

The alternative is to fix the design issue. But that's going to be a lot
harder and require more than a few days.

~~~
nbouscal
@arfon doesn't admit that there's a bug, rather he says that "requiring a
unique primary_extension isn't really a 'bug', rather it's a consequence of
how language detection works in Linguist."

The work to fix the design issue was already done by @nox, who submitted a
pull request which is still open:
[https://github.com/github/linguist/pull/985](https://github.com/github/linguist/pull/985)

~~~
hyperpape
I guess the charitable reading is "this isn't a bug, it's more of a bad design
choice, and we can't just fix it overnight".

But I honestly can't tell if that's what he meant, or if it was more of a "not
my problem" type of response.

~~~
Allan_Smithee
Except that it was totally fixed overnight. No, wait. Not overnight. Over two
hours.

Of course that PR isn't being accepted either.

------
eXpl0it3r
I can agree that the detection is broken. C++ gets often recognized as C. PHP
with some CSS file gets recognized as mostly CSS, etc.

Personally I'd like to have a fixed language that I can set and that the
search will use. Next to that, it would be fine for me to statically show what
the repository contains, but please use a better language detection, just
going by extensions is quite naive.

~~~
nox_
> C++ gets often recognized as C.

The disambiguation test for C++ headers is ridiculous:

    
    
          matches << Language["C++"] if data.include?("#include <cstdint>")

~~~
Allan_Smithee
Well, I expect that's why so much C++ is misrecognized. Not enough people
write valid C++, in Github's narrow world view. :)

------
mcovey
I wish I could pick the language so I could upload shell scripts without
extensions, but it doesn't even read the shebang line.

------
johnduhart
Sorry, but was there an actually something useful in that comment? I couldn't
tell over the 6 paragraphs of childish moaning.

------
moron4hire
Okay, but the automatically updating comments view is pretty cool. I didn't
know Github did that. That is pretty awesome.

~~~
Allan_Smithee
And that would be part of the problem with Github. Emphasis on "pretty cool"
visual flair while letting fundamental architecture fly out that is flatly,
and very obviously, just plain broken.

~~~
akerl_
Considering that the comment you're replying to said "that feature is pretty
cool", and didn't even need to address the actual linked rant, it seems that
not everyone agrees with your "this is just plain broken" viewpoint.

I use Github for the visual flair and cool features. If I wanted to run my own
fundamental architecture, I'd be doing that.

~~~
Allan_Smithee
"I use Github for the visual flair and cool features."

The software crisis spelled out in a single sentence.

