Hacker News new | past | comments | ask | show | jobs | submit login
A faster file programming language detector (github.com/go-enry)
17 points by ducktective on Jan 17, 2021 | hide | past | favorite | 7 comments



It is similar to the tool github uses to detect language of files (https://github.com/github/linguist) but about 2x faster.

Here is the CLI binary repo: https://github.com/go-enry/enry


Ah, cool. I was initially thinking it was a tool that would look at binary executables and try to figure out what language they were compiled from. Once I realized this was only looking at source, I couldn't figure out what anyone would use it for. Totally makes sense after this comment. (Which should probably be in the readme too).


I think for binaries, `binwalk` is the proper tool. One of the usecases of this tool could be "automatic syntax highlighting" for webapps (like git repo interfaces) or pastebin sites.


Have you got accuracy scores? Speed ain't everything if misclassification is frequent enough to worry.


I didn't find any in the README.md. Maybe someone from the project could provide an accuracy benchmark.


This is a surprisingly difficult problem and yet not as common an issue as you might anticipate. For most use cases you can rely on the filename or extension.

I wrote scc and it eventually reached enough people I needed to add some basic checks to determine the difference between coq and verilog. It’s primitive mostly because I try to prioritise speed without losing accuracy. Another code counter tokei does not do this at all and only uses extensions. The number of people unhappy with this decision is surprisingly small.

But you could easily get away with just remapping as I doubt they occur in the same repo very often.

That said I’m looking at improving this sort of thing because of the number of C with C++ code bases.

Oh I should also mention there is no universal list of extension and name to language mapping anywhere online. Each tool builds it’s own. I’d love someone to take this on and have some sort of standard for everyone to follow and build against.


For enry, there is a `alias.go` file in the tree that contains the extention_to_language mappings. Yeah, I've also found that there is no standard for this. Anyway, I've reformatted that file as json: https://dpaste.com/APYDEJPKJ




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: