
How to Identify Software Licenses Using Python, Vector Space Search and Ngrams - boyter
http://www.boyter.org/2017/05/identify-software-licenses-python-vector-space-search-ngram-keywords/
======
abetusk
Interesting!

I wrote something vaguely similar called 'WhatIsThisLicense' [1] available
under AGPLv3 [2].

Mine was less ambitious in that it was using a straight approximate string
matching approach using Levinstein distance with a "thresholded" Ukkonnen
algorithm and was only meant to be used on the license text itself, rather
than a whole source file. It uses a C library ported over to Emscripten to run
in browser [3].

[1]
[http://mechaelephant.com/whatisthislicense/](http://mechaelephant.com/whatisthislicense/)

[2]
[https://github.com/abetusk/www.mechaelephant.com/tree/releas...](https://github.com/abetusk/www.mechaelephant.com/tree/release/www/whatisthislicense)

[3]
[https://github.com/abeconnelly/asmukk](https://github.com/abeconnelly/asmukk)

------
pombreda
The vector space / IR ranking approach is useful but is not enough to get
accurate detection IMHO. Eventually you need something which is not a
probabilistic match and the devils is the details.

For instance a license notice may be as short as "mit" or "gpl" or "gpl2" or
as long as a full AGPL license text or a long text with multiple licenses
texts and notices.

In these cases, the rankings are likely to be completely off and your GCC scan
detects a few weird stuff for sure which are not right.

So you can get a decent indication but it will be inaccurate often enough at
scale.

------
rahkiin
Nice writeup!

Have you tried Sublime Text 3 to open the 20MB file? I have opened 200MB+
textfiles with it without issue.

~~~
boyter
That's exactly what I tried opening it with. I suspect the issue may have been
down to the fact that it was all on a single line.

That said I didn't want to distribute a 20MB file with the application anyway,
so the reduction to 3MB is a good choice anyway.

~~~
allenz
It's probably because of the syntax highlighting, which is fairly slow due to
regex. Sublime should have no problems reading it in text mode. more would
also work!

