

ScanCode: Open-source tool to scan code for licenses and copyrights - pombreda
https://github.com/nexB/scancode-toolkit/

======
hlieberman
How is this different from licensecheck that's part of devscripts?

~~~
pombreda
At a high level, scancode detects many more licenses and copyrights than
licensecheck does, reporting more details about the matches. It is likely
slower.

In more details: ScanCode is Python app using a data-driven approach (as
opposed to carefully crafted regex):

\- for license scan, the detection is based on a (large) number of license
full texts (~900) and license notices/rules (~1800) and is data driven as
opposed to regex-driven. It detects exactly where in a file a licensse text is
found. Just throw in more license texts to improve the detection.

\- for copyright scan, the approach is natural language parsing (using NLTK)
with POS tagging and a grammar; it has a few thousand tests.

\- licenses and copyrights are detected in texts and binaries

Debian's licensecheck (available here:
[https://anonscm.debian.org/cgit/collab-
maint/devscripts.git/...](https://anonscm.debian.org/cgit/collab-
maint/devscripts.git/tree/scripts/licensecheck.pl#n489) for reference) is a
Perl script using hand-crafted regex patterns to find typical copyright
statements and about 50 common licenses. There are about 50 license detection
tests.

~~~
pombreda
FWIW, I did run a quick test using the HEAD of devscripts and there are many
things not detected by licensecheck that are detected by ScanCode.

