ScanCode: Open-source tool to scan code for licenses and copyrights

hlieberman · on July 25, 2015

How is this different from licensecheck that's part of devscripts?

pombreda · on July 26, 2015

At a high level, scancode detects many more licenses and copyrights than licensecheck does, reporting more details about the matches. It is likely slower.

In more details: ScanCode is Python app using a data-driven approach (as opposed to carefully crafted regex):

- for license scan, the detection is based on a (large) number of license full texts (~900) and license notices/rules (~1800) and is data driven as opposed to regex-driven. It detects exactly where in a file a licensse text is found. Just throw in more license texts to improve the detection.

- for copyright scan, the approach is natural language parsing (using NLTK) with POS tagging and a grammar; it has a few thousand tests.

- licenses and copyrights are detected in texts and binaries

Debian's licensecheck (available here: https://anonscm.debian.org/cgit/collab-maint/devscripts.git/... for reference) is a Perl script using hand-crafted regex patterns to find typical copyright statements and about 50 common licenses. There are about 50 license detection tests.

pombreda · on July 26, 2015

FWIW, I did run a quick test using the HEAD of devscripts and there are many things not detected by licensecheck that are detected by ScanCode.