Hacker News new | past | comments | ask | show | jobs | submit login
ScanCode: Open-source tool to scan code for licenses and copyrights (github.com/nexb)
12 points by pombreda on July 25, 2015 | hide | past | favorite | 3 comments



How is this different from licensecheck that's part of devscripts?


At a high level, scancode detects many more licenses and copyrights than licensecheck does, reporting more details about the matches. It is likely slower.

In more details: ScanCode is Python app using a data-driven approach (as opposed to carefully crafted regex):

- for license scan, the detection is based on a (large) number of license full texts (~900) and license notices/rules (~1800) and is data driven as opposed to regex-driven. It detects exactly where in a file a licensse text is found. Just throw in more license texts to improve the detection.

- for copyright scan, the approach is natural language parsing (using NLTK) with POS tagging and a grammar; it has a few thousand tests.

- licenses and copyrights are detected in texts and binaries

Debian's licensecheck (available here: https://anonscm.debian.org/cgit/collab-maint/devscripts.git/... for reference) is a Perl script using hand-crafted regex patterns to find typical copyright statements and about 50 common licenses. There are about 50 license detection tests.


FWIW, I did run a quick test using the HEAD of devscripts and there are many things not detected by licensecheck that are detected by ScanCode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: