I'm currently compiling a large corpus of articles written by different authors & I'm having difficulty verifying the copyright status of contributions.
Basically, I want to make sure all articles are under a copyleft license & no excessive copy-pasting has happened. GFDL, CC-BY, Public Domain content is acceptable of course. Ideally, we'd be able to score user-generated content on submission.
We've been looking at plagiarism detection (String Comparison Methods(e.g. RKR-GST), NLP (e.g. detection of writing style differences), search-based tools) but we obviously need help. There are some pretty smart people on this board so I thought I'd ask:
How are you handling copyright issues in user-generated content? Anyone has experience applying above methods to user-generated content? What mechanisms exist to automatically make sure that content is not covered by restrictive licenses, reverse drm if you wish to call it that way.