Graue I am from Common Crawl. We don't filter for porn. A corpus of web data needs to include porn or it wouldn't be a representative sample of the web ;) We do want to enrich our sample of the web with high-value sites and that is where the blekko data will be so incredibly valuable.

I really appreciate your mention of LGBT and sexual health sites being collateral damage - we need to draw more attention to that problem. I would love to see someone work with Common Crawl to improve methods of distinguishing. Lisa

Glad to have brought up the issue. Sounds like you should talk to 'randomstring, above.

