Hacker News new | comments | show | ask | jobs | submit login

Graue I am from Common Crawl. We don't filter for porn. A corpus of web data needs to include porn or it wouldn't be a representative sample of the web ;) We do want to enrich our sample of the web with high-value sites and that is where the blekko data will be so incredibly valuable.

I really appreciate your mention of LGBT and sexual health sites being collateral damage - we need to draw more attention to that problem. I would love to see someone work with Common Crawl to improve methods of distinguishing. Lisa

Glad to have brought up the issue. Sounds like you should talk to 'randomstring, above.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact