

Ask HN: Scraping Answer Sites for ML Project? - kurtosis

Hey,<p>Does anyone have any advice on how to download yahoo answers without angering anyone?  Is this dataset published anywhere?<p>I'm putzing around with some statistical methods to automatically predict which answers will be highly ranked on question/answer sites like yahoo answers, stackoverflow.com, et. al.<p>So far I have trained my methods using the published data dumps of stackoverflow.com - the results are interesting/encouraging and I'd like to work with a softer dataset like yahoo answers where the questions are less technical.<p>(Incidentally this method gives interesting results for predicting points of comment threads on HN, however I refuse to release this without making a nice interface for people to browse)
======
tocomment
That's a cool project. Let me know if you need any help. I'm not sure what I
could do but it sounds interesting.

Here a couple ideas that might help (not that I condone violating a site's
TOS).

Have a random time interval between each page download so someone looking at
the logs doesn't see a regular pattern.

Pick 4 or 5 common user agent strings to alternate randomly between.

Go through free proxies, or package up your program and have 5 or more friends
run it on different sections of the data you want to download.

