I teach machine learning applications to masters students. Many students ask me whether it’s legally OK to scrape websites without using an API and use the data for their projects for my course. I usually just direct them to use APIs with authentication or use tabular datasets on Kaggle, data.world, etc., because I’m not a lawyer and I don’t know the legality of web scraping. The most relevant article I know is from EFF (https://www.eff.org/deeplinks/2018/04/scraping-just-automated-access-and-everyone-does-it) but it’s more than a year old.
Can anyone who knows the law please guide me on this issue? Note that the concern is less about what’s ethical and more about what’s legal. This will also help me in my research because these days some reviewers are raising this concern when they see authors used web scraped data. Online there are a ton of opinion pieces but nobody is clear on the legal side of it. Mostly people oppose scraping because they think it’s unethical.
https://www.eff.org/cases/hiq-v-linkedin
Basically: if it's publicly visible, you can scrape it.
Caveat: the case is still making its way to the Supreme Court.
Edit: There's also Sandvig v. Sessions, which establishes that scraping publicly available data isn't a computer crime:
https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...
Edit2: Two extra common sense caveats:
- Don't hammer the site you're scraping, which is to say don't make it look like you're doing a denial of service attack.
- Don't sell or publish the data wholesale, as is -- that's basically guaranteed to attract copyright infringement lawsuits. Consume it, transform it, use it as training data, etc. instead.