Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So many ideas start to come to mind if scraping is legal.

Can we start to scrape Google Search in order to bootstrap building an alternative to Google Search? Search is a really hard problem (that somebody should tackle), but if we can leverage what Google has already scraped from the web and associated with popular search terms, we can use that to help train and validate our search model.

Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing service that strips out all the ads? It's hard to bootstrap a social media website, but if you can import all the content from the existing giants, your site is no longer a wasteland.

Can we finally scrape and get rid of IMDB? I'd love to put all of their content on a wiki and be done with it.



Seems like a hard problem to legally solve. I can see so many valid use cases for bots to scrape pages. But in all of your examples, I'm inclined to say that it shouldn't be allowed.

Maybe it falls into a "fair use" situation? Obviously copying an entire website would not be considered fair use, but something like scraping a bunch of public profiles on Steam to get aggregate data on what games are played the most seems totally valid.

Hopefully it doesn't end up with everything gated behind a sign-in and a TOS.


> Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing service that strips out all the ads?

Even if web scraping was definitively legal (this preliminary injunction doesn't mean that), that doesn't mean you can bypass the content creator's copyright. Non-copyrightable functional data is one thing, but copying all of Reddit, for example, would include copying https://www.reddit.com/r/WritingPrompts/ and that would definitely be violating the rights of the authors.


> Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing service that strips out all the ads?

How are you going to pay for it? Subscription model doesn't work for search/social networks.


Ad-free Reddit would be sustainable if:

- Comments are ephemeral, expiring after two weeks (no growing storage costs)

- "Reddit Gold" helps to offset costs

- Run Wikipedia-like donation drives yearly

- Write everything in bare-metal Rust so that CPU is cheap. Likewise, make intelligent choices about schema and service design for scalability.

- Don't continue to drive unnecessary feature work (that is usually just to drive ad engagement and growth).


> Can we finally scrape and get rid of IMDB? I'd love to put all of their content on a wiki and be done with it.

Just because you can scrape the content legally does not mean you can also republish it on your own website.


> Just because you can scrape the content legally does not mean you can also republish it on your own website.

Except IMDB copied all of its data by scraping publicly available data posted to Usenet back in the day. And they still rely on volunteer contributions. [1]

[1] https://en.wikipedia.org/wiki/IMDb#History




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: