I have no idea what will happen to this with the changes.
If what you suggest is done, even a package, we could probably do a distributed pushshift[1] alternative to aggregate the data like the archiveteam's warrior[2] does and keep publishing the monthly data.
Right now it seems legal to scrape Reddit. But given their trajectory of making the API fairly expensive to use, do you think it's likely that they would also limit/prohibit scraping (assuming apps like Apollo start scraping as an alternative)?
This is the first classic example that I've encountered where a company will uses its power and ownership to completely render smaller, independent products unsustainable
It's perfect in that there's no way for them to differentiate from normal browser traffic. This is adversarial interoperability which is exactly how we're supposed to deal with corporations and their idiotic "terms". Nobody is forced to accept their "take it or leave it" BS.
If it's important enough, someone somewhere will care enough to fix it when it inevitably breaks. Look at youtube-dl and its variations, there's even a youtube.js now.
They can fight all they want. In the end it doesn't matter how many engineers they throw at the problem. The only way they can win is for the world to descend ever deeper into DRM tyranny. Google must literally own your computer in order to prevent you from copying YouTube videos. It has to come pwned straight off the factory so that Google can literally dictate that you the user are prohibited from running certain bad programs that are known to hurt their bottom line.
I realize we're not too far off from that dystopia but it's still a battle that deserves to be fought for the principle of it.
Sure. We can also control those variables. Youtube-dl has a small javascript interpreter to figure out the audio and video URLs that YouTube generates for every client. In this thread it was also pointed out that people can use headless chrome.
It's simple. If they allow public access to their site at all, there's pretty much nothing they can do to stop any of this.
Yes and no, the terrible attempts to build a new front-end are, but the old front-end that runs on python with Pylons (I believe) as it’s “front” isn’t.
I like react, and I love typescript, but sites like Wikipedia, old.Reddit.com, stackoverflow, hacker news and so on are nice showcases of how you should never be afraid of the page reload, because your users won’t be unless you’re building something where you need to update screen states without user inputs. Like when your user receives a mail and can’t just reload the page because your users input would be lost if you do so. I think this last part is the primary reason (along with mobile) that Reddit has been attempting to move to React, because the “social” parts like chats and private messages don’t instantly show up for users in the old front-end. Unfortunately they haven’t been very successful so far.
You probably can scrape their current or their new.Reddit front-ends since you can scrape SPAs, but it’s much, much, easier to scrape the old.Reddit front.
Don't know to be honest. I assume new reddit is while old reddit isn't. Truth is the job is even easier in case of an SPA: you just need to use whatever internal APIs they built for themselves. Can't add idiotic restrictions to that without affecting the normal web service.