Good luck. News is the API fees are going to destroy Apollo.

seabass-labrax · on May 31, 2023

I believe Teddit uses scraping; I run a Teddit instance for myself and haven't needed to set up an API key.

bentcorner · on June 1, 2023

From a very brief skim it doesn't appear to scrape: https://codeberg.org/teddit/teddit/src/branch/main/routes/ho...

But I'm not really a web dev so I might be misreading things.

Also: https://codeberg.org/teddit/teddit/issues/400

rovr138 · on June 1, 2023

You can (currently?) add .json to any url on reddit and it'll provide a json object.

- https://codeberg.org/teddit/teddit/src/commit/b0ce6c52a66987...

imglorp · on June 1, 2023

That's another feature that that seems on its way out, like locking up the api..

rovr138 · on June 1, 2023

Do you mean that they've said it or just makes sense?

Curious if they've said anything about it.

ffpip · on June 1, 2023

The json thing is an API request, so it will come under the new API rate limits.

KMnO4 · on May 31, 2023

It would be great for someone to scrape Reddit and expose that information in a format compatible with the official API.

So if you call /get-comments/1234 it scrapes post 1234 and returns the JSON object exactly as the official API does.

Then third party clients can just point to this endpoint.

rovr138 · on June 1, 2023

Currently, just add .json to the end of the url.

I have no idea what will happen to this with the changes.

If what you suggest is done, even a package, we could probably do a distributed pushshift[1] alternative to aggregate the data like the archiveteam's warrior[2] does and keep publishing the monthly data.

[1] - reddit.com/r/pushshift

[2] - https://wiki.archiveteam.org/index.php?title=ArchiveTeam_War...

moneywoes · on June 1, 2023

Won’t they just get rate limited or hit with captchas

girfan · on June 1, 2023

Right now it seems legal to scrape Reddit. But given their trajectory of making the API fairly expensive to use, do you think it's likely that they would also limit/prohibit scraping (assuming apps like Apollo start scraping as an alternative)?

xur17 · on June 1, 2023

My understanding is that scraping of public websites is generally legal, isn't it?

vcanales · on June 1, 2023

Legal, but probably against ToS

Semaphor · on June 1, 2023

That ToS is meaningless if you scrape logged out.

princevegeta89 · on June 1, 2023

This is the first classic example that I've encountered where a company will uses its power and ownership to completely render smaller, independent products unsustainable

grensley · on June 1, 2023

$12,000 per 50 million requests according to a post the dev made on Reddit, which he claims translates to $20 million a year.

grensley · on June 1, 2023

And per user it's about $25 a year

matheusmoreira · on May 31, 2023

HTML is a perfectly good API.

mr_toad · on June 1, 2023

Most HTML is undocumented and unstable, so I’d say it’s far from perfect.

matheusmoreira · on June 1, 2023

It's perfect in that there's no way for them to differentiate from normal browser traffic. This is adversarial interoperability which is exactly how we're supposed to deal with corporations and their idiotic "terms". Nobody is forced to accept their "take it or leave it" BS.

https://www.eff.org/deeplinks/2019/10/adversarial-interopera...

If it's important enough, someone somewhere will care enough to fix it when it inevitably breaks. Look at youtube-dl and its variations, there's even a youtube.js now.

https://news.ycombinator.com/item?id=31021611

great_psy · on June 1, 2023

I would be very curious to know how many engineers inside google have as their sole responsibility to fight programs like YouTube-dl and so on.

I am willing to bet it’s a whole division.

matheusmoreira · on June 1, 2023

They can fight all they want. In the end it doesn't matter how many engineers they throw at the problem. The only way they can win is for the world to descend ever deeper into DRM tyranny. Google must literally own your computer in order to prevent you from copying YouTube videos. It has to come pwned straight off the factory so that Google can literally dictate that you the user are prohibited from running certain bad programs that are known to hurt their bottom line.

I realize we're not too far off from that dystopia but it's still a battle that deserves to be fought for the principle of it.

moneywoes · on June 1, 2023

Can’t they just fingerprint the incoming requests based on a litany of variables such as headers ip etc to prevent this “scraping”?

Sorry I don’t follow

matheusmoreira · on June 1, 2023

Sure. We can also control those variables. Youtube-dl has a small javascript interpreter to figure out the audio and video URLs that YouTube generates for every client. In this thread it was also pointed out that people can use headless chrome.

It's simple. If they allow public access to their site at all, there's pretty much nothing they can do to stop any of this.

VWWHFSfQ · on May 31, 2023

Reddit is a SPA though right

devjab · on June 1, 2023

Yes and no, the terrible attempts to build a new front-end are, but the old front-end that runs on python with Pylons (I believe) as it’s “front” isn’t.

I like react, and I love typescript, but sites like Wikipedia, old.Reddit.com, stackoverflow, hacker news and so on are nice showcases of how you should never be afraid of the page reload, because your users won’t be unless you’re building something where you need to update screen states without user inputs. Like when your user receives a mail and can’t just reload the page because your users input would be lost if you do so. I think this last part is the primary reason (along with mobile) that Reddit has been attempting to move to React, because the “social” parts like chats and private messages don’t instantly show up for users in the old front-end. Unfortunately they haven’t been very successful so far.

You probably can scrape their current or their new.Reddit front-ends since you can scrape SPAs, but it’s much, much, easier to scrape the old.Reddit front.

matheusmoreira · on June 1, 2023

Don't know to be honest. I assume new reddit is while old reddit isn't. Truth is the job is even easier in case of an SPA: you just need to use whatever internal APIs they built for themselves. Can't add idiotic restrictions to that without affecting the normal web service.

delfinom · on June 1, 2023

SPA being a problem was last decade. Headless chromium is pretty standard for scraping nowadays.

VWWHFSfQ · on June 1, 2023

How does that work for clients though. Should Apollo ship a headless chromium in their mobile app

rovr138 · on June 1, 2023

Can they launch a browser view hidden and scrape it? I have no idea if they can read from it.

moneywoes · on June 1, 2023

Isn’t that very expensive to run at scale especially mixing in residential IPs to avoid blocking

ipaddr · on June 1, 2023

Each person would spider for their own needs and most would use residential ips

willjp · on June 1, 2023

Good to know, good time to donate if I can. I love apollo

scottydelta · on May 31, 2023

What a poor timing for something like this.

ajcoll5 · on May 31, 2023

Actually, perfect timing. It doesn't use the official API.

Aulig · on May 31, 2023

Well, teddit is not new if you mean that. I:ve been using it for a while - only downside is that it's a bit slow.

47thpresident · on May 31, 2023

I've found that the instance hosted by privacytools.io is significantly faster than the official instance.

https://teddit.privacytools.io/

_fjb4 · on June 1, 2023

I personally like the Adminforge instance https://teddit.adminforge.de. It's much quicker than the original teddit.net.