Hacker News new | past | comments | ask | show | jobs | submit login

Good luck. News is the API fees are going to destroy Apollo.



I believe Teddit uses scraping; I run a Teddit instance for myself and haven't needed to set up an API key.


From a very brief skim it doesn't appear to scrape: https://codeberg.org/teddit/teddit/src/branch/main/routes/ho...

But I'm not really a web dev so I might be misreading things.

Also: https://codeberg.org/teddit/teddit/issues/400



That's another feature that that seems on its way out, like locking up the api..


Do you mean that they've said it or just makes sense?

Curious if they've said anything about it.


The json thing is an API request, so it will come under the new API rate limits.


It would be great for someone to scrape Reddit and expose that information in a format compatible with the official API.

So if you call /get-comments/1234 it scrapes post 1234 and returns the JSON object exactly as the official API does.

Then third party clients can just point to this endpoint.


Currently, just add .json to the end of the url.

I have no idea what will happen to this with the changes.

If what you suggest is done, even a package, we could probably do a distributed pushshift[1] alternative to aggregate the data like the archiveteam's warrior[2] does and keep publishing the monthly data.

[1] - reddit.com/r/pushshift

[2] - https://wiki.archiveteam.org/index.php?title=ArchiveTeam_War...


Won’t they just get rate limited or hit with captchas


Right now it seems legal to scrape Reddit. But given their trajectory of making the API fairly expensive to use, do you think it's likely that they would also limit/prohibit scraping (assuming apps like Apollo start scraping as an alternative)?


My understanding is that scraping of public websites is generally legal, isn't it?


Legal, but probably against ToS


That ToS is meaningless if you scrape logged out.


This is the first classic example that I've encountered where a company will uses its power and ownership to completely render smaller, independent products unsustainable


$12,000 per 50 million requests according to a post the dev made on Reddit, which he claims translates to $20 million a year.


And per user it's about $25 a year


HTML is a perfectly good API.


Most HTML is undocumented and unstable, so I’d say it’s far from perfect.


It's perfect in that there's no way for them to differentiate from normal browser traffic. This is adversarial interoperability which is exactly how we're supposed to deal with corporations and their idiotic "terms". Nobody is forced to accept their "take it or leave it" BS.

https://www.eff.org/deeplinks/2019/10/adversarial-interopera...

If it's important enough, someone somewhere will care enough to fix it when it inevitably breaks. Look at youtube-dl and its variations, there's even a youtube.js now.

https://news.ycombinator.com/item?id=31021611


I would be very curious to know how many engineers inside google have as their sole responsibility to fight programs like YouTube-dl and so on.

I am willing to bet it’s a whole division.


They can fight all they want. In the end it doesn't matter how many engineers they throw at the problem. The only way they can win is for the world to descend ever deeper into DRM tyranny. Google must literally own your computer in order to prevent you from copying YouTube videos. It has to come pwned straight off the factory so that Google can literally dictate that you the user are prohibited from running certain bad programs that are known to hurt their bottom line.

I realize we're not too far off from that dystopia but it's still a battle that deserves to be fought for the principle of it.


Can’t they just fingerprint the incoming requests based on a litany of variables such as headers ip etc to prevent this “scraping”?

Sorry I don’t follow


Sure. We can also control those variables. Youtube-dl has a small javascript interpreter to figure out the audio and video URLs that YouTube generates for every client. In this thread it was also pointed out that people can use headless chrome.

It's simple. If they allow public access to their site at all, there's pretty much nothing they can do to stop any of this.


Reddit is a SPA though right


Yes and no, the terrible attempts to build a new front-end are, but the old front-end that runs on python with Pylons (I believe) as it’s “front” isn’t.

I like react, and I love typescript, but sites like Wikipedia, old.Reddit.com, stackoverflow, hacker news and so on are nice showcases of how you should never be afraid of the page reload, because your users won’t be unless you’re building something where you need to update screen states without user inputs. Like when your user receives a mail and can’t just reload the page because your users input would be lost if you do so. I think this last part is the primary reason (along with mobile) that Reddit has been attempting to move to React, because the “social” parts like chats and private messages don’t instantly show up for users in the old front-end. Unfortunately they haven’t been very successful so far.

You probably can scrape their current or their new.Reddit front-ends since you can scrape SPAs, but it’s much, much, easier to scrape the old.Reddit front.


Don't know to be honest. I assume new reddit is while old reddit isn't. Truth is the job is even easier in case of an SPA: you just need to use whatever internal APIs they built for themselves. Can't add idiotic restrictions to that without affecting the normal web service.


SPA being a problem was last decade. Headless chromium is pretty standard for scraping nowadays.


How does that work for clients though. Should Apollo ship a headless chromium in their mobile app


Can they launch a browser view hidden and scrape it? I have no idea if they can read from it.


Isn’t that very expensive to run at scale especially mixing in residential IPs to avoid blocking


Each person would spider for their own needs and most would use residential ips


Good to know, good time to donate if I can. I love apollo


What a poor timing for something like this.


Actually, perfect timing. It doesn't use the official API.


Well, teddit is not new if you mean that. I:ve been using it for a while - only downside is that it's a bit slow.


I've found that the instance hosted by privacytools.io is significantly faster than the official instance.

https://teddit.privacytools.io/


I personally like the Adminforge instance https://teddit.adminforge.de. It's much quicker than the original teddit.net.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: