Hacker News new | past | comments | ask | show | jobs | submit login

Anyone ever did something like this for Facebook?

I know it'd have to be subjective per user (security ACLs ⇒ different accounts seeing differing subsets of other accounts' posts); but I'd be fine with just getting my own account's subjective view, by logging into such a service using Facebook OAuth (or, if that isn't enough, then I'd be fine with handing over my Facebook creds themselves, ala XAuth, provided the service is a FOSS one I'm running a copy of myself in e.g. an ownCloud instance.)

I also know that it'd likely require heavyweight scraping using e.g. Puppeteer, to fool Facebook into thinking it's real traffic. But that's not really that much of an impediment, as long as you don't need to scale it to more than a dozen-or-so scrapes per second. (Which you'd automatically be safe from if it was a host-it-yourself solution, since there'd only be one concurrent user of your instance.)

Anyone done this?




RSSBox used to have Facebook support (but only for public pages, no personal content), but when Facebook started cordoning off their API two years ago, I had to turn it off since I was unable to get my application approved. The code is still there, but I am doubtful it would work even if you manage to get an API key that works. I think the best option may be to scrape the web content now, unfortunately.


I have assumed for a while the only way to convert FB -> RSS would be to scrape the home page, but from what I recall the HTML & DOM is all kinds of messed up - intentionally obfuscated to prevent adblocking. From a quick look just now it does seem like it would be a nightmare to try to parse it as-is - and I would guess FB changes a lot of the output regularly anyway to defeat adblockers, making efforts to keep up pretty challenging.


It almost sounds like a problem best solved with OCR, rather than scraping per se. Build a simple model to recognize “posts” from screenshots, and output the rectangular viewport regions of their inner content; then build some GIS-like layered 2D interval tree of all the DOM regions, such that you could ask Puppeteer et al to filter for every DOM node with visibility overlap with that viewport region; extract every single Unicode grapheme-cluster within those nodes separately, annotated with its viewport XY position; and finally, use the same kind of model that lets PDF readers you highlight “text” (i.e. arbitrary bags of absolute-positioned graphemes) in PDFs, to “un-render” the DOM nodes’ bag of positioned graphemes back into a stream of space/line/paragraph-segmented text.


I wrote about trying to do this for Facebook page events as an example. Code sample included:

https://chrishardie.com/2019/10/unlocking-community-events-f...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: