Anyone ever did something like this for Facebook? I know it'd have to be subject...

stefansundin · on June 20, 2020

RSSBox used to have Facebook support (but only for public pages, no personal content), but when Facebook started cordoning off their API two years ago, I had to turn it off since I was unable to get my application approved. The code is still there, but I am doubtful it would work even if you manage to get an API key that works. I think the best option may be to scrape the web content now, unfortunately.

trog · on June 20, 2020

I have assumed for a while the only way to convert FB -> RSS would be to scrape the home page, but from what I recall the HTML & DOM is all kinds of messed up - intentionally obfuscated to prevent adblocking. From a quick look just now it does seem like it would be a nightmare to try to parse it as-is - and I would guess FB changes a lot of the output regularly anyway to defeat adblockers, making efforts to keep up pretty challenging.

derefr · on June 21, 2020

It almost sounds like a problem best solved with OCR, rather than scraping per se. Build a simple model to recognize “posts” from screenshots, and output the rectangular viewport regions of their inner content; then build some GIS-like layered 2D interval tree of all the DOM regions, such that you could ask Puppeteer et al to filter for every DOM node with visibility overlap with that viewport region; extract every single Unicode grapheme-cluster within those nodes separately, annotated with its viewport XY position; and finally, use the same kind of model that lets PDF readers you highlight “text” (i.e. arbitrary bags of absolute-positioned graphemes) in PDFs, to “un-render” the DOM nodes’ bag of positioned graphemes back into a stream of space/line/paragraph-segmented text.

ChrisHardie · on June 21, 2020

I wrote about trying to do this for Facebook page events as an example. Code sample included:

https://chrishardie.com/2019/10/unlocking-community-events-f...