My partner usually writes substack posts which I then mirror to our website’s blog section.
To automate this, I made a simple tool to scrape the post and clean it so that I can drop it to our blog easily. This might be useful to others as well.
Oh and ofcourse you can instruct GPT to make any final edits :D
1. Throughly scraping the content of page (high recall)
2. Dropping all the ads/auxilliary content (high precision)
3. And getting the correct layout/section types (formatting)
For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.
Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.