Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is Substack doing with HN data?
9 points by cactusplant7374 on April 30, 2023 | hide | past | favorite | 7 comments
After I posted a link to my website to HN I noticed this in my logs:

44.195.67.189 - - [30/Apr/2023:20:38:44 +0000] "GET / HTTP/1.1" 200 11321 "-" "SubstackContentFetch/1.0 (https://substack.com/)"

I've never seen this before.




My guess is that is the opengraph[0] crawler for Substack Notes[1], the Twitter-ish alternative that Substack is making, so like someone posting a link to your blog, it visits, grabs meta tags to display a link preview.

[0]: https://ogp.me/

[1]: https://substack.com/notes

Edit: I was indeed correct, I went on Substack notes and wrote my own site, but the bad part is that it crawls as you type your post out, so instead of 1 request it'll be several!

> Sun Apr 30 2023 23:56:03 GMT+0000 (Coordinated Universal Time)]: | Ip=34.200.242.86 | Req_page=/?substack_notes | Agent=SubstackContentFetch/1.0 (https://substack.com/)


I don't think anyone is linking to my blog. I didn't get any upvotes. Seems really unlikely.


It's most likely one of those bots that use RSS to post all new HN posts.


Was it a substack employee clicking the link while on the corporate VPN?

tl;dr - you won't know much from a single log entry

generally, there are many groups scraping HN and the links found there


I wouldn't expect they are rewriting their user agents to announce their company. You'd get a regular user agent if they were using a regular browser.

It feels more likely that they are gathering stats on tech writers and weblogs to see who they might want to invite to Substack one day. They could hide this with a fake user agent, but have chosen not to.


Why would all browsers at Substack Co have their user agent changed?


proxy could be at play, I'm just speculating like you about the nature of the log entry




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: