Hacker News new | past | comments | ask | show | jobs | submit login

Having built an AI crawler myself for first party data collection:

1. I intentionally made sure my crawler was slow (I prefer batch processing workflows in general, and this also has the effect of not needing a machine gun crawler rate)

2. For data updates, I made sure to first do a HEAD request and only access the page if it has actually been changed. This is good for me (lower cost), the site owner, and the internet as a whole (minimizes redundant data transfer volume)

Regarding individual site policies, I feel there’s often a “tragedy of the commons” dilemma for any market segment subject to aggregator dominance:

- individual sites often aggressively hide things like pricing information and explicitly disallow crawlers from accessing them

- humans end up having to access them: this results in a given site either not being included at all, or accessed once but never reaccessed, causing aggregator data to go stale

- aggregators often outrank individual sites due to better SEO and likely human preference of aggregators, because it saves them research time

- this results in the original site being put at a competitive disadvantage in SEO, since the their product ends up not being listed, or listed with outdated/incorrect information

- that sequence of events leads to negative business outcomes, especially for smaller businesses who often already have a higher chance of failure

Therefore, I believe it’s important to have some sort of standard policy that is implemented and enforced at various levels: CDNs, ISPs, etc.

The policy should be carefully balanced to consider all these factors as well as having a baked in mechanism for low friction amendment based on future emergent effects.

This would result in a much better internet, one that has the property of GINI regulation, ensuring well-distributed outcomes that are optimized for global socioeconomic prosperity as a whole.

Curious to hear others’ perspectives about this idea and how one would even kick off such an ambitious effort.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: