A friend and I built a proof-of-concept of using a variation of Latent Semantic Analysis to automatically build up conceptual maps and loadings of individual words against the latent conceptual vectors back in 2000. In exploring what it would take to scale I concluded, like you, that we should use professionally written and edited content like books, news articles and scientific journals as the corpus against which to build up the core knowledge graph.
Twenty-four years later I still regret not being able to raise money to enable us to keep working on that nascent startup. In most ways it was still too early. Google was still burning through VC money at that point and the midwestern investors we had some access to didn't get it. And, honestly they were probably correct. Compute power was still too expensive and quality data sources like published text were mostly locked up and generally not available to harvest.
Twenty-four years later I still regret not being able to raise money to enable us to keep working on that nascent startup. In most ways it was still too early. Google was still burning through VC money at that point and the midwestern investors we had some access to didn't get it. And, honestly they were probably correct. Compute power was still too expensive and quality data sources like published text were mostly locked up and generally not available to harvest.