On that data point: I wonder if anyone can comment on how much useful training data we could get out of generating text based on knowledge graphs/databases that we have. You can construct an awful lot of sentences out of just a few facts (e.g. weights of various classes to generate sentences like: "x's are heavier than y's, but not as heavy as z's"). All the variations would contain the same information (of subsets of it), but the same could be said of lots of text online. Obviously this is an inefficient way to incorporate the databases into a GPT-like model, but it might make sense economically given the race that is now playing out - just shoehorn it in or you'll be left behind (at least in the short term) by those who do. "We can work out how to make it efficient after we're rolling around in cash."
The knowledge databases could be used to generate what would essentially be "word problems" (in math classes), starting with simple things like "If I put three marbles in a cup, and then I take one out, and each marble weighs 20g, then the remaining marbles weigh 40g in total" and moving on to progressively more complex ones.
If that were to happen, then you'd see companies employing people to create templates which essentially convert databases into sentences/paragraphs, which can then be consumed by the GPT-like model.
It seems like this data would need to be used in a sort of pre-training step though, because you want the model to encode all the relationships, but you don't want it to learn to generate these types of concrete sentences, specifically.
The knowledge databases could be used to generate what would essentially be "word problems" (in math classes), starting with simple things like "If I put three marbles in a cup, and then I take one out, and each marble weighs 20g, then the remaining marbles weigh 40g in total" and moving on to progressively more complex ones.
If that were to happen, then you'd see companies employing people to create templates which essentially convert databases into sentences/paragraphs, which can then be consumed by the GPT-like model.
It seems like this data would need to be used in a sort of pre-training step though, because you want the model to encode all the relationships, but you don't want it to learn to generate these types of concrete sentences, specifically.