Shouldn't the competitors just clone the repository locally and train their mode...

dontupvoteme · on Dec 14, 2023

You could, yes, and probably many are doing this, but you now have to git pull on all of those if you want to, say, know which LLM libraries are currently trending, or how quickly PopularLib v0.2 is being used in codebases related to Y, etc.

IMO It's much less about the legacy code (there exist already terabyte size datasets that take in a lot of things on github) and MUCH more about how up-to-date your LLM/AI is with new repos, "best practices" (or most common practices), etc.

Plus you often get "LLM Code poisoning" from older training data as it attempts to use functionality which has experienced breaking changes versus the current stable release. Current is King.

Also there's the whole goldmine of github discussions, issues, etc that a repo just.. doesn't have.

Right now you can still easily index those (though iirc they sometimes ban datacenter IPs), but they may also fall victim to the loginwall.