There are 20M+ repos on there. Which leads to a serious discoverability + signal/noise problem around 'true' open source projects. As someone who wants to put up an open source project for collaboration (with willingness to maintain), it's hard to get the right eyes on it. Too often, it seems like no one cares. Conversely, as someone who wants to contribute to open source projects, it's a little hard to find small early ones where I feel like I can make a meaningful contribution (yes, the big ones maintained by well known companies are easy to find, but I much rather work on something put up by an individual programmer).
How do you cut through the noise on GitHub to find projects you actually want to work with (and want you too)?
Having access to the data set itself could unlock a lot of new, creative ideas and applications beyond the expected ones. That's one great thing I've learned from the open source community.
It could not only be used for search, but some data analysis, and what not. I think it would be fairly beneficial for github to do it, actually. Easy to work with, up to date dataset -> interesting projects -> github brand value++.
Or go the other way, and train a searching algorithm based on a data set of actively maintained open source projects on Github, with things like commit history/contributors/etc as features.