I admin a dozen FOSS repos for my dot-org away from Microsoft-Github; I spun up Sourcegraph but it "phones home" so much that it made me nervous, and I uninstalled it. It seems to assume a high-bandwidth, always-on sort of world, and who-knows-what-else it does because it was certainly a black box.
Overall: not safe for privacy or off-beat FOSS "In My Opinion"
Hey, I'm the CTO/co-founder of Sourcegraph. Thanks for the feedback here. I've opened up a pull request to make it clearer how to disable non-critical telemetry: https://github.com/sourcegraph/sourcegraph/pull/24171. Please let us know if you have any feedback on that!
Will respond to the OP's concerns, which I agree are important, in a direct reply.
Thanks for responding here, in the pull request, and committing to respond to the parent poster. That's more than required, and welcome.
In general I reckon a lot of us in tech believe that our e-mail addresses "are out there anyway", and as such we start to think that it's reasonable to opt-in collection of personal information on the behalf of others (per the Critical Telemetry[1] section) without their full consent.
I don't personally think that's OK behaviour. Good products grow and are shared by worth-of-mouth and network effects over time when they're genuinely useful (as I think sourcegraph is), and I'd debate whether there's greater overall value in silently transmitting e-mail addresses (something that many people will only learn about at a later date) versus the potential privacy and reputation costs (such as could arise from conversations like this).
There may be some kind of argument that the information collection's required in order to send security and policy update notices; I'm uncertain about that: it's honest and useful to announce relevant information to the public when ready, but some consumers may wish to stay current on those themselves rather than be (unwittingly, at least) added to push-based messaging.
These would probably be considerations you'd have to reconcile not only with your own codebase and perspective, but also with your colleagues and peers, and I understand that friction - this is just honest feedback from my perspective.
That's something to be aware of (and IMO phoning home should be called out in some extremely obvious fashion in the readme or getting started docs), but for people who care it's pretty easy in today's world to build a version with the telemetry ripped out or to just disable external network access for Sourcegraph at the OS level.
yes, but no - git is a networked system and updating git repos is what is happening.. so neither "just disable external network access" nor "build a version with the telemetry ripped out" is practical
I feel like I'm being baited into Cunningham's Law [0].
> git is a networked system
Git has the potential to be part of a distributed system. It is not inherently networked. Most of Sourcegraph's features don't need any network connectivity at all, and most of the rest can get by with severely restricted access. Check out man unshare [1].
> nor build a version with telemetry ripped out
Of course that's possible! Telemetry is centralized in remarkably few places in the code, and nothing in the license prohibits that sort of thing. Replace the telemetry with no-ops and build it....
Sourcegraph CTO here, thanks for posting this feedback.
Yes, we do collect information from self-hosted instances of Sourcegraph. (Note to other readers: the blog post is talking about sourcegraph.com whereas what we are discussing here is running a standalone instance of Sourcegraph.) Here is what that info is:
(1) High-level "ping" data (https://docs.sourcegraph.com/admin/pings#critical-telemetry) that includes the email address you put in during installation, the version of Sourcegraph that's running, an aggregate count of users, total codebase size.
No information about individual user identities, behavior, repositories, or code is sent outside the Sourcegraph instance, unless you explicitly enable a feature (like code monitoring alerts) that does so.
We collect this data in order to sell our software to companies that use it. We have no plans to charge for Sourcegraph for open-source development, nor will we sell individual user data (which we don't collect).
We've made the decision to make (1) mandatory, because we haven't figured out a better way to ensure we have reasonable awareness of which companies are running Sourcegraph. We do try to keep this data as high-level/aggregate/non-invasive as possible, but I also understand that some might not want to send any data. For folks that fall into that camp, there are a number of other code search engines that are great: Livegrep, Zoekt, Hound, and OpenGrok are ones I'd recommend. If you think the feature set of Sourcegraph is cool and want to use it strictly for personal or open-source development while disabling all telemetry, feel free to reach out (beyang@sourcegraph.com) and I'd be happy to do what I can to make this happen.
Of course, if you don't want to run your own instance of Sourcegraph and are okay with using a cloud service, you can also add your repositories directly to sourcegraph.com, where they can be discovered and used by anyone from a single search box :)
If folks have any questions or feedback on the above, I'd love to hear it!
Sourcegraph have been sending unsolicited emails to my personal and work email addresses, and there's no way I signed up to any marketing on either of them. Very dodgy practices going on.
Sourcegraph CTO here, I'm very sorry this happened with you. If you shoot me your email (I'm reachable at beyang@sourcegraph.com), I'll ensure you don't receive any future sales/marketing emails from us.
This is something we should be clearer and more upfront about. We collect emails for sales purposes from self-hosted Sourcegraph instances, but I agree that entering your email as part of the installation process doesn't mean you want us to email you directly. Going to fix this!
Same here, whole at my previous job, they contacted me and since I knew the product already I really wanted to use it.
I engaged with them because I wanted to convince my management to pay for the damn thing instead of running some crappy hound instance. But you know how those things go, management didn't want to invest time in running the trial, nobody could just install this on their spare time because the very fact that we started this engagement with the trial program effectively turned fun into work and so nothing happened.
That's why opensource wins. Nobody is pushing it down your throat. You have a problem, you find a solution, you have the incentive to actually follow through your own drive and just use the best tool you found. Some companies have understood that and monetize on support and other things that happen long after you've been hooked to the free and open product.
Not sure sourcegraph can't do that as well, but I'm sure they have their good reason. Business is hard, I'm just an engineer.
I'm sharing this perspective just because, when you seel stuff to engineers, you should know how engineers think.
Yes, some products for engineers are sold to managers/execs instead; you can tell because engineers hate to use those, but they usually have no choice.
Sourcegraph is a pleasure to use. I'm regularly impressed by the snappiness of the UI. No wonder they're trying to sell it to engineers, it seems to have been indeed built for engineers!
But yeah, their reach out strategy can feel a bit invasive and sloppy.
> By the end of the year, we will have grown our index to include every open source repository with more than 1 star on GitHub and GitLab (that’s 5M+ repos).
And what if your source code has a license that says "Do as you like, as long as you're not Sourcegraph"? I'm not entirely sold on the legalities of what they are doing...
With search engines you have your robots file and many other ways to opt out of indexing, I wonder what Sourcegraph have. My guess is nothing.
> "Do as you like, as long as you're not Sourcegraph"
It wouldn't matter that this provision is in the license that you offer to your users, because when you post your code publicly on GitHub you agree to their TOS which includes a section called "License Grant to Other Users". This grants all GitHub users, including Sourcegraph, the right to "use, display, and perform" your code. Even if this were not true, it has been claimed (I'm not qualified to judge whether it's correct) that fair use covers this type of usage even if there is no license.
License files (and copyright) aren’t omnipotent. I understand why you might feel the ethics lie in a different direction, but the established legal precedent is that parsing and building search indexes or neural nets is generally fair game regardless of what the copyright holder wants.
You mentioned robots.txt - there’s no legal enforcement behind that either. Notably, the Internet Archive bot ignores it completely.
Are you thinking along the lines of GitHub co-pilot with this concern, out of interest? Or are there other scenarios where you think people would want their code to be public, but not indexed/crawled?
Overall: not safe for privacy or off-beat FOSS "In My Opinion"