Sourcegraph: Why we're indexing the OSS universe

mistrial9 · on Aug 19, 2021

I admin a dozen FOSS repos for my dot-org away from Microsoft-Github; I spun up Sourcegraph but it "phones home" so much that it made me nervous, and I uninstalled it. It seems to assume a high-bandwidth, always-on sort of world, and who-knows-what-else it does because it was certainly a black box.

Overall: not safe for privacy or off-beat FOSS "In My Opinion"

jka · on Aug 19, 2021

Thanks - I wasn't aware of that before submitting this story, and it's a good concern to have.

It led me to https://docs.sourcegraph.com/dev/background-information/tele... and https://docs.sourcegraph.com/admin/pings. Neither of which mention opt-in or opt-out settings.

Upvoting you to raise awareness.

beyang · on Aug 19, 2021

Hey, I'm the CTO/co-founder of Sourcegraph. Thanks for the feedback here. I've opened up a pull request to make it clearer how to disable non-critical telemetry: https://github.com/sourcegraph/sourcegraph/pull/24171. Please let us know if you have any feedback on that!

Will respond to the OP's concerns, which I agree are important, in a direct reply.

jka · on Aug 19, 2021

Thanks for responding here, in the pull request, and committing to respond to the parent poster. That's more than required, and welcome.

In general I reckon a lot of us in tech believe that our e-mail addresses "are out there anyway", and as such we start to think that it's reasonable to opt-in collection of personal information on the behalf of others (per the Critical Telemetry[1] section) without their full consent.

I don't personally think that's OK behaviour. Good products grow and are shared by worth-of-mouth and network effects over time when they're genuinely useful (as I think sourcegraph is), and I'd debate whether there's greater overall value in silently transmitting e-mail addresses (something that many people will only learn about at a later date) versus the potential privacy and reputation costs (such as could arise from conversations like this).

There may be some kind of argument that the information collection's required in order to send security and policy update notices; I'm uncertain about that: it's honest and useful to announce relevant information to the public when ready, but some consumers may wish to stay current on those themselves rather than be (unwittingly, at least) added to push-based messaging.

These would probably be considerations you'd have to reconcile not only with your own codebase and perspective, but also with your colleagues and peers, and I understand that friction - this is just honest feedback from my perspective.

[1] - https://github.com/sourcegraph/sourcegraph/blob/66ce1f814946...

hansvm · on Aug 19, 2021

That's something to be aware of (and IMO phoning home should be called out in some extremely obvious fashion in the readme or getting started docs), but for people who care it's pretty easy in today's world to build a version with the telemetry ripped out or to just disable external network access for Sourcegraph at the OS level.

mistrial9 · on Aug 19, 2021

yes, but no - git is a networked system and updating git repos is what is happening.. so neither "just disable external network access" nor "build a version with the telemetry ripped out" is practical

Sourcegraph CTO responded below in detail

hansvm · on Aug 20, 2021

I feel like I'm being baited into Cunningham's Law [0].

> git is a networked system

Git has the potential to be part of a distributed system. It is not inherently networked. Most of Sourcegraph's features don't need any network connectivity at all, and most of the rest can get by with severely restricted access. Check out man unshare [1].

> nor build a version with telemetry ripped out

Of course that's possible! Telemetry is centralized in remarkably few places in the code, and nothing in the license prohibits that sort of thing. Replace the telemetry with no-ops and build it....

[0] https://meta.wikimedia.org/wiki/Cunningham%27s_Law

[1] https://man7.org/linux/man-pages/man1/unshare.1.html

beyang · on Aug 19, 2021

Sourcegraph CTO here, thanks for posting this feedback.

Yes, we do collect information from self-hosted instances of Sourcegraph. (Note to other readers: the blog post is talking about sourcegraph.com whereas what we are discussing here is running a standalone instance of Sourcegraph.) Here is what that info is:

(1) High-level "ping" data (https://docs.sourcegraph.com/admin/pings#critical-telemetry) that includes the email address you put in during installation, the version of Sourcegraph that's running, an aggregate count of users, total codebase size.

(2) Additional telemetry (https://docs.sourcegraph.com/admin/pings#other-telemetry) that includes aggregate usage, latencies, and product actions (e.g., which features are in use, progress through onboarding). This can be disabled in config: https://docs.sourcegraph.com/admin/config/site_config#disabl...

No information about individual user identities, behavior, repositories, or code is sent outside the Sourcegraph instance, unless you explicitly enable a feature (like code monitoring alerts) that does so.

We follow the open-core model and our enterprise-licensed code is also public. You can use Sourcegraph to explore the source code of Sourcegraph and see how telemetry is implemented: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-....

We collect this data in order to sell our software to companies that use it. We have no plans to charge for Sourcegraph for open-source development, nor will we sell individual user data (which we don't collect).

We've made the decision to make (1) mandatory, because we haven't figured out a better way to ensure we have reasonable awareness of which companies are running Sourcegraph. We do try to keep this data as high-level/aggregate/non-invasive as possible, but I also understand that some might not want to send any data. For folks that fall into that camp, there are a number of other code search engines that are great: Livegrep, Zoekt, Hound, and OpenGrok are ones I'd recommend. If you think the feature set of Sourcegraph is cool and want to use it strictly for personal or open-source development while disabling all telemetry, feel free to reach out (beyang@sourcegraph.com) and I'd be happy to do what I can to make this happen.

Of course, if you don't want to run your own instance of Sourcegraph and are okay with using a cloud service, you can also add your repositories directly to sourcegraph.com, where they can be discovered and used by anyone from a single search box :)

If folks have any questions or feedback on the above, I'd love to hear it!

mistrial9 · on Aug 19, 2021

OP here - thx Beyang, I am reading this now

AltF4me · on Aug 19, 2021

Sourcegraph have been sending unsolicited emails to my personal and work email addresses, and there's no way I signed up to any marketing on either of them. Very dodgy practices going on.

ushakov · on Aug 19, 2021

Can confirm, same thing happened to me and i’ve never signed up for any newsletter (and i always make sure not to)

beyang · on Aug 19, 2021

Sourcegraph CTO here, I'm very sorry this happened with you. If you shoot me your email (I'm reachable at beyang@sourcegraph.com), I'll ensure you don't receive any future sales/marketing emails from us.

This is something we should be clearer and more upfront about. We collect emails for sales purposes from self-hosted Sourcegraph instances, but I agree that entering your email as part of the installation process doesn't mean you want us to email you directly. Going to fix this!

ithkuil · on Aug 19, 2021

Same here, whole at my previous job, they contacted me and since I knew the product already I really wanted to use it.

I engaged with them because I wanted to convince my management to pay for the damn thing instead of running some crappy hound instance. But you know how those things go, management didn't want to invest time in running the trial, nobody could just install this on their spare time because the very fact that we started this engagement with the trial program effectively turned fun into work and so nothing happened.

That's why opensource wins. Nobody is pushing it down your throat. You have a problem, you find a solution, you have the incentive to actually follow through your own drive and just use the best tool you found. Some companies have understood that and monetize on support and other things that happen long after you've been hooked to the free and open product.

Not sure sourcegraph can't do that as well, but I'm sure they have their good reason. Business is hard, I'm just an engineer.

I'm sharing this perspective just because, when you seel stuff to engineers, you should know how engineers think.

Yes, some products for engineers are sold to managers/execs instead; you can tell because engineers hate to use those, but they usually have no choice.

Sourcegraph is a pleasure to use. I'm regularly impressed by the snappiness of the UI. No wonder they're trying to sell it to engineers, it seems to have been indeed built for engineers!

But yeah, their reach out strategy can feel a bit invasive and sloppy.

ushakov · on Aug 19, 2021

Sourcegraph is literally pushing their trial and upgrade down your throat, once you have the open source edition

Then they send you emails from a mailing list you didn’t sign up for:

Thanks for installing Sourcegraph

Hello- Mark here with the Sourcegraph Team!

I've shared a couple of links below to help you get started with Sourcegraph. https://docs.sourcegraph.com/getting-started https://docs.sourcegraph.com/integration

Out of curiosity, how did you hear about Sourcegraph, and is there a specific use case you're evaluating Sourcegraph for?

Best, Mark Muldez — DevTools Advocate

unicodeveloper · on Aug 19, 2021

Hi, apologies for the invasiveness. I'm glad you are impressed with Sourcegraph and we're working hard to make the experience better.

unicodeveloper · on Aug 19, 2021

Hello,

Sorry about this. How can I reach you privately? I need your email addresses so that we can remove it from our communication line moving forward.

PS: I work at Sourcegraph.

bArray · on Aug 19, 2021

> By the end of the year, we will have grown our index to include every open source repository with more than 1 star on GitHub and GitLab (that’s 5M+ repos).

And what if your source code has a license that says "Do as you like, as long as you're not Sourcegraph"? I'm not entirely sold on the legalities of what they are doing...

With search engines you have your robots file and many other ways to opt out of indexing, I wonder what Sourcegraph have. My guess is nothing.

electroly · on Aug 20, 2021

> "Do as you like, as long as you're not Sourcegraph"

It wouldn't matter that this provision is in the license that you offer to your users, because when you post your code publicly on GitHub you agree to their TOS which includes a section called "License Grant to Other Users". This grants all GitHub users, including Sourcegraph, the right to "use, display, and perform" your code. Even if this were not true, it has been claimed (I'm not qualified to judge whether it's correct) that fair use covers this type of usage even if there is no license.

542458 · on Aug 19, 2021

License files (and copyright) aren’t omnipotent. I understand why you might feel the ethics lie in a different direction, but the established legal precedent is that parsing and building search indexes or neural nets is generally fair game regardless of what the copyright holder wants.

You mentioned robots.txt - there’s no legal enforcement behind that either. Notably, the Internet Archive bot ignores it completely.

jka · on Aug 19, 2021

Are you thinking along the lines of GitHub co-pilot with this concern, out of interest? Or are there other scenarios where you think people would want their code to be public, but not indexed/crawled?

markl42 · on Aug 19, 2021

Big fan of sourcegraph, we use it at work. super excited to see this happen.

(I've been using grep.app until now which is awesome too!)

unicodeveloper · on Aug 19, 2021

Thanks for the great feedback.

asy22 · on Aug 19, 2021

If you're looking for a desktop application you can run locally, https://www.sourcetrail.com/ is quite nice.

c17r · on Aug 19, 2021

There's https://github.com/livegrep/livegrep for your own personal stuff.