Hacker News new | comments | show | ask | jobs | submit login
Sourcegraph Server 2.4: free, powerful search for private code (sourcegraph.com)
188 points by azmenak 6 months ago | hide | past | web | favorite | 82 comments

For people interested in searching code using an open source solution, you might be interested in Zoekt too, github.com/google/zoekt.

There is a demo site where you can search 30G of source code (including the Linux kernel, Android and Chrome) supporting regular expressions, and file name search:


For example, https://cs.bazel.build/search?q=+r%3Atorvalds+craz%5Byi%5D&n... looks for craz[iy] across the Linux kernel.

Zoekt looks very interesting. We'll consider adding it to GitLab in https://gitlab.com/gitlab-org/gitlab-ce/issues/41925 and https://gitlab.com/gitlab-org/gitlab-ce/issues/41450 Please comment in those issues if you have thoughts about if we can use it.

BTW I know you're in Munich now but I wanted to say hi from your hometown Utrecht.

Sourcegraph CEO here. Thanks for this post. We packed a lot of stuff into 2.4: faster, more powerful code search, Google Alerts-style search monitoring, diff searches, and more.

It’s now free on a single server for any number of users and repositories.

Happy to answer questions here.

What benefits does it have over git grep, esp. if I'm using a monorepo? What new patterns/possibilities does it enable? Is it maybe speed somehow? (I assume it could then be more "live search/exploration"/rapid exploration than git grep - but OTOH wouldn't it require some slow reindexing after each change?)

For code search on a monorepo, Sourcegraph is often faster in UX and execution time for a lot of tasks. It's easier/faster to filter the results than `git grep`, you can see more on your screen, it's easier to jump to the full file, it's easier to see blame info for particular lines, etc.

Sometimes while coding you just need to find where something is so you can edit it or jump to it. In that case, your editor's search or `git grep` is definitely better. But when you're looking for example code, reviewing/reading code, or debugging code, it's often better to do it in a UI that's more optimized for those tasks than `git grep` and your editor.

And then Sourcegraph also has code intelligence, code host browser extension integrations, saved queries, etc., beyond the basic code search.

Google has a massive monorepo, and they have a similarly advanced code search system that they describe publicly. It's very well loved and frequently used by their developers. If you know any ex-Googlers (or ex-Facebookers, who have a similar system), ask them, and check out https://static.googleusercontent.com/media/research.google.c... and https://docs.google.com/document/d/1LQxLk4E3lrb3fIsVKlANu_pU....

BTW, Sourcegraph doesn't use an index for search. We heavily optimized the performance of searching an arbitrary revision that has never been indexed. So no slow reindexing after each change.

Just tried this out on some of our code bases. It appears to fail to generate snippets/highlights for all files that contain non-Unicode text, e.g. "Müller" in ISO-8859-1. Known issue?

Many queries times out for me although I'm running it on a pretty beefy AWS c5 instance with SSDs. Queries such as "type:diff" doesn't seem to work at all on my code bases. It also does not appear to cache any data from previous runs of "git log", so attempting to do the suggested reload doesn't really workaroudn the issue. Are you working on improving the performance?

Sorry to hear that it didn't work out of the box for your repositories. I've filed the non-Unicode text issue here: https://github.com/sourcegraph/issues/issues/32

We're actively working on improving the performance of diff search, but I would expect other types of queries to complete quickly. Would you mind sharing more about the size / characteristics of your repositories? Feel free to email me at beyang@sourcegraph.com if a private channel is better.

Thanks, I'll keep an eye on this.

Three examples: 300MB size, 600 branches, 25k commits. 250MB, 250 branches, 15k commits. 80MB, 100 branches, 4k commits. Textwise it is a mix of Golang, JS and Python. Most of the repo size comes from binary resources (images etc).

> Google Alerts-style search monitoring

I expected email notifications but all I managed to do was to add the saved queries on the home page. Am I missing something?

Email and other kinds of notifications for saved queries are coming in the next release in early Feb. Email me (sqs@sourcegraph.com) if you'd like to preview them sooner. I agree they are crucial for this feature to feel truly complete and awesome.

The homepage does show a nice sparkline and results summary, though. Easy to see at a glance if new secret keys, deps, etc., are added to your repositories if you set up the queries.

For those of us associated with very many repos is there a way to limit cloning to only the ones we work with?

What's the analytics/telemetry feature? I don't see it discussed on the site at all.

Analytics lets you see statistics about how your own server's are using it (each user's total count of pageviews and searches).

Telemetry lets you see the telemetry data it sends to Sourcegraph (which you can disable in the site config and never contains code/paths/repo names or anything derived from them).

Can this be run without docker, somehow? It's nigh-unusable on OS X with it.

It requires Docker for now. A lot of people use it on macOS successfully. What are you seeing?

Installed it, configured it, pointed it at the public cpython repo, it said 'cloning' and then pegged cpu for about 15 minutes (with about a third to a half of the usage in 'system') and eventually kernel-panicked.

Sorry about that. What version of macOS and Docker for Mac? We’ll look into that now.

OS 10.13.3 (17D34a) and Docker 17.12.0-ce-mac46 (21698). I don't think it's you, Docker is just still a bit flaky there, which is why I'm asking. I'll try it with something smaller.

Thanks. Theoretically it'd be possible for us to ship a static Go binary for everything except for our syntax highlighter and the (very convenient) bundled PostgreSQL and Redis. We'll monitor the feedback we get and see how to prioritize this. It definitely helps to hear that Docker did not work well for you.

I've heard that docker does have a (inconsistent) CPU usage issue on OS X. I would definitely pin it on docker.

Docker for Mac has known issues with file system. You may want to try docker-sync for the volumes.

Do you have a non-docker installation? The equivalent of gitlab's omnibus would be awesome.

Particularly for things like postgres configuration etc.

We'll definitely consider it. What benefits would you get from having non-Docker? (Not saying there aren't any, just curious what your biggest needs are.)

Re: PostgreSQL configuration, is it that you want to be able to manage and back up the data yourself (not using the Docker container's internal PostgreSQL), or is it a tuning/performance concern, or something else?

BTW, someone else down-thread asked for this, too (https://news.ycombinator.com/item?id=16121182).

unrelated question to the server but related to Sourcegraph: Why did you guys switch away from the VS code style editor on the web to an uneditable one? I loved using it.

on an unrelated note: did you work on AWS SQS?

Do you have any publicly available security review documents?

Our security page is at https://about.sourcegraph.com/security. We have a security assessment that we can share with customers, but not one that we post publicly yet.

We have customers who run Sourcegraph on machines that are completely blocked off from the Internet and only have access to the specific IP ranges of their code hosts on the same network. You can set it up like that if you'd like, which would significantly reduce the risks without needing to trust any third parties (us or the security reviewer).

Do you support mercurial? If not, is it planned?

Sourcegraph only supports Git natively, not Mercurial. You could use Sourcegraph with Git mirrors of your Mercurial repository, if that is appealing, and we'd definitely consider adding in some extra translation work so that the Mercurial metadata embedded in the Git mirror repository would be respected. Does your code host (or do you internally) already have a Git mirror of your repository?

I wonder if the work done for git-cinnabar, which maps a Mercurial repository to something git understands, could help at all here?


Do you accept REMOTE jobs?

Yes, we have some fantastic international and non-SF-based teammates at Sourcegraph, and we'd love to have more.

non US?

Yes. I was wrong in my previous answer - after sqs's response, I looked them up on linkedin, they have at least a german developer in Berlin.

(FWIW - and I'm only saying this in case you were in a similar situation: I applied to them for a job not because I was looking for one, but because I accidentally saw it on HN and the match between my skills and their apparent need was simply "too good to be true" territory. I wasn't necessarily expecting an offer, but I expected to talk to someone - was curious to learn more about what they're doing. However, I got rejected straight away - so I just assumed that they said "REMOTE" for the heck of it... I know it sounds arrogant, but I have a hell of a hard time believing their other applications outclass me so obviously that it was not even worth talking to me, so I assumed it must be something else)

what's the underlying search engine?

How does this differ from OpenGrok https://oracle.github.io/opengrok/ ?

Wow, that's a pretty awful UI. The default search form screams "advanced search" from the late 90ies. The compare example looks pretty dated too. I think Kibana and Sourcegraph are on the right track with a single input field that accepts field:value type searches. They're great once you've learned them.

You can see a feature comparison table here: https://about.sourcegraph.com/products/code-search/

Let us know if you have any other questions or feedback!

Opengrok supports C/C++.

Code search on Sourcegraph Server is actually language agnostic.

Here's an example of a C++ search query: https://sourcegraph.com/search?q=repo:google/leveldb+FilterB...

Opengrok implements hyperlinking for C/C++ code which is the primary productivity multiplier since it allows you to easily jump around callgraphs. That functionality is sorely missing here (unless i am missing this feature somehow).

Very cool, will give it a try later for our gitlab repos. Is gitlab supported?

Also, I cannot find the code for sourcegraph on github. It used to be available under a fair source license. Anyone have a link to the code?

Yes, Sourcegraph supports GitLab repositories! Check out https://about.sourcegraph.com/docs/server/config/repositorie... and the section right below for auth. You'll need to add and authenticate them one-by-one in the config. Soon we'll be add direct GitLab integration like we have for GitHub and GitHub Enterprise, which will sync all (or selected) repositories using the GitLab API.

The source code is not public for this version. I think that source-available but non-open-source licenses are an idea ahead of their time when applied to user-facing software like Sourcegraph. I hope that changes, and we'd love to make Sourcegraph source-available again, but it actually introduced (rather than eliminated) questions in the process of companies adopting Sourcegraph. I'll probably blog about this soon because it's something I care about a lot.

> I'll probably blog about this soon because it's something I care about a lot.

Yes, please! I have seen your videos and read a lot about your thoughts on this subject.

Another question, if you don't mind: does sourcegraph have a forum or irc/slack? A quick search for sourcegraph+slack ends up finding many hits.. for your name, lol.

Cool! We don't have a public Slack/IRC yet, but it seems like something we might do in the future. In the meantime, we're all pretty responsive on Twitter and on email.

And yeah, my last name being Slack does create some confusing moments sometimes. :)

Awesome to hear that GitLab support is coming. Please let me know if we can he help.

I wasn't able to get the netrc file working, my password does contain spaces which seems to be an issue for netrc parsers.

In fact looks like I'm not able to get netrc or SSH keys working.

does it support gitea? gitea has become my sole self-hosted git repository portal, lightweight comparing to gitlab and get the job done super well.

Do you have any plans to add more languages to code intelligence? (Particularly C and C++)

Why should we use this instead of the github search (for our private repos) ?

I mean honestly if you are happy with github search you should stay where you are.

Personally, I find it to be absolutely horrendous. It’s so much faster to clone the repo and search it locally.

> Personally, I find it to be absolutely horrendous.

Why is that so?

These are the reasons of which I am conscious:

1. There doesn't appear to be any relevancy sorting. It appears only the exact term is returned. If it’s not exact, I am not sure how to control whether or not it looks for an exact match and/or what strategy it uses to fuzzy match. Does it tokenize? Use some kind of levenstein distance algorithm?

2. The query results are hugely wasteful in terms of screen space. This means searching for a minorly common term in a large codebase is prohibitively time consuming compared to cloning + ripgrep or whatever.

3. There's no way to search file names + file content. It took me 7 years after github's creation to realize you could search for filenames if you press 't' on the repository.

4. No regex or globbing support, to my knowledge.

This is before listing all the tooling (like sourcegraph) I would hope would be built into a source code repository to assist browsing but are strangely missing--every IDE and editor out there is much faster at casually browsing code because navigation is so much cheaper and frictionless.

I mean overall it's not broken, it's just way less useful for searching a tree of code files than find/xargs/grep is, let alone ack/thesilversearcher/ripgrep. If the capabilities I'm describing are there, they're well-hidden. Github just isn't a good place to browse code.

1. I never really noticed that because I mainly use Sourcegraph's code-intelligence on open source projects and as a result search is something that I have to rarely rely on.

2. You can stylize any page using something like stylebot or a homebrew browser extension.

3. Although not something that I do often, I find the filename search on google (for OSS projects) quite accurate and then the chrome extension allows you to open that file on sourcegraph.com or inject code intelligence within the github page as well.

4. Github sort of supports regex like search, you can learn more @ https://help.github.com/articles/searching-code/

2) isn’t about style, it’s about the fact that the results are paginated. You end up needing to search the damn search results, which is super slow when you’re paginated and it could have been a screen scan if you could fill the screen with results and scroll rapidly through the rest.

this is just honest feedback, not a value judgement.

I don’t think the person you replied to is complaining about Sourcegraph. He is talking about how he dislikes GitHub code search, and it seems that he likes Sourcegraph.

I dislike github search too, especially the pagination part but the issues GP mentioned can be avoided by using some google-foo and Sourcegraph for Chrome :)

GitHub search is pretty useless if you want to rely on it to find all uses of a given word, say during a refactorization


So, you end up having to clone all repos locally and grep for the word.

Just pointing out that limitation of search in GitHub, not saying that this other tool is actually reliable to do this kind of things (I haven't used it before)

Sourcegraph CEO here.

To compare GitHub to Sourcegraph search, here is that same query on Sourcegraph.com (which is Sourcegraph Server running for all open-source code on GitHub):


It works as expected (and as the SO poster wanted)! It shows desired results that GitHub search does not.

Regexps are also supported...give it a try!

I'm the poster at SO :) Very cool! I'll definitely give it a try

can you only search a single repo at a time ?

does it parse the code and store the AST, or is it just plain text ?

It searches multiple repositories at a time. That query above searches all repositories in the given organization.

It does have code intelligence (parsing, semantic references and go-to-def, etc.) but that search is just a text search.

Our users prefer Sourcegraph over GitHub for code search for multiple reasons:

- Regular expression searches

- Exact searches (no ignoring punctuation, for example)

- Searches on any commit or branch, not just recently indexed master

- Diff searches (see https://about.sourcegraph.com/blog/introducing-sourcegraph-s...)

- Overall faster, more powerful searches and filtering capabilities

- Code intelligence (go-to-definition, find-references, hovers, etc.)

Not everyone needs these things. But users who do need them say that they save a lot of time and make them more productive.

At Google, for example, they have a similarly advanced internal code search system that developers love (see https://static.googleusercontent.com/media/research.google.c... and https://docs.google.com/document/d/1LQxLk4E3lrb3fIsVKlANu_pU... for research/numbers).

If your needs are met by GitHub's search, then I would still suggest using the Sourcegraph Chrome extension (also available for Firefox), which adds code intelligence to code you view on GitHub: https://chrome.google.com/webstore/detail/sourcegraph-for-gi....

Did you get permission from SourceGraph to post this comment to HN?

> You may not release the results of any performance or functional evaluation of any of the Software to any third party without prior written approval of Sourcegraph for each such release.

-- https://about.sourcegraph.com/terms/

We just removed that clause (also replied to your other comment about it). Didn’t intend for it to be in there; I agree it’s silly. Thanks for pointing it out.

Google also has this:


That's not an open source product and not available otherwise, though it is backed by kythe.

If you want to have an open source index search, you can try out github.com/google/zoekt/ . See here for a demo site: https://cs.bazel.build/

For example: https://cs.bazel.build/search?q=r%3Atorvalds+meltdown&num=50 searches the Linux kernel for "meltdown"

Is code intelligence a paid upgrade for all languages?

Yes, code intelligence (go-to-definition, find references, hovers, etc.) on Sourcegraph Server is a paid upgrade for all languages.

But you can try/use it for free on open-source projects using our Chrome extension (to get it on code you view on GitHub) at https://chrome.google.com/webstore/detail/sourcegraph-for-gi... or on our public site directly at https://sourcegraph.com/github.com/gorilla/websocket/-/blob/... (for example).

Chrome extension is marvellous for anyone who hasn't used it. Particularly useful for looking up docs of calls to external packages in Python repos.

I don't really get the pricing on code intelligence. So if I have 50 users and want Javascript, Python, and PHP, that's $750/month, even if 25 users only ever use Python?

If you want 3 or more languages, then contact us (at https://about.sourcegraph.com/pricing) and we can give you a package discount. Overall, if pricing is a concern, I'd love to learn more. I'll email you.

I think it really boils down to a perception issue. I think more people would be happy paying a flat fee for a language package then deal with the angst of "wasted" money paying for languages that some users don't leverage. The package as it currently stands would "work" if I could assign languages to users and have that reflect back up to pricing, but that's cumbersome to manage from a customer perspective and sounds pretty painful to implement from the provider perspective.

It may be useful to consider a baseline two-language deal, as Javascript + one server side language covers a huge amount of use cases.

That said, you cover 2 out of 3 of the following scenarios pretty well with the existing model, I just happen to fall into the third, which is probably the smallest sector for you guys anyway:

1 - Small startups, probably standardized on one or two languages, 8-10 people 2 - Larger orgs (200+) where the cost is negligible compared to revenue. 3 - Medium-sized, microservice/squad based orgs, with heterogenous language support but focused within teams.

GitHub search is quite useless. Results are incomplete in unpredictable ways. I've had good results with etsy/houndd.

What is the License for the free install ? MIT/BSD/Apache ?

Here are the terms: https://about.sourcegraph.com/terms/

I note with some distaste that it includes an Oracle-esque prohibition on benchmarking.

> You may not release the results of any performance or functional evaluation of any of the Software to any third party without prior written approval of Sourcegraph for each such release.

Sourcegraph CEO here. That shouldn’t be in there, I agree—we meant to remove that section. It will be removed in a couple of minutes. Please try it, use it, and post lots of evaluations about our product. :)

No, it isn't open source. It's similar to GitHub Enterprise.

Which version control systems are supported?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact