
How Google Code Search Worked (2012) - rsc
https://swtch.com/~rsc/regexp/regexp4.html
======
burntsushi
I am planning on adding functionality like this to ripgrep. If folks have
opinions on how it should work, I'd love to hear from you!
[https://github.com/BurntSushi/ripgrep/issues/1497](https://github.com/BurntSushi/ripgrep/issues/1497)

~~~
hangtwenty
Thank you for ripgrep! I will mention it in that thread too but I'm AFK so
before I forget...

I'm imagining a "drill down" TUI with rg and fzf. fzf can be good for both
filenames and other filter-downs. Thinking of breadcrumbs and easily stepping
forward or backward, ability to easily bookmark/"pin" parts of search paths as
presets for easy reuse later, etc.

EDIT: I recognize this would be outside of the scope of rg itself, I'm voicing
it in case it sparks ideas about the functionality you're thinking of adding.
I'll think more about it and see if I can explain better

------
rkhassen9
Code search was “too good to be true, gotta pinch myself” awesome. I miss it
to this day.

Here is how I used it - I’d type in some code I was working on and the search
result would show similar code and how it was used. Great for debugging and
thinking by looking at similar solutions. Sigh.

~~~
tehlike
For me it was f:cc$ f:contentads some keyword I wanted to learn more about.
Then a bunch of cross refs.

~~~
londons_explore
You can still play with it here:
[https://cs.chromium.org/](https://cs.chromium.org/)

~~~
fzzfff
wow this site is so laggy, even on chromium

~~~
londons_explore
Google engineers have superfast high end desktops and laptops, and 10 gbit
internet connections. They don't tend to optimise their internal tools for low
spec machines or internet connections.

------
greenyouse
Weird, I was just looking into google code search this weekend so I could use
something like it on my work computer. It's a little surprising that big co
git storage companies don't have a proper code search tool as part of their
package. I use Bitbucket right now but the search is built over Elasticsearch
and special characters aren't handled so regular expressions won't work.

A couple open source projects that I've seen are Hound and Zoekt. Hound
actually uses this code search backend with a nice frontend in React. Zoekt is
what I was going to use since it scales really well, is faster, and has good
search operators for filtering by repo name, language, etc. Google was using
Zoekt until recently for code search across all their open source repos.[0]

[0][https://cs.chromium.org/](https://cs.chromium.org/)

~~~
Cyph0n
We use Opengrok at Cisco. It’s a pretty barebones interface, but it works
well.

~~~
sebastos
Interesting, because the search function that Cisco's intranet provides for
documents and such is perhaps the single most useless piece of technology I've
ever encountered. You could search something like "401k plan" and you'd get
marketing materials written in Japanese. Utter trash.

------
dvirsky
The current internal CodeSearch is one of the best tools available for Google
engineers. It's really a marvel.

~~~
jules
Do you have a list of all major Google tools and why they're better than
what's available elsewhere? (if so)

~~~
antoinealb
In my opinion, what makes them really great is the tight integration that they
have. For example, since the whole company uses one build system and one
single repository, you can build a truly awesome IDE that knows about every
library in the company and can autocomplete for it. Same for code search,
where cross references are accurate and work cross languages (for example a
class generated from Protobuf).

~~~
londons_explore
Outside Google, the percentage of my coding time I spend hunting some
dependencies source tree for the relevant header files or documentation or
"Where on earth is this constant defined" is huge.

With codesearch, answering those kind of questions is near instant.

------
kannmig
Other great articles on regexes from rsc:

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

[https://swtch.com/~rsc/regexp/regexp2.html](https://swtch.com/~rsc/regexp/regexp2.html)

------
ninkendo
I’ve always wondered why regular expressions and full text indexes are the
best thing we expect out of a code search engine.

I mean, we’re talking about _code_ here. Text meant to be interpreted and
understood by a compiler. Why can’t we do better?

Why can’t I say “show me everyone that’s calling this function”, like an IDE
lets me do? Or “show me functions that accept <type> as one of their arguments
and return <type>”, in a way that integrates with the real grammar/AST of the
language(s) in question, without resorting to clunky regular expressions?

I should be able to write structured queries against a codebase, with regexes
being just one part of that query language.

~~~
michaelt
Only some languages can readily support such features.

For example C has a preprocessor and linking step driven by a build system.
And C has a bunch of different build systems available, some of which are
procedural rather than declarative.

Maybe you'll need to support package management - if a function signature
calls for a CopyOnWriteArrayList do you need to know what the subclasses and
superclasses of that type are? Do you need to resolve all the dependencies to
be able to do that?

If you're thinking "No problem, everyone compiles their programs in CI anyway"
\- are you happy to skip indexing unused code and uncompileable code?

And of course you'll be chasing after language and build tool changes - not
only to one language, but every language.

On the other hand, a nice simple grep? Sounds much simpler to me.

------
ovi256
It's worth taking a look at livegrep (Try it here:
[https://livegrep.com/search/linux](https://livegrep.com/search/linux)) as an
alternative to your git provider's code search.

Quoting patio11: "I intend to boot up a livegrep instance on the first day of
every startup for the rest of my life. It borders on miraculous."

It is indeed very good.

------
hnaccy
I'm baffled at how bad github code search even for enterprise github
deployments. Is there some third party solutions that are popular or standard?

~~~
newusertoday
try sourcegraph
[https://github.com/sourcegraph/sourcegraph](https://github.com/sourcegraph/sourcegraph)
. It is backed by company so you can buy enterprise support as well if
required.

------
generatorguy
for some reason I always thought russ cox was 60 years old (even 15 years ago)
with a big grey neck beard. boy was I wrong!

~~~
kick
He's not that old, but he's not that young, either. He worked on Plan 9 for
like a decade before this happened. A few years before he put out this blog
post, he finished his PhD thesis. While a bit silly to hire someone with that
much experience as an intern, it _is_ how most PhDs are treated.

~~~
generatorguy
I was actually thinking of Alan Cox!

~~~
kick
Alan Cox is in his fifties!

------
bminor13
A minor note in the article reads:

> To minimize I/O and take advantage of operating system caching, csearch uses
> mmap to map the index into memory and in doing so read directly from the
> operating system's file cache. This makes csearch run quickly on repeated
> runs without using a server process.

Does anyone know some resources where I can read more about this technique?
(how-to, pros/cons, caveats, etc.) I'm interested in figuring out the best way
to have a commandline tool persist state that it can quickly access across
multiple runs, but so far a background server process is the only technique
I'm familiar with.

~~~
akkartik
Yeah, you have to run a server. When a process exits, all its mmap'd pages are
reclaimed. Just like any other memory.

~~~
bminor13
But this excerpt says that this technique obviates the need for a server
process? Are they saving the contents of memory into files using mmap, and
then using this state on every run?

~~~
akkartik
I just reread your quoted passage and noticed the reference to the OS's file
cache. Ok, this is outside my knowledge :)

------
chubot
Related:

[https://codesearch.debian.net/](https://codesearch.debian.net/)

[https://github.com/Debian/dcs/](https://github.com/Debian/dcs/)

[https://codesearch.debian.net/research/bsc-
thesis.pdf](https://codesearch.debian.net/research/bsc-thesis.pdf)

------
YesThatTom2
Wait... /usr/include on Mac OS Lion included constants for DATAKIT?

Was that a joke? Does someone have a Lion system around that can verify?

~~~
mjlee
[https://github.com/apple/darwin-
xnu/blame/a449c6a3b8014d9406...](https://github.com/apple/darwin-
xnu/blame/a449c6a3b8014d9406c2ddbdc81795da24aa7443/bsd/bsm/audit_domain.h#L47)

It's been there for 11 years in this repo. Along with DECnet. Suppose there's
no real drive to remove it.

------
ausjke
opengrok is the best I have used so far, java-based, can search _huge_ code
base (e.g. android source code, linux kernel, whatever you throw at it)

[https://oracle.github.io/opengrok/](https://oracle.github.io/opengrok/)

