
Searchcode: A source code search engine - boyter
http://searchcode.com/
======
beliu
Neat project, thanks for sharing!

We've been working on another code search engine for open-source:
[https://sourcegraph.com/](https://sourcegraph.com/). Our approach is a little
different -- we parse and index things at a programming language level. This
has the benefit of being able to list usage examples of a function or jump to
a definition (i.e., smart IDE-like behavior). Obviously, one drawback is
having to implement language-specific support; right now Sourcegraph just
supports Python, Javascript (node.js) and Go.

Would love to hear people's feedback on either approach!

For other examples of code search, check out: Ohloh:
[http://www.ohloh.net/](http://www.ohloh.net/), Krugle:
[http://opensearch.krugle.org/](http://opensearch.krugle.org/), and Google
code search (or what remains of it)
[https://code.google.com/p/chromium/codesearch](https://code.google.com/p/chromium/codesearch)

~~~
boyter
Thanks.

I have been watching sourcegraph.com for a while. As you say you do much
deeper and expensive parsing of the code, which is not really feasable at the
moment for searchcode since it has 90 languages. Its also a constraint as I
don't have the hardware to support this (searchcode is very lean).

My goal was to provide codesearch for sources other than Github as there is so
much code out there to find. Github does an excellent job indexing and I am
certain that Bitbucket will be following with their own implementation soon.

I too would love to know what people prefer, a deeper analysis of the code
with IDE support or more bredth. I can see both being useful.

~~~
pfraze
Maybe a merge?

~~~
Reich
Don't --force it

------
rane
If you compare these two search results:

[https://searchcode.com/?q=MongoDBObject+find+lang%3AScala](https://searchcode.com/?q=MongoDBObject+find+lang%3AScala)

[https://github.com/search?q=MongoDBObject+find+language%3Asc...](https://github.com/search?q=MongoDBObject+find+language%3Ascala&type=Code&ref=searchresults)

I find Github much, much more digestible, and to the point than search code.

~~~
boyter
Yes, thats my fault. Whats happening under the hood is searchcode is trying to
match "MongoDBObject find" exactly, where as Github is going for a looser but
phrase heavy search.

Its something I am going to change based on feedback as clearly what I am
doing is not what people are expecting. I will however make it an option, so
you can chose exact match (current) or loose.

~~~
rane
Thanks, would be great to have searchcode up to par with Github's search.

I would pay special attention to the yellow highlights. It's only confusing if
the technique is overused.

~~~
boyter
I am modifying it now to be a little more like github's.

I had planned on removing the highlights too, its going to be modified to just
affect the portion where the match occurred. Its mostly there as a hangover
from when I was not actually highlighting the actual match and just the line.

------
JoshTriplett
Handy! However, this seems to do fuzzy searches by default, and I don't see an
obvious way to disable that.

I tried searching for "xcb_connect", and got a pile of results for
"xcb_connection_t". However, unlike with other search engines, I can't quote
the term to force an exact match, because it'll search for the quotes
literally (which I'd otherwise appreciate when trying to find a string).

~~~
boyter
Hi. That's actually a very good point. I need to have a think on how to allow
forcing exact matches, perhaps an advanced option which does this for you.
Thanks for the feedback.

~~~
JoshTriplett
As a first pass, a "match whole words" checkbox should suffice. Eventually you
might want a syntax that allows more flexibility, whether full regexes or
something more scalable, but "match whole words" would solve a large fraction
of the problem. (You might even want to make that the default, and have a
checkbox for "match partial words" instead; for code search, I'd bet that most
of the time you _want_ the whole word matches.)

~~~
boyter
That sounds reasonable. I will probably go with that. Added to the queue and
thanks for the feedback.

------
fbellomi
That's a very interesting project.

I've been working on another tool,
[http://crossclj.info](http://crossclj.info) focused on cross-referencing a
large parte of the open-source ecosystem for Clojure and ClojureScript (some
3600+ projects).

You can browse both the source code (jumping to definitions across projects)
and the auto-generated docs of the whole codebase.

~~~
srcmap
Very Neat! Fast search of src code is also a personal itch of mine - I also
work on site/webapp at let you search source code with mouse click.

[http://www.srcmap.org/s/sl.htm/p=about#c=ABOUT](http://www.srcmap.org/s/sl.htm/p=about#c=ABOUT)

It cross reference of any strings in any text files inside a project.

I indexed the linux, bsd kernel, android AOSP (JB), openstack, golang,
raspberrypi_userland, elasticsearch, nodejs and few other projects with it on
that site.

------
chdir
I use sourcegraph occasionally and mostly rely on Gihub search. I wish the
search has all those advanced refinement options that grep & Sublime Text
search has. Some examples would be to use regex, search a word within a scope
of lines, search within search results etc. Additionally, it's very useful to
be able to sort the search results by stars/forks. Sometimes I just want to
see how popular projects have implemented a certain feature. A keyword based
search isn't enough for that.

I guess these features are very expensive & slow to implement but it would be
super useful if it can be achieved. Source code search is for geeks so it is
probably fair to say that a _truly_ advanced & complex interface won't turn
away users.

~~~
boyter
Sounds like what I want to achieve with searchcode. I am working on most of
what you have listed there. If you want to email me with your ideas on how
these features should work I would be more then happy to add them. My email is
in my bio and listed on searchcode

~~~
chdir
That sounds great. If I have something useful to contribute, I'll get in
touch.

------
laurent123456
This is nicely implemented, however like for most code search engine I'm
wondering what is the exact use case? When I want to search for a function, I
usually try Google and find sample codes on forums, GitHub or Stack Exchange
sites along with detailed information and discussions about them. What
additional feature would searchcode provide?

~~~
boyter
Personally I use it quite a lot when looking for implementations. I also used
code search before searchcode as well to do this, so it fits into my workflow.

However I also use DuckDuckGo, and a !code redirects to searchcode for this so
there is that.

Possibly it depends on how you learn and use things. I find examples useful.
For instance when looking at how to do things in Python's Fabric I would
rather see some examples then read about how someone else doing it.

------
frik
Sphinx and searchcode: [http://www.boyter.org/2014/06/sphinx-
searchcode/](http://www.boyter.org/2014/06/sphinx-searchcode/) (by Ben E.
Boyter, searchcode.com)

------
arafalov
I like the search engines and discoverability. My own project - much, much
smaller - is for embedding search into the Javadocs themselves (along with
some SEO improvements, like iframes and better meta-info).

The search is driven by Apache Solr and can be seen at: [http://www.solr-
start.com/javadoc/solr-lucene/index.html](http://www.solr-
start.com/javadoc/solr-lucene/index.html) . I use custom doclet to generate
the actual Javadoc, which is an interesting challenge all by itself.

------
skybrian
GWT is not on Google Code anymore (well we were, but it's old code). We're at
gwt.googlesource.com.

I imagine you'll want android.googlesource.com and other googlesource.com
repos.

Also, ranking is really important. My usual test is to search for
java.lang.String and see what comes up. If the first result isn't the String
class from some version of the Java SDK, something is wrong :-)

Also: CrossSiteIframeLinker finds the right file on the first page (though not
at the top), but CrossSiteIframeLinker.java has no results.

~~~
boyter
Thanks. I'll update the details.

The reason you aren't seeing string from java is it actually be default goes
for files using the string class, rather then the implementation. Its
something I need to work on though as I agree it should pop near the top, or
possibly appear in the documentation listing at the top.

As for CrossSiteIframeLinker, I don't index the filenames, although I am
considering it for these sort of cases.

------
ch
We'll. They don't break tokens on '_' which is a plus (trend searching for
pthread_t), and they seem to prioritize definitions over declarations.

This could become a useful code search option.

I would request that relevancy should take into account provenance. Meaning a
search for pthread_t would return pthread implementations over uses.

This obviously can't make use of traditional tfidf for scoring.

~~~
boyter
Thanks. Yes, all characters are acutally indexed, with some logic to split
intelligently when required. It will always go for the most exact match first
though.

I have started looking at ranking the main implementation over usage, but
found the results were less useful generally. This may have just been me
though, as I wanted to see usage rather then the implementation, since if I
know that I will just go to the source.

Perhaps an option to request that the orginal source is given greater weight
on an advanced search page would solve this.

~~~
ch
Perhaps simply an info box which points to the implementation, set off from
the usage results. No reason to give the results equal screen real-estate.

------
tsenkov
Does anyone know details on the API - what is the request quota? Will there be
api-keys, or just anonymous users to the api?

I am interested how does it work with Github search - their API allows
something like 5 unauthenticated req/min and 20 req/min if authenticated (at
least the search API)?

Congrats to @boyter.

~~~
boyter
Hi tsnkov,

There is no request quota at the moment, although if I discover any abuse I
may end up rate limiting per IP address (which I really do not want to do as I
don't currently track IP's). No API key is required either. Neither of these
are likely to change.

I only ask for a link back to the site and that you don't spam it. I want to
operate by the motto "Be excellent to each other".

If you want to pass a referrer in the GET request so I know who's using the
API that would be nice, but not required.

searchcode has no integration with Github search at all. I am unlikely to add
it either as I belive they do an excellent job, hence the link on the right
hand side to be redirected to Github search.

Thanks! Hope you find a nice use for the API. If so, let me know as I am
always happy to hear about usage.

~~~
tsenkov
Thanks. This is an awesome plan. I might be using the api through a desktop
app, so eventual limit per IP would work far better than overall.

About the link back to the site - I don't see anything in the API results to
link back to and just a link to the site will probably get a lot lower rate of
clicking than a link with context to the search... Perhaps, it will be a nice
idea to add a link_back_url (just url?) for every result, so people can
navigate to "searchcode.com/codesearch/view/n"?

~~~
boyter
Feel free to go nuts.

Ah that's a good point. I meant just link back to searchcode itself, but that
make sense. Added for you. It should appear on new searches, and all once the
cache falls off (don't want to flush it right now while its getting so much
traffic).

~~~
tsenkov
Awesome. Thank you. :)

------
pbreit
There was a code search engine wave a few years back which I thought was a
decent idea but they apparently never gained much traction (Google Code
Search, Krugle, Koders come to mind). One thing I don't recall seeing which is
a good call are the real world examples.

------
sevko
A similar (proof of concept) application:
[http://bitshift.it/](http://bitshift.it/) Demo video:
[https://vimeo.com/98697078](https://vimeo.com/98697078)

------
rubycodesearch
I made a site that dedicated to Ruby developers. It's mostly about searching a
RubyGem's source code. RegExp supported.

[http://rubycodesearch.com](http://rubycodesearch.com)

------
eng_monkey
Congratulations! Excellent usability. The user interface is pretty nice too.
It would be nice being able to use it for a regular intranet search engine.

~~~
boyter
Actually you can if you set your default search provider in your browser. All
DDG bang searches are supported so its actually possible to do.

------
adzicg
just tried searching for a few obscure things I can find through Github and
the site couldn't find anything. meh... not much of an index then

~~~
zapt02
I tried searching for: eval($_GET

No results.

Same search on GitHub gives over 100 000 results for PHP. Doh.

[https://github.com/search?q=eval%28%24_GET&ref=searchresults...](https://github.com/search?q=eval%28%24_GET&ref=searchresults&type=Code)

~~~
boyter
Howdy. Thats mostly because searchcode is trying to match exactly "eval($_GET"
and currently has none in the index. Looking at GitHub it seems to come back
with only 4 results which are exact and the rest as loose matches.

I am seeing a pattern that most people would prefer the loose match over exact
to get back more results, but with the exact matches (if any) at the top.

I will take this onboard and see if I can improve things.

~~~
zapt02
If you search for an exact match you still get almost 600 results, see:

[https://github.com/search?q=%22eval%28%24_GET%22&type=Code&r...](https://github.com/search?q=%22eval%28%24_GET%22&type=Code&ref=searchresults)

I would definitely expect the google pattern of doing exact match when in
quotes. Throw in a regexp engine as well, because I tried using wildcards and
that didn't do anything.

For example, if searching for vulnerabilities I would like to be able to do
something like:

    
    
      eval(.*$_[(GET|POST)].*)

------
splitbrain
A bit off-topic. But are there any open source projects like this that can be
self-hosted, working on a largish local directory of projects?

~~~
frik
It's called Desktop Search [1] or Enterprise Search [2].

If you want to create your own solution, the easiest way is using _Sphinx
Search_ [3] (searchcode.com uses it too [4]) and a bit more advanced with
Lucene [5] (and its related sub-projects) or Xapian [6].

If you want your personal _Google Code Search_ [7] with its powerful scalable
Regex functionality, Russ Cox published the related code [8].

[1]
[http://en.wikipedia.org/wiki/Desktop_search](http://en.wikipedia.org/wiki/Desktop_search)

[2]
[http://en.wikipedia.org/wiki/Enterprise_search](http://en.wikipedia.org/wiki/Enterprise_search)

[3]
[http://en.wikipedia.org/wiki/Sphinx_(search_engine)](http://en.wikipedia.org/wiki/Sphinx_\(search_engine\))

[4] [http://www.boyter.org/2014/06/sphinx-
searchcode/](http://www.boyter.org/2014/06/sphinx-searchcode/)

[5] [http://en.wikipedia.org/wiki/Lucene](http://en.wikipedia.org/wiki/Lucene)

[6] [http://en.wikipedia.org/wiki/Xapian](http://en.wikipedia.org/wiki/Xapian)

[7]
[http://en.wikipedia.org/wiki/Google_Code_Search](http://en.wikipedia.org/wiki/Google_Code_Search)

[8] [http://swtch.com/~rsc/regexp/](http://swtch.com/~rsc/regexp/)

------
slashdotaccount
You can do regex based searches of Debian here:

[http://codesearch.debian.net/](http://codesearch.debian.net/)

------
talles
Ty so much, I've be waiting for something like this since when they ruined
koders.com!

~~~
boyter
No problem at all.

I actually think the new koders (code.ohloh.net) is an improvement over the
old coders in many ways, but wanted a leaner implementation.

Also while ohloh was working on the new version koders was becoming stale
hence starting my own. Lastly I thougth that this sort of service should
provide an API which ohloh sadly does not (hopefully this will change).

~~~
PDegenPortnoy
Ooooh, good point. I'm an engineer for Black Duck and in charge of Ohloh.net.
We have an API for most everything available on Ohloh.net (Organizations,
which is still in Beta, is an exception and we're going to fix that real
soon).

We're working on an improved code search and I want to leverage this new
infrastructure, which should start being available for our internal
development in the next few weeks, to make code search a top-level citizen
within Ohloh. For example, searching for key phrases within a project right
from the Project page.

I'll add some stories to the back log to see what we can do to make the code
search itself API accessible.

You mentioned "leaner implementation"; could you expand on that a bit? I'd
love to hear your thoughts.

~~~
boyter
Sounds good to me. An exposed API is a big boon, and the lack of one is one of
the main reasons I started searchcode.

Sure, just email me at bboyte01@gmail.com and ill be happy to discuss. Mostly
it comes down to not overcrowding the UI and running on minimal hardware.

------
voltagex_
grepcode.com seems to be good for Java/Android. I'm a little annoyed they have
a generic domain for a language-specific search but oh well.

------
chewxy
Did you abandon your .de domain?

~~~
boyter
Its still active, but just as a redirect.

Google does not like .de domains and moving over to .com gave a massive boost
in search results.

------
thejosh
Does this also sort gists?

~~~
boyter
No it don't sorry. If you want to send me an email with some details on how to
implement this I would be happy to though.

