Would love to hear people's feedback on either approach!
For other examples of code search, check out:
and Google code search (or what remains of it) https://code.google.com/p/chromium/codesearch
I have been watching sourcegraph.com for a while. As you say you do much deeper and expensive parsing of the code, which is not really feasable at the moment for searchcode since it has 90 languages. Its also a constraint as I don't have the hardware to support this (searchcode is very lean).
My goal was to provide codesearch for sources other than Github as there is so much code out there to find. Github does an excellent job indexing and I am certain that Bitbucket will be following with their own implementation soon.
I too would love to know what people prefer, a deeper analysis of the code with IDE support or more bredth. I can see both being useful.
I find Github much, much more digestible, and to the point than search code.
Its something I am going to change based on feedback as clearly what I am doing is not what people are expecting. I will however make it an option, so you can chose exact match (current) or loose.
I would pay special attention to the yellow highlights. It's only confusing if the technique is overused.
I had planned on removing the highlights too, its going to be modified to just affect the portion where the match occurred. Its mostly there as a hangover from when I was not actually highlighting the actual match and just the line.
I tried searching for "xcb_connect", and got a pile of results for "xcb_connection_t". However, unlike with other search engines, I can't quote the term to force an exact match, because it'll search for the quotes literally (which I'd otherwise appreciate when trying to find a string).
I did write some code which would turn a regex query into a normal sphinx query some time ago so I am going to try to implement that again and hopefully get similar results.
I've been working on another tool, http://crossclj.info focused on cross-referencing a large parte of the open-source ecosystem for Clojure and ClojureScript (some 3600+ projects).
You can browse both the source code (jumping to definitions across projects) and the auto-generated docs of the whole codebase.
It cross reference of any strings in any text files inside a project.
I indexed the linux, bsd kernel, android AOSP (JB), openstack, golang, raspberrypi_userland, elasticsearch, nodejs and few other projects with it on that site.
I guess these features are very expensive & slow to implement but it would be super useful if it can be achieved. Source code search is for geeks so it is probably fair to say that a truly advanced & complex interface won't turn away users.
However I also use DuckDuckGo, and a !code redirects to searchcode for this so there is that.
Possibly it depends on how you learn and use things. I find examples useful. For instance when looking at how to do things in Python's Fabric I would rather see some examples then read about how someone else doing it.
The search is driven by Apache Solr and can be seen at: http://www.solr-start.com/javadoc/solr-lucene/index.html . I use custom doclet to generate the actual Javadoc, which is an interesting challenge all by itself.
I imagine you'll want android.googlesource.com and other googlesource.com repos.
Also, ranking is really important. My usual test is to search for java.lang.String and see what comes up. If the first result isn't the String class from some version of the Java SDK, something is wrong :-)
Also: CrossSiteIframeLinker finds the right file on the first page (though not at the top), but CrossSiteIframeLinker.java has no results.
The reason you aren't seeing string from java is it actually be default goes for files using the string class, rather then the implementation. Its something I need to work on though as I agree it should pop near the top, or possibly appear in the documentation listing at the top.
As for CrossSiteIframeLinker, I don't index the filenames, although I am considering it for these sort of cases.
This could become a useful code search option.
I would request that relevancy should take into account provenance. Meaning a search for pthread_t would return pthread implementations over uses.
This obviously can't make use of traditional tfidf for scoring.
I have started looking at ranking the main implementation over usage, but found the results were less useful generally. This may have just been me though, as I wanted to see usage rather then the implementation, since if I know that I will just go to the source.
Perhaps an option to request that the orginal source is given greater weight on an advanced search page would solve this.
I am interested how does it work with Github search - their API allows something like 5 unauthenticated req/min and 20 req/min if authenticated (at least the search API)?
Congrats to @boyter.
There is no request quota at the moment, although if I discover any abuse I may end up rate limiting per IP address (which I really do not want to do as I don't currently track IP's). No API key is required either. Neither of these are likely to change.
I only ask for a link back to the site and that you don't spam it. I want to operate by the motto "Be excellent to each other".
If you want to pass a referrer in the GET request so I know who's using the API that would be nice, but not required.
searchcode has no integration with Github search at all. I am unlikely to add it either as I belive they do an excellent job, hence the link on the right hand side to be redirected to Github search.
Thanks! Hope you find a nice use for the API. If so, let me know as I am always happy to hear about usage.
About the link back to the site - I don't see anything in the API results to link back to and just a link to the site will probably get a lot lower rate of clicking than a link with context to the search... Perhaps, it will be a nice idea to add a link_back_url (just url?) for every result, so people can navigate to "searchcode.com/codesearch/view/n"?
Ah that's a good point. I meant just link back to searchcode itself, but that make sense. Added for you. It should appear on new searches, and all once the cache falls off (don't want to flush it right now while its getting so much traffic).
Same search on GitHub gives over 100 000 results for PHP. Doh.
I am seeing a pattern that most people would prefer the loose match over exact to get back more results, but with the exact matches (if any) at the top.
I will take this onboard and see if I can improve things.
I would definitely expect the google pattern of doing exact match when in quotes. Throw in a regexp engine as well, because I tried using wildcards and that didn't do anything.
For example, if searching for vulnerabilities I would like to be able to do something like:
If you could supply some specific examples that would be very useful though.
If you want to create your own solution, the easiest way is using Sphinx Search  (searchcode.com uses it too ) and a bit more advanced with Lucene  (and its related sub-projects) or Xapian .
If you want your personal Google Code Search  with its powerful scalable Regex functionality, Russ Cox published the related code .
I actually think the new koders (code.ohloh.net) is an improvement over the old coders in many ways, but wanted a leaner implementation.
Also while ohloh was working on the new version koders was becoming stale hence starting my own. Lastly I thougth that this sort of service should provide an API which ohloh sadly does not (hopefully this will change).
We're working on an improved code search and I want to leverage this new infrastructure, which should start being available for our internal development in the next few weeks, to make code search a top-level citizen within Ohloh. For example, searching for key phrases within a project right from the Project page.
I'll add some stories to the back log to see what we can do to make the code search itself API accessible.
You mentioned "leaner implementation"; could you expand on that a bit? I'd love to hear your thoughts.
Sure, just email me at email@example.com and ill be happy to discuss. Mostly it comes down to not overcrowding the UI and running on minimal hardware.
Google does not like .de domains and moving over to .com gave a massive boost in search results.