
Towards Natural Language Semantic Code Search at GitHub - Chris911
https://githubengineering.com/towards-natural-language-semantic-code-search/
======
dgreensp
It is a real cultural problem how engineers get more excited about machine
learning than basic usability.

GitHub search can't even search for a literal string, let alone a regex. It
can't search a subdirectory. Ranking is indistinguishable from random. It's
been this way for years. How about building an actual, usable, basic code
search and then getting all fancy with your machine learning?

I almost built my own "online git grep for GitHub" last year.

~~~
kornish
Agreed. Luckily, we as a community have tools like Sourcegraph which are based
on battle-tested pragmatic systems from places like Google.

Disclaimer: no affiliation, just love the team and product.

~~~
bryanrasmussen
Damn, Sourcegraph is very close to something I've been thinking of building.

~~~
sqs
Join us! We are hiring and growing quickly.

------
rococode
This might just be me, but does anyone else feel that GitHub's code search has
other points that could be improved first?

My biggest gripe is that the other results show in seems to be totally random.
For example, if I have a Java class called A and I search "class A" in code
search, the actual A.java doesn't tend to show up anywhere near the front. I
just tried this in a repo and the actual A.java file was on the last page of
results when I searched "class A". The vast majority of the results before it
didn't even have the words "class" and "A" next to each other, which A.java
does...

Maybe I'm doing something wrong (I'd welcome any input on how to use code
search correctly!), but it just feels like they're jumping the gun on trying
to make their code search more advanced when the basic functionality doesn't
work that well.

~~~
samlambert
We are very aware of the problem. I think you are going to really love what we
are working on.

~~~
hiccuphippo
I really hope you are right and have your priorities straight when it comes to
search. I'd love for a way to search for usages of a class::method, or for
strings that contain the text "hello" or for variables named foo. And if you
integrate that into the code itself, Ctrl+click a class method to find all
usages, maybe even usages in other repositories, so I can see how other people
use a certain library.

And of course good old regex search.

~~~
hv42
Sourcegraph does this quite well if you need.

------
finnh
I would settle for the ability to use logical OR when searching issues/pull
requests, or to combine multiple negated searches.

"is:pr is:open ( author:bob OR author:jim )"

The lack of this pretty basic functionality makes issue & PR search much less
useful than it could be.

~~~
matmo
Agreed. It'd also be nice to see a list of issues you're subscribed to. Here's
a fun issue to follow for that -
[https://github.com/isaacs/github/issues/283](https://github.com/isaacs/github/issues/283)

------
sam0x17
It is awesome that they are working on this, but can I just say there are a
lot of basic search features they need to add before "doing the hard thing".
Here are some things that I should be able to do easily but can't (or can't
very easily or well) using GitHub's search mechanism:

1\. exact or close string searches for code that involves ![]{}_-*() etc
characters

2\. searches across past commits (e.g. find a line that used to be in the
code)

4\. search across pull request + comments (not just issues and commit
messages)

5\. advanced search operators -- there should be a full filtering UI with ands
and ors etc

Because of this I often find my self grepping locally, or (more often) totally
out of luck.

------
aaaaaaaaaab
Now that’s what I call a misfeature!

GitHub is used by programmers. Surprisingly, they tend to be very good at
telling computers _precisely_ what they want, in the computers’ own language.

Natural language search is the exact opposite of this, invented for mom & pops
who start their search phrase with “Dear Google, I’d like to search for ...”.

------
KenanSulayman
GitHub is building some amazing stuff recently, I guess now that Microsoft is
going to acquire them, there's far less pressure on making Github Enterprise
profitable..

------
DannyBee
I saw this created in another thread and it seems to accurately sum up the
comments here: [https://imgflip.com/i/2i90x2](https://imgflip.com/i/2i90x2)

~~~
sqs
What thread did you originally see that in?

------
paintstripper
They should add regex search support first before this stuff.

------
nraynaud
wait, they can't search through forks or collate identical results and they
are going into natural language processing?

------
manigandham
Devs don't search code repositories using natural language queries, and any
scenarios of searching for code examples that way are already extremely well
handled by StackOverflow and Google.

This is an incredible waste of time and resources that could be spent making
the existing search far better with very minor tweaks. A perfect example of
big company project management where nobody seems to know what their users
actually want.

------
tyingq
I'd settle for github search that's case sensitive and recognizes things like
dollar signs, semi-colons, commas, braces, and such.

------
HereBeBeasties
Dear GitHub,

Please build search that lets me actually find a given file by name.

You are busy building a space rocket when all we want is a bicycle.
Impressive, but useless for just popping down to the shops.

Love,

The rest of the world's developers

------
mullikine
I want to work at github. They're making cool things.

~~~
brian-armstrong
Do they? The main product appears to have 0 product velocity

~~~
nkantar
I used to feel this way, but then I discovered
[https://blog.github.com/](https://blog.github.com/) and no longer do.

Sure, they may not be addressing your/my specific concerns, but the product
_is_ changing.

