@danfox, sent you an email though commenting here too.
I'm the CTO @ GitHub. Would love to talk to you about this and other things we are building in this area at GitHub.
Feel free to email direct to jason at github.com
I hope this means we’re getting grep searches for github soon. Cheers.
My own side project uses a server with 20 cores (2x E5-2690v2 CPUs), with 256GB RAM and a 2TB SSD. This is a dedicated server I rented from tier.net in Texas, after seeing it listed on webhostingtalk . It costs about $160/mo, and that's recently fallen further by paying for 3 months up front.
Then i had to download repo in my local; run VS code (updating first), search there, modify, push.
I wish i could do this on Ghub web GUI
TIL some of them are on this page that you only see if you search for an empty string:
Click on 'prefixes'. This kind of thing should be readily available from any search box that searches through GitHub.
I think there were better solutions on the early 2000’s.
@danfox, i'm always down to talk code search as well - firstname.lastname@example.org
This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.
It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.
Either pattern in alternation works fine, but even a simple alternation of the two goes back to the behavior that you might expect to get from awful patterns like \d..\d..\w...\s...\d (i.e. reporting only a partial set of matches).
1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?
2. Is it Elasticsearch or similar or a completely custom engine?
3. What kind of RAM/CPU are you using to power it?
4. Any plans to open source the code or commercialize the technology?
I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.
Blazing fast multi-repo regex code search. May be more expensive to run in prod, not sure.
I'm pretty sure Elasticsearch supports regex search, it's just that it's horrendously slow and can blow up the system.
If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.
It also keeps track of back references, so you can search "who calls any function in this file", which is very hard to do with any other search system.
Major disadvantages are it only indexes one build config, so if you're debugging android code in a multi-platform project and the indexing was done on the windows version, you won't find much (apart from dumb text based search which it does in addition).
Already has been publicly contacted by:
- GitHub CTO
- SerpApi CEO
- SourceGraph CEO
Search is hot right now!
CTOs from software companies interview at other software companies.
i bet they would all go back home and immediately fix their own hiring practices.
I like my coworkers and I find them perfectly find to work and make small talk with, but I don't share interests with many of them and wouldn't really care to hang with them outside of work. That shouldn't be a criterion for hiring.
I have found it highly annoying to work in engineering orgs where everyone seems to have the same interests. Everyone talking about Star Wars, Dungeons and Dragons, Lord of the Rings, etc. constantly because it's assumed everyone else around also enjoys that conversation.
Give me someone who is talented who makes great code so I can be home at 4:30pm and I don't care what their personality is like.
Additionally someone who tells me when something is an issue even at my ego's expense is extremely valuable, over back patters and schmoozers who just want to keep everyone happy. That leads to a terrible product. I would not like to see whatever product you're working on is like.
You all should take a long look at yourselves and ask why you have to work with people who are just like you instead of being adaptive to other walks of life, personality, and backgrounds.
Try getting out of yourselves for a minute. You might even learn something now outside of your own tiny tiny worlds!
I mean, if one person who rejects bro culture only wants to collaborate with other people who also reject bro culture, does that mean they are now proponents of bro culture?
I also find it frankly a bit weird for you to make grand sweeping assumptions about who some strangers on an Internet forum choose to associate and collaborate with. How do you know people here don’t work with people from other backgrounds?
And not a single thing you just said makes any logical sense.
I do know I would never want to work on any project that you're in charge of because I guarantee they're nightmare environments.
Best of luck to you nonetheless.
If they get me out of work at 4:30 pm and keep the project I'm working on in quality code so I have less fires to deal with, that's good enough for me.
I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?
(See the "Theory: the search problem" section)
Size: This is only indexing ~500k public repos. A first party solution would be expected to index all of it, public and private.
Indexing speed: This can take up to a few days to index. A first party solution would be expected to have a much lower index latency - seconds to minutes.
Query language: This can (and does) have its own simple query language. A first party solution would need to have support embedded into and not break backwards compatibility with the current query language.
Context-dependence: A first party solution would be expected to index private repos as well, and now the query context (logged in user) becomes another variable in an already multi-variate problem space.
Latency: Gets harder with scale, and a first party solution would likely provide a SLA/SLO around latency.
Access control: Same issue as context-dependence, with private repos being included.
There's also unknown but likely considerations around compliance and internationalization, which are quite tricky problems.
Note - I don't mean for this to be critical of the author at all. This is an awesome and useful tool, with a fantastic UX. I just want to make it clear that search at scale is a lot harder than it seems at first glance, especially as the feature requirements increase.
The more reasons you give people to go to GitHub, the better off they will be in the future. So I do agree with you that as a commercial solution, this may not be viable, but for GitHub's public repos, this can turn into a very positive thing.
Doing technical things that GitLab, Bitbucket, etc. can't is quite valuable. It also helps with recruiting, since smart people want to work on difficult problems.
It may well be costly to maintain, but I think the operating cost would be well within the realm of an incumbent that wants to maintain and expand their reach. I've been studying the code hosting space for quite sometime and GitHub, from an outsiders perspective, appears to be much more focused and ambitious, which should cause serious concerns for GitLab.
I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! email@example.com
Sourcegraph.com is universal code search and navigation across all public repositories. To use it on private code inside your company, run a self-hosted instance at https://docs.sourcegraph.com/#quickstart.
We've been so focused on internal code search for companies. See https://about.sourcegraph.com for some of the logos of well-known companies whose devs all use Sourcegraph. Because of that, our "public demo" site at Sourcegraph.com has a few limitations that we're working on lifting, such as only searching across a subset of popular repositories by default (unless you specify a specific subset with `repo:` in the query).
While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".
Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.
my_function.*option_1 # search
I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.
Can it grep on individual repos?
Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.
I am the CEO at SerpApi. If you need a job, shot me an at julien _at_ serpapi.com.
That podcast is at: https://runninginproduction.com/, drop me a line at firstname.lastname@example.org if you're interested.
I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!
Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:
Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand
Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.
So you know exactly how Google's index works?
I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.
Rust's regex engine does not support backreferences. RE2 does not either. ripgrep does however have a -P/--pcre2 flag which causes it to use PCRE2 instead of Rust's regex engine. PCRE2 supports backreferences and other things, like look-around. (ripgrep also has an --auto-hybrid-regex flag, which will automatically enable PCRE2 for you if you write a regex with backreferences or look-around.)
The reason not to use an engine like PCRE2 for a project like this is because it would be trivially exposed to ReDoS: https://en.wikipedia.org/wiki/ReDoS
(1) Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions.
Besides, that paper was published 12 years ago. Where is the productionized version of it? Or are you suggesting the the OP go spend a few years writing a regex eninge? :-) Doesn't seem like a particularly practical suggestion.
 - https://perl.plover.com/NPC/NPC-3SAT.html
 - https://branchfree.org/2019/04/04/question-is-matching-fixed...
By the way, good work ripgrep and rust.
At that point, it would be hopelessly difficult to build ripgrep. The right path then would be to build a new application that uses whatever of ripgrep's libraries make sense.
Popping up a level though, why would you want to compile to WASM? If you're using Node, then surely you can build an FFI bridge to Rust's regex library. At least at that point, you'd be using the same regex engine. I even maintain official C bindings for them: https://github.com/rust-lang/regex/tree/master/regex-capi
EDIT: Oh, and not sure if this is useful, but the regex crate itself should compile to WASM just fine. I know I've seen people run it in the browser before. If there's a problem here, then please file a bug!
As for wasm vs FFI, it would ideally work in browser (Monaco), which makes wasm the best bet I believe.
primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!
i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.
Backend for codegrep was Play framework + Elasticsearch and you could search by programming languages.
Details like this would help the OP to track down the exact cause of why it has indexed the forks but not the original repo.
In my case, I am not talking about forks but about people who copied my files into their repositories (with proper attribution and respecting the license). I just searched for my surname and was happily surprised to see it in major projects like ffmpeg, pytorch, bytedeco, scikit and opencv.
Can't wait for filename filters which would make this the perfect solution
But it's actually pretty #neat. It's all tidied up into a single app without any dependencies.
This rocks and, so far, seems way way WAY better than Github's own search tool.
I got a tooltip say:
Error: JSON.parse: unexpected character at line 1 column 1 of the JSON data
Update:: Oh ^(.)"(.)"(.)$ works and fast.
Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.
We detached this subthread from https://news.ycombinator.com/item?id=22397156.
I hope u/dang sees your comment history; you are basically just spamming nerdydata.com
However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.
Yeah, regex can be a bit clunky at times and has a steeper learning curve, but they're pretty industry standard at this point, and portable across languages with a few caveats.
Is there an alternative that is clearly superior?
It uses operator overloading to build patterns from component parts. I don't think anything can replace the terseness of regex for command line use, or vim searching, cases like that.
But for a program, give me lpeg every time.
I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.