I do enjoy moments like these on hacker news when someone presents a project for X and the CTO of X shows up and wants to talk. It shows how directly of an impact one can potentially have in this community.
I hope this means we’re getting grep searches for github soon. Cheers.
It would certainly be an expensive play. When he mentioned a 20 core system, I'm assuming it is some VM system, since I don't know of any 20 core CPUs. I'm guessing he is using DigitalOcean and he has two of them, so he is looking at $1000 a month in hosting cost.
It's an expensive side project for sure, but it doesn't have to be anywhere near as expensive as that.
My own side project uses a server with 20 cores (2x E5-2690v2 CPUs), with 256GB RAM and a 2TB SSD. This is a dedicated server I rented from tier.net in Texas, after seeing it listed on webhostingtalk [1]. It costs about $160/mo, and that's recently fallen further by paying for 3 months up front.
I'm not a dev and what i like to to do is go searching for code (a la exact match) to replace whatever variable or text should be changed.
Github search in repo kinda worked at some moment, then not.
Then i had to download repo in my local; run VS code (updating first), search there, modify, push.
I wish i could do this on Ghub web GUI
That's awesome. I didn't know about the keyword filename:
I've been using the button to the left of "Clone or Download" this whole time.
Thanks for the info!
iirc, Github uses (used?) my old project (https://github.com/intel/hyperscan) at Intel. It's probably faster than the alternatives, although if you want to support all types of regex you'll need to use Hyperscan as a prefilter for a richer regex engine like PCRE.
This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.
It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.
Either pattern in alternation works fine, but even a simple alternation of the two goes back to the behavior that you might expect to get from awful patterns like \d..\d..\w...\s...\d (i.e. reporting only a partial set of matches).
Impressive! Really fast, full featured code search across a huge corpus.
1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?
2. Is it Elasticsearch or similar or a completely custom engine?
3. What kind of RAM/CPU are you using to power it?
4. Any plans to open source the code or commercialize the technology?
I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.
Thanks! It's built on top of Solr. It fetches the repos from GitHub - it should pick up any updates to repos within a few days. It's running on a couple servers with 20 cores each, which is not really enough for the traffic it's getting right now.
I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!
> I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!
I'm pretty sure Elasticsearch supports regex search, it's just that it's horrendously slow and can blow up the system.
I still miss Google Code Search, which was a great way to find examples of anything I wanted to learn about in programming and usually answered my questions better than anything else, including Stack Overflow. Has it really been 8 years... https://news.ycombinator.com/item?id=3112029
If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.
The main difference it has IMO is it indexes a symbolic code graph extracted from halfway through the compilation process. That means when you search, it knows which functions are frequently called. For example, the LOG() macro is defined in hundreds of places, but the one in logging.h is the one everyone calls, so that's the one that comes top of the results.
It also keeps track of back references, so you can search "who calls any function in this file", which is very hard to do with any other search system.
Major disadvantages are it only indexes one build config, so if you're debugging android code in a multi-platform project and the indexing was done on the windows version, you won't find much (apart from dumb text based search which it does in addition).
The difficulty of compiling every project to build a decent index would make this approach hard on a GitHub scale - all it takes is one missing header file from a dependency not in the repo and the build fails and the whole project can't be indexed. Also, have fun with things like JavaScript which are so dynamic you have to solve the halting problem to know which bit of code calls which other.
I’m in a field and physical area with a pool so shallow - that it seems like straight up madness to throw questions like that at people and kick them out the door for it.
Sure you built app on multi 20 core machines with functionality to search hundreds of millions of lines of code almost instantaneously, but are you someone I'd drink a beer with?
This snide remark dismisses the fact that working on software does mean working with other humans, not just unemotional robots devoid of any kind of irrational ideas. Being able to “drink a beer with” (and reasonably substituting the drinking of beer for just about any other social interaction) is an important part of being able to work with someone. Unless of course you believe an office environment consisting of a tyrannical manager barking orders at worker drones is a healthy relationship.
Are you having intimate romantic relationships with all of your co-workers?
If they get me out of work at 4:30 pm and keep the project I'm working on in quality code so I have less fires to deal with, that's good enough for me.
I think when people talk about this, they mean to push back against the fact that people will often to be biased to hire someone they think they could be casual friends with, share interests with, etc.
I like my coworkers and I find them perfectly find to work and make small talk with, but I don't share interests with many of them and wouldn't really care to hang with them outside of work. That shouldn't be a criterion for hiring.
I have found it highly annoying to work in engineering orgs where everyone seems to have the same interests. Everyone talking about Star Wars, Dungeons and Dragons, Lord of the Rings, etc. constantly because it's assumed everyone else around also enjoys that conversation.
It's an ego thing to want to work with someone just like you instead of adapting yourself to others. It's basically bro culture. It's kind of what's wrong with technology culture.
Give me someone who is talented who makes great code so I can be home at 4:30pm and I don't care what their personality is like.
Additionally someone who tells me when something is an issue even at my ego's expense is extremely valuable, over back patters and schmoozers who just want to keep everyone happy. That leads to a terrible product. I would not like to see whatever product you're working on is like.
You all should take a long look at yourselves and ask why you have to work with people who are just like you instead of being adaptive to other walks of life, personality, and backgrounds.
Try getting out of yourselves for a minute. You might even learn something now outside of your own tiny tiny worlds!
That’s a pretty unfortunate interpretation of my comment, and not entirely logically consistent either.
I mean, if one person who rejects bro culture only wants to collaborate with other people who also reject bro culture, does that mean they are now proponents of bro culture?
I also find it frankly a bit weird for you to make grand sweeping assumptions about who some strangers on an Internet forum choose to associate and collaborate with. How do you know people here don’t work with people from other backgrounds?
If only the answer to "how" was as simple as "writing a web service for searching GitHub repos with regexes," even though the problem is probably in itself non-trivial if there's this much interest in search at all. At least the specification is clear enough.
I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?
If this were to be offered by an actual company (a first party solution), there are some features that'd be expected that make the problem space a lot harder. Here's an "intro to search" article that's a good read, and I'll use it to highlight some of the things that'd be different in a first party solution - https://medium.com/startup-grind/what-every-software-enginee...
(See the "Theory: the search problem" section)
Size: This is only indexing ~500k public repos. A first party solution would be expected to index all of it, public and private.
Indexing speed: This can take up to a few days to index. A first party solution would be expected to have a much lower index latency - seconds to minutes.
Query language: This can (and does) have its own simple query language. A first party solution would need to have support embedded into and not break backwards compatibility with the current query language.
Context-dependence: A first party solution would be expected to index private repos as well, and now the query context (logged in user) becomes another variable in an already multi-variate problem space.
Latency: Gets harder with scale, and a first party solution would likely provide a SLA/SLO around latency.
Access control: Same issue as context-dependence, with private repos being included.
There's also unknown but likely considerations around compliance and internationalization, which are quite tricky problems.
Note - I don't mean for this to be critical of the author at all. This is an awesome and useful tool, with a fantastic UX. I just want to make it clear that search at scale is a lot harder than it seems at first glance, especially as the feature requirements increase.
Engineering manager for code search at GitHub here... this is an excellent summary of many of the concerns we have as we work on code search at GitHub scale!
For GitHub, I would have to imagine only being able to search public repos with regexp would be good enough. GitHub has many strategies, but the main one is, they want to maintain, if not, expand their open source mind share.
The more reasons you give people to go to GitHub, the better off they will be in the future. So I do agree with you that as a commercial solution, this may not be viable, but for GitHub's public repos, this can turn into a very positive thing.
That might well be true but to scale this type of service to all public repos with decent latency and update ratio is a major technical challenge and likely very costly to maintain.
This is my personal observation, but GitHub appears to be a much more ambitious company, now that they are part of Microsoft. With a CEO that understands both the open source and the enterprise world and with Microsoft cash at hand, I don't think spending money to make search better would cause any concerns.
Doing technical things that GitLab, Bitbucket, etc. can't is quite valuable. It also helps with recruiting, since smart people want to work on difficult problems.
It may well be costly to maintain, but I think the operating cost would be well within the realm of an incumbent that wants to maintain and expand their reach. I've been studying the code hosting space for quite sometime and GitHub, from an outsiders perspective, appears to be much more focused and ambitious, which should cause serious concerns for GitLab.
This is really cool. What are you using it for? Usage examples, debugging, etc.?
I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! sqs@sourcegraph.com
Doesn't sourcegraph allow to just search regex over any files in a repo? This is textual search, so how are languages relevant to it? I didn't seem to have problems with that
Sorry, maybe I have confused SourceGraph with https://searchcode.com, but last time I tried, it supports only most widely used languages such as Java, Python and so on, but not the language I use (Delphi/Object Pascal). I'm sorry if I'm wrong.
Sourcegraph.com is universal code search and navigation across all public repositories. To use it on private code inside your company, run a self-hosted instance at https://docs.sourcegraph.com/#quickstart.
We've been so focused on internal code search for companies. See https://about.sourcegraph.com for some of the logos of well-known companies whose devs all use Sourcegraph. Because of that, our "public demo" site at Sourcegraph.com has a few limitations that we're working on lifting, such as only searching across a subset of popular repositories by default (unless you specify a specific subset with `repo:` in the query).
This is amazing! One thing that allows me to do, which I wasn't before, is to do a search for the repos that use some of my open source.
While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".
Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.
That was my first thought. I’ll have to wait until tomorrow to try it, but I have one super rarely used function ima rare package I’d love to see how other people are using.
to grep specific repos locally, I use a tool called Hound, https://github.com/hound-search/hound developed by a couple of engineers at Etsy while I was there, but never released officially.
I built https://grephub.com for that. It doesn’t maintain an index so it’s not super snappy, but it’s good enough / better than you’d expect in many cases!
Why would you want to use this tool to grep individual repos? If you know the repo you're interested in, you can just clone it and then grep it locally...?
I like to grep a code pattern through out all repos.
I use gstreamer, and sometimes i just don't know how to use it to do a specific thing. So i search substrings to find out usage patterns by other people.
A tangent, my biggest gripe with GitHub code search (within a repo) off the top of my head is the inability to blacklist directories or only search whitelisted directories. Often times I want to look up the implementation of a function, and bam, three pages of results from tests.
I'm glad I'm not the only one. It's very common that I'll be searching for a keyword that only appears in the actual code a handful of times but hundreds of times in tests. GitHub's search is practically useless in those cases.
I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.
yeap .. having this issue as well, trying to easily find where a method is defined in JS/TS I'd so much want to be able to exclude `*.(spec|test).(js|ts|jsx|tsx)`
This is really cool! Awesome work. I assume you've seen https://sourcegraph.com/ as well? This to me seems much clearer and a bit more intuitive (though I've only spent a little time in sourcegraph). Really really cool. Does it also search code comments?
Sorry, maybe I have confused SourceGraph with https://searchcode.com, but last time I tried, it supports only most widely used languages such as Java, Python and so on, but not the language I use (Delphi/Object Pascal)
Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.
Hey Dan, if you ever wanted to come on my podcast to talk about your tech stack (how your site is developed / deployed, lessons learned, etc.), I'd love to have you on.
There must be something else or something wrong, because you indexed one of my small repo (~100 stars, ~20 forks, ~20Mb) and not the bigger ones (~500 stars, ~100/150 forks, ~150Mb)
I do not have a great example to try on my phone, but are results deduplicated? That's my big peeve with GitHub search is getting 5 pages of the same forked repo.
There isn't any deduplication, although that will hopefully be less of an issue at this point since there's a limited number of repositories in the index.
GitHub confirmed to me that their search is not able to find in substrings; this is annoying because if you want to find all affected code among all possibly involved repositories, before a change, you need to clone them and grep locally. In the end this means you need to clone absolutely everything you work with, because otherwise you might miss changing that one repo you didn't think of:
I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!
Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:
I would say this needs a list of indexed repos and mainly an explanation of how it exactly works to be usable (how's the index build and how often it's refreshed, what types of files are being indexed, etc.). Otherwise, there's no much value in searching in an unknown data, is it?
Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.
> there's no much value in searching in an unknown data, is it?
So you know exactly how Google's index works?
I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.
Superb work. You built a better code search than Github (well with some of its features missing sure) with a lot less resources. Shows how stagnated the progress in big companies is after a service is deemed "good enough". Good for you kicking them in their butts to lead the way. Hope you get out of this something else too than HN karma.
Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand
Not quite. ripgrep uses Rust's regex engine, not RE2. Rust's regex engine is descended from RE2, but there is no code sharing.
Rust's regex engine does not support backreferences. RE2 does not either. ripgrep does however have a -P/--pcre2 flag which causes it to use PCRE2 instead of Rust's regex engine. PCRE2 supports backreferences and other things, like look-around. (ripgrep also has an --auto-hybrid-regex flag, which will automatically enable PCRE2 for you if you write a regex with backreferences or look-around.)
The reason not to use an engine like PCRE2 for a project like this is because it would be trivially exposed to ReDoS: https://en.wikipedia.org/wiki/ReDoS
Nope. That still supports backreferences, and resolving backreferences is an NP-complete problem.[1] And I don't see anything in that paper that addresses that. Note that there may be some versions of the problem that maybe aren't NP-complete[2], but again, not addressed by that paper.
Besides, that paper was published 12 years ago. Where is the productionized version of it? Or are you suggesting the the OP go spend a few years writing a regex eninge? :-) Doesn't seem like a particularly practical suggestion.
In the paper there are some bounds about the number of states in the automata as a function of the length of the input. So one could limit the length of the input when using back references to bound the complexity of the algorithm. They have used their algorithm for snort (network intrusion detection) using asic. The author could contact the authors of the paper and ask for (or pay for) an implementation.
Thanks for the clarification. As an aside, how difficult do you think it would be to compile ripgrep to wasm? In VS Code we use ripgrep for full-workspace search and Node's regex library for in-memory searches, but this leads to discrepancies and issues such as catastrophic backtracking in the in-memory search.
I've never tried to compile to WASM. It really depends on how much of the OS APIs need to be fixed. e.g., I don't think WASM supports memory maps as one example. In that case, ripgrep could be made to compile without support for memory maps with a bit of work. But that's an easy case. What other things does WASM not support? What about typical file/directory APIs? I don't think it does, or it least, it looks like Rust's standard library doesn't implement anything for them: https://github.com/rust-lang/rust/blob/master/src/libstd/sys...
At that point, it would be hopelessly difficult to build ripgrep. The right path then would be to build a new application that uses whatever of ripgrep's libraries make sense.
Popping up a level though, why would you want to compile to WASM? If you're using Node, then surely you can build an FFI bridge to Rust's regex library. At least at that point, you'd be using the same regex engine. I even maintain official C bindings for them: https://github.com/rust-lang/regex/tree/master/regex-capi
EDIT: Oh, and not sure if this is useful, but the regex crate itself should compile to WASM just fine. I know I've seen people run it in the browser before. If there's a problem here, then please file a bug!
Thanks for all the advice. We'd just be running it on single buffers so I agree it makes more sense to start from the rust regex library than ripgrep. We do however need to continue supporting backrefs and lookaround, so we'll need to add `--auto-hybrid-regex` functionality to fall back to either Node's engine or a webassembly PCRE2.
As for wasm vs FFI, it would ideally work in browser (Monaco), which makes wasm the best bet I believe.
Ah yeah, for backrefs you'll need to find a way to use PCRE2. Not sure what the WASM story is there. But at that point, if your only problem with Node regexes is catastrophic backtracking, then you might as well just stick with Node. PCRE2 will have the same problem.
thank you so much for doing this! i hope it continues to open more doors of opportunities to you!
primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!
i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.
Can you provide detailed steps to reproduce? What strings did you search? Two examples of repos that appeared in the results? What is the link to your repo that did not appear in the results?
Details like this would help the OP to track down the exact cause of why it has indexed the forks but not the original repo.
The authors are quite explicit that this site only includes a fraction of all github repos. Thus, this is not a "bug" that needs to be corrected.
In my case, I am not talking about forks but about people who copied my files into their repositories (with proper attribution and respecting the license). I just searched for my surname and was happily surprised to see it in major projects like ffmpeg, pytorch, bytedeco, scikit and opencv.
Can I search only additions/deletions? Recently when searching GitHub I wanted to find if anyone had replaced the usage of a deprecated method with the new one, because the docs for that library don't mention the non-deprecated method name.
My last name(Ament) is really rare where I come from, so I've used the tool to find other people with the same last name. Was not disappoint.
Thank you!
There's no need for personal attack. We ban accounts that do that, so please don't.
Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.
Why regex still exists? It is unintuitive, requires mastering an obscure syntax, it is very hard to debug, and very difficult to explain to others how it works. It feels like we are trying to write intermediate code by ourselves, while we should have a human readable language that generates regex.
However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.
The investment into learning regexes is worth it if you write or read enough of them. They become the human readable language you speak of, eventually. The question is where the threshold lies.
Do it! You will find that it's very easy, but the result will either be extermely verbose or just like regex. Since most regexes (at least for me) are meant as one-time-use, the extra verboseness has no added benefit. If you have complex needs, you should probably be using something other that regex, anyways.
Yeah, regex can be a bit clunky at times and has a steeper learning curve, but they're pretty industry standard at this point, and portable across languages with a few caveats.
Your mileage may vary, but to my taste, the lpeg flavor of Parsing Expression Grammars is clearly superior.
It uses operator overloading to build patterns from component parts. I don't think anything can replace the terseness of regex for command line use, or vim searching, cases like that.
Because it's really powerful, and some people actually like it (I'm one of them).
I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.
@danfox, sent you an email though commenting here too.
I'm the CTO @ GitHub. Would love to talk to you about this and other things we are building in this area at GitHub.
Feel free to email direct to jason at github.com