Hacker News new | past | comments | ask | show | jobs | submit login
GitHub switches to Rust based Blackbird code search (theregister.com)
161 points by fork-bomber on May 9, 2023 | hide | past | favorite | 66 comments



My current employer uses sourcegraph [1], which also uses n-gram based indexing like github's blackbird. Would be interesting to see a comparative benchmark of the two.

1. https://github.com/sourcegraph/sourcegraph/blob/main/doc/dev...


Have you tried Sourcegraph's LLM assistant, Cody, yet?


No, the company hasn't enabled it yet for us.



The linked article contains the news that the feature became generally available. Your link isn't a press release, its an explanation of the tech behind this feature published a few months ago.



Oh, this is why this news felt very old and wrong. the "0.01 QPS" number for ripgrep is just a on-paper math, not an actual performance number from a running system.


> The resulting system can manage about 640 queries per second, compared to about 0.01 queries per second for ripgrep.

They are trying to make a point about indexes, but it is overshadowed by the unit mismatch (distributed system QPS vs what sounds like single-machine ripgrep pace)


According to the engineering article [1], that ripgrep number is with 2048 cores combined with 115TB of data in memory.

[1] https://github.blog/2023-02-06-the-technology-behind-githubs...


OK, thanks for that. Sounds like it's a valid comparison then.


Isn't ripgrep also written in Rust?


Yes. ripgrep is used as a basis of comparison for that sentence. ripgrep wasn't GitHub's previous search implementation. (It never was. It's not the right tool for that scale.)


Considering the massive debt GitHub owes to open-source, is this engine also open?


Does Github owe debt to open-source, or does open-source owe debt to Github?


You don't owe debts to open source. That's the point. Or at least that's the point of the free software philosophy.

This is also why I think it's unbelievably lucky we had luminaries like RMS in the beginning. To think of software freedom, what a concept! Legend!

Meanwhile everyone else is constantly trying to turn everything into a transactional interaction. "what are you going to give me for my code? How are you going to pay your debts?"


I think if they did not, and another git host offered free hosting to open source it would have much of GitHub’s share of the market today. They still competed hard on UX and features but it was necessary to win over open source projects, to become the defacto solution for software.


why would any floss owe debt to github


They are using Github for hosting, visibility and ease of contribution. A lot of projects would never have been picked up if they were hosted on other platforms.


With the prior understanding that the code uploaded there would remain free and visible for all those who wished to contribute towards something greater than themselves. Microsoft sweeping in at the last minute to undermine all that wasn't exactly an act met with rapturous applause, nor was it expected by anyone who hosted on Github.

The users are trapped. They've made a time (and some a financial) investment into Github, and to pack up bow and go elsewhere would kill the communities they grew there.

Github owes FOSS everything for the community and buy-in fostered there, and would be nothing without it.


Undermine how?



So in other words not really at all.


Because they can host their software and community for free?


I think it would be accurate to say that the value GitHub extracts from the OSS community (in the form of its core VCS, network effects, ML models, etc.) substantially outstrips the value returned to the community. Which is not to say that the latter is insignificant!


Is that a qualitative or quantitative argument? Like, would you put numbers to either/both sides of that? I'm not confident Github gets more value than it returns (but you may be right).


Github are confident though so...


The OSS projects are also confident that they are getting a better deal, though, since they continue to host on Github.

Sometimes both sides win.


I don't think this requires belief in a "better" deal, only a good one (or an acceptable one, given the friction and network effects involved in switching).


Better measure is if GH is getting very high value in return,some competition should be there working at lesser margins (giving more value for OSS)


GitHub had ~$1b in revenues in 2022, but I have no doubt that it created many multiples of that in value among its 90m users and communities in the same period.

I would wager that the cumulative revenues of all third-party services (CircleCI, Sourcegraph etc.) that are part of the GitHub ecosystem also far exceed GitHub, value and revenues that would unlikely exist without GH (or be at an earlier phase of their evolution).

I'd even go so far as to say the value created by pull requests alone exceeds $1bln annually, for the many commercial and open source teams that use it to get work done.


“Symbiotic” by definition means one depends on the other. I think it’s moot who depends on who more. They both depend on each other in a substantial way. Without the other they would likely adapt, but in a suboptimal way, at least for some time.

But I think folks dramatically undervalue what GitHub(Lab) brings, and how expensive it is to host such a system for everyone.


You said "symbiotic," not me :-)

I would describe the relationship as mutualistic: both parties benefit, but not necessarily proportionally. And they don't need to benefit proportionally, but there's also value in highlighting the disproportionality: it behooves the OSS community to remember that extremely valuable ecosystems like GitHub and GitLab derive a significant amount of their value from the continued (and mostly passive) goodwill of the community.


i think that would be grossly inaccurate, as the OSS _services_ market alone is worth 3-4x GitHub's valuation at the time of acquisition


I don't think the OSS services market (I'm assuming you mean Red Hat, etc.) extracts disproportionate value from GitHub. If anything, I'd expect those companies to pay handsomely for their access -- mine certainly does, and we're primarily open source.


It may or may not be. But magic is in infrastructure and not search engine algorithm per se.


Cmon bruh.


In 2008, Microsoft acquired Fast Search and Transfer, an enterprise search company that has decent search technologies [1]. It ran on common Linux and Unix systems as well as Windows at the point of the acquisition.

[1] I worked there


BlackBird is internally developed by GitHub engineers. I know the engineers who did the research and then wrote up the system.


Yeah, I thought so (that it was internally developed), because someone commented that it's not open sourced.

I was just posting to say Microsoft have (had?) a perfectly nice search engine many years ago. Either GitHub folks didn't want to use it, it's not good (I think it was good, but maybe it's not suitable, or maybe it isn't maintained anymore shrug) or Microsoft butchered it too much over the years.


I don't know much about the inner workings of Microsoft, but I could also imagine than an acquisition from that long ago just isn't known to those in the Github side of the house. Seems pretty typical for a large company.


BlackBird is a clean slate, code-specific search engine. Not a general purpose fulltext search engine.


annnnd it's down. https://news.ycombinator.com/item?id=35872835

Is this related?


i was wondering that... Probably not, but funny anyway


Seems to be happening more and more frequently. Starting to regret moving over there..


I hate the new search, most of the time it misses simple things and it is infuriating, I always end up just cloning the repo and searching for things using my own tools

"about twice as fast", yeah twice as fast but twice less effective.. what's the point


Relevant threads:

Feb 23, 2020 (https://grep.app) Show HN: Search code in GitHub repos using regular expressions https://news.ycombinator.com/item?id=22396824

Dec 8, 2021 Improving GitHub Code Search https://news.ycombinator.com/item?id=29487237


Lately I feel like the old joke about vegans can be applied to Rust:

Q: How do you know if something is written in Rust?

A: The author/headline will tell you.


Considering "it's written in Rust" is synonymous with "it has been statically guaranteed to not contain a wide class of common bugs that frequently cause severe vulnerabilities", I think it's relatively reasonable for people to advertise when it's been used for a large project. Especially considering how much people like to gripe about its use.


Doesn't apply to this project, but that can be also read as "it has taken on a class of bugs it wouldn't have been exposed in a GC language, with little benefit."

Or even worse, if the app needs to use unsafe, you'd be surprised how many Rust engineers think unsafe somehow "encapsulates" memory bugs inside the block itself. Which is not true. Using unsafe can cause memory issues that bubble up the call stack.

I don't feel any guarantees when a product is made in Rust. There are buggy Rust projects


> [It's not true] that unsafe somehow "encapsulates" memory bugs.

I believe this real-life, 4-year, 1.5 million-line-of-code experiment disagrees with you directly [1]

> In general, use of unsafe in Android’s Rust appears to be working as intended. It’s used rarely, and when it is used, it’s encapsulating behavior that’s easier to reason about and review for safety.

[1] https://security.googleblog.com/2022/12/memory-safe-language...


I don't think you really understood my comment.

    unsafe fn create_null_pointer() -> *const u32 {
        std::ptr::null()
    }
    
    fn main() {
        let ptr = unsafe { create_null_pointer() };
    
        // Dereferencing the null pointer outside of unsafe code
        let value = unsafe { *ptr };
    
        println!("Value: {}", value);
    }
This will segfault, and the compiler won't warn you about it, even though an error didn't actually happen until you printed the value in "safe" code

What this article is getting at is that the unsafe keyword highlights pieces of code that Google scrutinizes, not that it has functionality that "magically" encapsulates memory bugs. And yes, it's a lot better than C++. But you still need to be aware this can happen and not assume you're safe

All unsafe does is shut off the compiler for sections of the code. You can write code in an unsafe block that bubbles outside of the scope of "unsafe."*


Ah, I get what you're saying now. When you said

> "you'd be surprised how many Rust engineers think unsafe somehow 'encapsulates' memory bugs inside the block itself,"

you essentially mean:

> "you'd be surprised how many Rust engineers think that they can use unsafe without regard as to whether they are maintaining Rust's safety invariants by hand in their unsafe blocks."

Yes, I do find that surprising!


> Considering "it's written in Rust" is synonymous with "it has been statically guaranteed to not contain a wide class of common bugs that frequently cause severe vulnerabilities"

No it isn't. It's not like we've been writing too much applications C-code recently and it's not like Rust offers that much fundamental security over e.g. Golang


> it's not like Rust offers that much fundamental security over e.g. Golang

It absolutely does, while also being more expressive. I suggest reading more of the type system literature before making unfounded claims like this.


> I suggest reading more of the type system literature

Like what, in general? I've been working with variously typed languages of various maturity with stuff like structural typing, null-safety, sealed/enum types, inlined and dto/data types, functional types, various implementations of traiting/inheritance/prototyping, etc for years. Do you have a specific point or did you assume that being vague makes you look mysteriously smart?

Honestly, you are acting like all we ever had is JS, C and Rust.


Okay, so you've listed a lot of type system features... which is fine (and exciting!), but not relevant?

> Do you have a specific point or did you assume that being vague makes you look mysteriously smart?

My point was what I said: you should read the academic literature in this space to understand why Rust is fundamentally safer than languages like Go. It's not just a made-up marketing ploy. The type system is specifically designed to make it very difficult to incur issues with memory, which was Rust's primary goal as a modern systems language. Yes, other languages have fancy features, and some of those features are not in Rust. But the features you mentioned don't specifically relate to memory safety, which was the topic at hand. Most of the features you mentioned are actually just about expressiveness of the type system, but not fundamentally distinct forms of type-checking.

Rust's type system is based on earlier work from Cyclone. You can find the various Cyclone papers here: [http://cyclone.thelanguage.org/wiki/Papers/]. Specifically, you'll want to look at the papers concerned with memory safety, which is where the region-based type system was first introduced.

> Honestly, you are acting like all we ever had is JS, C and Rust.

No, I'm pretty familiar with the landscape of programming languages — it comes with the job! I think you just didn't understand my point, which is probably my fault for not being clearer originally. It happens when I reply from my phone sometimes. Apologies.


Thank you, it was extremely unclear to what specific niche of type systems problem domain you were referring to.


I feel like you're being sassy with me, and I don't much care for it.

That Rust is fundamentally more secure (in certain, specific ways) than other languages is a fact, not an opinion. It is a fact well-known by members of the programming languages research community, users of the language, and other sufficiently interested people. I didn't think to qualify it by citing the relevant papers partly because the fact is well-known, partly because the fact is easily researched, and partly because this is an informal conversation where you don't always need to bring primary evidence simply to talk to people.

Just because you were not aware of this fact didn't grant you license to be so dismissive at the start. Your first interaction with me was a stern, corrective "No it isn't". You were wrong, but worse than that, you were authoritative in your wrongness. If you had just asked me to explain what I meant instead of trying to correct me, this conversation could've gone differently.

Now, maybe I've misread your tone, in which case I genuinely apologize. But the certainty with which you made your incorrect statements at the beginning, followed by your kind-of condescending interactions with me in the comments since then, makes me think I am right in my interpretation, and I think that's really unfortunate.


> I feel like you're being sassy with me

I am honestly not. Please don't read too much into my tone, English is not my native language.

It actually wasn't clear why you've mentioned types theory initially, the semantics of "expressive types" confused me a bit, but I upvoted your latter post where you elaborated.


All of the above, and also:

If it’s written in Rust, the tooling (Cargo; the Rust language server; etc) make it much easier than most other languages to jump in and fix something or add a new feature. I’ve contributed to projects written in Ruby, Python, Java, C#, C++, Haskell and many others, and every Rust project I’ve ever touched has been much easier/quicker to go from `git clone` to PR submitted.

That may not be appealing to people who never fix or extend the software they use; for me, as someone who finds bugs and deficiencies in almost everything I lay eyes on, knowing that a piece of software is written in a language that is the least likely to frustrate me and suck up my time when I inevitably have to dive in and tweak something is a huge competitive selling point over similar software written in other languages.


its not wise to mock vegans when they are tryin' something.


I have nothing but respect for vegans and vegetarians.


I love how Microsoft GitHub doesn’t allow you to search the code base unless you’re authenticated. /s


On mobile, the search bar is hidden from unauthenticated users, but you can still search. The easiest way is to go to https://github.com/search

That said, I'm also annoyed by various navigability issues, including that one.


Fix your damn database!


This type of comment does not belong on hn. Take your frustrations elsewhere or discuss your point.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: