> Get rid of that noise and your hardware goes a lot longer.
What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?
> I'm running a search engine on consumer hardware out of my living room that can index 100 million documents.
That's extremely cool. I would love to know more. To me an impressive feat already.
I think I was editing the comment while you were replying. Sorry about that. I was just adding to it though, didn't really rug pull on your response so I think it's fine.
> What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?
Now this is a proper difficult problem with (probably) fairly subjective answers. I do however think it's something that warrants serious investigation. It's probably a decent candidate for a machine learning model combined with some manual tweaking for sites similar to wikipedia or github that have absurd amounts of parallel historical content.
Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.
> That's extremely cool. I would love to know more. To me an impressive feat already.
Yeah it's at <https://search.marginalia.nu/>. I've built all the software myself from scratch in Java[1], and I'm doing my own crawling and indexing. The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.
I do use a MariaDB database for some metadata, but I think it will have to go as its hardware demands is becoming a serious bottleneck.
[1] Despite using Java, I should say regarding the index. This is approaches sunk cost at this point. Building a search engine index is not something Java is at all suitable for, its limited low-level I/O capabilities is incredibly handicapping.
> I [...] didn't really rug pull on your response so I think it's fine.
No you didn't. All good. And I learned a lot from the extended answer. So I am thankful for the explanation.
> Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.
I can totally understand the feeling. There are quite a few things that I'd like to go deeper into either at work or in private. But alas time.
> Now this is a proper difficult problem with (probably) fairly subjective answers.
I agree. And I don't have answers ready. A lot boils down to preference. Personally, for example I prefer written content over video. Except in a few areas were I like (some) explanatory videos. To me it comes down to the question of how easy I can skim the content when I am looking for an answer.
On the other hand - for deep immersion into a topic I use multiple media formats.
In terms of web search I sadly nowadays need to sift through a lot of seo-fied content that is there either to build a (personal) brand or to attract clicks for advertising revenue/affiliate revenue.
So in principle I agree with you on the noise problem. Still I also believe that there are real great gems to be found in the long tail. When I still feel like I came late to the party, but when I started out in the web in '97 there were so many lovely, quirky sites. So many places that people had put a lot of time, energy and thought into. And sites so packed full of information that I came away not only with more knowledge, but in awe that somebody would give this knowledge away for free.
There also were quite a number of horrible sites (my first ones probably included). So there was a noise vs. signal problem back then. Maybe not to the extent today, though.
> The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.
Call me impressed. Sounds absolutely cool.
So even with a raid setup for redundancy this is doable.
May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?
I could probably shoot many more questions, but don't want to be a nuisance.
> May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?
I initially did basically a DFS-walk originating at a few websites I liked, with some filtering criteria that deprioritized websites that didn't look too interesting. Now that I have a fairly comprehensive mapping of the space I want to index, I use a few factors like frequent outbound links from highly ranking domains to inform which new sites to index.
> I could probably shoot many more questions, but don't want to be a nuisance.
Why? I don't change my reply based on the author. I reply to a statement to the best of my knowledge regardless of the author behind it.
And I learned already a lot in this thread after the explanations unfolded.
The initial statement sounded exactly like the armchair "experts" one so often encounters. Actually this was for a long time the first time that there is a person with substantial experience in the problem space behind such a statement.
> The initial statement sounded exactly like the armchair "experts" one so often encounters.
Maybe - but [marginalia_nu](https://news.ycombinator.com/user?id=marginalia_nu) isn't an armchair expert - they've actually implemented theor own publically available search engine - which is linked in their profile.
I didn't say they are. Only that the initial comment sounded like that. And in the thread above we discussed a bit about their achievements. I really liked it and learned a lot.
What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?
> I'm running a search engine on consumer hardware out of my living room that can index 100 million documents.
That's extremely cool. I would love to know more. To me an impressive feat already.