Hacker News new | past | comments | ask | show | jobs | submit login
SearchHut (searchhut.org)
385 points by tsujp on July 15, 2022 | hide | past | favorite | 155 comments



I like the idea of searching a curated list of domains, but I'm not sure that doing the curation yourself is the best approach considering the huge number of useful but niche websites in existence.

I wonder if simply parsing all of Wikipedia (dumps are available, and so are parsers capable of handling them) and building a list of all domains used in external links would do the trick. Wikipedia already has substantial quality control mechanisms, and the resulting list should be essentially free of blog spam and other low-quality content. Wikipedia also maintains "official website" links for many topics, which can be obtained by parsing the infobox from the article.


Oh, and more to the point: This is a role Wikipedia explicitly renounced, isn't it. When it became so big and the Google PageRank gave it high importance, the spam became unbearable so Wikipedia decided it needs to change the incentives and it applied the rel=nofollow to all external links, so that it could stop working as an unpaid manual spam filter for the whole internet. Sure, your new search might ignore the rel=nofollow but if you ever become big enough, the incentives of spammers would lead to a bigger spamming pressure on Wikipedia...


That's easily fixed by relying only on protected or high-profile pages. Those already deal (mostly successfully) with spam and NPOV violations on a daily basis, so piggybacking on those protection mechanisms should yield a fairly high-quality pool of curated external links.


While that sounds good in theory, who is to say those that can edit protected and high profile pages aren't in SEO spammers' pockets?

I mean the edit history is public and there's plenty of people that actually pay attention to edits and the like so they would be found out soon enough, but still.

I'm sure this is an ongoing discussion when e.g. political figures' pages are protected as well - who becomes the gatekeeper, and what is their political angle?


Sure, but at that point you're simply discussing Wikipedia's quality control system, which may be an interesting discussion, but has nothing to do with search engines per se.

Considering that Wikipedia has become a pillar of most scientific work (imagine writing a math or computer science paper without Wikipedia – utterly unthinkable), it's safe to say that knowledgeable people have collectively decided that its quality control mechanisms are "good enough", or at least better than those of any other resource of comparable depth and breadth.

And that puts Wikipedia's link pool leaps and bounds ahead of whatever dark magic current search engines are using, which mostly seems to be "funnel the entire web through some (easily gamed) heuristic".


Ehh, I think you’ve overestimated the quality of Wikipedia on highly specialist topics (like, say, the kinds of things you’d write academic papers on). It’s not so much that it doesn’t have the content, it’s that the coverage is super uneven; sometimes it has extremely detailed articles on a niche theorem in a particular field, and other times it has the barest stub of information on an entire sub-sub-field of study.


The "official website" links can be retrieved in a machine-readable way from Wikidata. E.g. a completely random silly example of official websites of embassies: https://w.wiki/5T3R


Only problem is that Wikidata is still incomplete when compared to Wikipedia itself. But yeah it's "trivial" to search it.


Is there documentation for rolling your own?

I've been considering building my own search engine for a while for my niche topic which has <50 websites and blogs on the web.

I can't tell how useful this will be but it'd be fun to give it a go.


Your own what? Your own query? Sure; it's just a SPARQL query over the Wikidata data model. The documentation portal for the query service is at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/... though you'd need some familiarity with Wikidata, its properties, etc. E.g. the "wdt:P856" in my query is the "official website" property on Wikidata: https://m.wikidata.org/wiki/Property:P856


Elastic App Search would be well suited for something like that. It comes with a built in crawler.


That's our strategy at you.com - we start with the most popular sites, crawl them and build apps for them, eg. you.com/apps and let users vote in search results.

Full disclosure: I'm the CEO.


Your search actually performed better than Google for me on a random query. I queried both engines "What is the specific heat of alcohol?", Google threw up a rich search answer that linked to some random Mexican site that is clearly exploiting SEO [1], you.com linked me to answers.com (which is more trustworthy than random mexican website).

[1]: http://elempresario.mx/sites/default/files/scith/specific-he...


When I search Google I get a rich search answer of http://hyperphysics.phy-astr.gsu.edu/hbase/Tables/sphtt.html


> you.com/apps and let users vote

what is stopping companies/users from abusing/gaming on this system with bots?


Richard, CEO at you.com here.

Users vote on their own preferences... not on other users' preferences. so i dont see how it will be gamed unless we start incorporating user preferences into global ones - which we would only do if we thought it worked better and isn't being gamed.


I just learned about you.com from this thread. It looks very promising.


I've looked into this, and found wikipedia's links not to be super useful. Wikipedia prefers references that don't change, so books primarily, and beyond that websites that don't change, so academic journals where you're getting paywalled, and WaybackMachine archives of websites (even if they are still live). You aren't getting much use out of wikipedia.


The problem with only using curated lists is that you kill discoverability of new sites, but it does have promise like we've seen with Braves 'goggles'.


The site has just been taken offline by Drew due to the unfortunate start. I hope we can come back to this once the project has been properly launched, although Drew notes that he is "really unhappy with how the roll-out went" and that "my motivation for this project has evaporated" [1].

Thanks for all the work Drew, I hope you guys manage to come to a conclusion that you are satisfied with!

[1]: https://paste.sr.ht/~sircmpwn/048293268d4ed4254659c3cd6abe67...


Oh, that's sad :-(

For what it's worth, my search engine got prematurely "announced" for me on HN as well, while likewise hilariously janky. I don't think the launch is the end of the world. (I guess I had the benefit of serving comically bizarre results when the queries failed, so it got some love for that)

The bigger struggle is, because a search box is so ambiguous, people tend to have very high expectations for what it can do. Many people just assume it's exactly like Google. It's something a lot of indie search engine developers struggle with. Even if your tool is (or potentially could be) very useful, how can you make people understand how to use it when it looks like something else? Design wise it's a real tough nut to crack, a capital H Hard UX problem.


Drew has the right to cancel his projects, but I really hope others don't cling to the hopes of "the perfect rollout" with their projects.

Startups and side projects are messy and sometimes things don't go as planned. Contracts get canceled, DoS takes down your homepage when you launch losing all those free leads, people leak new features and your sixth deployment erases most of the production database.

There are a lot of great ideas that start out as bad as the first release of thefacebook, AirBnB, Twitch and Youtube. Still, they iterate on these wonky, almost-working sites and end up making something great.

YC pushes this idea constantly; put something in front of people and iterate. Drew was following that advice and I applaud him. https://www.youtube.com/c/ycombinator/videos


The amount of tweaking needed to make a search engine work well can't be overstated either. When you start out, it's inevitably going to be kinda shit. That's fine. Now you need to draw the rest of the owl.


Yeah, I agree. I was certainly underwhelmed with my first small search engine. It was so bad even I didn't want to use it - and I had spent months and months on it.

Still, most projects aren't a search engine. I see people put high expectations on how things will go and often it's just really hard to realize some of those hopes.

Sometimes you just have to take what you can get and iterate. Don't give up.


I think, with search engines, it's best to work with them for the problem domain. It is a fractal of interesting programming problems touching upon pretty much every area of CS, programming and programming-adjacent topics, and take whatever comes out of it as an unexpected bonus.


Well said, my next version will be focused on a niche I actually need instead of general search. Still, I haven't finished studying the CS books + 47 algorithms I'll need to actually implement it.


I love your post. I have to say that my experience is that many in the HN community are not nearly as kind as you. Folks here will rip you apart if you don't have everything figured out :(

So, I started rolling out new features of our search engine on Twitter, Slack, etc instead of here.


Is the source available somewhere? I'm really interested in the internals, regardless of how the "product"/platform/performance turns out.

[edit: nevermind, found it easy enough. Seems to be Go and PostgreSQL with the RUM extension]


Jesus he needs to take it easy. It's not that big of a deal.


Hey everyone I accidentally shared this too early.

I misinterpreted a don't-share-yet announcement to mean the announcement post only and not the entire software and announcement. I don't mean this as an excuse; that's just the context.

So this is out before Drew et. al. intended it be hence some 404s and so forth as commented by Drew in this very thread here: https://news.ycombinator.com/item?id=32105407

I let my excitement get the better of me this time and I hope people revisit SearchHut in a week or so after these quirks are resolved.


With a bit of collaboration i bet we can flag it off the front page in no time.

You (or Drew or someone else) can resubmit it later with a bogus query string to skip HNs dupe checker.


Cat's out of the bag now.


Ok, unflagged!


A whole project trashed for some magic internet points


It's well communicated, so no harm done. And it gives an idea what kind of "curiosity hit" you can expect when announcing it for real.


Good morning, HN. Please note that SearchHut is not done or in a presentable state right now, and those who were in the know were asked not to share it. Alas. I had planned to announce this next week, after we had more time to build up a bigger index, add more features, fix up the 404's and stub pages, do more testing, and so on, so if you notice any rough edges, this is why.

I went ahead and polished up the announcement for an early release:

https://sourcehut.org/blog/2022-07-15-searchhut/

Let me know if you have any questions!


Just a warning from a fellow search engine developer.

If you happen to be cloud hosting this, and if you do not have a global rate limit, implement one ASAP!

Several independent search engines have been hit hard by a botnet soon after they got attention, both mine and wiby.me, and I think a few others. I've had 10-12 QPS, sustained load for weeks after weeks from a rotating set of mostly eastern european IPs.

It's fine if this is on your own infrastructure, but on the cloud, you'll be racking up bills like crazy from something like that :-/


Could you clarify domain submition rules?

e.g. "Any websites engaging in SEO spam are rejected from the index" - how is determined whether something is SEO spam or not? More clarification of whats allowed/not allowed would be nice!


The criteria are documented here:

https://searchhut.org/docs/docs/webadmins/requirements/

And there's some advice for web masters on ways to improve your site's ranking without running afoul of this rule:

https://searchhut.org/docs/docs/webadmins/recommendations/

But ultimately, it's subjective, and a judgement call will be made. If it's minor you might get a warning, if it's blatant then you'll just get de-listed.


I think it would be great if we had Code Forge index to search uniquely. In this index are only the myriad of code hosting sites around the internet - shared hosting like gitlab, github, sourcehut, sourceforge, codeberg, and all the project instances like the kernel.org, GNU Savannah, GNOME, KDE, BSD, etc. Probably hundreds of them out there, and allow people to submit their own self-hosted Gitea/Gitlab/sr.ht/etc. instances to be crawled - maybe even suggest a robots.txt entry your crawler could key in on as "yes please index me, hutbot".


Long ago -- 2006 to 2011 -- google had a functional source code search engine: https://en.wikipedia.org/wiki/Google_Code_Search

I don't recall if it supported SourceForge and GitHub (2008) but it certainly included gzipped tarballs which were popular and prevalent at the time.


Do you intend any of this to merge/cooperate with other similar initiative?

e.g opencrawl, internet-archive, archiveteam

It strikes me the resources to crawl, update, and manage/index data is a common problem.


I intend to at least support other search engines by adding !bangs for them and recommending them in the UI if you didn't find the results you're looking for. I don't think that crawling is something that is easily distributed across independent orgs, though.


I guess cppreference.com isn't even a part of the list?

I tried a couple test queries:

> lambda decay to function pointer c++

I get some FSF pages and the wikipedia for Helium?

> std function

I get... tons of Rust docs?

> std function c++

All rust docs? The wikipedia page for C++??

Interesting idea, but this seems like it would be the primary failure mode for an idea like this: as soon as you are researching outside of the curator's specializations, it doesn't have what you're looking for. Yet these results would both be fixed simply by adding cppreference.com to the index. Let's try and give it a real challenge:

> How to define systemverilog interface

And as I might expect, I get wikipedia pages. For "Verilog", for "System on a Chip" and for "Mixin".

1st google result:

> An Interface is a way to encapsulate signals into a block...

Working as expected


I added cppreference.com now and kicked off a crawl. It'll be a while. The list of domains is pretty small right now -- it was intended to be bigger before the announcement was made. Will also add RFCs and man pages soon.

There will (soon) be a form to request new domains are added to the index, so if there are any sites you want indexed which are outside of my personal expertise, then you'll be able to request them.


You probably are already thought about it, but just in case feature idea: adding moderation support for collaboration. Somewhat trusted persons vetting niche subjects.


I guess missing content can be explained by...

> Notice! This product is experimental and incomplete. User beware!

But in reality, if you know where is the answer that you are looking for, why would you use that search engine.

I use DDG and if I want to search on the Scala docs, I use "!scala whatever I'm searching" instead of just search with DDG.


GitHub doesn't seem to be either. I get that it's a competitor but not being able to search GitHub is probably a deal breaker for most devs that aren't Drew.


I'm not opposed to indexing GitHub, but the signal to noise ratio on GitHub is poor. Nearly all GitHub repositories are useless, so we'd have to filter most of it out. I think instead I'll have to set it up where people can request that specific interesting repositories are added to the index, and maybe crawl /explore to fill in a decent base set.


GitHub is hella tricky to crawl too due to its sheer size and single entry point (meaning slow crawl speed). I've been looking at the problem as well, and so far just ignored it as un-crawlable, but I might do something like crawl only the about pages for repos that are linked to externally some time in the future.


There's an asterisk to that: they serve the underlying content through two different APIs so one can side-step the HTML wrapper around the bytes: the discovery phase has a formal API (both REST and GraphQL) for finding repos, and then the in-repo content can be git-cloned and one can locally index every branch, commit, and blob, without issuing hundreds of thousands of http requests to GH. One would still need to hit GH for the issues, if that's in scope, but it'd be way less http requests unless your repo is named kubernetes or terraform.


We're still talking about git clone:ing a hundred thousand github repos. Git Repos get big very fast. That's a lot of data when you're realistically only interested in is a few markdown files per repo.


Perhaps all repo's that have a published package is a good heuristic. Then you'll at least get all the repos of npm, python and other packages.


Some interesting repos have no published packages. A combination of number of commits, stars and forks would be probably more relevant.


And likewise, some uninteresting repos do have published packages.


SearchHut was built to this point in about a week by Drew and contributors which I think is amazing.

It is also meant to be very simple to run in the case you want to index your own category of sites. For instance cooking content is specifically not indexed but if _you_ wanted to you can spin up an instance and index cooking sites yourself.


This seems like a great idea, honesty. There are niche topics that are very hard to navigate in Google, because it’s so skewed towards mainstream topics. I think it would make sense for these communities to maintain their own search engine.


Devault is doing some beautiful things over at his Sourcehut.

I watched him crank out a prototype of this in Go in about three hours.

It really feels like the old days at sr.ht. It’s fun again.


Watched as in streamed? Or refreshed the repo page?

How can I watch?


Too bad this came out before Drew intended, and I hope that after having a weekend to rest he’ll feel his motivation recover.

One meta-thought, I think projects like this are surfacing something interesting: The underlying technology to make a pretty good search engine is no longer especially difficult for programmers or for server. This is potentially a very good thing, as it means the end of the Google era.

I can imagine a future that is almost a blast from the past, where there are a lot of different search engines, those engines are curated differently, and while none of them index the entire Internet, that’s what makes them valuable and better than Google (which I think cannot defeat spam).

I’m trying to think of a historical parallel, Where are some service used to be very difficult to provide and therefore could only effectively be done by a single natural monopoly, but technology progressed and opened up the playing field, breaking the monopoly. Television has some similarities. Perhaps radio vs podcasting. What others?


What Google has is marketing and momentum... its the ubiquitous search engine.


Google also funnels a lot of traffic to itself through Chrome's search bar, and Firefox does the same. Sure you can replace the search engine, but whatever you replace it with needs to have the same capabilities or the entire model falls apart. Meanwhile, alternative means of navigating the web (such as bookmarks) are made increasingly difficult to access, requiring multiple clicks.

I don't mean to be conspiratorial, I'm sure there are good intentions behind this, the consequence however is effectively locking in Google as the default gateway for the Internet.


Search results are presently poor. Mostly Wikipedia pages.

But it passes tests that are very important for me:

1) It's fully accessible by Tor. No CAPTCHAs or "We don't serve your kind in here" messages.

2) It works in a text browser without JavaScript and renders in a sensible way without style requirements.

10/10 for accessibility. Something Google and other search engines could learn from.


I'm not a fan of google but you can do exactly what this search engine does by curating your own list of domains to search against.

https://programmablesearchengine.google.com


Some downsides with this approach:

- search queries are performed directly from the clients computer so can't protect their privacy (since Custom Search JSON API have a daily limit of 10k queries)

- forced to use javascript, and the way it's implemented makes it difficult if not impossible to do even basic things like the loading animation cards

- ads are loaded from an iframe so you can't do any styling (except extremely limited options that they make available in their settings, but no matter what then it will be very ugly if you want to have a light/dark theme)

But there are of course many benefits as well, such as it being 'free' (Bing is ridiculously expensive IMO, and feels impossible to join their ad network to offset the costs.. which might explain why you see countless Bing proxies shut down after a few months) and search results are no doubt better than the ones you'd get from Bing.


It requires google account, has tracking and isn't open-source. I'd say it is no go


Add to that the likelihood that Google will just randomly cancel the product one day. Why invest time in this Google product?


It's been around forever, but your concern is real. Who's to say an OSS project won't get archived, or removed from the internet? Why invest time into anything when it will all be replaced eventually?

edit: Looks like this OSS project was launched and cancelled in a single day.


> Looks like this OSS project was launched and cancelled in a single day.

Touché :-)

However the idea of a federated search is some measure of protection against that. If it ever happens.


Interesting. This looks like Custom Search Engines evolved into this?

I can't tell whether this is a neglected Google product that they were going to refresh but lost interest in, or something that is undergoing a breath of fresh air.

As you say, I was able to add a list of domains and get some pretty decent results from it. The UI makes me feel like Google are not interested in making it a truly successful product, though.


For those looking for an alternative to that, I've been building a self-hosted search engine that crawls what you want based on a basic set of rules. It can be a list of domains, a very specific list of URLs, and/or even some basic regexes.

https://github.com/a5huynh/spyglass


Great project! Given a local archive of Wikipedia and other sources, this can be very powerful.

Which raises the question: does archive.org offer their Wayback Machine index for download anywhere? Technically, why should anyone go through the trouble of crawling the web if archive.org has been doing it for years, and likely has one of the best indexes around? I've seen some 3rd-party downloaders for specific sites, but I'd like the full thing. Yes, I realize it's probably petabytes of data, but maybe it could be trimmed down to just the most recent crawls.

If there was a way of having that index locally, it would make a very powerful search engine with a tool like yours.


https://searchhut.org/search?q=js+service+worker

> List of accidents and incidents involving commercial aircraft


I think the idea of federation of domain-specific search engines, possibly tied together by one or more front-ends, is a brilliant idea.

I think it's similar to how Google's search works internally, though I doubt the separation is based on a list of domain (as in DNS) names. IIRC they have a set of search modules, and what they return (and how fast they return it) all gets mixed in to the search results according to some weighting. Right below the ads.

If you look at a search system that way, it's easy enough to add modules that do things like search only wikipedia, and display those results in a separate box (like DDG), or parse out currency conversion requests, and display those up top based on some API (like Google). etc


How would you do ranking though?

It is possible for a site's results to be of different quality: maybe one article about MySQL is not so informative, and an article about Python on the same site is a reference.

The search engine operated by the author is unlikely to acknowledge that.



Here's the current list:

https://paste.sr.ht/~sircmpwn/0cab5e3137c2c2077b5aabf9e2fc8d...

It was intended to be larger prior to launch. Here's some other domains I want to index:

https://paste.sr.ht/~sircmpwn/84d052f14a9a282698b5e5f7a9d9d9...


I congratulate you for the novel approach, but this is impossible to scale in a way that would make the engine useful.


sqlite.org is not on the latter list yet. Should be added.


> erowid

Cool.



I wonder how the page ranking will work in the end. A quick look at the source doesn't show (me!) any planning for intelligent ranking. The database has a last_index_date and an authoritative field. Could be used fot basic relevance sorting, but nothing exhaustive.

Postgres as backend is maybe not the best choice and there are already many sites that index specific pages and take suggestions. The hard part is getting relevant results when having a large index.

Still thank you for a new web search.


As I understand, the idea is to only have manually curated high quality domains. In that regard, ranking is entirely secondary to BM-25. Might work, but it leaves out a lot of long tail sites that (in my experience at least) often have very good results. It's really the middle segment where most of the shit is.


The about page has some good general info and links: https://searchhut.org/about


https://searchhut.org/about/domains returns 404, by the way.


Lots of broken links from https://searchhut.org/about .


"Notice! This product is experimental and incomplete. User beware!"

:).


Has anyone experimented with creating a search engine that only indexes the landing page of domains? I’m less interested in another Google, and more interested in a way to find new and interesting sites/blogs/etc. Stumbleupon was great for this back in the day.

Seems like it would be an interesting experiment to see what the results would be, indexing only the content / meta tags of “index.html”.


I built a solution at https://mitta.us/ that lets you submit the sites you want crawled, and puts them in a self-managed index (which isn't shared globally). I don't do link extraction, but instead let GPT-3 generate URLs based off keywords.

!url <keyterms> |synthesize

I also wrote a screenshot extension for Chrome that lets you save a page when you find it interesting. The site is definitely not "done" but it's usable if you want to try it. Some info in help and in commands is inaccurate/broken, so it is what it is for now.

It does the !google <search term> and !ddg <search term> thing to find pages to save to the index. There are a bunch of other commands I added, and there's an ability for others to write commands and submit them to a Github repo: https://github.com/kordless/mitta-community

!xkcd was fun to write. It shows comics. The rest of the commands can be viewed from !help or just !<tab>

I've been working on pivoting the site to do prompt management for GPT-3 developers and have been kicking around Open Sourcing the other version for use as a personal search engine for bookmarked pages.


Looks cool! Most of the results right now are all Wikipedia though.


Tried a few things:

- Beltalowda – no results (for reference: it's a term to refer to "people from the [asteroid] belt" used in the The Expanse books and TV series).

- The Expanse – bunch of results, but none are what I'm looking for (the TV series or books). It looks like it may drop the "the" in there?

- Star Trek – a bunch of results, but ordered very curiously; the first is the Wikipedia page for "Star Trek Star Fleet Technical Manual", and lots of pages like "Weapons in Star Trek" and such.

- NGC 3623 – lists "Messier object" and "Messier 65", in that order, which is somewhat wrong as NGC 3623 refers to Messier 65 specifically.

- NGC3623 (same as previous, but without a space) – no results.

- vim map key – pretty useless results, most of which have no bearing on Vim at all, much less mapping keys in Vim.

- python print list – the same; The Go type parameters proposal is the first result; automake the second, etc.

Conclusion: "this product is experimental and incomplete" is an understatement.


You could say it's in the Garbage stage (though Garbage is a bit harsh for a product that is built in a week).


I didn't call the product garbage, just some of the results, which I think is fairly accurate. But I edited it to "useless" now, as that comes off as a bit less harsh.


Hot Garbage


Never

Give

Up

This requires less paying attention to negative emotion and more “water off my back”.

Tweak it. tweak it some more.

Focus on the goal, notably one tiny sub-goal at a time.

Good luck, entrepreneurial spirit is a tough beast to attain.

Whatever you stake on,

NEVER

EVER

GIVE

UP


Word!



I love that I can self-host this! Are there plans for federation?

Rather than maintaining a whole separate index for myself, I'd love to self-host an instance of this, only indexing sites that aren't in the main index, and then falling back to the main index / merging it with my index to answer queries. I wonder how easy that would be with the current architecture.


Does Sourcehut offer textual search within a repo's files? GitHub and GitLab offer it, but Codeberg doesn't seem to (and I couldn't find any information about its presence or absence on Sourcehut).


It doesn't appear to be a code search engine. Just a regular search engine focused on code.

Does Sourcegraph index Sourceht projects? It is a proper code search engine and very good.


Looks like SourceHut is down so I'm not sure what projects need indexing.

In other news we now index 87k Rust packages on crates.io

https://sourcegraph.com/search?q=context:global+repo:%5Ecrat...


All 4 search results for "searX" (a self hostable meta-search engine):

Wikipedia: List of Search engines

Drew's blog: We can do better than DuckDuckGo (perhaps the impetus for this project)

Wikipedia: List of free and open source projects

Wikipedia: Internet Privacy


The point of the project is that it's a curated list of sites to crawl, doesnt make sense to crawl other search engines.


You don't think a popular open source project fits within this niche?


Not really.

Given it’s not even ready for release either.

The best you could have hoped for is a GitHub link, but GitHub isn’t being crawled right now.

So I’m not sure what you’re getting at. Your expectations for an alpha level software that wasn’t supposed to be announced is far too high.


In addition to curated domains list, some searches would benefit of limiting display of old results, as usually you might find an answer, but solved in jQuery or older version of framework you are using.


Bad serp... Searched 'mdn a'. Google return '<a>: The Anchor element - HTML: HyperText Markup Language | MDN' SearchHut rerurn a generic: 'MDN Web Docs'


It seems like it uses postgresql's FTS, which will generally drop stop-words so "the", "a", "and" and similar words are dropped. I've been meaning to figure out the best way to deal with this myself, and I'm guessing looking for exact matches first and then running a FTS query could work.


You can write a custom stemming algorithm and load it as an extension library into Postgres, then use that with `CREATE TEXT SEARCH DICTIONARY` to create a custom dictionary. It's not as difficult as it sounds - you can use the default Snowball stemmer as a sample, and tweak it.


It's not just a custom dictionary. Stop words are usually excluded for a reason, you need to understand the nature of the query, and when to exclude what looks like stop words from being pruned. It's not really a job for a snowball stemmer, as you do need to operate over multiple tokens to gather context.


Most of the time, keywords like this come from external anchors as well, which is something that you're gonna be able to leverage with this design (as I understand it).


>SearchHut indexes from a curated set of domains.

Are there already plans to expand it as a service? E.g. subreddits could maintain their preferred lists of domains.


Also, the curated list is 404.


This is super cool - especially the self-hosting angle.


I hope Drew opens his own Google some day


Curated set of domains page is down.

Also, selfish plug I think it would be cool if you added Hackernoon to that list.


I think the point is to avoid including low quality websites…


May I know what are the limits of the API?

How many requests per minute / hour are acceptable?


The API limits are not documented yet, like many other things, due to the early launch. For now I'll just say "be good". Don't hit it with a battering ram.


What does it promise as an alternative to other search engines?


Many 404 not found.


It was not supposed to be released now, OP accidentally shared it here because of a misunderstanding.


>What's the most popular web server

SearchHut: The first result is django which is not the most popular web server.

Google: Shows an answer box with the market share of various web servers.


Considering Google's answer box randomly picked multiple photos of unrelated people as pictures of murderers and rape victims (with Google being very uncooperative about resolving the issue) I'd say the lack of an answer box might not be that bad.


An answer box is the right thing for that query. (the web servers)

The part that Google seem to have unfortunately skimmed over is that the answers need to be relevant, exact & correct.


It can certainly be a helpful feature, but I wonder whether it's really better than good, relevant search results presented in a readable way. For example I'd argue the manually curated infoboxes on Wikipedia are likely more reliable than the algorithmic versions Google shows in their results, especially as it's difficult to fix mistakes in Google's version. Google thinks their own solution is the best one because Google made it and so they circumvent the whole page ranking process. Some queries of course need more than just plain search results (see Semantic Web and related things) but for those most engines don't offer enough control and transparency.

But I'm glad people are trying to build alternatives. I'd love a search engine that ignores sites with antipatterns like required registration for any kind of usage, and this is the first step.


Even if you skip the answer box the first result is a page which breaks down the market share of the most popular web servers.


This one is actually hilarious because google cites the site wrong for me. >Apache HTTP Server >It is one of the most popular web servers around the world. As of May 2022, Apache holds 31.5% of the market according to W3Techs and 22.99% according to Netcraft. It's quoting that from https://www.stackscale.com/blog/top-web-servers/ which clearly states Nginx as the top one. >As of May 2022, Nginx holds 33.5% of the market according to W3Techs and 30.71% according to Netcraft.


Another day on hacker news, another roll-your-own search engine, yet I still waste my time pitching to VCs that ‘can’t see it’.


SearchHut is a cool name


DDoS Drew for a year and he starts writing code for your next competitor.


> SearchHut indexes from a curated set of domains. The quality of results is higher as a result, but the index covers a small subset of the web.

[citation needed]

The quality of the results right now are not very high, and in theory I don't understand why one would believe a search engine with a hand picked set of domains would be expected to outcompete a search engine that can crawl the entire web and determines reputation by itself. This also ignores the fact that a lot of domains have a mix of high quality content and low quality content, for example twitter or medium. If you are going to rely on domain-level reputation then your search engine is going to be way behind the search engines that can judge content more specifically, which is all of the other search engines.

If you were to tell me curated domains is just a bootstrapping method and as the search engine evolves it will change, fine, but right now the search engine is so simplistic that the theory of how it might be good is really the only point. And if that underlying theory is dubious, and the infrastructure is simplistic and obviously won't scale, then I don't know what is interesting or novel about this right now. Doesn't seem worthy of reaching the top of HN.


> If you are going to rely on domain-level reputation then your search engine is going to be way behind the search engines that can judge content more specifically, which is all of the other search engines.

Then why do Google and DuckDuckGo return 90% garbage for most queries?

"All of the other search engines" have completely failed to keep pages from the results that are not only low-quality, but outright spam.


> Then why do Google and DuckDuckGo return 90% garbage for most queries?

If you can give me a list of 10 normal-ish queries where 9 out of the first 10 results on Google or DDG are "garbage", then I'll concede your point.

I think you are creating an impossible standard for search engines, then using it to deem the current ones as failures. While at the same time ignoring that this new search engine is, as present, unusable with no realistic argument for why it might eventually be better.


See my reply on the sibling comment for an illustrative example.


They definitely do not return "90% garbage for most queries". This 8s an unsubstantiated claim I see often i HN and honestly not backed by any real data. e.g. You can check your search history and see it yourself.


I just tried searching for "python str" on Google. I expected the top result to be a link to the official Python docs for the `str` type, then ideally some relevant StackOverflow questions highlighting common Python issues with strings, bytes, Unicode etc.

Instead, the top result was W3Schools. Then came the Python docs, then 5 pages somewhere between blogspam and poor-quality tutorials. Then a ReadTheDocs page dating to 2015. And that was it. No more official Python resources, no StackOverflow. In the middle of the results some worthless "Google Q&A" dropdowns that lead to more garbage quality content.

So for this query, using my definition of "garbage", the "garbage percentage" is somewhere between 80% and 90+%, depending on how many Q&A dropdowns you waste your time opening.



The fact that the ranking of results for queries that have nothing to do with location-based services depends on where you are located (and, possibly, on whether or not you are logged in) is one of the worst things about Google. And the fact that you can't seem to disable that behavior is even worse.


I just tried searching for “python str” on searchhut and the top result is Postgres docs, then Wikipedia article for empty strings and then Drew’s blog. Official Python docs isn’t in the index at all.


For me the second hit is: https://docs.python.org/3/howto/clinic.html

So at least some official Python docs are indexed.


Afaik Python does not have a str type (I think you meant string?).

You could instead search for "python string" to find more information about python strings.

Even then the very first result for Python str is actually relevant for me (Python documentation about built in types.


> Afaik Python does not have a str type

It does have it:

  $ python3 -c 'print(type(""))'
  <class 'str'>


> why one would believe a search engine with a hand picked set of domains would be expected to outcompete a search engine that can crawl the entire web and determines reputation by itself.

Because SEO manipulation is a well developed field, ensuring that the search engines trying to determine reputation automatically will (and does) end up with bad results.


Indeed. Whatever "smart" algorithm you use to rank results, you can be certain that half the web will turn into adversarial examples once your engine becomes popular enough.


> Notice! This product is experimental and incomplete. User beware!

Seems like your expectations are misplaced. Being at top of HN is not an indicator of quality, just interest.


If you were to tell me curated domains is just a bootstrapping method and as the search engine evolves it will change, fine

This makes me think of a possible approach. Curate a giant set of domains that almost exclusively host high quality content. Crawl said domains. Use all of the crawled data as a training set to create a model with which to ascertain the quality of random Web pages from other domains. Then spider everything and run it against the model.


good


Quite passive aggressive if you ask me. Boo-boo someone shared your project before you were “ready”.

If you don’t want something disclosed , don’t disclose it.

Only way for three people to keep a secret is if two of them are dead.

A thing is in the world. Let it be in the world. Harness the collective power and focus it into a force multiplier.

Or don’t.


It's not passive aggressive. It's sensitive, but he has a right to be if he wants to. He wasn't petty or mean spirited in his announcement to take it down. He only expressed that he was taking the feedback very hard, which is understandable if you had big plans to roll out and make a good first impression.


Meh. Hacker news is the place to get actual real feedback. Frying pan to fire etc.

Develop a thick skin or don’t read the comments lol!

He chose to disclose it to a few people. Word spreads. That’s what happens.

Execute NDAs and have a security program if you don’t want stuff getting out.


Expecting people to have to use NDAs to not spread your hobby tech project is pretty much the antithesis of the "hacker ethos" that this website is literally named after


Where did you read aggression?


Whole thing sounds made up. "Haha oops one of my fans from IRC totally misunderstood and got me additional publicity UwU"


Pretty much just a worse version of Wikipedia's search at this point.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: