I like the idea of searching a curated list of domains, but I'm not sure that doing the curation yourself is the best approach considering the huge number of useful but niche websites in existence.
I wonder if simply parsing all of Wikipedia (dumps are available, and so are parsers capable of handling them) and building a list of all domains used in external links would do the trick. Wikipedia already has substantial quality control mechanisms, and the resulting list should be essentially free of blog spam and other low-quality content. Wikipedia also maintains "official website" links for many topics, which can be obtained by parsing the infobox from the article.
Oh, and more to the point: This is a role Wikipedia explicitly renounced, isn't it. When it became so big and the Google PageRank gave it high importance, the spam became unbearable so Wikipedia decided it needs to change the incentives and it applied the rel=nofollow to all external links, so that it could stop working as an unpaid manual spam filter for the whole internet.
Sure, your new search might ignore the rel=nofollow but if you ever become big enough, the incentives of spammers would lead to a bigger spamming pressure on Wikipedia...
That's easily fixed by relying only on protected or high-profile pages. Those already deal (mostly successfully) with spam and NPOV violations on a daily basis, so piggybacking on those protection mechanisms should yield a fairly high-quality pool of curated external links.
While that sounds good in theory, who is to say those that can edit protected and high profile pages aren't in SEO spammers' pockets?
I mean the edit history is public and there's plenty of people that actually pay attention to edits and the like so they would be found out soon enough, but still.
I'm sure this is an ongoing discussion when e.g. political figures' pages are protected as well - who becomes the gatekeeper, and what is their political angle?
Sure, but at that point you're simply discussing Wikipedia's quality control system, which may be an interesting discussion, but has nothing to do with search engines per se.
Considering that Wikipedia has become a pillar of most scientific work (imagine writing a math or computer science paper without Wikipedia – utterly unthinkable), it's safe to say that knowledgeable people have collectively decided that its quality control mechanisms are "good enough", or at least better than those of any other resource of comparable depth and breadth.
And that puts Wikipedia's link pool leaps and bounds ahead of whatever dark magic current search engines are using, which mostly seems to be "funnel the entire web through some (easily gamed) heuristic".
Ehh, I think you’ve overestimated the quality of Wikipedia on highly specialist topics (like, say, the kinds of things you’d write academic papers on). It’s not so much that it doesn’t have the content, it’s that the coverage is super uneven; sometimes it has extremely detailed articles on a niche theorem in a particular field, and other times it has the barest stub of information on an entire sub-sub-field of study.
The "official website" links can be retrieved in a machine-readable way from Wikidata. E.g. a completely random silly example of official websites of embassies:
https://w.wiki/5T3R
That's our strategy at you.com - we start with the most popular sites, crawl them and build apps for them, eg. you.com/apps and let users vote in search results.
Your search actually performed better than Google for me on a random query. I queried both engines "What is the specific heat of alcohol?", Google threw up a rich search answer that linked to some random Mexican site that is clearly exploiting SEO [1], you.com linked me to answers.com (which is more trustworthy than random mexican website).
Users vote on their own preferences... not on other users' preferences.
so i dont see how it will be gamed unless we start incorporating user preferences into global ones - which we would only do if we thought it worked better and isn't being gamed.
I've looked into this, and found wikipedia's links not to be super useful. Wikipedia prefers references that don't change, so books primarily, and beyond that websites that don't change, so academic journals where you're getting paywalled, and WaybackMachine archives of websites (even if they are still live). You aren't getting much use out of wikipedia.
The problem with only using curated lists is that you kill discoverability of new sites, but it does have promise like we've seen with Braves 'goggles'.
The site has just been taken offline by Drew due to the unfortunate start.
I hope we can come back to this once the project has been properly launched, although Drew notes that he is "really unhappy with how the roll-out went" and that "my motivation for this project has evaporated" [1].
Thanks for all the work Drew, I hope you guys manage to come to a conclusion that you are satisfied with!
For what it's worth, my search engine got prematurely "announced" for me on HN as well, while likewise hilariously janky. I don't think the launch is the end of the world. (I guess I had the benefit of serving comically bizarre results when the queries failed, so it got some love for that)
The bigger struggle is, because a search box is so ambiguous, people tend to have very high expectations for what it can do. Many people just assume it's exactly like Google. It's something a lot of indie search engine developers struggle with. Even if your tool is (or potentially could be) very useful, how can you make people understand how to use it when it looks like something else? Design wise it's a real tough nut to crack, a capital H Hard UX problem.
Drew has the right to cancel his projects, but I really hope others don't cling to the hopes of "the perfect rollout" with their projects.
Startups and side projects are messy and sometimes things don't go as planned. Contracts get canceled, DoS takes down your homepage when you launch losing all those free leads, people leak new features and your sixth deployment erases most of the production database.
There are a lot of great ideas that start out as bad as the first release of thefacebook, AirBnB, Twitch and Youtube. Still, they iterate on these wonky, almost-working sites and end up making something great.
The amount of tweaking needed to make a search engine work well can't be overstated either. When you start out, it's inevitably going to be kinda shit. That's fine. Now you need to draw the rest of the owl.
Yeah, I agree. I was certainly underwhelmed with my first small search engine. It was so bad even I didn't want to use it - and I had spent months and months on it.
Still, most projects aren't a search engine. I see people put high expectations on how things will go and often it's just really hard to realize some of those hopes.
Sometimes you just have to take what you can get and iterate. Don't give up.
I think, with search engines, it's best to work with them for the problem domain. It is a fractal of interesting programming problems touching upon pretty much every area of CS, programming and programming-adjacent topics, and take whatever comes out of it as an unexpected bonus.
Well said, my next version will be focused on a niche I actually need instead of general search. Still, I haven't finished studying the CS books + 47 algorithms I'll need to actually implement it.
I love your post. I have to say that my experience is that many in the HN community are not nearly as kind as you. Folks here will rip you apart if you don't have everything figured out :(
So, I started rolling out new features of our search engine on Twitter, Slack, etc instead of here.
Hey everyone I accidentally shared this too early.
I misinterpreted a don't-share-yet announcement to mean the announcement post only and not the entire software and announcement. I don't mean this as an excuse; that's just the context.
Good morning, HN. Please note that SearchHut is not done or in a presentable state right now, and those who were in the know were asked not to share it. Alas. I had planned to announce this next week, after we had more time to build up a bigger index, add more features, fix up the 404's and stub pages, do more testing, and so on, so if you notice any rough edges, this is why.
I went ahead and polished up the announcement for an early release:
Just a warning from a fellow search engine developer.
If you happen to be cloud hosting this, and if you do not have a global rate limit, implement one ASAP!
Several independent search engines have been hit hard by a botnet soon after they got attention, both mine and wiby.me, and I think a few others. I've had 10-12 QPS, sustained load for weeks after weeks from a rotating set of mostly eastern european IPs.
It's fine if this is on your own infrastructure, but on the cloud, you'll be racking up bills like crazy from something like that :-/
e.g. "Any websites engaging in SEO spam are rejected from the index" - how is determined whether something is SEO spam or not? More clarification of whats allowed/not allowed would be nice!
But ultimately, it's subjective, and a judgement call will be made. If it's minor you might get a warning, if it's blatant then you'll just get de-listed.
I think it would be great if we had Code Forge index to search uniquely. In this index are only the myriad of code hosting sites around the internet - shared hosting like gitlab, github, sourcehut, sourceforge, codeberg, and all the project instances like the kernel.org, GNU Savannah, GNOME, KDE, BSD, etc. Probably hundreds of them out there, and allow people to submit their own self-hosted Gitea/Gitlab/sr.ht/etc. instances to be crawled - maybe even suggest a robots.txt entry your crawler could key in on as "yes please index me, hutbot".
I intend to at least support other search engines by adding !bangs for them and recommending them in the UI if you didn't find the results you're looking for. I don't think that crawling is something that is easily distributed across independent orgs, though.
I guess cppreference.com isn't even a part of the list?
I tried a couple test queries:
> lambda decay to function pointer c++
I get some FSF pages and the wikipedia for Helium?
> std function
I get... tons of Rust docs?
> std function c++
All rust docs? The wikipedia page for C++??
Interesting idea, but this seems like it would be the primary failure mode for an idea like this: as soon as you are researching outside of the curator's specializations, it doesn't have what you're looking for. Yet these results would both be fixed simply by adding cppreference.com to the index. Let's try and give it a real challenge:
> How to define systemverilog interface
And as I might expect, I get wikipedia pages. For "Verilog", for "System on a Chip" and for "Mixin".
1st google result:
> An Interface is a way to encapsulate signals into a block...
I added cppreference.com now and kicked off a crawl. It'll be a while. The list of domains is pretty small right now -- it was intended to be bigger before the announcement was made. Will also add RFCs and man pages soon.
There will (soon) be a form to request new domains are added to the index, so if there are any sites you want indexed which are outside of my personal expertise, then you'll be able to request them.
You probably are already thought about it, but just in case feature idea: adding moderation support for collaboration. Somewhat trusted persons vetting niche subjects.
GitHub doesn't seem to be either. I get that it's a competitor but not being able to search GitHub is probably a deal breaker for most devs that aren't Drew.
I'm not opposed to indexing GitHub, but the signal to noise ratio on GitHub is poor. Nearly all GitHub repositories are useless, so we'd have to filter most of it out. I think instead I'll have to set it up where people can request that specific interesting repositories are added to the index, and maybe crawl /explore to fill in a decent base set.
GitHub is hella tricky to crawl too due to its sheer size and single entry point (meaning slow crawl speed). I've been looking at the problem as well, and so far just ignored it as un-crawlable, but I might do something like crawl only the about pages for repos that are linked to externally some time in the future.
There's an asterisk to that: they serve the underlying content through two different APIs so one can side-step the HTML wrapper around the bytes: the discovery phase has a formal API (both REST and GraphQL) for finding repos, and then the in-repo content can be git-cloned and one can locally index every branch, commit, and blob, without issuing hundreds of thousands of http requests to GH. One would still need to hit GH for the issues, if that's in scope, but it'd be way less http requests unless your repo is named kubernetes or terraform.
We're still talking about git clone:ing a hundred thousand github repos. Git Repos get big very fast. That's a lot of data when you're realistically only interested in is a few markdown files per repo.
SearchHut was built to this point in about a week by Drew and contributors which I think is amazing.
It is also meant to be very simple to run in the case you want to index your own category of sites. For instance cooking content is specifically not indexed but if _you_ wanted to you can spin up an instance and index cooking sites yourself.
This seems like a great idea, honesty. There are niche topics that are very hard to navigate in Google, because it’s so skewed towards mainstream topics. I think it would make sense for these communities to maintain their own search engine.
Too bad this came out before Drew intended, and I hope that after having a weekend to rest he’ll feel his motivation recover.
One meta-thought, I think projects like this are surfacing something interesting: The underlying technology to make a pretty good search engine is no longer especially difficult for programmers or for server. This is potentially a very good thing, as it means the end of the Google era.
I can imagine a future that is almost a blast from the past, where there are a lot of different search engines, those engines are curated differently, and while none of them index the entire Internet, that’s what makes them valuable and better than Google (which I think cannot defeat spam).
I’m trying to think of a historical parallel, Where are some service used to be very difficult to provide and therefore could only effectively be done by a single natural monopoly, but technology progressed and opened up the playing field, breaking the monopoly. Television has some similarities. Perhaps radio vs podcasting. What others?
Google also funnels a lot of traffic to itself through Chrome's search bar, and Firefox does the same. Sure you can replace the search engine, but whatever you replace it with needs to have the same capabilities or the entire model falls apart. Meanwhile, alternative means of navigating the web (such as bookmarks) are made increasingly difficult to access, requiring multiple clicks.
I don't mean to be conspiratorial, I'm sure there are good intentions behind this, the consequence however is effectively locking in Google as the default gateway for the Internet.
- search queries are performed directly from the clients computer so can't protect their privacy (since Custom Search JSON API have a daily limit of 10k queries)
- forced to use javascript, and the way it's implemented makes it difficult if not impossible to do even basic things like the loading animation cards
- ads are loaded from an iframe so you can't do any styling (except extremely limited options that they make available in their settings, but no matter what then it will be very ugly if you want to have a light/dark theme)
But there are of course many benefits as well, such as it being 'free' (Bing is ridiculously expensive IMO, and feels impossible to join their ad network to offset the costs.. which might explain why you see countless Bing proxies shut down after a few months) and search results are no doubt better than the ones you'd get from Bing.
It's been around forever, but your concern is real. Who's to say an OSS project won't get archived, or removed from the internet? Why invest time into anything when it will all be replaced eventually?
edit: Looks like this OSS project was launched and cancelled in a single day.
Interesting. This looks like Custom Search Engines evolved into this?
I can't tell whether this is a neglected Google product that they were going to refresh but lost interest in, or something that is undergoing a breath of fresh air.
As you say, I was able to add a list of domains and get some pretty decent results from it. The UI makes me feel like Google are not interested in making it a truly successful product, though.
For those looking for an alternative to that, I've been building a self-hosted search engine that crawls what you want based on a basic set of rules. It can be a list of domains, a very specific list of URLs, and/or even some basic regexes.
Great project! Given a local archive of Wikipedia and other sources, this can be very powerful.
Which raises the question: does archive.org offer their Wayback Machine index for download anywhere? Technically, why should anyone go through the trouble of crawling the web if archive.org has been doing it for years, and likely has one of the best indexes around? I've seen some 3rd-party downloaders for specific sites, but I'd like the full thing. Yes, I realize it's probably petabytes of data, but maybe it could be trimmed down to just the most recent crawls.
If there was a way of having that index locally, it would make a very powerful search engine with a tool like yours.
I think the idea of federation of domain-specific search engines, possibly tied together by one or more front-ends, is a brilliant idea.
I think it's similar to how Google's search works internally, though I doubt the separation is based on a list of domain (as in DNS) names. IIRC they have a set of search modules, and what they return (and how fast they return it) all gets mixed in to the search results according to some weighting. Right below the ads.
If you look at a search system that way, it's easy enough to add modules that do things like search only wikipedia, and display those results in a separate box (like DDG), or parse out currency conversion requests, and display those up top based on some API (like Google). etc
It is possible for a site's results to be of different quality: maybe one article about MySQL is not so informative, and an article about Python on the same site is a reference.
The search engine operated by the author is unlikely to acknowledge that.
I wonder how the page ranking will work in the end. A quick look at the source doesn't show (me!) any planning for intelligent ranking. The database has a last_index_date and an authoritative field. Could be used fot basic relevance sorting, but nothing exhaustive.
Postgres as backend is maybe not the best choice and there are already many sites that index specific pages and take suggestions. The hard part is getting relevant results when having a large index.
As I understand, the idea is to only have manually curated high quality domains. In that regard, ranking is entirely secondary to BM-25. Might work, but it leaves out a lot of long tail sites that (in my experience at least) often have very good results. It's really the middle segment where most of the shit is.
Has anyone experimented with creating a search engine that only indexes the landing page of domains? I’m less interested in another Google, and more interested in a way to find new and interesting sites/blogs/etc. Stumbleupon was great for this back in the day.
Seems like it would be an interesting experiment to see what the results would be, indexing only the content / meta tags of “index.html”.
I built a solution at https://mitta.us/ that lets you submit the sites you want crawled, and puts them in a self-managed index (which isn't shared globally). I don't do link extraction, but instead let GPT-3 generate URLs based off keywords.
!url <keyterms> |synthesize
I also wrote a screenshot extension for Chrome that lets you save a page when you find it interesting. The site is definitely not "done" but it's usable if you want to try it. Some info in help and in commands is inaccurate/broken, so it is what it is for now.
It does the !google <search term> and !ddg <search term> thing to find pages to save to the index. There are a bunch of other commands I added, and there's an ability for others to write commands and submit them to a Github repo: https://github.com/kordless/mitta-community
!xkcd was fun to write. It shows comics. The rest of the commands can be viewed from !help or just !<tab>
I've been working on pivoting the site to do prompt management for GPT-3 developers and have been kicking around Open Sourcing the other version for use as a personal search engine for bookmarked pages.
- Beltalowda – no results (for reference: it's a term to refer to "people from the [asteroid] belt" used in the The Expanse books and TV series).
- The Expanse – bunch of results, but none are what I'm looking for (the TV series or books). It looks like it may drop the "the" in there?
- Star Trek – a bunch of results, but ordered very curiously; the first is the Wikipedia page for "Star Trek Star Fleet Technical Manual", and lots of pages like "Weapons in Star Trek" and such.
- NGC 3623 – lists "Messier object" and "Messier 65", in that order, which is somewhat wrong as NGC 3623 refers to Messier 65 specifically.
- NGC3623 (same as previous, but without a space) – no results.
- vim map key – pretty useless results, most of which have no bearing on Vim at all, much less mapping keys in Vim.
- python print list – the same; The Go type parameters proposal is the first result; automake the second, etc.
Conclusion: "this product is experimental and incomplete" is an understatement.
I didn't call the product garbage, just some of the results, which I think is fairly accurate. But I edited it to "useless" now, as that comes off as a bit less harsh.
I love that I can self-host this! Are there plans for federation?
Rather than maintaining a whole separate index for myself, I'd love to self-host an instance of this, only indexing sites that aren't in the main index, and then falling back to the main index / merging it with my index to answer queries. I wonder how easy that would be with the current architecture.
Does Sourcehut offer textual search within a repo's files? GitHub and GitLab offer it, but Codeberg doesn't seem to (and I couldn't find any information about its presence or absence on Sourcehut).
In addition to curated domains list, some searches would benefit of limiting display of old results, as usually you might find an answer, but solved in jQuery or older version of framework you are using.
Bad serp... Searched 'mdn a'.
Google return '<a>: The Anchor element - HTML: HyperText Markup Language | MDN'
SearchHut rerurn a generic: 'MDN Web Docs'
It seems like it uses postgresql's FTS, which will generally drop stop-words so "the", "a", "and" and similar words are dropped. I've been meaning to figure out the best way to deal with this myself, and I'm guessing looking for exact matches first and then running a FTS query could work.
You can write a custom stemming algorithm and load it as an extension library into Postgres, then use that with `CREATE TEXT SEARCH DICTIONARY` to create a custom dictionary. It's not as difficult as it sounds - you can use the default Snowball stemmer as a sample, and tweak it.
It's not just a custom dictionary. Stop words are usually excluded for a reason, you need to understand the nature of the query, and when to exclude what looks like stop words from being pruned. It's not really a job for a snowball stemmer, as you do need to operate over multiple tokens to gather context.
Most of the time, keywords like this come from external anchors as well, which is something that you're gonna be able to leverage with this design (as I understand it).
The API limits are not documented yet, like many other things, due to the early launch. For now I'll just say "be good". Don't hit it with a battering ram.
Considering Google's answer box randomly picked multiple photos of unrelated people as pictures of murderers and rape victims (with Google being very uncooperative about resolving the issue) I'd say the lack of an answer box might not be that bad.
It can certainly be a helpful feature, but I wonder whether it's really better than good, relevant search results presented in a readable way. For example I'd argue the manually curated infoboxes on Wikipedia are likely more reliable than the algorithmic versions Google shows in their results, especially as it's difficult to fix mistakes in Google's version. Google thinks their own solution is the best one because Google made it and so they circumvent the whole page ranking process. Some queries of course need more than just plain search results (see Semantic Web and related things) but for those most engines don't offer enough control and transparency.
But I'm glad people are trying to build alternatives. I'd love a search engine that ignores sites with antipatterns like required registration for any kind of usage, and this is the first step.
This one is actually hilarious because google cites the site wrong for me.
>Apache HTTP Server
>It is one of the most popular web servers around the world. As of May 2022, Apache holds 31.5% of the market according to W3Techs and 22.99% according to Netcraft.
It's quoting that from https://www.stackscale.com/blog/top-web-servers/ which clearly states Nginx as the top one.
>As of May 2022, Nginx holds 33.5% of the market according to W3Techs and 30.71% according to Netcraft.
> SearchHut indexes from a curated set of domains. The quality of results is higher as a result, but the index covers a small subset of the web.
[citation needed]
The quality of the results right now are not very high, and in theory I don't understand why one would believe a search engine with a hand picked set of domains would be expected to outcompete a search engine that can crawl the entire web and determines reputation by itself. This also ignores the fact that a lot of domains have a mix of high quality content and low quality content, for example twitter or medium. If you are going to rely on domain-level reputation then your search engine is going to be way behind the search engines that can judge content more specifically, which is all of the other search engines.
If you were to tell me curated domains is just a bootstrapping method and as the search engine evolves it will change, fine, but right now the search engine is so simplistic that the theory of how it might be good is really the only point. And if that underlying theory is dubious, and the infrastructure is simplistic and obviously won't scale, then I don't know what is interesting or novel about this right now. Doesn't seem worthy of reaching the top of HN.
> If you are going to rely on domain-level reputation then your search engine is going to be way behind the search engines that can judge content more specifically, which is all of the other search engines.
Then why do Google and DuckDuckGo return 90% garbage for most queries?
"All of the other search engines" have completely failed to keep pages from the results that are not only low-quality, but outright spam.
> Then why do Google and DuckDuckGo return 90% garbage for most queries?
If you can give me a list of 10 normal-ish queries where 9 out of the first 10 results on Google or DDG are "garbage", then I'll concede your point.
I think you are creating an impossible standard for search engines, then using it to deem the current ones as failures. While at the same time ignoring that this new search engine is, as present, unusable with no realistic argument for why it might eventually be better.
They definitely do not return "90% garbage for most queries". This 8s an unsubstantiated claim I see often i HN and honestly not backed by any real data. e.g. You can check your search history and see it yourself.
I just tried searching for "python str" on Google. I expected the top result to be a link to the official Python docs for the `str` type, then ideally some relevant StackOverflow questions highlighting common Python issues with strings, bytes, Unicode etc.
Instead, the top result was W3Schools. Then came the Python docs, then 5 pages somewhere between blogspam and poor-quality tutorials. Then a ReadTheDocs page dating to 2015. And that was it. No more official Python resources, no StackOverflow. In the middle of the results some worthless "Google Q&A" dropdowns that lead to more garbage quality content.
So for this query, using my definition of "garbage", the "garbage percentage" is somewhere between 80% and 90+%, depending on how many Q&A dropdowns you waste your time opening.
The fact that the ranking of results for queries that have nothing to do with location-based services depends on where you are located (and, possibly, on whether or not you are logged in) is one of the worst things about Google. And the fact that you can't seem to disable that behavior is even worse.
I just tried searching for “python str” on searchhut and the top result is Postgres docs, then Wikipedia article for empty strings and then Drew’s blog. Official Python docs isn’t in the index at all.
> why one would believe a search engine with a hand picked set of domains would be expected to outcompete a search engine that can crawl the entire web and determines reputation by itself.
Because SEO manipulation is a well developed field, ensuring that the search engines trying to determine reputation automatically will (and does) end up with bad results.
Indeed. Whatever "smart" algorithm you use to rank results, you can be certain that half the web will turn into adversarial examples once your engine becomes popular enough.
If you were to tell me curated domains is just a bootstrapping method and as the search engine evolves it will change, fine
This makes me think of a possible approach. Curate a giant set of domains that almost exclusively host high quality content. Crawl said domains. Use all of the crawled data as a training set to create a model with which to ascertain the quality of random Web pages from other domains. Then spider everything and run it against the model.
It's not passive aggressive. It's sensitive, but he has a right to be if he wants to. He wasn't petty or mean spirited in his announcement to take it down. He only expressed that he was taking the feedback very hard, which is understandable if you had big plans to roll out and make a good first impression.
Expecting people to have to use NDAs to not spread your hobby tech project is pretty much the antithesis of the "hacker ethos" that this website is literally named after
I wonder if simply parsing all of Wikipedia (dumps are available, and so are parsers capable of handling them) and building a list of all domains used in external links would do the trick. Wikipedia already has substantial quality control mechanisms, and the resulting list should be essentially free of blog spam and other low-quality content. Wikipedia also maintains "official website" links for many topics, which can be obtained by parsing the infobox from the article.