Creating a thin, non-working wrapper around DuckDuckGo as a "hire me" ad, claiming it's a search engine and mentioning Google in the headline because it'll get more clicks.
To be fair, a lot of resumes say "I created X" and "I developed Y" when, if you look even a little past the surface, the claim boils down to "Made a thin wrapper around one or more existing services." I interviewed a guy who was the sole developer of a web app that converted image files from one format to another. OK, cool project, bro, let's look at it. A little probing revealed it was just a small generic page that called into ImageMagick to do everything.
I'm convinced that's a lot of what passes for software development in the 2020s: Make a splash screen and display some branding on top of older, robust, actually-complex projects.
Leaning on IM seems the exact right way to handle the problem? If I needed web accessible image handing tricks, I could get away with not much code, because it is ultimately going to be passing some POSTed bytes into an image handing library.
It's a little unfair to say that Kagi "does that" and thereby imply that they merely repackage search results from Google and Bing. Kagi uses an array of different search sources that includes Google, DuckDuckGo, Apple, Wikipedia, Wolfram Alpha, and others, alongside their own small web crawlers. Their mission is to present search results in a user centric way and so they try to surface results that are directly useful to the user rather than the advertisers (they don't have advertising partners as far as I can tell), and correspondingly their business model supports that approach rather than being in conflict with it.
https://help.kagi.com/kagi/search-details/search-sources.htm...
> Yeah I think they are trying to get a search-results-only view without the ads etc.
If that was the case then their description is completely off, as what they are doing is not creating a search engine but filtering results from other search engines.
I mean, just because you pipe your output to grep it does not mean you created an entirely new app.
"Our search result pages may include a small number of clearly labeled "sponsored links," which generate revenue and cover our operational costs. Those links are retrieved from platforms such as Google AdSense. In order to enable the prevention of click fraud, some non-identifying system information is shared, but because we never share personal information or information that could uniquely identify you, the ads we display are not connected to any individual user."
The phonebook capabilities of pretty much all Google alternatives suck. Google is currently the only search engine that actually works for local queries (i.e. phone repair in <local town>).
Alternatives while probably fine in America, suck in the Nordics. I think people forget just how much search traffic happens in this category.
This is arguably a large part of why Google's tough to compete with. They have an absurd number of seamless integrations. Even if you do search well, you'll have users going back to Google for a bunch of other stuff all the time.
I've been using SearXNG[1] via Perplexica[2] and I couldn't be happier. It replaced Google and ChatGPT/Perplexity type search engines for me and it's the first tool I use for question answer type searches.
As someone who cares about their online searches actually being good, fast and private, I cannot recommend SearXNG more. https://github.com/searxng/searxng/
It's a metasearch engine that can query multiple search providers at once, including google, so you're not missing out on the good results you expect. Pick an instance at https://searx.space/ and tell your friends!
Yeah I ran that on like $5k worth of consumer hardware initially. The only caveat is that you want a lot of RAM, and since consumer machines are mostly limited to 128 GB, that's a bit of a hard limit on how much you can do. In terms of storage space and compute, it's relatively fine.
I'm on a single ~$17k server now, probably utilizing about 20% of it's capacity.
The problem domain is sprawling and varied, so you ideally want a bit of everything. You definitely need to do a bunch of custom low level programming, as a lot of standard tools can't cope with the data volumes involved. Like I'm not even storing known URLs in a DBMS anymore, the table would be too slow to update.
High level language with the ability to make downcalls to a low level language is probably a sane compromise.
Robust crawling with all the gratuitous anti-bot infra out there (requesting the rss feed that is specifically meant to be consumed by machine for a public gov site is 'protected' by Cloudflare's anti-botting. Seriously?) takes way more work than you think.
The way to get around anti-bot stuff as a search engine crawler is to be upfront about what you are doing, as well as to respect robots.txt, and rate limits, and use a gratuitous crawl delay (1 sec is the bare minimum).
You can register with cloudflare as a search engine crawler and they largely will let your traffic through. If you get blocked by individual sites, you can usually just email them explaining the situation and get unblocked.
I haven't looked at it for over a decade but the p2p search engine YaCy is very old and it worked just fine. Something similar shouldn't be to hard to make. We are spoiled with tools now.
The sales pitch is simple, you download the crawler, point it at your own blog index it and build an index of pages that your blog links to. Then index the pages those pages link to. etc You simply crank up the depth whenever you like it.
If you are a half decent blogger you have articles that link to most of the important websites that fit the subject of your blog.
You put a search box/page on your website that connects to your desktop client, your visitors can search with options:
- articles on this blog,
- related pages you've linked to,
- 1-5 depth to broaden the topical search (but less related articles)
- search other instances
It scales so well because searching your own blog is the most important, linked pages is pretty nice to have, deeper crawls are still useful but much less important and searching other instances, the anti climax if you like, is great but the least important.
The most crappy hardware can do 50 000 per day, if you run it slowly in the background [say] 100 pages on average per day is still 36 500 every year.
More usual is to be excited about the new found tool and run it for a few hours the first day. You are initially shocked how useful it is. The next day you crawl a few more pages until you get bored with it. You look again after a while and do one more good crawl. Few years later and you have an oddly large index.
You might want to run it automatically when your rss updates.
If you use it once in a while it is easy to ban some instances full of spam.
YaCy checks all results returned by other nodes by fetching the html and looking for the keywords on the page. This worked well. A very stale index may reflect poorly on the node but it may also be full of material that is important to you.
You would get crusty results at times but this is a feature not a bug. There is no man behind the curtain who is the big decider what you may and may not look at.
If your client is not running the search box/page on your blog only does p2p but it is likely able to still search your domain. What is a lot of posts for a blog is not a lot for a crawler. You can glue all kinds of products onto this. Besides a db YaCy keeps the full text of all pages crawled but only the text. If users want a feature that cant be done for free you can sell it to them. If someone has a website that is hard to index they can customize their crawler themselves or pay to have it done.
If you want to throw money at it and have a blog search engine you can limit the results by things that have an rss or atom feed.
It's fashionable to bash Google and there's probably good reasons to do so. But I agree, their search is still good (enough) for me. Over the years, I've tried others, including DDG, but ultimately returned to Google every time.
So, until a better one comes along, Google's search is fine for me.
Google is mostly fine if you know what you are looking for and if it's not an SEO / ad target topic.
If those two don't apply Google is often pretty much useless, just giving you low quality/ AI spam blog posts, useless ad-ridden product comparison sites and marketing pages.
There's a reason why adding "reddit" to Google searches has almost become a meme now.
Why do people always jump on 'umm Google bad' trends? It's all well and good to say it, but as someone who hasn't ever experienced this (apparently common) problem before, I would like to see to believe.
Don't get me wrong, I still find anything I search for, on page 2-N
I switched to duckduckgo about 5 years ago and look back occasionally.
DDG simply works better for me. I find what I search for usually on page 1.
Just to put things into context: I don't care about google, meaning I have no beef with them. I used google search for many years until I switched the default engine, simply because the search results for how I search got worse over time.
Google is fine unless you need something really specific, chatgpt is better at this point for a lot of queries because Google give you the most simple answer no matter how specific you got with your keywords.
Quotes still work when there are good results. As far as I can tell all that has changed is a) dropping obviously bad duplicate results like Stack Overflow scrape dumps, and b) adding in more un-quoted results when there's a lack of good results for quoted. I personally haven't had any issues with quoting.
Last time I investigated (many years ago), quoted Google searches only work in some countries. I don't know why. I dug that fact up from a Google help page somewhere.
Duckduckgo is now particularly riddled by AI spam. The top 5 summaries for lots of searches now begin with some variation of "In the fast-moving world of...". Utter shit.
It'd be so easy to filter I wonder why they/Microsoft don't bother. Oh, wait...