Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Ichido, search engine that tags sites using Google and Cloudflare (ichi.do)
122 points by anthonyhn on Feb 26, 2023 | hide | past | favorite | 71 comments
Hello HN,

In my spare time I work on an experimental search engine named Ichido. Search is fascinating, there are so many features you can add to a search engine, but I find that the existing search engines are a bit limited in the features they have to offer. So I decided to work on my own search engine to test out different features, searching algorithms, and front ends in order to improve my (and hopefully others) searching experience.

Ichido includes a tagging system that provides more info on search results. For example, if a site links to Google services or uses Cloudflare, a tag is shown with the search result that let's the user know about that site's use of those services. Ichido also includes links to RSS feeds in search results, making it much easier to find RSS feeds.

This search engine is free to use, but if you like the service and want to support continued development please consider making a donation (Ichido currently supports donations through Libera Pay).




I think the tags can be grouped like Extereme trackers, Moderate trackers, etc and clicking on them expands the full list.

Also one really useful tag would be "Affiliate links" if there is a way to identify a page contains affiliate links like amazon affiliate, etc. Those pages are always almost crap.

Also a tag for "Modal popups", those are too often just marketing related websites and definitely want to skip it if I know prior to visiting.


I run this search engine comparison tool:

https://www.gnod.com/search/

Just added Ichido.

Click on "more engines" to activate it.


I made a post with a bunch of suggestions from my list and then my browser extension that limits my time on HN lost my whole comment including all the little explanations I had for each one. So here's my raw list instead haha

  meta
    https://www.gnod.com/search/
    https://github.com/searx/searx
  categories
    independent
      https://www.crawlson.com/
      https://search.marginalia.nu/
      https://wiby.me/
      https://searchmysite.net/
    international
      https://bonzamate.com.au/ australia
      https://www.baidu.com/ china
      https://yandex.com/ russia
    code
      https://searchcode.com/
      https://codesearch.ai/
      http://symbolhound.com/
      https://publicwww.com/
      https://search.feep.dev/
      http://codesearch.debian.net/
      https://codesearch.isocpp.org/
      https://www.programcreek.com/python/
      https://livegrep.com/search/linux
      https://grep.app/
    ai
      https://consensus.app/ scientific consensus
      https://github.com/jokenox/Goopt procedurally generated
      https://same.energy/ image similarity
    products
      https://www.looria.com/
      https://knifist.com/ knives
      https://attic.city/ home and fashion from indie stores
    topical
      https://biztoc.com/search business news
    premium
      https://kagi.com/
    other
      https://metager.org/ privacy centric engine that combines results of several engines
      https://thangs.com/ 3d models
      https://filmot.com/ youtube subtitles
  lists
    https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
    https://web.archive.org/web/20200710091019/http://www.jaruzel.com/textfiles/Old%20Web%20Info/Internet%20Search%20Engines%20v2.61.txt
Hope its useful still


Can you add https://biztoc.com/search ? (Real-time business/finance News) Zero Tracking/Cookies.


Nice! Please consider adding https://mwmbl.org

Thanks!


You could add other AI search such as perplexity.ai and phind.com


>Just added Ichido.

Thanks, much appreciated


Search engines will do literally anything except the option "never show results from this domain again"

Is there something obvious I'm missing that makes it infeasible, or maybe is it just something only I want?

As for this site there's too many tags for them to be useful imo. Give it 2 weeks of using the search engine and I bet you could hide silly fake tags in there and I'd never notice. Lots of tags = no tags.

I was picturing maybe a little pillbox type thing you might find appended to Google search results.

For instance when a result is a PDF: https://img.imgy.org/-7lq.jpg


Ability to block/boost domains as seen in the following link:

https://blog.kagi.com/kagi-features

I only know about it because it pops up often on hn. Haven't tried it because at this point I don't want to pay $10 per month for search.


Eventually someone will block enough big domains and render the search unusable and they may forget they actually did that. And there goes a user


Including something like "7 results from sources you blocked are hidden. Click to show" would solve this nicely.


yea but you can just apply the filter on a per user basis. a literal bootcamp grad could write this


We're actually going to do that!!


Yes, there are search engines that let you do that


Such as..? I've seen Kagi as linked in a sibling comment which I'll give a go.


Yes, I'm using Kagi and it has that feature.


I would prefer more logical tags like “top 1k”, “aggregator”, “user-generated content” than technical like “utm” and “obfuscated scripts”. Also, I would prefer tags grouped together into expandable lists and not shown all by default. Every site uses javascript, I don’t want to see it over and over again unless specifically queried for that.


An interesting search proxy is also SearX. Written in Python, it supports many backend engines and can be self hosted.

And here's a lightweight frontend/proxy I wrote in C for using Google search on low-end phones that can't render bloated HTML (SearX was too complicated to install):

http://searc.4a.si:7327/search?q=news

It's also nice that the structured never constantly changing HTML it produces makes it ideal to programatically query Google. Although you still run into captchas which it cannot solve if queries get too suspicious.



This looks great, I am really glad to see things making it more obvious how pervasive malicious Google scripts are.

I find the webp flag interesting, as I don't think webp itself is inherently harmful, except for being an image spec that solely exists because Google NIHs everything and wants to write their own everything. (Long live JPEG-XL!)

I'm curious why you chose to tag it explicitly though.


I love that tag, as (to me) it indicates a site is trying to be bandwidth efficient instead of just defaulting to JPEG.

JXL is pretty much dead thanks to Google... and avif is still mostly suited to thumbnails.


In your about page, I see you are using Bing's API. I didn't even know Bing has a search API that everyone can use!

How much do you have to pay them for this?


It's $4/1000 queries, but the rate is increasing in May to $18/1000 queries. The Bing API is available through Azure.


> $4/1000

Is it just me or does this seem insanely expensive already?


Yeah for my search traffic alone that feels like it'd be a $4/mo service


I hope the author knows this, and won't be surprised by a bill more than 4x larger.


That comment was written by the author.


Oh! I am scatterbrained.


quite a price hike. Bad for the sustainability of the other portals I use that are backed by the bing index. Will either increase their pricing or efforts on monetization.

Is it easy for you to rely on more search index providers, what are your options?


>Is it easy for you to rely on more search index providers, what are your options?

I have a few options:

* Switch index providers. For example, Mojeek has an index with 6bn pages and has a web search API; may be more sustainable to switch to their index in the long run.

* Build my own index. This is my preferred option and I've already started to work on this.

* Look for funding sources to offset the price hike.


own index: the wiby.me DIY writeup is fantastic. Each person will do this differently based on knowledge and experience, but the step by step guide was educational.

marginalia put out the news yesterday getting a NLnet grant with one of the sub-goals "to produce and offer portable data in order to bolster adjacent efforts in the search and discovery space", maybe talk?

As an end user I profit from the many licensees to the bing index, different interfaces and ideas can compete with less effort required.


Pretty sure that only Google and Microsoft have the money and resources to crawl the entire internet. Or perhaps the only that can AND are willing to.

Correct me if I’m wrong though, but I’m pretty certain that all other search engines in the same category use one of these as their backend. Eg I’m pretty certain that counts for duckduckgo as well.


> Pretty sure that only Google and Microsoft have the money and resources to crawl the entire internet. Or perhaps the only that can AND are willing to.

Money and resources and a dominant-enough position so that your crawlers are not blocked by websites.

Unfortunately.


There's definitely gatekeeping on websites but having done a bunch of crawler work I can say you'd be surprised how rarely a site will outright block you if you just do the right things: have an identifiable user agent with a working URL that explains what your crawler does, respect robots.txt, implement polite crawling. As for actually being able to crawl the whole thing though, yeah it's stupidly expensive :-/


Brave Search has its own independent index too - https://brave.com/brave-search-beta/


I think Yandex have their own index and some others too like Marginalia, but the latter couldn't be called "in the same category" as the other three.


Yandex also suffered a security breach and their source code is available[0] although utilizing it in any way is ethically and legally dubious (at best).

[0] - https://arstechnica.com/information-technology/2023/01/massi...


What about Common Crawl?

https://commoncrawl.org/


Mwmbl has its own index but it's orders of magnitude smaller than commercial search engines.


Thank you! I think any competition is welcome for search engines, with Google going down the monetization path.

A piece of feedback: When I select "Remove top ...." and click Submit, then click Next, the popularity filter is gone.

Edit: looks like the file type filter is dropped as well. Do add the arguments to the pagination links.


>Edit: looks like the file type filter is dropped as well. Do add the arguments to the pagination links.

Thank you, great feedback! You're right, I forgot to include some of the params in the pagination, will have to include those in the next update.


The pagination keep increasing past the point where Bing will provide no more results. Testing a popular search term, for which there are no doubt millions of results, it was only possible to get new results up to page 45. Yet the website will keep incrementing the page number and result numbers as if new results are being returned.

Then tried same search with popularity set to 500000 and could not even get a single full page of 10 results. It's laughable to assume from this "search" that only, say, 500004 out of the millions of websites in existence include this term. Not that I want to browse a full list, but at least I want to know how many hits I got. Then I can add more terms and try to reduce that number.


What would be the issue of being hosted on CF? I believe it is a better option than the rest of the shared hosting industry.. If nothing critical whats the intention of tagging?


http://crimeflare.eu.org

CloudFlare is a MitMaaS. Traffic is seen by them because they are in control of the HTTPS certificates, and you have to take them at their word that they do not log content (and even if they're not lying/under a gag order, just metadata is enough for a lot of evil things).


So are AWS and Azure also MitMaaS?

If yes, what's the endgame? Everyone goes back to managing their own servers?

If no, why is Cloudflare the only hosting provider that gets singled out?


Good point. Yes, they are, as are all cloud services. But with AWS and Azure it's more explicit, in that the server is running on their machines.

With Cloudflare, you may not realize this if you don't know how HTTPS works.

The end game is not necessarily to avoid all use of cloud services, but to be aware of their functioning and their trade-offs, and avoid them where third party spying is crucial to prevent.


Again better option than the rest of shared hosting industry where a company of 50 or 500 can lurk into user space and for courtesy take a look what is going on. I see CF promoting ZeroTrust services and I believe they use them for themselves first hand. Any digital information is always prone to compromise, whether it remains in my pocket or in the bunker of Area51...


Wouldn't this criticism apply to all content delivery networks? They have to terminate TLS in order to know which content to deliver.


Are you getting confused with a VPN? Those words in that order are a bad thing for a VPN, not so much a CDN.

These are all reasons I use Cloudflare lmao. Yes I need them to decrypt the traffic because they do various rules and caching for me. That DDoS protection would be pretty naff if they couldn't see the traffic! In one case I really wish they did log, I had to write my own Worker to log the info I needed.

If we were talking outbound proxy then fair enough but it's not like Cloudflare have strongarmed me into using them.. it was me that updated the NS records!

A lot of the list from that site just seem to describe what Cloudflare does, they don't seem to say why each thing is actually a bad thing.

Really does feel like someone's got a hate rod on for Cloudflare and tried to crowbar in as many VPN criticisms without understanding the difference between a VPN/proxy and a CDN.


I am equally repelled by VPNs[1] and CDNs[2]. I probably should have started with my own articles.

1 - https://danuker.go.ro/dont-use-vpns.html

2 - https://danuker.go.ro/how-to-protect-your-personal-data.html


Even your own article (VPN hate is a given, of course, moving on) doesn't really go into the why of the reasoning.

They might get hacked? So might my site! In fact if you were a dedicated attacker I'd advise you go after my hosting rather than try to crack Cloudflare, it'll be much quicker. Probably be even quicker to just come at me irl and start snapping fingers until I logged you in as an admin.

They might sell data or censor my site? I doubt they would but if they did start doing so it takes like 20 seconds to update the nameservers. I'm not being held hostage, promise!

The arguments just don't work when it's my site and I've consented to using Cloudflare.

You're welcome to block Cloudflare's networks your side if you don't want to use them of course but I want all of the things they do in between you and my site. That's why am using them.


> it takes like 20 seconds to update the nameservers

Users are being denied your site without your knowledge. People using nonstandard browsers (such as for accessibility or security reasons), Tor/I2P for privacy, scrapers, hobbyists and so on. Maybe the content is replaced in certain cases (like governments targeting certain individuals).

> I'm not being held hostage, promise!

If you can easily turn off the CDN and still serve your traffic with reasonable performance, then you are not. Otherwise, you are.


With my knowledge. Quit assuming people are naive haha. I'm aware that some people may get bounced. Chances are it was me that set that setting that bounced them.

If you got bounced from a Cloudflare fronted site it was more likely the site admin wanted to bounce traffic like you than Cloudflare acting nefarious. Have a browse through the settings some time, see what's available.

I'm pretty sure it's a single toggle in Cloudflare to enable my site on a tor domain too. Can't find the setting right now (on lunch, not used to mobile dash) but I've seen it somewhere.

> If you can easily turn off the CDN and still serve your traffic with reasonable performance, then you are not

Yeah that's easy enough, no worries. The bits I use Cloudflare for aren't strictly site performance related (cache close to the users is handy, of course) - I can make a webserver purr with the best of them.

I can't afford to deploy servers globally and anycast them right now. Not to mention the administration burden. I could, sure, with the right budget. I don't have that budget.

Might be a little pain moving redirects somewhere else but nothing I'd need to cancel plans for.


CF will force you to recaptcha if you try to remain anonymous


Is that universally true, or just when domains explicitly opt into specific traffic screening measures? Asking as I'm thinking of moving some stuff to Cloudflare Pages.


I think some amount of captchas are enabled by default, but you can certainly turn them off entirely for your domain if you prefer.

Note that users can also use Privacy Pass to avoid captchas while remaining anonymous.

(Disclosure: I am an engineer at Cloudflare, though I don't work specifically on anything related to captchas.)


I see you offer an opensearch.xml already - if you embed it as link node with the appropriate type it will be straightforward to add it to the browser as (default) search engine: https://developer.mozilla.org/en-US/docs/Web/OpenSearch#auto...

also: happy to give this a try, more knobs for power users


> I see you offer an opensearch.xml already - if you embed it as link node with the appropriate type it will be straightforward to add it to the browser as (default) search engine

Thanks for the heads up, I used to have a <link rel="search"> to the opensearch in a prior iteration of the site, must have removed it by mistake. Will add in the link in the next release.


This is really cool! Please consider joining forces with us at mwmbl.org, would love to incorporate some of these ideas.


Nice project! However, when trying to search for my site (https://spacehey.com), it shows multiple tags, with most of them being false (Cloudflare, UTM Tracking, WEBP Images). I used Cloudflare at one point in the past, but don't anymore. Additionally, there has never been UTM tracking or anything like that nor WEBP images... Where do you get such data from?

Apart from that, awesome project!


Since spacehey includes user-submitted content, it's possible that:

* Someone uploaded a WEBP image to the site.

* Someone pasted a link with a utm_* param.

* The page was crawled when cloudflare was used.

Will look into it and see if I can find the pages that generated the tags. Search results are generally tagged by domain name (necessary since not all pages can be crawled, and even if the page the user connects to doesn't have, for example google trackers, a user would likely want to know if the site is using trackers elsewhere).

Also love the spacehey project, really captures the feel of Myspace!


EDIT: I found some of the pages with links that include UTM tracking params. Let me know if you want me to send you the pages with those links, can send them through email (my email is on the contact page of the site).


Oh, I see - that makes perfect sense! Thank you for the clarification!

Glad you like SpaceHey :)

Keep up the great work!


what's wrong with webp?


> what's wrong with webp?

Nothing wrong with the format in particular. However some may prefer formats such as PNG and JPEG since:

* A lot more software supports PNG and JPEG (backwards compatibility, better integration with one's existing system and tools).

* You can often get the same file size, visual quality, and performance with PNG and JPEG as you can with WEBP with optimization.


Interesting concept but find this one promoting going backwards. I understand people like rotary phones.


What's the use case for this? If I don't want Google scripts, I block them. I'll use a user agent that doesn't download or run them. If I don't want cookies, I'll instruct my browser not to save cookies. What situation would I be in where knowing whether a site uses these things is a search result I want to visit?


I find the extra information useful, as I dont have to visit the site to find out.


But what does it matter? If you're blocking it anyway, what difference does it make whether the site has it or not? I genuinely don't know why knowing this in advance is helpful and want to know what I'm missing


Brave goggles also do something similar, allowing to filter search the way to you want.


Too many tags, and if a site has something, like scripts, why do you say "may"?

If a site has scripts then it's not "This site may be using Javascript", it's for sure that the site uses it...?

And popularity filter doesn't work, the results are empty and if you try going to any of the other pages it removes the filter




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: