Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why don't we have personalized search engines?
44 points by enether on Aug 25, 2024 | hide | past | favorite | 56 comments
- Search as it is today sucks

- Google is an ad-engine, not a search engine

- SEO is gamed all the time

The end result is a search result that isn't that valuable.

Why isn't there a tool that allows me to:

- search good content I've read

- search curated (from other people I trust) content

- search books and other paid material I have bought

- search my notes (that are scattered throughout 5 apps)

All in one?



I’ve often wished I could publish a graph of myself, with 10-20+ items of interest, and let search engines and content recommendation engines (and good ad networks) bring me stuff I actually care about..

(especially calendar events, which used to be fun to track but everyone seems to have given up on event listings).

It wouldn’t have to track me, or infer other nefarious dimensions of my online habits, just target the things I’m asking to be targeted for.

I’m guessing that the implicit data dimensions of current tech is aggregating so much additional data about everyone that the recommendations we end up getting aren’t that great.

None of Google, Netflix, or Amazon get me at all, and I keep shoveling my habits right into their gaping data maw.


Even Instagram/Youtube's recommendation algorithms, which I find decent, tend to have a lot of recency bias.

This is probably not a bug but rather an optimization to ensure I spend more time on screen. (chances are higher I like the stuff I most recently watched)

Which goes toward the skewed incentives problem - what they want to happen is not what I want to happen.

I find value in being reminded of old content I've been interested in.


People probably won't like this, but I want to point out that Recall from Microsoft tries to do a little bit of this. Apparently, the specific implementation of that product is a spectacular privacy disaster. Which actually may not be an accident -- it is probably not simple to handle the privacy well for a personalized search engine (again, even though Microsoft made a lot of obvious mistakes), and you probably want to ensure that the data you aggregate do not end up being sold by a third-party. Still, you need to build a viable business. That's hard.


Soon as I read the OP; I cmd-f'd for Recall. 'Why has no one..' because people freak out about privacy violations tying too many disparate aspects of their lives together is why.


On my TODO list is to build a system that downloads the text content of all the sites I visit and dump it in a vector DB. Then make my own search engine using RAG.

I did write a script that does the downloading part. It looks at my browser history and downloads the text of every site going back years.

Ditto for decades worth of email. I want to see if I ask it for my nephew's birthday, will it figure it out?

Should be doable without much difficulty.


I am already doing this with my app, hopefully I can release it soon :) There are however several players in this field, although most of them have their own goal.


This is the future of AI. Apple is getting close with their attempt (Small language models for specific contexts).

It'll take a few more generations to get there.


Agreed it's the future.

It's a pretty obvious problem to make use of all the personalized data you have in an ecosystem and slap an AI to start answering questions.

I tried something similar with Google's Gemini. I have a gmail account I use exclusively for newsletter subscriptions. I found out I could ask Gemini questions about the content there.

It was atrociously bad.


I’ve been paying for Kagi search engine (a thing I never thought I’d pay for) for many months and it has a lot of what you’re asking for.


I pay for Kagi Ultimate, and it does almost none of what OP asked for. The only thing you could say it does is searching curated content from other people they trust - that is fulfilled by the custom lens feature, since they can be shared.


Same here. Paying for Kagi, and I don't ever use Google for search now. No ads, no SEO shilling, I can curate my own search results, and it's fast.

That's a winner for me.


another paying (happy) Kagi customer here...


I guess because something needs to track everybody to get good at doing that, and when that something manages to track everybody, the way to make profit is to sell ads.

It's not that Google doesn't know technically how to give good results. It's really that Google is optimizing for profit, not for quality. In a system that makes it extremely difficult for anyone to compete (and whoever succeeded with that would presumably end up in the same situation and optimize for profit).


> It's not that Google doesn't know technically how to give good results.

Can you explain this sentiment? I’m a Googler and I believe the incentives bias very heavily towards offering the best organic search. When the user goes elsewhere (and quality competition exists) Google loses everything…


Can you expand more on the incentives?

I, and perhaps the other people on the thread, distrust Google actually trying hard to give you the best organic search.

The monetary incentives are simply too large to circumvent imo

If they really are, then perhaps the problem is that there's so much attention and competition to game the search engine, that it's an impossible-to-beat cat and mouse game. Due to their success, they're basically guaranteed to constantly have "parasites" trying to game the system to their advantage. (cf. the SEO industry and companies like ahrefs)


First, I did not say that individuals at Google are being malicious. I am saying that Google as an entity is a profit maximization machine.

Look at all the antitrust cases against Google. That's not the result of systematically doing what's best for the users.

> When the user goes elsewhere (and quality competition exists) Google loses everything…

Which is not an incentive for giving better results. Just for locking the users in.


> Google as an entity is a profit maximization machine

Google is still an ad company that dabbles in services and software mostly to sell more ads.

People keep forgetting it for some reason.


Neeva tried to do something like this. While it wasn’t everything you mention here, there was a feature to login to various online accounts so it would search across the web, but also your data and documents in those various accounts.

I never connected my other accounts, as I found the idea of a 3rd party having access to crawl and catalog them uncomfortable.

Neeva has since shutdown and was acquired by Snowflake.

What you’re mentioning, would likely require a company to have a very large monopoly for a very long time, where all a person’s digital media was controlled by one company. Google is close, but for book people paid for, that’s something that would fall more into Amazon’s territory. Apple also has a bookstore, so maybe it would work for people who are 100% in Apple’s ecosystem and never stray, and then only have friends with people also in Apple’s ecosystem (for the people you trust feature).

I don’t think we’ll every see enough benevolent cooperation between companies, without ulterior motives, to do something like this well without it also being a security nightmare.


Agreed. I also don't see them having much incentive to do this.

It's pretty hard to get right, so unless they truly believe they can offer a superior alternative to Google (or any other search engine), they're unlikely to pursue it.

Plus, most users are likely not enough of a power user to truly benefit from something as rich/complex as this, hence the reward for the behemoth to pursue this likely won't be worth the complexity and effort.


This is why I have one giant, enormous text file of all the technical notes I have ever taken.


I don't want a 'personalised' search engine because such a thing promotes tunnel vision. What I would like is a search engine which offers a 'spam filter' with a few different settings:

- no_SEO: demote anything which employs 'SEO' so it appear below search results not guilty of this sin

- no_Blogspam: demote blogspam below the original articles the bloggers refer to

- no_Sales: demote anything which tries to sell me something below results which do not. This is a tricky one to implement because not every site offering to sell something should be caught, e.g. a site explaining how to repair a flux capacitor which links to a source for these ubiquitous parts but mostly contains instructions how to install and tune the part is fine.

- no_GPT: demote anything recognised as being generated through 'AI'.

- $filter: an option to create custom filters

Depending on the reason for searching the 'net I'd have most of these options enabled but every now and then I'd switch one off, e.g. no_SEO/no_Sale when looking for something to buy.

I'm running an instance of SearxNG and hardly ever interface directly with individual search engines so I mostly avoid the 'personalisation' problem but I do not yet have access to filtering options like the ones I mentioned.


I've tried YaCy, but the amplitude of the crawling community it's way too little to make it usable so... For now there are not much options, for maps the situation is better with OSM, but still far from being usable like Google Maps for navigation or even only mere exploration since the coverage here and there it's even MUCH superior but for many others is next to void.

To been able to avoid commercial search engines we have only an option: public funding public universities who cure national infra (something already existing, but bigger) and a public indexing project with a national plan for a homeserver per connected home (much like actual ISPs 'router', only pure FLOSS handled by the user or using anyway public code) witch in the other functions also index a small part of the web in an open project like YaCy. Same thing for VoIP comms.

WE DAMN NEED institutionalized FLOSS.


What's insane is that pagerank, and most other graph centrality algorithms (the heart of modern search engines) have from the beginning supported a "personalization vector" which does EXACTLY this. It's available in all major graph analysis libraries (i.e. https://networkx.org/documentation/stable/reference/algorith...)

This exists, it's here, and no one uses it for anything except serving you better ads.


Among power users there would be immediate cries of privacy violation, on display with Microsoft's recent debacle with the screenshooting AI thing (the name is escaping me).

And among 95% of normal users there's no demand for it because what most people do is google restaurants, cinemas, dancing videos on TikTok or they just add "reddit" to their search for anything more complicated. Most people haven't bought any reading material on the internet and don't have notes.


Agreed 100%

The only way a screenshotting AI tool could get away with it is if it was open source.

There probably is a successful open source project waiting to be made there!

Somebody else shared https://www.perfectmemory.ai/ which seems intriguing, but I'd be reluctant to ever install it on my main PC.

There's some things that you'd just be prudent to not trust... even if truly developed with well intentions


I built something like this using manual screenshoting, OCR, and indexing with Meilisearch but now there are tools to do this automatically like perfectmemory.ai. You can definitely build something yourself by gluing a bunch of open source tools like I did but if you want something ready-made then it kinda already exists if you are willing to trust your operating system or 3rd party software engineers to not leak your information.


A search engine's first job is to cull the crap. To distinguish the good from the bad.

Thus a personalized search engine could double as a forum moderator.

And you could share search engines. Get a copy of the search engine of somebody that you admire/trust and merge it with your own. Thus your search engine could learn from others what's good and bad.

You could have a family search engine, passed down through the generations.


I was just reading about how you can add a local LLM that you have on your PC to be your default AI chat assistant in Brave browser. Elsewhere on HN today is news of the latest version of LM Studio that can act as your local RAG. I reckon this sort of tech will be built into operating systems soon enough, and in this way personalized search will be enabled.


How good is AI at search?


I think this has been the dream of the digital personal assistant for a long time.

When people started talking about LLMs and AI I was hoping for something that would monitor news and websites and find things that I was interested in. Something that would go beyond just keyword searches and also be able to pull in stories on radio and tv.


The smaller your corpus, the harder it is to find signals to get good results. This is why even corporate intranet search is much worse than Google et al on the public internet. Personal information graphs end up being much more unusual than the average of all information online since there’s much less to average.


Any questions about why we don't have innovation in search can only be attributed to one monopoly.


Photo libraries give you search across your photos and videos. Digital asset management systems provide search across all your documents. OSX and Windows provide terrible search across your local filesystem. I would consider these personalized search engines.


For sure! I meant something more along the lines of one unified engine. Today they're very disjoint. And some awfully terrible as you alluded to :)


This is why I have a personalized search engine... miniflux lets me search all my rss feeds, mail is searchable, logseq is searchable, everything is searchable... and you can combine it with 50 lines of python (with plugin support)


I think what Apple is building into the OSes now, combined with LLMs capable of running on mobile devices and new fine tuning techniques (that probably might not be invented yet?) will give rise to exactly this.


20 years ago, Google had a custom search engine capability where you could give it a list of sites to search from. Is that functionality still around?


I don’t think it is, but I’ve seen someone make a browser extension to transparently add query parameters to always exclude some sites. I imagine doing the opposite (for some queries search only this list of sites) is also achievable.


Can't spotlight already do most of that?


Not in any way that's close to good, at least in my experience


The idea is that.

Implementation: beyond horrible except very simple things.


I always wanted to search for random or uncommon websites, but these links are never reached.


It's actually quite hard problem to solve; even Larry Page acknowledged that like 10 years ago[1] and nobody is even close to solving it, not even Google. My opinion is that LLMs are good step towards answer machine type of search engine. Perhaps the ultimate search engine would be something like Elon Musk's Neuralink[2] brain-computer interface where chip implant could read your thoughts and know your feelings and based on that give you "perfect" results. That would be really personal, I mean like on another level of directly personal. Now all we have is indirect personalisation where search engine gathers everything it can about you and assumes what you would want to see.

[1] https://www.youtube.com/watch?v=mArrNRWQEso

[2] https://en.wikipedia.org/wiki/Neuralink


Oof, quite the powerful thought with Neuralink

Thanks for the references


We do, it's just personalized to your cohort.


What about hosting your own instance of SearXNG?


A few years ago I thought about building a personal search engine. The idea was to save the HTML of every site I surfed and search it with a document engine like Apache Lucene. Because text compresses well and isn't big to begin with, a terabyte drive would last a long time and maybe forever. At the time, I thought RPi's would be a good idea because I didn't know any better. Now I might prototype it on an old Thinkpad.

Basically, it smells like a solved problem wiht open source tools built for Enterprise. I thought then and think now it could be scaled down to a hardware appliance that sits on a home network. But I am probably wrong about all of it. Good luck.


There are chrome (and firefox) extensions such as Full Text Tabs Forever [1] that offers some of the functinos you describe

[1] https://github.com/iansinnott/full-text-tabs-forever


Thanks. I think I was thinking about the problem in 2016 or so.


What open source tools are you referring to? Do you just mean the search component?

There'd be two hard parts to this problem I reckon:

- gathering the data

If you make it too cumbersome and with friction, it won't be used. If you make it too easy to dump data, the useful info might get drown out.

- ensuring search gives you good results

We have open source engines like Lucene that let you search extensively, but what happens when you get 200 results back? How do you know what's the best/most-useful one? It's likely most users would get exhausted having to sift through everything and just default back to a Google


A script running curl on my browsing history would collect the html. I’d solve the 200 result problem if and when it was an actual problem in a way that addressed the actual problem. There’s a lot of success before too many results is a problem.

The idea that it might be too much friction than it was worth is why I didnt build it. Probably why nobody has built it and perhaps why you just listed a bunch of imagined problems as reasons not to build it.

I mean it would probably be shit if I built it and I liked my idea better than the idea of the work. That’s most things.

For what it is worth, I would default to google for the things google does better and use my personalized historic search when I wanted to see what I had seen before. Its both-and not either-or.


I’d start from one’s bookmarked websites on Pinboard.


> gathering the data

I've been of the opinion that website content monitoring should be implemented with a browser extension (plus possibly a local agent app)[1]. An extension-based approach would work well and be easy to use IMO.

I've been extremely disappointed by how Chrome in particular likes to forget everything about my browsing history (except for tracking cookies) after three months. I don't see why a link I clicked on a year ago on any given page should turn blue just because computers from 2004 might have performance problems with it.

[1]: Enterprises seem to prefer MITM here instead, but I'd argue it's not truly required, given the overwhelming popularity of agent-based EDR solutions.


Kagi does some of this.


This is a solution in search of a problem


the problem is search sucks and doesn't give you the most valuable info you can find on a given topic

related posts about the suckiness of search:

- https://news.ycombinator.com/item?id=30347719

- https://news.ycombinator.com/item?id=30635720

- https://news.ycombinator.com/item?id=22091944




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: