Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using Vector Embeddings to Overengineer 404 pages (aimd.app)
5 points by adaboese on Jan 17, 2024 | hide | past | favorite | 7 comments


> In practice, this type of hint will be most useful for pages that were removed or renamed, e.g. I have accidentally introduced numerous 404s on this site by changing the dates of the posts.

Your blog or website software should really be setting up those redirects for page moves automatically.

> You might be wondering why not just use Levenshtein's distance to find the page that the user was looking for. The reason is that Levenshtein's distance is only going to be useful in case of typos. Whereas vector embeddings will go further and find the pages that are the most semantically similar

Which is not at all necessarily what they wanted, compared to an honest 404! This would be particularly bad given that you fail to set up redirects for ordinary page moves/renames: this means that I could link to your post on 404 pages, you move it, the embedding fail and silently send all future visitors to a different post (eg. maybe you write a second post on 404 pages; after all, there's now a roughly 50-50 chance of which one would be ever so slightly more similar embedding-wise). This further means that anyone who updates their links to follow redirects will unwittingly bake the lie into their page, making what used to be a working link or an honest error into a false live link.

In general, I would suggest that you set up correct redirects to obviate most of it, use embedding search to curate possible redirects, and only present a list of suggestions on a genuine 404 error.

(Incidentally, if you have a good list of redirects, Levenshtein search on the list itself with a new error will often correctly generate a redirect. So you don't even need embeddings/retrieval to fix a lot of regular 404s. I'd say the majority of my remaining 404s are handled by my Levenshtein search script.)


I agree with you. If the exact page was deleted, it may not be the best experience for the user to just redirect them to the closest match. However, you could solve this in the UI by telling them that they were redirected because the original page was not found. Showing just 404 page is not a great experience either. You could also adjust this behavior based on the score of the match – anything that's 90% match should be good enough.

For what it is worth, I opted to showing 404 page and suggest the URL they were most likely looking for, e.g. https://aimd.app/blog/2023-12-27-programmatic-seo-what-is-it...


> anything that's 90% match should be good enough.

There's no such thing as a '90% match' when you are doing nearest-neighbor lookup. Percentages of what? There's just distance in a high-dimensional space, which is itself dependent on dimensionality and pretty arbitrary - different embeddings will have totally different 'units'.

> You could solve this in the UI by telling them that they were redirected because the original page was not found.

That doesn't deal with issues like archiving or updating links to follow redirects which turn out to be spurious because you thought '90%' was a meaningful number.


Off-topic, not familiar with the website, but the design is very cool!


> Your blog or website software should really be setting up those redirects for page moves automatically.

There is theory and there is practice. Keeping URLs tidy is a simple task for a small site, but as the application/blog grows in scope, it becomes almost an impossible/full-time job. Anyone who managed a platform with thousands of pages (I've had the luck to do that), will tell you that 404s are not avoidable. Feature flags, geofenching, and many more edge cases will inevitably lead to users experiencing 404s. This tactics is more mitigating the impact of those edge cases.


> (I've had the luck to do that)

I have 20,000+ files/pages on gwern.net where I try to fix 404s for over the past 13 years or so that I've been building it, so I am well aware of the difficulty in maintaining links (particularly across very large changes in the website structure), and the perverse ingenuity of bots, scanners, and real users in coming up with brandnew 404s.

> it becomes almost an impossible/full-time job.

That it cannot be solved perfectly in full generality is true, but nevertheless, that is not an excuse to not set a redirect for the very simple and common case of 'mv foo.html bar.html', and all website software should support that.


All good points. I don't think we are disagreeing on anything, it seems I am just more adventurous when it comes to pioneering solutions and learning by doing it. Will continue to learn from this experiment.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: