I've been working on this kind of thing over the past several years (for a while full time as an attempted entrepreneur, now on the side for the past couple years). The latest iteration is https://yakread.com -- hit "take a look around" and you can see the "home page"/a list of recommendations without signing up. The recommendations are personalized, i.e. the probability you'll see any particular post depends on your individual interactions with past posts, if you've signed up. (it does collaborative filtering with spark mllib). So that may be a bit different from what you had in mind, since your comment sounds more like an unpersonalized system, but with some extra exploration thrown in. However in practice I suspect the biggest thing the collaborative filtering is doing at Yakread's current scale (not much) is learning which items are good/bad in general.
I also do have some methods baked in for doing exploration. "Epsilon greedy" is a common simple approach where x% of the recommendations are purely random. I do a bit more of a linear thing where I rank all the posts by how many times they've been recommended, then I pick a percentage 0 - 100, then I throw out the top x% most popular (previously recommended) items. that also gives you some flexibility to try out different distributions for the x% variable.
Thank you so much! "Epsilon greedy" sounds like a great approach for the general idea I had in mind — I only glanced it but will read it more deeply.
I'll definitely try out your product, but I have to say — an enter your email box is surprisingly high-friction and if you weren't a considerate person I'd met on Hacker News I'd probably close the tab when I saw that. I'll try it out and see if there's a particular reason why you need to capture an email address so early on, but I'd bet if you simplified it you'd get more traffic!
Thanks for the feedback. I've structured Yakread (and its predecessors) as a daily email newsletter because it increases user retention tremendously. It's much less work for users if Yakread can show up in a place they already check regularly (their email inbox) rather than trying to get users right away to build a habit of visiting a new website regularly. The most common approach to this problem for consumer products is to make a mobile app so you can send push notifications; I like email a lot more since it's a bit more decentralized and is/can be less pushy (no pun intended).
But yeah, I wouldn't be opposed to trying out an alternate landing page that shows you article recommendations up front with a signup box somewhere. Could be interesting to see how both approaches perform in an A/B test. Especially if I ever made a concerted effort to get traffic from HN; then structuring the site a bit more like HN would probably be great. Maybe even aggregate comments from bluesky/mastodon? Once I get through the mountain of other TODO items that's been piling up :).
You can do that, it's just slow if there are a lot of results.
Agreed you want to keep data in your main database normalized since it's easier to reason about and avoid bugs/inconsistencies in the data. The inherent trade-off is just that it's more computationally expensive to get the denormalized data.
The idea of materialized views is to get the best of both worlds: your main database stays normalized, and you have a secondary data store (or certain tables/whatever inside your main database, depends on the implementation) that get automatically precomputed from your normalized data. So you can get fast queries without needing to introduce a bunch of logic for maintaining the denormalized data.
The hard part is how do you actually keep those materialized views up to date. e.g. if you're ok with stale data, you can do a daily batch job to update your views. If you want to the materialized views to be always up-to-date then things get harder; the solution described in the article is one attempt at addressing that problem.
Hey HN. Since this has showed up here maybe a status update would be interesting? This continues to be my main side project--amusingly it's had more traction than any of the startups I tried to build with it. Over the past year I've been working on some experimental features for Biff that are meant to help with medium-to-large codebases[1] (I've been doing this as I rewrite one of my Biff apps from scratch). There haven't been many code releases in that time, so I've got a decently sized backlog of things I'd really like to get to. E.g. XTDB v2 is almost out of beta; once I finish the app rewrite, that's next on my list.
I've played around with Biff. It's an amazing project and a great way to get started with web-development in Clojure. Clojure can be kinda confusing because of the community defaults to orthogonal libraries. Biff, makes it easy to see which libraries are useful to connect up.
Thank you for doing this. I am just checking out the Biff framework.
One part I would change is the dependence on htmx for html generation. I would really prefer an external template file into which we can replace fields
I might have misunderstood your comment but I don't think that's what htmx does, it just adds reactivity without needing to write JS, the HTML is represented in the project using Hiccup syntax which is essentially HTML in Clojure data structures - makes sense when code is data is a big part of the Lisp idea. It is an external template file into which you can replace fields, it's just a Clojure file too.
AFAIK, all libraries are loosely coupled in Biff. Swapping out Hiccup / Rum for one of the other HTML templating options should be in "userspace"; straightforwardly so, without the framework maintainers intervention.
The framework checklist[1] makes me think of Fulcro: https://fulcro.fulcrologic.com/. To a first approximation you could think of it like defining a GraphQL query alongside each of your UI components. When you load data for one component (e.g. a top-level page component), it combines its own query with the queries from its children UI components.
For the interleaving, yes we want to prefer the item that's been skipped fewer times. I got the wording backwards in the article; I'll fix that.
For shuffling, I was trying to come up with an approach that would recommend the top k items roughly the same amount regardless of how many total items are in the list. E.g. say you have 10 subscriptions that you really like--I want to have those be a reasonable portion of your recommendations whether you've subscribed to 100 other subs or 1000 other subs.
Contrast that to a weighted random shuffle where each subscription's weight is its affinity score and we sample them based on weight without regard to their order in the original list. That approach is much more influenced by the size of the total list, and my experience is the handful of subscriptions that I really liked were always drowned out by all the other "speculative" subscriptions I had accumulated in my account.
The computational complexity ends up being OK because we generally don't actually need to shuffle the whole list. I recommend items in batches of 30, so we just need to get that many items and then we can abort the shuffle. There probably is some more efficient way to implement this though.
During implementation I was mostly thinking of this as "sampling" rather than "shuffling" actually, and just ended up describing it as the latter when I wrote the post.
Also--I think the pseudo code you have isn't /quite/ correct. If x is greater than p, we don't immediately take a random element from the list; rather we go to the next element and generate a new x and repeat. I.e. with p=0.1, there's a 10% we immediately take the first item, and if we don't do that, then there's a 10% chance we immediately take the second item, etc. we only pick a completely random item as a fallback if we get to the end of the list without picking anything.
Thanks for providing more insight into the development process that went into designing this algorithm! :)
> Contrast that to a weighted random shuffle where each subscription's weight is its affinity score and we sample them based on weight without regard to their order in the original list. That approach is much more influenced by the size of the total list, and my experience is the handful of subscriptions that I really liked were always drowned out by all the other "speculative" subscriptions I had accumulated in my account.
I personally see that as an argument for having the original position affect the weight.
I experimented with modifying Fisher-Yates to generate a slightly less left-skewed distribution, and decided upon inverse transform sampling as a fairly cheap solution -- basically, with some function f(X): [0,1) -> [0,1), you can transform a uniformly-distributed random variable X ~ Unif[0,1), which then allows you to bias the shuffle without affecting its asymptotic performance.
> There probably is some more efficient way to implement this though.
One potential optimization is to use rejection sampling: keep track of the elements you've seen, then reject any duplicates when pulling out a sample (instead of manually constructing a new list every time). Now, this interacts poorly with your biasing -- in the extreme case where p=1, you'd keep trying to select the first element, then rejecting it. I might then suggest tracking the first element you haven't selected yet, but at some point this ends up being a lot of complexity for asymptotic behavior when what you have is more than good enough for the scale that you're considering.
This also only addresses the "copy the list every iteration of the sampling" aspect of your algorithm's performance, not the fact that with p=0 (or arbitrarily low) you may end up walking the entire list.
(Rejection sampling is more generally using some criterion to reject samples, e.g. if you were trying to select points within a circle by checking x^2 + y^2 < r^2.)
It is pretty subjective--I design the algorithm with myself in mind first, but I can definitely see a heavier bias toward exploration/variety being better for some others. Something to experiment with.
Fun to see this on the front page! I worked on Yakread full time for about 8 months as an attempted startup, after a few years of other recommender system startup ideas. Now it's a side project that I develop on the weekends after my kids fall asleep, aided by caffeine (me, not the kids). I'm in the middle of open-sourcing/rewriting it. Hopefully will be done in a couple months? Then I can finally get back to adding new features. I talked about some potential ones in my previous post: https://obryant.dev/p/rewriting-yakread/
reply