Hi, OP here. I started working on recommender systems in 2016 during my undergrad, specifically doing music recommendation (I was dissatisfied with the quality of Pandora/Spotify recommendations). I spent about five months in 2019 trying to make a music startup based on that. However, during that time I realized that there would probably be more value in having a really good general-purpose/cross-domain recommender system. i.e. if you're looking for something specific, use Google, and if you're not looking for something specific, use Findka. That's the vision anyway.
More specifically, the benefits I see from cross-domain recommendation are:
- More data per user => better recommendations.
- More potential users => (eventually) better recommendations. For example, to get users for a podcast recommender, you have to find people who like podcasts above a certain threshold. With Findka, anyone who's interested in getting recommendations for at least one content type is a potential user. (And even people who aren't above that threshold for podcasts might appreciate an occasional podcast recommendation).
- Lots of potential applications. I'm particularly interested with trying to use Findka for social networking (opt-in of course). Data from Findka could be useful for dating, job opportunities, forming online communities, etc. This is more long-term, but I also think Findka data could be useful for search.
The algorithm currently is dead simple. Just collaborative filtering without explicitly taking into account content type. So it's naively cross-domain. Since the data set is still small, there's no need for matrix factorization. I recompute the whole matrix every hour and store it in memory. See  for the implementation (it only took 30 LOC). That's a little out-of-date but the general approach hasn't changed.
For the tech stack, I'm using a Clojure web framework + deployment solution that I made. It's like a self-hosted version of Firebase (I'm running it on DigitalOcean).
Hi! I'm very interested in this domain, I have long been dissatisfied with the results for music recommendation across almost every platform (and the respective apps, but I digress). So frequently it seems my own personal preferences are washed away in the breadth of ML and algorithmic recommendation systems, that no matter what I begin listening on, I will invariably find myself within the mainstream for that genre/artist. It also seems to me that many algorithms can't properly deduce that I would want songs centered around a specific year, and instead seem to draw from what's "popular" in the genre regardless of the time period the song is starting from.
As an example, I can begin a GPM station based on I Gotta Feeling, a song from 2009. When I start a playlist like that, I nearly never am rewarded with a song from that time period. I tried it just now, it immediately jumped to Roar by Katy Perry, a song from 2014 with an entirely different vibe.
I supposed my question is, how meaningful are the music recommendations Findka produces?
I would love to talk more about the music startup you were pursuing, and what you accomplished there, as well as any roadblocks you may have faced. I have been whittling down an idea for a music startup myself and can't help but wonder where that road led for you.
Also your site link is broken/not resolving.
As far as the current music recommendations; it's nothing intelligent. Just pure collaborative filtering, i.e. it has no understanding of "this is a rock song" or "this song is from 2005" or even "this is a song". Right now, the algorithm just sees a bunch of URLs and their rating data. I am interested in making the algorithm more intelligent over time though.
Could the site not resolving be an issue with your network? It Works For Me, and I'm getting plenty of traffic right now.
I think that scaling will be a challenge, both computationally and gathering enough data to be meaningful. The biggest wins, I suspect, are when you can infer people's preferences incidentally (say, from watching what podcasts they actually talk about on social media) rather than from their self-reported preferences, where conscious intentions often override true emotions.
But I'd love to see this work. I know there are things out there that I'd enjoy but don't manage to connect with.
Quick heads up, the book links aren't working for me.
- Consider injecting information with "oracles" An oracle is a kind of virtual user that likes one thing and only one thing. For example they only watch movies that have been tagged sci-fi. This sci-fi oracle adds information about sci-fi-ness to your data which is useful for several things. It helps with the cold start problem as new items can be automatically tagged by the appropriate oracles and get past the zero information horizon quickly. Also you can measure a users sci-fi affinity by measuring that users similarity to the sc-fi oracle.
- Another way to think about co-occurrences is as connected nodes in a digraph. You have users and items and connections between them (user watched video). Start with an item and traverse all the links to the other side (all the users who watched this video) then for each user traverse to the items side (you can roll up the occurrences for a score) and you have similar items. Works equally as well for finding similar users.
- Create an "average user" and use that as a seed for new users. If we know nothing else we should expect a new user to be close to average. This means they will probably get recommended the most popular items but
- Find items with divisive scores or groups and ask new users their opinion on those items to find out about them. After a new user gets created consider asking them their opinion on five of these divisive items. Their ratings should swiftly put them in an informed space the way taking five steps down a binary tree does a lot to reduce search space.
- I like the way you use simple plus one smoothing for your scores. I'm not sure why this doesn't get used more often.
Good luck with the project!
For music, Gnod definitely gives me better suggestions then iTunes and Youtube.
There once was a very nice German-based application called Foundd. The app was soecialised in movie recommendation and it was awesome: great interface, good filtering, good recommendations.
Then, they introduced TV shows recommendations. As soon as I started rating TV shows, the quality of my movie recommendations plummeted. I guess my general dislike for TV shows wasn't helping the algorithm.
It seems really hard to build useful cross-domain recommendations.
Rather than being at the mercy of suggested movies which, in my experience of using Findka, 80% I hadn't watched. Thus I couldn't like or dislike so the system carried on showing me more and more content suggestions that weren't increasing in relevance as they had no input data.
I'm sure by simple maths over extended use I would inevitably see movies I could express a preference on, but I must be on "Refresh 50" now and only managed to vote on a tiny number of suggestions.
Also in the name of speeding up the process of identifying movies (or whatever) that I liked and disliked, would it be possible to like/dislike all three suggested items per screen before moving on to the next batch? atm as soon as one "vote" is placed the system throws out three more - when I could well have wanted to express a view on the other items that were presented.
Findka is a great idea btw! :-)
When you rate an item, are you sure all three items get replaced? If you rate one item, the ones below it should be moved up, and then the item on the bottom will be new. So if you rate the top item, you'll see a visual change in all items, but the other two items will still be there. Perhaps I should replace rated items in place without moving the others.
From a UI perspective: Maybe there could be a searchbox for books, songs etc. so that one can quickly enter things one likes. With the current system of entering preferences for the 3 suggested items, the problem is that I don't know many of the items and so I can't enter a preference.
I'm pretty curious to see what the correlation ends up being.
- I'd like to change how many recommendations I get with each newsletter.
- I'd like to add some music, but it didn't find anything I searched. I could just add a link. That would however link the music with the source/url, which does not make any sense semantically.
- If you don't rate suggestions (because you don't know them), they'll soon appear again. Maybe it would be nice to block them in the current session.
- I'll add an option for this.
- What songs were you searching for? The search feature uses Last.fm for music. They've had pretty much everything I've ever searched for, but maybe they're missing certain categories. I might investigate other/additional search APIs in the future. In the mean time, adding songs via URL is a decent option. The way the algorithm is implemented currently, it won't make a difference, and eventually I'm planning to add a cron job that'll go through the URL items and classify them as the correct content type and fetch additional metadata (with manual intervention as needed).
- I started working on this today; should be fixed tonight or tomorrow.
- Some swiss german music. However a search for "Not afraid" didn't return anything too... Didn't manage to get any result so far.
- Again: nice.
It's very motivating to get a response this fast and you seem to care about the input/feedback from users. The world needs more of that, thank you. :)
EDIT: My bad, uBlock Origin was blocking requests to ws.audioscrobbler.com. Can search for music now.
Also you're welcome! It's been really nice to actually have some users, a luxury which escaped me in previous startup attempts.
Additionally, reviews led to discovery of people with shared interests, when I saw the same people liking, sharing and commenting on stuff I liked. Following these people then added another layer of filtering to my ^stumbles^.
Might be food for thought.
I'm actually just starting to implement features like these. If you make an account then go the the Account tab, you can enable a public profile like this. So far I'm planning to add follow/subscribe, commentary (i.e. comment on items instead of just rating them), and showing users who've rated the same items as you.
- ram usage today has gone from ~1.1GB to ~1.6GB
- DigitalOcean prices ram linearly at $5/GB (I'm currently on a $10/month droplet)
Today is a big spike obviously (the number of thumbs up/thumbs down events have gone from 3K to almost 10K today, vs going from 0 to 3K since February). But say I continue to grow ram usage at 0.5GB per week (growth hopefully won't be linear, but I'd say that's a pretty steep linear growth rate for now). That means my hosting costs would increase at a rate of $10/month which is not bad at all.
That's probably the crappiest scale estimation ever made ha ha, but I believe at least that I should have plenty of time to figure out a reliable marketing channel before I do any major re-architecting. Maybe some time I'll do a stress test on my laptop to get a better sense of how RAM usage increases with additional data.
I've been at 10s of weekly active users. I guess this should handle 100s just fine, but maybe not 1000s.
Also I hadn't heard of letterboxd or RYM. I'll check those out more when the time comes.
I might be missing the big picture, but wouldn't log-in with oauth solve integration problem and removed impediment from new users coming in? Also, if you do not mind, which algo do you use to find matches?
The algorithm uses a simple item-based neighborhood model. i.e. if you like song A, Findka looks up all the content liked by other users who liked song A and probabilistically chooses an item that was well-liked. To help the algorithm keep learning, 35% of the recommendations are purely random ("epsilon-greedy"). I describe the implementation here, though it's changed slightly (now I export the database every day or so and generate a model on my laptop, then I load it into memory on the server). I experimented with a machine learning-based model (a latent factor model) last week, but it seems I don't yet have enough rating data for that to be useful.