Can I ask you a question? How computationally expensive is it?

photon_off · on Aug 1, 2010

Amazingly, matching can be done with a single MySQL query that will take anywhere between .05 - 2.0 seconds (on average .25 if I do a whole batch of URLs) -- it depends on how popular the tags of the URL are. Those numbers were too large for my liking, so I cooked up a Java program that loads the relevant db tables into memory and does the matching itself. The program uses about 80MB (though it only 'really' uses 35MB -- damn Java heap) of memory and takes .008s to compute 500 matches for a given URL -- which I think is suitably fast. However, its 8ms because I've optimized it quite well -- a brute force match would take URL_INDEX_SIZE^2 * 2 * log2(NUMBER_OF_TAGS_PER_URL) multiplications.

The main bottleneck is inserting those 500 rows into a separate table, which is currently taking .1 seconds. I'm sure there's a better way to do this... I'll have to look into it if it becomes a scaling pain.

mjw · on Aug 1, 2010

Interesting. Sounds like you're effectively doing a sparse matrix-vector multiplication for each request?

If you want to buy yourself a bit more time performance-wise, might be worth looking into using a numerical computing library for sparse linear algebra which has optimised routines for this (e.g. http://math.nist.gov/spblas/ at a first glance). Looks like there's been some recent work on parallelising this via the GPU too: http://graphics.cs.uiuc.edu/~wnbell/

Some kind of dimensionality reduction is probably one of the next steps scalability-wise though, and should hopefully improve the results too if you tune it right. Latent Semantic Analysis the term to google if you're not already aware of it. Testing the success of your predictive model might be tricky though in the absence of user-specific tagging data.

I experimented with this stuff a while back for music recommendations - got as far as doing some simple SVD-based dimensionality reduction but the results weren't as good as I'd hoped and it got put on the backburner. So, respect for polishing this one up for launch :)

lkozma · on Aug 1, 2010

Good suggestion for dimensionality reduction. Most ready made packages you will find under the name PCA though (principal component analysis). It's almost always good to reduce dimensionality first, depending on the component number you choose, you can remove a lot of the noise from the data.

I'm curious why it didn't work for music recommendation for you. In the netflix challenge for movie recommendation, pretty much everyone ended up using SVD-based methods and variations thereof.

If anyone is interested, this is the most accessible write-up of the idea I know: http://sifter.org/~simon/journal/20061211.html

mjw · on Aug 1, 2010

> I'm curious why it didn't work for music recommendation for you. In the netflix challenge for movie recommendation, pretty much everyone ended up using SVD-based methods and variations thereof.

I expect the algorithm did reasonably well given the data available. Probably a mix of reasons why I was disappointed though:

* My expectations were too high, I didn't realise quite how hard of a problem good music recommendations are

* Dataset wasn't large enough

* Dataset was based on deliberately-stated opinions (ratings) rather than actual observed behaviour (listening data)

* I didn't find a way to use the timing data associated with the ratings

* Figuring how best to normalise the dataset prior to attempting SVD was tricky and I'm not convinced I found the best way

I admit I was also sort of naively hoping that the 'features' identified by SVD would have at least a vaguely-human-recognisable theme to them. The first 2 or 3 of them did but from there on in it all looked pretty random.

Also, like a lot of recommenders, it gave the impression of having based its recommendations on some kind of generic, averaged-out 'middle point' of your overall usage data.

Really I'd rather it clustered my usage data, then recommended new things in and around each of the clusters.

I have some ideas for what I'd do differently the next time around - that being one - but also ideas about how better to tie algorithmic recommendation tools in to human interaction.

lkozma · on Aug 1, 2010

Yeah, there is always the danger that the data is just too little or too noisy. But the technical issues you mention (normalization, subtracting bias, using temporal data) all came up in the netflix movie recommendation competition as well so you can always look at how people handled it there.

Some useful information can be found in the report of the winners: http://www.netflixprize.com/assets/ProgressPrize2007_KorBell...

There's a lot of fancy stuff, which would be overkill in a real system, but lot of practical info also.

Also, there was an article of the winners "collaborative filtering with temporal dynamics" which might be useful and I think it is freely available.

photon_off · on Aug 2, 2010

You seem to know quite a lot about this type of stuff. It's great. What's your story?

photon_off · on Aug 2, 2010

Really amazing input, thank you. Unfortunately, this stuff is over my head, in two ways.

First, I haven't touched C++ in five years and don't want to spend weeks figuring out how to get my data from MySQL loaded into those data structures, how to create sockets so my PHP scripts can connect to it, handling errors, etc, etc. It's probably not as hard as I'm imagining it to be, and I'd love doing it, but I personally can't afford getting sucked into those details (no matter how appealing they are).

Second, I'm not familiar enough with the math of this library to know how to use it. I have about 200,000 vectors with 60,000 dimensions, each vector being extremely sparse. Given one vector, I'd like to compute the dot product of it with all others, then sort by value. I have no clue how to do this with that library.

It sounds like you have a lot of experience with this type of stuff. How much do you think my results could be improved (specifically)? Which results do you think are insufficient, and why?

Thanks for your input, I'm excited to nerd out on this type of stuff... even if it is over my head at times.

abecedarius · on Aug 2, 2010

I wrote something similar a few years ago (finding photos with the most-similar tag vectors among a set of ~500,000). The core lookup was done by a few hundred lines of C code returning answers in ~0.1s, IIRC. This used a standard info-retrieval method, not a linear-algebra library -- all vectors pre-loaded into RAM in a sparse representation, then for each query scan them all, keeping a min-heap of the best vectors so far. (This could be sped up by indexing the vectors and skipping the ones that with no tags in common with the query, but I didn't need to bother.) Email if you'd like to discuss my experience with this, though it looks like you already get decently fast answers.

You can do queries besides "sites most like site X", you know -- "What sites are like a cross of X and Y?"

todayiamme · on Aug 1, 2010

Is there any way to do this once and for all for a link? As you're crawling the database you look up the link and do this once and for all for it. So that the next time a user asks a query if it's over there you can directly return that, if it isn't then you can use more resources to crunch it for them.

Is this possible? Also, is there any way I can help you to make this? I don't expect any monetary benefit. I just want to learn.