10/28/2006
10/25/2006
10-25-06 - 1
"supervised learning" , via nnets or whatever, perhaps CART decision trees. find f() to minimize |y - f(X)|.
categorizing/clustering. related to DT problem. how to find planes or rules to split a huge set of N-d points into related groups that minimize some overall cost function.
svm
10/24/2006
10-24-06 - 1
10/20/2006
10-20-06 - 2
10-20-06 - 1
I now seem to be getting around 0.920 and there still a million little areas for improvement. Every time I check off one "todo" I add five more as more questions are opened up (the vast majority of which are dead ends but need to be explored).
One thing's been troubling me lately - I've tried various schemes and thrown out the ones that weren't as good. Perhaps that's wrong though? They aren't as good over all the data, but maybe they're better on portions of the data, and if I could choose that portion of the data, they would help overall. This was part of the surprise of Volf's compression work - adding in predictors that are much worse overall can still be a huge win if they are selected on the portions they do well on. With CF it's intuitively obvious that you have these similar-item based predictors and similar-user based predictors. In general similar-item does better. However, on certain queries you might be able to guess that they will be more user-correlated than item-correlated, so you can user your similar-user predictor on that case and win.
10/19/2006
10-19-06 - 4
10-19-06 - 3
Consider two vectors A and B. The "MSE" of their difference is just |A-B|^2, the "rmse" is |A-B|. You can define a scale-invariant version of MSE as (|A-B|^2)/ab (where a = |A|, b = |B|). I'm not sure this scale invariance is a good thing or not, in practice is does weight things to the weightings, but people seem to like it because it's "elegant". Anyway, the Cosine law tells us that :
Cos(theta) = (a/b + b/a)/2 - (|A-B|^2)/2ab
The second term is familiar, the first term is this weird sort of ratio average. In the special case of unit vectors a = b = 1 this reduces to
Cos(theta) = 1 - (|A-B|^2)/2
So for unit vectors a "cosine" similarity and an MSE similarity will produce identical orderings. For non-unit vectors, we have this ratio term. The ratio term is minimized when a=b, so it gets bigger for length differences. In a sense it cancels out the length difference from |A-B|. Consider two parallel vectors of different length. The MSE metric considers these very different, but cosine says they are the same.
In practice with movie ratings there are some very weird things that happen here. The "A" vector is the vector of user A's ratings over N movies *subtracted by his average rating*. This subtraction is important, but it does a weird thing. It makes the vector of random direction (by definition of average), and generally makes it close to zero. If you look at a subset of movies that he rates close to average, his vector will be very close to zero and very random. The "B" vector is the same thing. If you now consider two vectors that are close to {0}, the MSE error between them is of course tiny, but the cosine between them is completely random. Obviously that's not a very sensible weighting. Similarly, consider the "scale invariant" version of MSE. You're dividing by the vector lengths. What that does in practice is make the error much larger for vectors near zero. (dividing by length does sort of work for Cosine because larger cosine = better, the opposite for mse)
Despite all this the "Pearson" correlation for user-user similarity does in fact seem to perform well.
10-19-06 - 2
I was thinking that to have any chance of winning I need to get some team-mates / helpers & use the power of the internet. I don't want to deal with signing people up & working out contracts where we'd have to negotiate shares & so on. I was thinking I'd do something like just give away my app, but make people agree that they won't do their own submission or something. Then I at least might attract some hobbyists who want to play with the problem and they might find some improvements or try some ideas that are beneficial.
10-19-06 - 1
10/18/2006
10-18-06 - 1
10/17/2006
10-17-06 - 6
1. They all use this "cosine" similarity measure. The cosine similarity is basically the dot product of the normalized vectors of ratings. What that means is that two movies with ratings like {2,2,2} and {4,4,4} are considered to be identical by this measure. Now, that's an okay thing assuming you're compensating for movie average, since if you subtract the average off both they are both {0,0,0}. However, the standard way of making an item-based prediction does NOT subtract off the average! It's reported in the literature as
Pred = Sum{other items} Weight(this,other) * Rating(other item) / sum of weights;
If you were to correct for averages it would be :
Pred = Average(this item) + Sum{other items} Weight(this,other) * (Rating(other item) - average(other item) / sum of weights;
But that's not what they report.
2. The exception to this (not correcting for average) is the "Similarity Fusion" paper (sigir06_similarityfuson.pdf) which DOES correct for averages, however, they specifically claim that their method in the case of item-only weights reduces to the standard item-based predictor which it in fact does NOT. It reduces to the 2nd term above which is the average-corrected item based predictor, which is quite different.
It seems to me the cosine weighting is very bad if you don't correct for averages, and yet the standard in the literature is to use the cosine weighting and not correct for averages. WTF. The standard "Item-Based Collaborative Filtering" paper (www10_sarwar.pdf) does try using a linear slope+bias to shift ratings between items, but they find that it hurts performance with lots of neighbors or dense data. Weird.
10-17-06 - 5
10-17-06 - 4
10-17-06 - 3
10-17-06 - 2
Some of the current leaders have finally spoken up in the Netflix forum. They seem to all be using some sort of standard academic bayesian/neural net learning system. This could be hard to beat if they're good & have powerful server clusters to do the training, which they presumably do at universities.
10-17-06 - 1
10/16/2006
10-16-06 - 2
So, I've read through a bunch of the literature. It seems like I independently invented most of the same techniques. Some interesting tidbits : I'm currently using 2 similar movies & 2 similar users, while the accepted good # in the academic community seems to be 30 (!!). (using more doesn't increase my run time at all since I'm looking at all of them anyway to pick the best 2). Also, the way they're comparing vectors is generally a dot product, while I've been using like a Euclidean distance. I'm not sure if that will make much of a difference. They also do fudging on the similarity to penalize small intersections and such; I have similar fudging and I've found that the details of those fudge factors are enormously important in practice. I can't tell in the literature if people have really investigated different similarity functions, or everyone is just using the one that was published first and has become standard.
One cool approach is the "Context-Boosted Collaborative Filtering" paper, though it seems completely unusable on large data sets because the first step is turning the sparse matrix into a full matrix by filling in the gaps with content-based predictions. That is, it would fill in the full 17k x 480k matrix of ratings, which is 8 billion ratings, which aside from taking about 8 GB of storage (not bad), would take around 1,000,000 hours to compute.
10-16-06 - 1
10/14/2006
10-14-06 - 1
10/13/2006
10-13-06 - 4
10-13-06 - 3
Having the probe data contained in the training set is actually really annoying, because it means I can't fairly precompute very much based on the training set. For example, if you precompute the "most similar" other movie for each movie, you will be using the probe data as part of that, and that helps a *TON* which totally throws off testing on the probe.
There's one case maybe y'all can give me ideas on. It's surprisingly common to have a movie for which I can't find any good "similar" movies. These "red herring" movies wind up contributing a lot to the RMSE score because they tend to be predicted badly. Right now I'm just using the average rating of them as their prediction, but there must be something better.
10-13-06 - 2
One thing I'm surprised at. 25% of movies fall into a category of "super predictable" cases. In these cases, I gave the same rating to the two most similar movies, and the most similar user gave the same rating to the query movie. The RMSE of these queries is still 0.85 ; I would have thought these cases would be more predictable.
Part of the problem with making a really amazing predictor is that the ratings are so discretized. If somebody felt like a "3.5" about a movie, they will sort of randomly shove it to a "3" or a "4" depending on how they feel, but the difference is a whole 1.0 which is big.
10-13-06 - 1
BTW I find the whole "Prize" aspect sort of annoying because it means no one is talking about their approaches publicly. Of course, even in fields where there is no prize, people are still super-secretive, because they always think they have some brilliant idea which is going to make them famous in the academic world, or get them patents and make them rich, blah blah blah.
Oh, I still could use some volunteers to run the heavy process if you have a fast machines that are idle.
10/12/2006
10-12-06 - 1
10/10/2006
10-10-06 - 2
10-10-06 - 1
10/09/2006
10-09-06 - 1
10/06/2006
10-06-06 - 3
RMSE for guessing each movie gets its average rating : 1.051938
RMSE for Netflix' Cinematch : 0.9474
RMSE for simple one-movie predictor : 1.037155
So, that's not very encouraging. Obviously I knew it wasn't going to be close to Cinematch, but the improvement over just using the average for each movie is microscopic, and cinematch is like light years ahead.
I'm starting to think that beating Cinematch by 10% is all but impossible. It's a lot like data compression where the closer you get to the true entropy the harder it is to make advances. The Netflix Prize FAQ justifies their 10% criterion by saying the difference between the "average" predictor and Cinematch is about 10%, so they want 10% more. That fails to address two major factors : "average" is actually not an awful predictor at all, and the next 10% is way way harder than the first 10%.
10-06-06 - 2
10-06-06 - 1
Anyone know anything about Memory mapping? Is there a reason I should memory map instead of just reading the whole file into big buffers?
Also, does anyone have a powerful machine that I can run a heavy process on for a long time? I estimate it will take about 200 hours of CPU time, 2 Gigs of RAM is necessary.
10/04/2006
10-04-06 - 2
Postman. Apparently very hard to get. I like walking around
Parks worker. Similar situation, I like planting plants and such, being outside
Apprentice Carpenter. I've seen some adds for this; I have no skills but would like to learn. I could offer to work the first few weeks with no pay
Bike shop retail or repair. Again my bike repair skills are only passable not expert, but I'd like to learn.
Cook. I'd love to be a chef but that's a long hard road and being an entry-level prep cook or something is pretty brutal
Other ?
10-04-06 - 1
Anybody who knows Windows - what's the best way to hold an 800 MB data structure on a machine with 1 GB of RAM ? I'm assuming if I just hold it in RAM I'm going to page like crazy and get killed, cuz Windoze is Stoopid. I guess I could write my program in basic C and boot to command line Linux or something ridiculous like that but that's not fun.
10/03/2006
10-03-06 - 2
At the moment the site seems to be a mess. The documentation sucks. The forums are full of people who don't know how to program asking nonsense questions.
10-03-06 - 1
10/02/2006
10-02-06 - 3
The new bill was tacked on the "SAFE ports" bill by Bill Frist primarily. It will soon be signed and go into law. Basically what it does is makes it illegal to provide online gambling or transactions to online gambling. It does not make it illegal for an individual/consumer to partake of said online gambling. Offshore sites will continue to operate. It will be illegal for US banks or other funding instruments to provide transactions with them, so the hard part will be getting your money in & out of them, but of course there will be ways. Another problem is that major publicily traded gambling companies will be shutting off their US customers because of this. This is a bit of a funny thing, because they're foreign companies they aren't directly affected by this law, but they have to comply with it if they want their stock traded in US markets, and so on and complicated stuff I don't really understand. Anyway, it seems all the major publicly trading companies will soon be shutting off US customers. That includes PartyPoker/PartyGaming, PokerStars, 888 (BetonSports), etc. There are other major private companies which will continue to allow US customers, the biggest being Full Tilt, Ultimate Bet, etc. While the law will go into affect immediately, the bank transfer prohibition won't go into affect into the bank regulatory body draws up the detailed regulations, which will take a while.
10-02-06 - 2
10-02-06 - 1
10/01/2006
10-01-06 - 1
old rants
-
▼
2006
(654)
-
▼
October
(39)
- 10-28-06 - 1
- 10-25-06 - 1
- 10-24-06 - 1
- 10-20-06 - 2
- 10-20-06 - 1
- 10-19-06 - 4
- 10-19-06 - 3
- 10-19-06 - 2
- 10-19-06 - 1
- 10-18-06 - 1
- 10-17-06 - 6
- 10-17-06 - 5
- 10-17-06 - 4
- 10-17-06 - 3
- 10-17-06 - 2
- 10-17-06 - 1
- 10-16-06 - 2
- 10-16-06 - 1
- 10-14-06 - 1
- 10-13-06 - 4
- 10-13-06 - 3
- 10-13-06 - 2
- 10-13-06 - 1
- 10-12-06 - 1
- 10-10-06 - 2
- 10-10-06 - 1
- 10-09-06 - 1
- 10-06-06 - 3
- 10-06-06 - 2
- 10-06-06 - 1
- 10-04-06 - 2
- 10-04-06 - 1
- 10-03-06 - 2
- 10-03-06 - 1
- 10-02-06 - 3
- 10-02-06 - 2
- 10-02-06 - 1
- 10-01-06 - 2
- 10-01-06 - 1
-
▼
October
(39)