Anyway, I was getting totally wrong results, so I went over my MS-SSIM implementation with a fine toothed comb, checked everything against the original Matlab; I found a few deviations, but they were all irrelevant. The problem was I was running the files in the TID database in a different order than the TID test program wanted, so it was matching the wrong file to the wrong file.
As part of that, I downloaded all the implementations of MS-SSIM I could find. The best one is in Metrix Mux . Most of the others have some deviation from the original. For example, many people get the Gaussian window wrong (A Gaussian with sdev 1.5 is e^(- 0.5 * (x/1.5)^2 ) - people leave off the 0.5), others incorrectly apply the window at every pixel position (you should only apply it where the whole window is inside the image, not off any edge), another common flaw is to get the downsample wrong; the way authors do it is with a Daub9 lowpass tap and then *point* downsampling (they use :1:2:end matlab notation which is a point downsample). Anyway, a lot of these details are pretty random. Also of note : Metrix Mux uses rec601 luma and does not gamma correct.
The TID2008 perceptual distortion database is a lot better than the Live database, but it's still not great.
The problem with both of them is that the types of distortions applied is just too small of a subset. Both of them mainly just apply some random noise, and then they both apply JPEG and JPEG2000 distortions.
That's okay if you want a metric that is good at specifically judging the human evaluation of those types of distortions. But it's a big problem in general. It means that metrics which consider other factors are not given credit for their considerations.
For example, TID2008 contains no hue rotations, or images that have constant luma channels but visible detail in chroma. That means that metrics which only evaluate luma fidelity do quite well on TID2008. It has no images where just R or just G or just OpponentBY is modified, so you can't tell anything about the importance of different color channels to perception of error.
TID2008 has 25 images, which is too many really; you only need about 8. Too many of the TID images are the same, in the sense that they are photographs of natural scenes. You only need 2 or 3 of those to be a reasonable representative sample, since natural scene photos have amazingly consistent characteristics. What is needed are more distortion types.
Furthermore, TID has a bunch of distortion types that I believe are bogus; in particular all the "exotic" distortions, such as injecting huge solid color rectangles into the image, or changing the mean luminance. The vast majority of metrics do not handle this kind of distortion well, and TID scores unfairly penalize those metrics. The reason it's bogus is that I believe these types of distortions are irrelevant to what we are doing most of the time, which is measuring compressor artifacts. No compressor will ever make distortions like that.
And also on that thread, too many of the distortions are very large. Again many metrics only work well near the threshold of detection (that is, the distorted image almost looks the same as the original). That limitted function is actually okay, because that's the area that we really care about. The most interesting area of work is near threshold, because that is where we want to be making our lossless compressed data - you want it to be as small as possible, but still pretty close to visually unchanged. By having very huge distortions in your database, you give too many points to metrics that handle those huge distortions, and you penalize
Lastly, because the databases are all too small, any effort to tune to the databases is highly dubious. For example you could easily do something like the Netflix prize winners where you create an ensemble of experts - various image metrics that you combine by estimating how good each metric will be for the current query. But training that on these databases would surely just give you overfitting and not a good general purpose metric. (as a simpler example, MS-SSIM has tons of tweaky hacky bits, and I could easily optimize those to improve the scores on the databases, but that would be highly questionable).
Anyway, I think rather than checking scores on databases, there are two other ways to test metrics that are better :
1. Metric competition as in "Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities". Unfortunately for any kind of serious metrics, the gradient is hard to solve analytically, so it requires numerical optimization and this can be very time consuming. The basic method I would take is to take an image, try a few random steps, measure how they affect metrics M1 and M2, then form a linear combo of steps such that the affect on M2 is zero, while the affect on M1 is maximized. Obviously the metrics are not actually linear, but in the limit of tiny steps they are.
2. Metric RD optimization. Using a very flexible hypothetical image coder (*), do R/D optimization using the metric you wish to test. A very general brute for RD optimizer will dig out any oddities in the metric. You can test metrics against each other by making the image that this compressor will general for 1.0 bits per pixel (or whatever).
* = you shouldn't just use JPEG or something here, because then you are only exploring the space of how the metric rates DCT truncation errors. You want an image compressor that can make lots of weird error shapes, so you can see how the metric rates them. It does not have to be an actual competitive image coder, in fact I conjecture that for this use it would be best to have something that is *too* general. For example one idea is to use an over-complete basis transform coder, let it send coefficients for 64 DCT shapes, and also 64 Haar shapes and maybe some straight edges and linear ramps, etc. H264-Intra at least has the intra predictors, but maybe also in-frame motion vectors would add some more useful freedom.