Why SSIM Falls Short of Human Vision | Inverity

Structural similarity index measure arrived in 2004 promising something better than peak signal-to-noise ratio, and in many respects it delivered. Where PSNR treats every misplaced pixel as equally offensive regardless of context, SSIM at least acknowledges that human perception organizes images into structures, that luminance and contrast matter differently than fine texture, and that a blurry edge in a sky region offends the eye less than the same blur applied to a face.

For two decades it has served as the default quality metric across compression research, medical imaging, and video streaming, and its longevity reflects genuine usefulness. But longevity and accuracy are different virtues, and the gap between what SSIM measures and what a person actually experiences when looking at an image has become harder to ignore as the field demands finer distinctions. The core problem is that SSIM computes local statistics across fixed windows and combines them into a single score, which sounds reasonable until you consider how radically human attention varies across a scene.

The visual system allocates processing resources in a highly nonuniform way, driven by saliency, task demands, prior expectations, and the particular story a viewer is constructing from the image. A compression artifact sitting directly on a speaker's mouth during a video call is catastrophic in a way that the same artifact floating in a background bookshelf simply is not, yet SSIM weights both regions by the same arithmetic logic.

It also struggles with certain distortion types that perception handles asymmetrically. Ringing artifacts around sharp edges, for instance, can score reasonably well under SSIM because the structural information is largely preserved, while looking genuinely unpleasant to any human observer.

The metric was designed around a particular model of what structure means, and that model does not fully capture masking effects, orientation sensitivity, or the way the brain integrates information across spatial scales simultaneously rather than sequentially. More fundamentally, SSIM evaluates images as static, context-free objects, but human judgments of quality are neither static nor context-free. Viewing distance changes what defects are visible. Display calibration, ambient lighting, and even the images a viewer has seen recently shift their threshold for acceptable quality.

The emotional content of a scene influences how carefully people scrutinize it. None of this can be folded into a formula that takes two pixel arrays and returns a number between zero and one.

Learned perceptual metrics trained on human opinion scores, like LPIPS, have made genuine progress by letting neural representations absorb some of this complexity, but they trade one set of limitations for another, inheriting biases from their training distributions and remaining opaque about which visual properties they are actually rewarding.

The honest position is that no single metric yet captures what humans see, and treating any of them as ground truth rather than as a coarse and provisional signal remains the more defensible practice.