Paper Review: SwinIR: Image Restoration Using Swin Transformer

Paper link

Code link

Main image

Image restoration is a vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy, and compressed images).

The authors use a model based on the Swin Transformers. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks (image super-resolution, image denoising, and JPEG compression artifact reduction) by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%.

The approach

Compared with widely used CNN-based image restoration models, transformer-based SwinIR has several benefits:

  • content-based interactions between image content and attention weights, which can be interpreted as spatially varying convolution;
  • long-range dependency modeling is enabled by the shifted window mechanism;
  • better performance with fewer parameters;

The architecture


SwinIR consists of three modules: shallow feature extraction, deep feature extraction, and high-quality image reconstruction modules:

  • Shallow feature extraction module uses a convolution layer (3x3) to extract shallow features that are fed into both following layers;
  • Deep feature extraction module is composed of K residual Swin Transformer blocks (RSTB), each of which utilizes several Swin Transformer layers for local attention and cross-window interaction. The authors add a convolutional layer at the end of the block for feature enhancement and use a residual connection to provide a shortcut for feature aggregation;
  • Both shallow and deep features are fused in the reconstruction module for high-quality image reconstruction. Different tasks use different reconstruction modules. For super-resolution, a sub-pixel convolutional layer is used. For other tasks, they use a single convolutional layer. Besides, they use residual learning to reconstruct a residual between LQ and HQ images instead of the HQ image;



  • For classical and lightweight image SR the loss is a simple L1;
  • For real-world SR, they use a combination of pixel loss, GAN loss, and perceptual loss;
  • For image denoising and JPEG compression artifact reduction, they use the Charbonnier loss;



Better and smaller models for all tasks.


  • The higher the channel number, RSTB number, and STL number in a RSTB, the better;
  • The bigger the image patches, the better;
  • Converges better and faster than CNN models;
paperreview deeplearning cv transformer superresolution imagerestoration