Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

2026-04-24

Quantization aware training for RGB image enhancement task

Foreword

I would like to sincerely thank all of the members in my team: TA Tinh Anh, TA Tien Huy, Nguyen Trong Nghia, Vo Hoang and Bui Minh Hieu.

Disclaimer: This blog is not an official post for the paper or the project, this post is based on my experience and is a reflect of what I learned.

Introduction

Image Enhancement is a general task consisting of a lot of subtasks such as improve the quality of a low-resolution image (Image Super-resolution), de-blur an image (Image Deblurring) or improve lightning condition (Low-light Image Enhancement). This task is a subset of Image Signal Processing (ISP): ISP aims to use signal to enhance raw image from image sensor, while Image Enhancement aims to enhance RGB images using algorithms or Deep Learning models.

In 2017, a research paper from ETH Zurich propose a task related to image enhancement, which is (s)RGB Image Enhancement. This task is about enhancing RGB images from old phones such as iPhone 3GS, BlackBerry Passport and Sony Xperia Z to match the quality of DLSR Canon Camera. The models trained on this task can applied on any image resolution and the methods can be applied to adapt to match any type of digital camera. The application is clear: "developing a method to take high quality photos using only an old smartphone". Imagine using only iPhone 6 and you are able to capture photos with quality as good as an expensive camera.

Related works

The original task was proposed in this paper together with the dataset DPED and a simple model.

Related paper in image enhancement are abundant. A notable paper is MobileIE, which introduces a tiny model of 4K parameters when inference and achieves SOTA (State-Of-The-Art aka the best) in LOL datasets. An important idea from the paper is that model use "Reparameterization method" to condense the weights of multiple Convolutional layers to a single layer, which reducing the model's size by a significant amount while keeping the performance unchanged. LL-UNet++ tackles low-light image enhancement task and inspires us to design Multi-Scale Refinement blocks (see next section).

Our approach

Our model looks similar to UNet (like many image enhancement do), and differ in Down and Up blocks' designs. Inspired from DaHua-IIG team, we branch the downsampling in Encoder block to 3 branch: 2 feature maps from 2 parallel convolutional branches and 1 ensemble feature map from them. This is to capture the interaction between feature maps by ensemble and refine them using Refinement block. Instance normalization layers are used (inspired from LL-UNet++) to process each image's special structure and noise. Pipeline

Another contribution is that we successfully integrate Quantization-Aware Training (QAT) and maintain high performance of the model even at 8-bit precision. In training stage, we add blocks of FakeQuant at every blocks to simulate quantization error. This allows model to correct the quantization error that it can encounters when deploying at 8-bit format. The result QAT model achieve 22.194 PSNR and 0.796 SSIM when evaluating in FP32 configuration and 21.050 PSNR and 0.725 SSIM in INT8 configuration. Normal PTQ model (Post-Training Quantization, i.e. convert a normal model to INT8 without training it to correct quantization error) only achieves 20.576 PSNR and 0.6139 SSIM. This means that QAT model saves 0.474 PSNR and 0.1111 SSIM score compared to PTQ model. Moreover, the qualitative results also shows significant improvement in QAT compared to PTQ (rightmost columns).

Quatlitative result

Training configurations

GPU: A6000 48GB VRAM
Batch size: 64
Number of epochs: 50
Accumulated gradient steps: 2
Gradient clipping value: 1
Learning rate: 1e-4
Scheduler: CosineAnnealingWarmRestarts (T0 = 10, Tmult = 2, eta min = 5e-6, warmup epochs = 5, warmup factor = 0.1)
Precision: bf16
Loss functions: PSNR loss, cosine similarity, and outlier-aware loss with weight 2.0, 1.0 and 1.0 respectively. We choose cosine similarity because Multi Scale Structural SIMilarity (MSSSIM) increase 5x the training time compared to using cosine similarity. Detail about formula is in my paper.

What's next

At first, we aim to distill large model to small model and combine with QAT. However, we can not find a good reference for such method. Therefore, we think that distillation + QAT for RGB image enhancement is a good direction for future work.