PICD: Versatile Perceptual Image Compression with Diffusion Rendering

PICD introduces a versatile perceptual image compression codec using diffusion rendering with three-tiered conditioning to achieve high text accuracy and visual quality for both screen and natural images, outperforming existing methods in key metrics like FID and text accuracy.

Generative AI, Diffusion Model, Image Generation, Multimodal Data, Perceptual Quality, Text Accuracy

Tongda Xu, Jiahao Li, Bin Li, Yan Wang, Ya-Qin Zhang, Yan Lu

AIR, Tsinghua University, Microsoft Research Asia

Generated by grok-3

Background Problem

Perceptual image compression has advanced significantly for natural images, achieving high visual quality at low bitrates using generative models like GANs and diffusion models. However, these methods often fail for screen content, producing artifacts in text reconstruction, as they prioritize marginal distribution matching over text accuracy. Conversely, existing screen content codecs focus on text fidelity but compromise on perceptual quality, resulting in blurry images at low bitrates. The key problem addressed by this work is the lack of a versatile codec that simultaneously achieves high text accuracy and perceptual quality for both screen and natural images.

Method

The proposed method, PICD (Perceptual Image Compression with Diffusion Rendering), introduces a versatile codec framework that handles both screen and natural images by separately encoding text and image data and using a diffusion model for rendering. For screen content, text is extracted via OCR (using Tesseract) and compressed losslessly, while the image is compressed using a conditioned MLIC codec. The decoding process involves rendering the compressed image and text together using a conditional diffusion model (based on Stable Diffusion) with a three-tiered conditioning approach: 1) Domain level, fine-tuning the diffusion model with screen content prompts using LoRA; 2) Adaptor level, developing a hybrid adaptor combining ControlNet and StableSR for better low-level control of image and text inputs; 3) Instance level, applying guidance during sampling to align intermediate outputs with text and image conditions. For natural images, the method simplifies by omitting text conditions and using image captions instead, maintaining perceptual quality.

Experiment

Experiments were conducted on screen content datasets (SCI1K, SIQAD, trained on WebUI) and natural image datasets (Kodak, CLIC, trained on OpenImages) using an A100 GPU. Metrics included text accuracy (Jaccard similarity), perceptual quality (FID, LPIPS, CLIP, DISTS), and PSNR, with Bjontegaard (BD) metrics for rate-distortion comparison at bitrates from 0.005 to 0.05 bpp. Baselines included perceptual codecs (Text-Sketch, CDC, MS-ILLM, PerCo) and MSE-optimized codecs (MLIC, VTM). Results show PICD outperforming other perceptual codecs in text accuracy and FID for screen content, and achieving the lowest FID for natural images, indicating strong perceptual quality. Ablation studies confirmed the importance of each conditioning level, with the proposed adaptor and instance guidance significantly boosting performance. However, compared to direct text rendering, PICD slightly lags in text accuracy but excels in visual quality (FID and CLIP). The experimental setup is comprehensive, covering diverse datasets and metrics, though the reliance on OCR accuracy introduces potential failure cases, as noted in the paper. Decoding time (30s with guidance) is a notable drawback compared to faster baselines like MS-ILLM (0.1s). Overall, results match the expectation of balancing text accuracy and perceptual quality, but practical deployment may be limited by speed and OCR dependency.

Further Thoughts

The reliance on OCR accuracy in PICD raises questions about its robustness in real-world scenarios where text extraction might fail due to complex layouts or low-quality inputs. Future work could explore integrating more robust text detection models or fallback mechanisms to mitigate OCR failures. Additionally, the decoding speed issue (30s with instance guidance) suggests a need for optimization, perhaps by exploring faster diffusion sampling techniques or hybrid approaches combining diffusion with lighter generative models. I also find the three-tiered conditioning framework inspiring for other multimodal tasks beyond compression, such as text-to-image synthesis or video rendering, where precise control over specific elements (like text or objects) is crucial. Relating this to broader AI research, PICD’s approach could intersect with advancements in vision-language models, potentially enhancing tasks like document understanding or augmented reality by ensuring high fidelity in rendered text and visuals. Lastly, the trade-off between text accuracy and perceptual quality compared to direct text rendering methods warrants further investigation—could a hybrid of direct rendering for critical text and diffusion for background visuals offer a better balance?