SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening

1Northwestern Polytechnical University, Xi'an, Shanxi, China

2Nanjing University of Information Science and Technology, Nanjing, Jiangsu, China

3Zhejiang University, Hangzhou, Zhejiang, China

* Corresponding author

Abstract

Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2–3$\times$ inference speedup, and exhibits robust zero-shot (cross-sensor) capability.
Keywords: Pan-sharpening, Latent Diffusion, Sensor-Agnostic, Remote Sensing Image Fusion

Methodology

***Framework Overview.*** As illustrated in *Figure 1*, SALAD-Pan employs a two-stage training strategy. ***Stage I: Single-Channel VAE Pretraining.*** We train a band-wise single-channel VAE that encodes each spectral band of HRMS independently into compact latent representations. This band-wise processing strategy naturally supports arbitrary numbers of spectral bands, enabling cross-sensor generalization. ***Stage II: Latent Conditional Diffusion.*** With the VAE encoder frozen, we perform conditional diffusion in the latent space. The diffusion backbone receives disentangled spatial-spectral conditioning from PAN and LRMS images via dual control branches, along with sensor-aware physical metadata via text cross-attention. A lightweight region-based cross-band attention (RCBA) module at the central layer further enhances spectral consistency.
SALAD-Pan Architecture Overview
*Figure 1*. Overview of SALAD-Pan. Stage I trains a band-wise single-channel VAE to map each HRMS band into a compact latent space. Stage II performs band-wise conditional latent diffusion with disentangled spatial-spectral conditioning: a spatial branch encodes PAN for spatial guidance, while a spectral branch encodes the upsampled LRMS band-by-band for spectral guidance. We use hybrid coupling: bidirectional interaction in the encoder and unidirectional (branch$\rightarrow$backbone) control in the mid block and decoder. RCBA improves inter-band consistency, and sensor-aware metadata prompts from a frozen CLIP text encoder provide additional conditioning.

Experimental Results

MethodsReduced ResolutionFull Resolution
ModelsPub/Year
PanNetICCV'170.891±0.0453.613±0.7872.664±0.3470.943±0.0180.017±0.0080.047±0.0140.937±0.015
FusionNetTGRS'200.904±0.0923.324±0.4112.465±0.6030.958±0.0230.024±0.0110.036±0.0160.940±0.019
LAGConvAAAI'220.910±0.1143.104±1.1192.300±0.9110.980±0.0430.036±0.0090.032±0.0160.934±0.011
BiMPanACMM'230.915±0.0872.984±0.6012.257±0.5520.984±0.0050.017±0.0190.035±0.0150.949±0.026
ARConvCVPR'250.916±0.0832.858±0.5902.117±0.5280.989±0.0140.014±0.0060.030±0.0070.958±0.010
WFANetAAAI'250.917±0.0882.855±0.6182.095±0.4220.989±0.0110.012±0.0070.031±0.0090.957±0.010
PanDiffTGRS'230.898±0.0903.297±0.2352.467±0.1660.980±0.0190.027±0.1080.054±0.0470.920±0.077
SSDiffNeurIPS'240.915±0.0862.843±0.5292.106±0.4160.986±0.0040.013±0.0050.031±0.0030.956±0.010
SGDiffCVPR'250.921±0.0822.771±0.5112.044±0.4490.987±0.0090.012±0.0050.027±0.0030.960±0.006
SALAD-PAN0.924±0.0642.689±0.1351.839±0.2110.989±0.0070.010±0.0080.021±0.0040.965±0.007

Best results are in bold. Second-best results are underlined.

Visualizations

After
Before
PAN
SALAD-Pan
After
Before
LRMS
SALAD-Pan
1 / 20

BibTeX

@misc{li2026_saladpan,
      title={SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening}, 
      author={Junjie Li and Congyang Ou and Haokui Zhang and Guoting Wei and Shengqin Jiang and Ying Li and Chunhua Shen},
      year={2026},
      eprint={2602.04473},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.04473}, 
}