This paper conducts an extensive review of natural image foundation models for seismic data processing, demonstrating that hierarchical models like Swin and ConvNeXt, especially with self-supervised pre-training, outperform non-hierarchical ones in demultiple, interpolation, and denoising tasks, while highlighting the benefits and limitations of natural image pre-training for seismic applications.
Foundation Model, Pre-training, Fine-tuning, Self-Supervised Learning, Transformer, CNN
Fabian Fuchs, Mario Ruben Fernandez, Norman Ettrich, Janis Keuper
Fraunhofer-Institut für Techno- und Wirtschaftsmathematik, DWS, University of Mannheim, IMLA, Offenburg University
Generated by grok-3
Background Problem
Seismic data processing is critical for generating high-quality subsurface images used in geoscience applications like hydrocarbon exploration and geothermal characterization. However, traditional methods struggle with noisy, damaged data and rely on manual, time-consuming workflows. While deep learning (DL) has offered promising alternatives, most DL approaches in seismic processing use specialized neural networks trained on synthetic data, which often fail to capture the diversity of real field data. The success of foundation models (FMs) in natural image processing has sparked interest in their application to seismic tasks, prompting this study to explore whether natural image FMs can effectively address key seismic processing challenges such as demultiple, interpolation, and denoising, and to identify optimal model characteristics for these tasks.
Method
The core methodology involves benchmarking various natural image foundation models (FMs) for seismic processing tasks using an encoder-decoder framework. The encoder, pre-trained on natural image datasets like ImageNet, serves as the FM, while a UNet-style decoder adapts the features for seismic tasks (demultiple, interpolation, denoising). Models are categorized as hierarchical (e.g., Swin, ConvNeXt) or non-hierarchical (e.g., ViT) based on feature map resolutions, and vary in architecture (transformer, convolutional, hybrid) and pre-training methods (supervised, self-supervised, contrastive learning, etc.). Three downstream training strategies are tested: ‘frozen encoder’ (only decoder trained), ‘fine-tuned encoder’ (both encoder and decoder fine-tuned), and ‘non-pre-trained encoder’ (trained from scratch). Training uses supervised learning with an loss on synthetic and open-source seismic datasets, with fixed hyperparameters to ensure comparability across a wide range of models. The approach aims to evaluate how pre-training techniques and architectural choices impact performance and efficiency in seismic processing.
Experiment
The experiments evaluate foundation models (FMs) on three seismic tasks—demultiple, interpolation, and denoising—using synthetic data for demultiple and the open-source 2007 BP Anisotropic Velocity Benchmark dataset for interpolation and denoising, with qualitative demultiple results on the North Sea Volve field data. The setup compares hierarchical (e.g., Swin, ConvNeXt) and non-hierarchical (e.g., ViT) models, various pre-training methods, and downstream training strategies, measuring performance via metrics like SSIM (Structural Similarity Index Measure). Results show hierarchical models, particularly Swin (V2) and ConvNeXt (V2) with self-supervised pre-training, outperform non-hierarchical ones, achieving higher combined SSIM scores (e.g., Swin V2 as the top performer). Pre-training on natural images proves beneficial, with fine-tuning often surpassing training from scratch, though the domain gap requires sufficient task-specific data. The experimental design is comprehensive in model variety but limited by fixed hyperparameters and a focus on in-distribution generalization for interpolation and denoising, with qualitative results indicating potential out-of-distribution challenges. Inference times (e.g., Swin V2 at 2.5ms per gather) suggest practical applicability, though results are not optimized. Overall, while improvements are evident, the reliance on natural image pre-training and synthetic data raises questions about real-world generalization, partially addressed by qualitative field data tests.
Further Thoughts
The findings of this paper open up several avenues for deeper exploration, particularly around the domain adaptation challenges between natural images and seismic data. The superior performance of hierarchical models like Swin suggests that spatial resolution preservation is critical for pixel-level tasks in seismic processing, which aligns with trends in biomedical imaging but contrasts with the non-hierarchical focus in some seismic foundation model research (e.g., ViT-based approaches). This raises a broader question: could a hybrid approach, combining hierarchical feature extraction with seismic-specific pre-training, yield even better results? Additionally, the reliance on natural image datasets for pre-training, while practical due to data availability, might be suboptimal given the structural differences in seismic data—future work could explore self-supervised pre-training on large, unlabeled seismic field datasets to capture domain-specific features, potentially integrating techniques from geophysical signal processing to handle noise characteristics unique to seismic data. Another insightful connection is to federated learning paradigms, where seismic data from diverse geological regions could be used to train a unified foundation model without centralizing sensitive data, addressing the generalization issues noted in the paper. Finally, the qualitative results on out-of-distribution data (e.g., North Sea Volve field) hint at robustness challenges that could be further investigated through adversarial training or domain generalization techniques, ensuring models are not just performant on synthetic or in-distribution data but also in real-world, varied seismic environments.