Home
Scholarly Works
VTFusion: A Vision–Text Multimodal Fusion Network...
Journal article

VTFusion: A Vision–Text Multimodal Fusion Network for Few-Shot Anomaly Detection

Abstract

Few-shot anomaly detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pretrained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pretrained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level area under the receiver operating characteristics (AUROCs) of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this article, further demonstrating its practical applicability in demanding industrial scenarios.

Authors

Jiang Y; Cao Y; Cheng Y; Zhang Y; Shen W

Journal

IEEE Transactions on Cybernetics, Vol. PP, No. 99, pp. 1–10

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Publication Date

January 21, 2026

DOI

10.1109/tcyb.2026.3651630

ISSN

2168-2267

Contact the Experts team