Home
Scholarly Works
Reducing modal differences in zero-shot Anomaly...
Journal article

Reducing modal differences in zero-shot Anomaly detection based on vision-language generation model

Abstract

Zero-shot anomaly detection methods based on vision-language model rely on alignment between image and text. These methods ignore the inherent differences between different modalities, which is unfavorable for improving the alignment between modalities. This paper reduces modal differences between image and text by using guiding vision feature and text feature from the pre-trained vision-language generation model. The vision perception text embedding is constructed by adding guiding vision feature to the weight shared text prompt. The text perception vision embedding is extracted by a vision text fusion module. The fusion module is designed to promote the visual modality to perceive the textual information locally. Anomaly regions are detected by cosine similarity between cross-modal perception embeddings. Zero-shot anomaly detection performance is evaluated on five publicly available industrial anomaly detection datasets, and a real-world dataset about automotive plastic parts. Experimental results show that the proposed method achieves highly competitive anomaly detection performance on multiple evaluation metrics.

Authors

Song Y; Shen W; Pan B; Wu Q; Gu D

Journal

Engineering Applications of Artificial Intelligence, Vol. 162, ,

Publisher

Elsevier

Publication Date

December 22, 2025

DOI

10.1016/j.engappai.2025.112541

ISSN

0952-1976

Contact the Experts team