Reducing modal differences in zero-shot Anomaly...

Reducing modal differences in zero-shot Anomaly detection based on vision-language generation model

Abstract

Zero-shot anomaly detection methods based on vision-language model rely on alignment between image and text. These methods ignore the inherent differences between different modalities, which is unfavorable for improving the alignment between modalities. This paper reduces modal differences between image and text by using guiding vision feature and text feature from the pre-trained vision-language generation model. The vision perception text embedding is constructed by adding guiding vision feature to the weight shared text prompt. The text perception vision embedding is extracted by a vision text fusion module. The fusion module is designed to promote the visual modality to perceive the textual information locally. Anomaly regions are detected by cosine similarity between cross-modal perception embeddings. Zero-shot anomaly detection performance is evaluated on five publicly available industrial anomaly detection datasets, and a real-world dataset about automotive plastic parts. Experimental results show that the proposed method achieves highly competitive anomaly detection performance on multiple evaluation metrics.