The manufacturing paradigm is evolving from conventional automation to Agentic Smart Manufacturing (ASM), where generative Artificial Intelligence (AI) and multi-agent decentralized decision-making frameworks reconfigure systems into autonomous ecosystems. These ecosystems enable agents to perceive, decision-making, and action in real time through multimodal fusion, predictive analytics, and collaborative problem-solving. Integrating multimodal data within Agentic Smart Manufacturing Systems (ASMS) has emerged as a crucial solution to enhance the intelligence that optimizes resilience and innovation without centralized oversight. Multi-Modal Deep Learning (MMDL) offers enhanced capabilities for comprehensive information processing and decision support by fusing various data modalities, such as text, images, and audio. However, the fundamental differences in format, scale, and structure between different modal data types lead to heterogeneity. The challenge of accurately mapping the relationships between cross-modal data. Computational complexity poses a major obstacle to the practical applications of ASM. This paper begins with an overview of MMDL advances in an industrial context, including multimodal representation, alignment, fusion, and optimization. It then reviews the evolution of MMDL across four layers of the manufacturing system architecture, ranging from the equipment layer to the supply chain layer. Additionally, the paper highlights the critical challenges related to data management, algorithm development, and deployment that MMDL encounters in ASMS applications. It also proposes potential future research directions. By providing a robust theoretical foundation and practical guidance for the application of MMDL in ASMS, this paper aims to drive the intelligent transformation and sustainable development of manufacturing systems.