Skin lesion segmentation is a critical step in the early diagnosis of skin cancer, particularly melanoma, which is known for its high mortality rate if not detected early. However, challenges such as indistinct lesion boundaries, lighting variations, and small lesion sizes often hinder accurate segmentation in dermoscopic images. In this paper, we propose UCMViT, an advanced deep learning framework for skin lesion segmentation that addresses these challenges by enhancing feature extraction and integration. UCMViT incorporates a transformer-based MVT-Fusion Module, leveraging a pre-trained MobileVision Transformer to capture long-range dependencies through multi-scale feature maps, which are fused with the encoder’s local features. Additionally, a Semantics and Detail Infusion (SDI) module is integrated to improve the fusion of semantics and detail information across encoder-decoder stages. To further enhance robustness, UCMViT employs comprehensive data augmentation strategies during both training and testing, including random flipping, rotation, and color jitter, as well as a test-time augmentation pipeline. A novel inference-time strategy detects and crops small lesions, focusing segmentation on challenging regions. Evaluated on the ISIC 2017 dataset, UCMViT achieves a Dice Similarity Coefficient (DSC) of 0.9266 and a sensitivity (SE) of 0.9185, outperforming state-of-the-art methods such as U-Net, Attention Swin U-Net, and UCM-NetV2. These results demonstrate UCMViT’s effectiveness in improving segmentation accuracy and robustness, making it a promising solution for skin lesion analysis in clinical settings.