Fuelled by the advancement in multimedia technologies, users across the world have witnessed the proliferation of online videos. Compared with the visual content of these videos, the textual content, for example, titles, tags, or descriptions, has been more broadly exploited in the real-world video data mining or information retrieval tasks. To enhance the understanding of videos, and improve the performance of the tasks such as automatic video annotation, video clustering, and cross-modal tag cleansing, the textual and visual content of videos are combined, through various methods. However, the absence of an empirical study on the properties of these contents makes them less solid to gain satisfactory performance. Therefore, in this paper, we conduct this study to verify the properties of textual content and draw insights from these analyses to promote further developments in video data mining that combine the two contents.