A topic modeling-based approach to executable file...

A topic modeling-based approach to executable file malware detection

Abstract

Malware is a term that refers to any malicious software used to harm or exploit a device, service, or network. The presence of malware in a system can disrupt operations and the availability of information in networks while also jeopardizing the integrity and confidentiality of such information, which poses a grave issue for sensitive and critical operations. Traditional approaches to malware detection often used by antivirus software are not robust in detecting previously unseen malware. As a result, they can often be circumvented by finding and exploiting vulnerabilities of the detection system. This study involves using natural language processing techniques, considering the recent advancements made in the field in recent years, to analyze the strings present in the executable files of malware. Specifically, we propose a topic modeling-based approach whereby the strings of a malware’s executable file are treated as a language abstraction to extract relevant topics, which can then be used to improve a classifier’s detection performance. Finally, through experiments using a publicly available dataset, the proposed approach is demonstrated to be superior in performance to traditional techniques in its detection ability, specifically in terms of performance measures such as precision and accuracy.