Home
Scholarly Works
A Comparative Study on Source Code Attribution...
Chapter

A Comparative Study on Source Code Attribution Using AI: Datasets, Features, and Techniques

Abstract

In recent years, the application of artificial intelligence (AI) techniques for source code authorship attribution has gained significant attention from academia and industry. Accurately attributing source code to its original author is crucial for various purposes, such as intellectual property protection, cybersecurity, and software forensics. Advances in AI technologies, like ChatGPT, which can generate code, present new challenges and opportunities in distinguishing between human- and machine-generated code. This article aims to comprehensively review existing research on source code authorship attribution and present a series of experiments using a dataset of 600 source codes. The study involves extracting lexical and layout features, ranking methods, and employing several machine learning models (SVM, LR, MLP, XGBoost, and RF) and deep learning models (LSTM, RNN, and CNN). The objectives include identifying the best model to determine whether source codes were written by a human or ChatGPT-4 and providing insights into two human characteristics: gender and region. Our results show that we achieved up to 94.7% accuracy with RF using TF-IDF and 95% accuracy with the CNN model. Finally, we identify emerging trends and potential future research directions in AI for authorship attribution.

Authors

Alalawi S; Alrabaee S; Khan W; Al-Azzoni I; Parambil MMA

Book title

Security and Privacy in Communication Networks

Series

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

Volume

627

Pagination

pp. 332-353

Publisher

Springer Nature

Publication Date

January 1, 2026

DOI

10.1007/978-3-031-94445-1_18
View published work (Non-McMaster Users)

Contact the Experts team