A Systematic Overview of Caching Mechanisms to...

A Systematic Overview of Caching Mechanisms to Improve Hadoop Performance

Abstract

ABSTRACT In today's distributed computing environments, the rapid generation of large‐scale data from diverse sources poses significant challenges in terms of storage, management, and processing, particularly for traditional relational databases. Hadoop has emerged as a widely adopted framework for handling such data through parallel processing across distributed clusters. Despite its advantages in scalability, flexibility, and fault tolerance, Hadoop suffers from inefficiencies related to high data access latency, redundant computations, and I/O overhead, which degrade overall system performance. To mitigate these issues, researchers have proposed various caching mechanisms aimed at improving data access time, enhancing data locality, minimizing duplicate computations, and optimizing resource utilization. This paper provides a comprehensive survey and novel classification of existing caching strategies for Hadoop, categorizing them based on the specific Hadoop performance bottlenecks they address. A detailed comparative analysis is provided based on critical caching characteristics such as cached item type, cache management policies, replacement strategies, and access patterns. To assess the effectiveness of these caching mechanisms, their impact on key Hadoop performance metrics is evaluated. Also, statistical insights are presented, highlighting the percentage of reviewed studies addressing specific Hadoop performance challenges and the frequency of performance metrics used for evaluation. Finally, this survey identifies hybrid caching as a promising future trend and proposes a novel approach termed Hybrid Intelligent Cache (HIC) as an example. HIC combines the strengths of two previously developed methods from distinct categories. The first method is the Hybrid Support Vector Machine–Least Recently Used (H‐SVM‐LRU) algorithm, which enhances the traditional LRU cache replacement strategy by employing a Support Vector Machine (SVM) to predict future data access patterns for intelligent eviction. The second is Cache Locality with Q‐Learning in MapReduce Scheduling (CLQLMRS), a reinforcement learning–based scheduling technique that optimizes task allocation by maximizing both cache locality and data locality. Experimental results demonstrate that HIC yields an average 31.2% improvement in job execution time, marking a significant advancement in intelligent caching for Hadoop ecosystems.