A Content-Based Indexing Scheme for Large-Scale...

A Content-Based Indexing Scheme for Large-Scale Unstructured Data

Abstract

The sheer volume of multimedia contents generated by todays Internet services are stored in the cloud. The traditional indexing method associating the user-generated metadata with the content is vulnerable to the inaccuracy caused by the low quality of the metadata. While the content-based indexing does not depend on the error-prone metadata. However, the state-of-the-art research focuses on developing descriptive features and miss the system-oriented considerations when incorporating these features into the practical cloud computing systems. We propose an Update-Efficient and Parallel-Friendly content-based multimedia indexing system, called Partitioned Hash Forest (PHF). The PHF system incorporates the state-of-the-art content-based indexing models and multiple system-oriented optimizations. PHF contains an approximate content-based index and leverages the hierarchical memory system to support the high volume of updates. Additionally, the content-aware data partitioning and lock-free concurrency management module enable the parallel processing of the concurrent user requests. We evaluate PHF in terms of indexing accuracy and system efficiency by comparing it with the state-of-the-art content-based indexing algorithm and its variances. We achieve the significantly better accuracy with less resource consumption, around 37% faster in update processing and up to 2.5X throughput speedup in a multicore platform comparing to other parallel-friendly designs.