Mining Top-k Sequential Patterns in Database Graphs:A New Challenging Problem and a Sampling-based Approach
Abstract
In many real world networks, a vertex is usually associated with a
transaction database that comprehensively describes the behaviour of the
vertex. A typical example is the social network, where the behaviour of every
user is depicted by a transaction database that stores his daily posted
contents. A transaction database is a set of transactions, where a transaction
is a set of items. Every path of the network is a sequence of vertices that
induces multiple sequences of transactions. The sequences of transactions
induced by all of the paths in the network forms an extremely large sequence
database. Finding frequent sequential patterns from such sequence database
discovers interesting subsequences that frequently appear in many paths of the
network. However, it is a challenging task, since the sequence database induced
by a database graph is too large to be explicitly induced and stored. In this
paper, we propose the novel notion of database graph, which naturally models a
wide spectrum of real world networks by associating each vertex with a
transaction database. Our goal is to find the top-k frequent sequential
patterns in the sequence database induced from a database graph. We prove that
this problem is #P-hard. To tackle this problem, we propose an efficient
two-step sampling algorithm that approximates the top-k frequent sequential
patterns with provable quality guarantee. Extensive experimental results on
synthetic and real-world data sets demonstrate the effectiveness and efficiency
of our method.