par Calders, Toon
Référence Lecture Notes in Business Information Processing, 172 LNBIP, page (1-32)
Publication Publié, 2014
Référence Lecture Notes in Business Information Processing, 172 LNBIP, page (1-32)
Publication Publié, 2014
Article révisé par les pairs
Résumé : | We present an overview of data mining techniques for extracting knowledge from large databases with a special emphasis on the unsupervised technique pattern mining. Pattern mining is often defined as the automatic search for interesting patterns and regularities in large databases. In practise this definition most often comes down to listing all patterns that exceed a user-defined threshold for a fixed interestingness measure. The simplest such problem is that of listing all frequent itemsets: given a database of sets, called transactions, list all sets of items that are subset of at least a given number of the transactions. We revisit the two main strategies for mining all frequent itemsets: the breadth-first Apriori algorithm and the depth-first FPGrowth, after which we show what are the main issues when extending to more complex patterns such as listing all frequent subsequences or subgraphs. In the second part of the paper we then look into the pattern explosion problem. Due to redundancy among patterns, most often the list of all patterns satisfying the frequency thresholds is so large that post-processing is required to extract useful information from them. We give an overview of some recent techniques to reduce the redundancy in pattern collections using statistical methods to model the expectation of a user given background knowledge on the one hand, and the minimal description length principle on the other. © Springer International Publishing Switzerland 2014. |