Thèse de doctorat
Résumé : Movement pattern mining involves the processing of movement data to understand the mobility behaviour of humans/animals. Movement pattern mining has numerous applications, e.g. traffic optimization, event planning, optimization of public transport and carpooling. The recent digital revolution has caused a wide-spread use of smartphones and other devices equipped with GPS. These devices produce a tremendous amount of movement data which contains valuable mobility information. Many interesting mobility patterns and algorithms to mine them have been proposed in recent years to mine different types of mobility behaviours, e.g., convoy, flock, group, swarm or platoon, etc. The drastic increase in the volumes of data being generated limits the use of these algorithms in the mining of movement patterns on real-world data sizes because of their lack of scalability.This thesis deals with three aspects of movement pattern mining, i.e. scalability, efficiency, and real-timeliness with a focus on convoy pattern mining. A convoy pattern is a group of objects moving together for a certain period. Mining of convoy pattern involves clustering of the movement dataset at each timestamp and then merging the clusters to form convoys. Clustering the whole dataset is a limiting factor in the scalability of existing algorithms. One way to solve the scalability problem is to mine convoys in parallel. Parallel mining can be done either using the existing distributed spatiotemporal data processing system like Parallel Secondo or by using a general distributed data processing system. We first test the scalability behaviour of Parallel Secondo for mining movement patterns and conclude that it is not an industrial grade system and its scalability is limited. An essential part of designing distributed data processing algorithms is the data partitioning strategy. We study three different data partitioning strategies, i.e., Object-based, spatial and temporal. We analyze their suitability to convoy pattern mining based on 5 properties, i.e., data exchange, data redundancy, partitioning cost, disk seeks and data ordering. Our study shows that the temporal partitioning strategy is best suited for convoy mining as it is easily parallelizable and less complicated. The observations in our study also apply to other movement pattern mining algorithms, e.g., flock, group or platoon, etc.Based on the temporal partitioning strategy, we propose a generic distributed shared nothing convoy mining algorithm called DCM which is linearly scalable concerning the data size, data density and the number of nodes. DCM can be implemented using any distributed data processing framework. For our experiments, we implemented the algorithm using the Hadoop MapReduce framework. It performs better than the existing sequential algorithms, i.e. CuTs family of algorithms by an order of magnitude on different computing architectures, e.g. single x86 machine, multi-core cluster with NUMA architecture and multi-node SMP clusters. Although DCM is a scalable distributed algorithm which can process huge datasets, the cost of maintaining the cluster is high. Also, the heavy computation it incurs because of the requirement of clustering the whole dataset is not resource-efficient.To solve the efficiency problem of DCM, we propose a new sequential algorithm called k/2-hop which even being a sequential algorithm can perform orders of magnitude faster than the existing state-of-the-art sequential as well as distributed algorithms. The main strength of the algorithm is its pruning capability. Our experiments show that it can prune up to 99% of the data. k/2-hop uses a notion of benchmark points which are timestamps separated by k/2 timestamps where k is the minimum length of the convoys to be mined. We prove that to be able to mine maximal convoys; we need to cluster the data belonging to the benchmark points only. For the timestamps between two consecutive benchmark points, we propose an efficient mining algorithm called the Hop Window Mining Tree (HWMT). HWMT clusters the data corresponding to only those objects that are part of a cluster in the benchmark points. k/2-hop is a batch algorithm that can mine convoys very fast, but we only get the result when the complete dataset has been processed. Also, it requires the data to be indexed for better performance and thus cannot be used in real-time scenarios. We propose a streaming variant of the k/2-hop algorithm which does not require the input dataset to be indexed and can process a stream of data. It outputs the mined convoys as and when they are discovered. The streaming k/2-hop algorithm is very memory efficient and can process data that is many times bigger than the memory made available to the algorithm. We show through experiments that if we include the data loading and indexing time in the runtime of the k/2-hop algorithm, streaming k/2-hop is the fastest convoy mining algorithm to date. Convoy pattern is part of a bigger category of co-movement patterns, and most of the observations (if not all) made in this thesis about convoy pattern mining also apply to other patterns of the category such as flock, group or platoon, etc. This applicability means that a generic batch and streaming distributed co-movement pattern mining framework can be build using the k/2 technique.