par Mantrach, Amin ;van Zeebroeck, Nicolas ;Francq, Pascal ;Shimbo, Masashi;Bersini, Hugues ;Saerens, Marco
Référence Pattern recognition, 44, 6, page (1212-1224)
Publication Publié, 2011
Référence Pattern recognition, 44, 6, page (1212-1224)
Publication Publié, 2011
Article révisé par les pairs
Résumé : | This work addresses graph-based semi-supervised classification and betweenness computation in large, sparse, networks (several millions of nodes). The objective of semi-supervised classification is to assign a label to unlabeled nodes using the whole topology of the graph and the labeling at our disposal. Two approaches are developed to avoid explicit computation of pairwise proximity between the nodes of the graph, which would be impractical for graphs containing millions of nodes. The first approach directly computes, for each class, the sum of the similarities between the nodes to classify and the labeled nodes of the class, as suggested initially in [1,2]. Along this approach, two algorithms exploiting different state-of-the-art kernels on a graph are developed. The same strategy can also be used in order to compute a betweenness measure. The second approach works on a trellis structure built from biased random walks on the graph, extending an idea introduced in [3]. These random walks allow to define a biased bounded betweenness for the nodes of interest, defined separately for each class. All the proposed algorithms have a linear computing time in the number of edges while providing good results, and hence are applicable to large sparse networks. They are empirically validated on medium-size standard data sets and are shown to be competitive with state-of-the-art techniques. Finally, we processed a novel data set, which is made available for benchmarking, for multi-class classification in a large network: the U.S. patents citation network containing 3M nodes (of six different classes) and 38M edges. The three proposed algorithms achieve competitive results (around 85% classification rate) on this large networkthey classify the unlabeled nodes within a few minutes on a standard workstation. © 2010 Elsevier Ltd. All rights reserved. |