Supported by the National Basic Research Program of China under Grant No.2005CB321905 (国家重点基础研究发展计划(973))
An effective clustering algorithm called “P-Stream” for probabilistic data stream is developed in this paper for the first time. For the uncertain tuples in the data stream, the concepts of strong cluster, transitional clusters and weak cluster are proposed in the P-Stream. With these concepts, an effective strategy of choosing candidate cluster is designed, which can find the sound cluster for every continuously arriving data point. Then, in order to further cluster on the high level and analyze the evolving behaviors of data streams, snapshots of micro-clusters are stored at every checkpoint. At last, an “aggressive” two-tier clustering model is introduced to judge whether the most recently arrived data point is fitting in with the first level clustering model or not. Probabilistic data streams in the experiments include KDD-CUP’98 and KDD-CUP’99 real data sets and synthetic data sets with changing Gaussian distributions. Comprehensive experimental results demonstrate that P-Stream is of high quality, fast processing rate and is efficiently fitting in with the evolving situations of data streams.