Abstract:Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets. There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree (cell dimension tree) is designed for indexing cells. Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD (skew of data) is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.