基于MapReduce与相关子空间的局部离群数据挖掘算法
作者:
基金项目:

国家自然科学基金(61272263)


Related-Subspace-Based Local Outlier Detection Algorithm Using MapReduce
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [30]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    针对高维海量数据,在MapReduce编程模型下,提出了一种基于相关子空间的局部离群数据挖掘算法.该算法首先利用属性维上的局部稀疏程度,重新定义了相关子空间,从而能够有效地刻画各种局部数据集上的分布特征;其次,利用局部数据集的概率密度,给出了相关子空间中的局部离群因子计算公式,有效地体现了相关子空间中数据对象不服从局部数据集分布特征的程度,并选取离群程度最大的N个数据对象定义为局部离群数据;在此基础上,采用LSH分布式策略,提出了一种MapReduce编程模型下的局部离群数据挖掘算法;最后,采用人工数据集和恒星光谱数据集,实验验证了该算法的有效性、可扩展性和可伸缩性.

    Abstract:

    In this paper, a related-subspace-based local outlier detection algorithm is proposed in MapReduce programming model for high-dimensional and massive data set. Firstly, the relevant subspace, which can effectively describe the local distribution of the various data sets, is redefined by using local sparseness of attribute dimensions. Secondly, a local outlier factor calculation formula in the relevant subspace is defined with probability density of local data sets. The formula can not only effectively reflect the outlierness of data object that does not obey the distribution of the local data set in relevant subspace, but also select N data objects with the greatest-outlierness as local outliers. Furthermore, a related-subspace-based local outlier detection algorithm is constructed by using LSH distributed strategy in MapReduce programming model. Finally, experimental results validate the effectiveness, scalability and extensibility of the presented algorithms by using artificial data and stellar spectral data as experimental data sets.

    参考文献
    [1] Knox EM, Ng RT. Algorithms for mining distance-based outliers in large datasets. In: Proc. of the Int'l Conf. on Very Large Data Bases. 1998. 392-403.
    [2] Zhang JF, Jiang YY, Hu LH, Cai JH, Zang SL. A concept lattice based recognition method of celestial spectra outliers. Acta Automatica Sinica, 2008,34(9):1060-1066 (in Chinese with English abstract).
    [3] Pham N, Pagh R. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In: Proc. of the 18th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM Press, 2012. 877-885. [doi: 10.1145/2339 530.2339669]
    [4] Sequeira K, Zaki M. ADMIT: Anomaly-Based data mining for intrusions. In: Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM Press, 2002. 386-395. [doi: 10.1145/775047.775103]
    [5] Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J. A comparative study of anomaly detection schemes in network intrusion detection. Technical Report, University of Minnesota, 2003.
    [6] Liu H, Shah S, Jiang W. On-Line outlier detection and data cleaning. Computers & Chemical Engineering, 2004,28(9):1635-1647.
    [7] Barnett V, Lewis T. Outliers in Statistical Data. New York: Wiley, 1994. [doi: 10.1016/j.compchemeng.2004.01.009]
    [8] Tao Y, Xiao X, Zhou S. Mining distance-based outliers from large databases in any metric space. In: Proc. of the 12th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM, 2006. 394-403.
    [9] Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 2000, 29(2):427-438. [doi: 10.1145/335191.335437]
    [10] Breunig MM, Kriegel HP, Ng RT, Sander J. LOF: Identifying density-based local outliers. ACM SIGMOD Record, 2000,29(2):93-104. [doi: 10.1145/335191.335388]
    [11] Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C. Loci: Fast outlier detection using the local correlation integral. In: Proc. of the 19th Int'l Conf. on Data Engineering (ICDE). IEEE, 2003. 315-326. [doi: 10.1109/ICDE.2003.1260802]
    [12] Sarawagi S, Agrawal R, Megiddo N. Discovery-Driven Exploration of OLAP Data Cubes. Berlin, Heidelberg: Springer-Verlag, 1998. [doi: 10.1007/BFb0100984]
    [13] Kriegel HP, Zimek A. Angle-Based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM Press, 2008. 444-452. [doi: 10.1145/1401890.1401946]
    [14] Kriegel HP, Kroger P, Schubert E, Zimek A. Outlier detection in arbitrarily oriented subspaces. In: Proc. of the 2012 IEEE 12th Int'l Conf. on Data Mining. IEEE Computer Society, 2012. 379-388. [doi: 10.1109/ICDM.2012.21]
    [15] Muller E, Schiffer M, Seidl T. Statistical selection of relevant subspace projections for outlier ranking. In: Proc. of the IEEE 27th Int'l Conf. on Data Engineering (ICDE). IEEE, 2011. 434-445. [doi: 10.1109/ICDE.2011.5767916]
    [16] Aggarwal CC, Yu PS. Outlier detection for high dimensional data. ACM Sigmod Record, 2001,30(2):37-46. [doi: 10.1145/ 376284.375668]
    [17] Keller F, Muller E, Bohm K. HiCS: High contrast subspaces for density-based outlier ranking. In: Proc. of the 28th Int'l Conf. on Data Engineering (ICDE). IEEE, 2012. 1037-1048. [doi: 10.1109/ICDE.2012.88]
    [18] Aggarwal CC, Philip SY. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal, 2005, 14(2):211-221. [doi: 10.1007/s00778-004-0125-5]
    [19] Zhang JF, Jiang YY, Chang KH, Zhang SL, Cai JH, Hu LH. A concept lattice based outlier mining method in low-dimensional subspaces. Pattern Recognition Letters, 2009,30(15):1434-1439. [doi: 10.1016/j.patrec.2009.07.016]
    [20] Kriegel HP, Kröger P, Schubert E, Zimek A. Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining. Berlin, Heidelberg: Springer-Verlag, 2009. 831-838. [doi: 10.1007/978-3-642-01307-2_86]
    [21] Bouguessa M, Wang S. Mining projected clusters in high-dimensional spaces. IEEE Trans. on Knowledge and Data Engineering, 2009,21(4):507-522. [doi: 10.1109/TKDE.2008.162]
    [22] Stupar A, Michel S, Schenkel R. RankReduce—Processing k-nearest neighbor queries on top of MapReduce. In: Proc. of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval. 2010. 15-20.
    [23] Lu W, Shen YY, Chen S, Ooi BC. Efficient processing of k nearest neighbor joins using mapreduce. Proc. of the VLDB Endowment, 2012,5(10):1016-1027. [doi: 10.14778/2336664.2336674]
    [24] Liu Y, Jing N, Chen L, Xiong W. Algorithm for processing k-nearest join based on R-tree in MapReduce. Ruan Jian Xue Bao/ Journal of Software, 2013,24(8):1836-1851(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4377.htm [doi: 10.3724/SP.J.1001.2013.04377]
    [25] Angiulli F, Basta S, Lodi S, Sartori C. Distributed strategies for mining outliers in large data sets. IEEE Trans. on Knowledge and Data Engineering, 2013,25(7):1520-1532. [doi: 10.1109/TKDE.2012.71]
    [26] Kriegel HP, Kröger P, Schubert E, Zimek A. LoOP: Local outlier probabilities. In: Proc. of the 18th ACM Conf. on Information and Knowledge Management. ACM Press, 2009. 1649-1652. [doi: 10.1145/1645953.1646195]
    [27] Zhang C, Li F, Jestes J. Efficient parallel kNN joins for large data in MapReduce. In: Proc. of the 15th Int'l Conf. on Extending Database Technology. ACM Press, 2012. 38-49. [doi: 10.1145/2247596.2247602]
    [28] Datar M, Immorlica N, Indyk P, Mirrokni VS. Locality-Sensitive hashing scheme based on p-stable distributions. In: Proc. of the 20th Annual Symp. on Computational Geometry. ACM Press, 2004. 253-262. [doi: 10.1145/997817.997857]
    [29] Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A. Can shared-neighbor distances defeat the curse of dimensionality. In: Proc. of the Scientific and Statistical Database Management. Berlin, Heidelberg: Springer-Verlag, 2010. 482-500. [doi: 10.1007/978-3- 642-13818-8_34]
    [30] White T. Hadoop: The Definitive Guide. O'Reilly, 2012.
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

张继福,李永红,秦啸,荀亚玲.基于MapReduce与相关子空间的局部离群数据挖掘算法.软件学报,2015,26(5):1079-1095

复制
相关视频

分享
文章指标
  • 点击次数:5337
  • 下载次数: 6572
  • HTML阅读次数: 1402
  • 引用次数: 0
历史
  • 收稿日期:2013-12-19
  • 最后修改日期:2014-05-21
  • 在线发布日期: 2014-08-22
文章二维码
您是第19936852位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号