基于MapReduce的图结构聚类算法
作者:
作者简介:

张伟鹏(1991-),男,广东汕头人,硕士生,主要研究领域为大规模图数据并行算法;刘宇鸿(1997-),男,学士,主要研究领域为社区搜索,时态图挖掘;李振军(1979-),男,博士,工程师,主要研究领域为数据挖掘,深度学习;毛睿(1975-),男,博士,教授,CCF高级会员,主要研究领域为数据挖掘,数据库,统计方法,机器学习,计算生物;李荣华(1985-),男,博士,讲师,主要研究领域为图数据挖掘,社交网络分析;乔少杰(1981-),男,博士,教授,CCF高级会员,主要研究领域为轨迹数据挖掘,机器学习.

通讯作者:

李振军,E-mail:15323940@qq.com

中图分类号:

TP311

基金项目:

国家自然科学基金(61402292,61772091);国家自然科学基金广东省联合基金(U1301252);教育部人文社会科学研究规划基金(15YJAZH058)


MapReduce-Based Graph Structural Clustering Algorithm
Author:
Fund Project:

National Natural Science Foundation of China (61402292, 61772091);National Natural Science Foundation of China Guangdong Joint Fund Project (U1301252);Planning Foundation for Humanities and Social Sciences of Ministry of Education of China (15YJAZH058)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [34]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    图结构聚类(SCAN)是一种著名的基于密度的图聚类算法,该算法不仅能够找到图中的聚类结构,而且还能发现图中的Hub节点和离群节点.然而,随着图数据规模越来越大,传统的SCAN算法的复杂度为Om1.5)(m为图中边的条数),因此很难处理大规模的图数据.为了解决SCAN算法的可扩展性问题,提出一种基于MapReduce的海量图结构聚类算法MRSCAN,这是一种计算核心节点以及两种合并聚类的MapReduce算法.最后,在多个真实的大规模图数据集上进行实验测试,实验结果验证了算法的准确性、有效性以及可扩展性.

    Abstract:

    Graph Clustering is a fundamental task for graph mining which has been widely used in social network analysis related applications. Graph structural clustering (SCAN) is a well-known density-based graph clustering algorithm. SCAN algorithm can not only find the clusters in a graph, but also be able to identify hub nodes and outliers. However, with the growing graph size, the traditional SCAN algorithm is very hard to handle massive graph data, as its time complexity is O(m1.5) (m is the number of edges in the graph). To overcome the scalability issue of SCAN algorithm, this paper proposes a MapReduce based graph structural clustering algorithm, called MRSCAN. Specifically, the paper develops a MapReduce based similarity computation, a core node computation, as well as two clustering merging algorithms. In addition, it conducts extensive experiments over serval real-world graph datasets, and results demonstrate the accuracy, effectiveness, and scalability of the presented algorithm.

    参考文献
    [1] Mayer-SchoSnberger V, Cukier K, Wrote; Zhou T, et al. Trans. Big Data:A Revolution That Will Transform How We Live, Work, and Think. John Murray, 2013(in Chinese).
    [2] Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In:Proc. of the IEEE Symp. on MASS Storage Systems and Technologies. IEEE Computer Society, 2010. 1-10.[doi:10.1109/MSST.2010.5496972]
    [3] Shiokawa H, Fujiwara Y, Onizuka M. SCAN++:Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proc. of the VLDB Endowment, 2015.[doi:10.14778/2809974.2809980]
    [4] Chang L, Li W, Qin L, Zhang W. pSCAN:Fast and exact structural graph clustering. IEEE Trans. on Knowledge & Data Engineering, 2017,29(2):387-401.[doi:10.1109/TKDE.2016.2618795]
    [5] Li JJ, Cui J, Wang D, Yan L, Huang YS. Survey of MapReduce parallel programming model. Acta Electronic Journal, 2011,39(11):2635-2642(in Chinese with English abstract).
    [6] Wang F, Lei BH. Model analysis of hadoop distributed file system. Telecommunications Science, 2010,26(12):95-99(in Chinese with English abstract).[doi:10.3969/j.issn.1000-0801.2010.12.019]
    [7] Chen F, Kodialam M, Lakshman TV. Joint scheduling of processing and Shuffle phases in MapReduce systems. In:Proc. of the IEEE INFOCOM. IEEE, 2012. 1143-1151.[doi:10.1109/INFCOM.2012.6195473]
    [8] Xu X, Yuruk N, Feng Z, Schweiger TAJ. SCAN:A structural clustering algorithm for networks. In:Proc. of the ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. ACM Press, 2007. 824-833.[doi:10.1145/1281192.1281280]
    [9] Zhou FF, Li JC, Huang W, Wang JW, Zhao Y. Based on dimension expansion Radviz visual clustering analysis method. Ruan Jian Xue Bao/Journal of Software, 2016,27(5):1127-1139(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4951.htm[doi:10.13328/j.cnki.jos.004951]
    [10] Guo QK. Study on connection method based on MapReduce[Ph.D. Thesis]. Jilin University, 2014(in Chinese with English abstract).
    [11] Cohen J. Graph twiddling in a MapReduce world. Computing in Science & Engineering, 2009,11(4):29-41.[doi:10.1109/MCSE. 2009.120]
    [12] Wang HY, Zhou L. Study on the maximum mission problem based on divide, prune and ant colony algorithm. Journal of Hefei Teachers College, 2011,29(3):59-62(in Chinese with English abstract).[doi:10.3969/j.issn.1674-2273.2011.03.020]
    [13] Qin L, Yu J X, Chang L, Cheng H, Zhang C, Lin XM. Scalable big graph processing in MapReduce. In:Proc. of the SIGMOD. 2014. 827-838.[doi:10.1145/2588555.2593661]
    [14] Feng XJ. Research on iterative distributed data processing based on MapReduce[Ph.D. Thesis]. Shandong University, 2013(in Chinese with English abstract).
    [15] Liu T, Wu SH, Qiang Y. Task scheduling algorithm for multiple MapReduce jobs. Microelectronics & Computer, 2013,30(12):156-159(in Chinese with English abstract).
    [16] Gao SL. Research on Web page parallel de-emphasis algorithm based on MapReduce framework. Heilongjiang Science, 2010,1(5):13-18(in Chinese with English abstract).
    [17] Que X, Wang Y, Xu C, Yu WK. Hierarchical merge for scalable MapReduce. In:Proc. of the Workshop on Management of Big Data Systems. 2012. 1-6.[doi:10.1145/2378356.2378358]
    [18] Tiwari N, Sarkar S, Indrawan-Santiago M, Bellur U. Improving energy efficiency of IO-intensive MapReduce jobs. In:Proc. of the Int'l Conf. 2015. 1-4.[doi:10.1145/2684464.2684484]
    [19] Zhang X. Design of DBSCAN algorithm based on query and search. Journal of Yili Normal University:Natural Science Edition, 2014,8(4):62-65(in Chinese with English abstract).[doi:10.3969/j.issn.1673-999X.2014.04.014]
    [20] Ma J, Iwama K, Ma SH. Search for parallel algorithm in loop of undirected graph. Ruan Jian Xue Bao/Journal of Software, 1997, 18(6):475-480(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/18/475.htm
    [21] Hou ZL, Wei XH, Huang DN, Xu S. Application of parallel computing and its performance analysis in gravity totally gradient data inversion. Applied Geophysics, 2015,12(3):292-302(in Chinese with English abstract).
    [22] Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel:A system for large-scale graph processing. In:Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. ACM Press, 2010. 135-146.[doi:10.1145/1582716.1582723]
    附中文参考文献:
    [1] Mayer-Schonborger V, Cukier K,著;周涛,等,译.大数据时代:生活、工作与思维的大变革.杭州:浙江人民出版社,2013.
    [5] 李建江,崔健,王聃,严林,黄义双.MapReduce并行编程模型研究综述.电子学报,2011,39(11):2635-2642.
    [6] 王峰,雷葆华.Hadoop分布式文件系统的模型分析.电信科学,2010,26(12):95-99.[doi:10.3969/j.issn.1000-0801.2010.12.019]
    [9] 周芳芳,李俊材,黄伟,王俊韡,赵颖.基于维度扩展的Radviz可视化聚类分析方法.软件学报,2016,27(5):1127-1139. http://www.jos.org.cn/1000-9825/4951.htm
    [10] 郭骐恺.基于MapReduce的连接方法研究[博士学位论文].长春:吉林大学,2014.
    [12] 王会颖,周琳.基于分治、剪枝和蚁群算法求解最大团问题.合肥师范学院学报,2011,29(3):59-62.[doi:10.3969/j.issn.1674-2273. 2011.03.020]
    [14] 冯新建.基于MapReduce的迭代型分布式数据处理研究[博士学位论文].济南:山东大学,2013.
    [15] 刘涛,武淑红,强彦.用于多个MapReduce作业的任务调度算法.微电子学与计算机,2013,30(12):156-159.
    [16] 高殊丽.基于MapReduce框架的网页并行去重算法研究.黑龙江科学,2010,1(5):13-18.
    [19] 张晓.基于并查集的DBSCAN算法设计.伊犁师范学院学报:自然科学版,2014,8(4):62-65.[doi:10.3969/j.issn.1673-999X.2014. 04.014]
    [20] 马军,岩间一雄,马绍汉.寻找无向图中回路的并行算法.软件学报,1997,18(6):475-480. http://www.jos.org.cn/1000-9825/18/475.htm
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

张伟鹏,李振军,李荣华,刘宇鸿,毛睿,乔少杰.基于MapReduce的图结构聚类算法.软件学报,2018,29(3):627-641

复制
分享
文章指标
  • 点击次数:4586
  • 下载次数: 7193
  • HTML阅读次数: 3323
  • 引用次数: 0
历史
  • 收稿日期:2017-08-03
  • 最后修改日期:2017-09-05
  • 在线发布日期: 2017-12-05
文章二维码
您是第19938456位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号