基于采样的在线大图数据收集和更新
作者:
作者简介:

尹子都(1990-),男,博士生,主要研究领域为海量数据处理与分析,知识融合.
岳昆(1979-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为海量数据处理与分析,大数据知识工程.
张彬彬(1982-),女,博士,讲师,CCF专业会员,主要研究领域为云计算和知识发现.
李劲(1975-),男,博士,副教授,CCF专业会员,主要研究领域为海量数据处理和机器学习.

通讯作者:

岳昆,E-mail:kyue@ynu.edu.cn

基金项目:

国家自然科学基金(U1802271,62002311);云南省基础研究计划杰出青年项目(2019FJ011);云南省青年拔尖人才培养支持计划(C6193032);云南大学东陆学者培育计划


Sampling-based Collection and Updating of Online Big Graph Data
Author:
Fund Project:

National Natural Science Foundation of China (U1802271, 62002311); Science Foundation for Distinguished Young Scholars of Yunnan Province (2019FJ011); Young Talent Support Program of Yunnan Province(C6193032); Donglu Scholars Training Program of Yunnan University

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [34]
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    互联网中,以网页、社交媒体和知识库等为载体呈现的大量非结构化数据可表示为在线大图.在线大图数据的获取包括数据收集和更新,是大数据分析与知识工程的重要基础,但面临着数据量大、分布广、异构和变化快速等挑战.基于采样技术,提出并行、自适应的在线大图数据收集和更新方法.首先,将分支限界方法与半蒙特卡罗采样技术相结合,提出能够自适应地收集在线大图数据的HD-QMC算法;然后,为了使收集的数据能反映实际中在线大图的动态变化,进一步基于信息熵及泊松过程,提出高效更新在线大图数据的EPP算法.从理论上分析了该算法的有效性,并将获取的各类在线大图数据统一表示为RDF三元组的形式,为在线大图数据分析及相关研究提供方便易用的数据基础.基于Spark实现了在线大图数据的收集和更新算法,人工生成数据和真实数据上的实验结果展示了该方法的有效性和高效性.

    Abstract:

    The large volume of unstructured data obtained from Web pages, social media and knowledge bases on the Internet could be represented as an online big graph (OBG). Confronted with many challenges, such as its large-scale, widespread, heterogeneous, and fast-changing properties, OBG data acquisition includes data collection and updating, which is the basis of massive data analysis and knowledge engineering. In this study, the method for adaptive and parallel data collection and updating is proposed based on sampling techniques. First, the HD-QMC algorithm is given for adaptive data collection of OBG data by combining the branch-and-bound method and quasi-Monte Carlo sampling technique. Next, the EPP algorithm is given for efficient data updating based on entropy and Poisson process to make the collected data reflect the dynamic change of OBGs in real-world environments. Further, the effectiveness of the proposed algorithms is analyzed theoretically, and various kinds of collected OBG data are represented by triples universally to provide an easy-to-use data foundation for OBG analysis and relevant studies. Finally, the proposed algorithms for data collection and updating are implemented with Spark, and experimental results on simulated and real-world datasets show the effectiveness and efficiency of the proposed method.

    参考文献
    [1] Wang JM. Key technologies in big data applications development and runtime support platform. Ruan Jian Xue Bao/Journal of Software, 2017,28(6):1516-1528(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5231.htm[doi:10.13328/j.cnki.jos.005231]
    [2] Wu XD, Chen HH, Wu GQ, Liu J, Zheng QH, He XF, Zhou AY, Zhao ZQ, Wei BF, Li Y, Zhang QP, Zhang SC. Knowledge engineering with big data. IEEE Intelligent Systems, 2015,30(5):46-55.[doi:10.1109/MIS.2015.56]
    [3] Zhang JZ, Meng XF. Mobile Web search. Ruan Jian Xue Bao/Journal of Software, 2012,23(1):46-64(in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4120.htm[doi:10.3724/SP.J.1001.2012.04120]
    [4] Wang GL, Han YB, Zhang ZM, Zhu ML. Could-based integration and service of streaming data. Chinese Journal of Computers, 2017,2017(1):107-125(in Chinese with English abstract).[doi:10.11897/SP.J.1016.2017.00107]
    [5] Xia D, Wang YS, Zhao ZP, Cui D. Incremental and interactive data integration approach for hierarchical data in domain of intelligent livelihood. Journal of Computer Research and Development, 2017,54(3):586-596(in Chinese with English abstract).[doi:10.7544/issn1000-1239.2017.20151048]
    [6] Lin HL, Wang YZ, Jia YT, Zhang P, Wang WP. Network big data oriented knowledge fusion methods:A survey. Chinese Journal of Computers, 2017, 2017(1):1-27(in Chinese with English abstract).[doi:10.11897/SP.J.1016.2017.00001]
    [7] Surendran S, Prasad DC, Kaimal MR. A scalable geometric algorithm for community detection from social networks with incremental update. Social Network Analysis and Mining, 2016,6(1):Article No.90.[doi:10.1007/s13278-016-0399-9]
    [8] Xi SJ, Sun FC, Wang JM. A cognitive crawler using structure pattern for incremental crawling and content extraction. In:Proc. of the IEEE Int'l Conf. on Cognitive Informatics. 2010. 238-244.[doi:10.1109/COGINF.2010.5599733]
    [9] Pavai G, Geetha TV. Improving the freshness of the search engines by a probabilistic approach based incremental crawler. Information Systems Frontiers, 2017,19(5):1013-1028.[doi:10.1007/s10796-016-9701-7]
    [10] Matteo R, Fabio V. MiSoSouP:Mining interesting subgroups with sampling and pseudodimension. In:Proc. of the 24th ACM Int'l Conf. on Knowledge Discovery & Data Mining. 2018. 2130-2139.[doi:10.1145/3219819.3219989]
    [11] Nikolov A, Haase P, Herzig DM, Trame J, Kozlov A. Combining RDF graph data and embedding models for an augmented knowledge graph. In:Proc. of the Companion of the Web Conf. 2018. 977-980.[doi:10.1145/3184558.3191527]
    [12] Andreou AS, Chatzis SP. Software defect prediction using doubly stochastic Poisson processes driven by stochastic belief networks. Journal of Systems and Software, 2016,122:72-82.[doi:10.1016/j.jss.2016.09.001]
    [13] Stivala AD, Koskinen JH, Rolls DA, Wang P, Robins G. Snowball sampling for estimating exponential random graph models for large networks. Social Networks, 2016,47:167-188.[doi:10.1016/j.socnet.2015.11.003]
    [14] Tao J, Zhao QQ, Cao PF, Wang Z, Zhang Y. APK-DFS:An automatic interaction system based on depth-first-search for APK. In:Proc. of the Int'l Conf. on Algorithms and Architectures for Parallel Processing. 2017. 420-430.[doi:10.1007/978-3-319-65482-9_29]
    [15] Khan A, Sharma DK. Self-Adaptive ontology based focused crawler for social bookmarking sites. Int'l Journal of Information Retrieval Research, 2017,7(2):51-67.[doi:10.4018/IJIRR.2017040104]
    [16] Wu CS, Hou W, Shi YQ, Liu T. A Web search contextual crawler using ontology relation mining. In:Proc. of the Int'l Conf. on Computational Intelligence and Software Engineering. 2009. 1-4.[doi:10.1109/CISE.2009.5365842]
    [17] Batzios A, Dimou C, Symeonidis AL, Mitkas PA. BioCrawler:An intelligent crawler for the semantic Web. Expert Systems with Applications, 2008,35(1-2):524-530.[doi:10.1016/j.eswa.2007.07.054]
    [18] Arulampalam MS, Evans RJ, Letaief KB. Importance sampling for error event analysis of HMM frequency line trackers. IEEE Trans. on Signal Processing, 2002,50(2):411-424.[doi:10.1109/78.978395]
    [19] Ahmed NK, Duffield N, Willke TL, Rossi RA. On sampling from massive graph streams. Proc. of the VLDB Endowment, 2017, 10(11):1430-1441.[doi:10.14778/3137628.3137651]
    [20] Yin ZD, Yue K, Wu H, Su YJ. Adaptive and parallel data acquisition from online big graphs. In:Proc. of the Int'l Conf. on Database Systems for Advanced Applications. LNCS 10827, Gold Coast:Springer-Verlag, 2018. 223-331.[doi:10.1007/978-3-319-91452-7_21]
    [21] Sharma V, Kumar M, Vig R. A hybrid revisit policy for web search. Journal of Advances in Information Technology, 2012,3(1):36-47.[doi:10.4304/jait.3.1.36-47]
    [22] Radinsky K, Bennett PN. Predicting content change on the Web. In:Proc. of the 6th ACM Int'l Conf. on Web Search and Data Mining. 2013. 415-424.[doi:10.1145/2433396.2433448]
    [23] Cho J, Ntoulas A. Effective change detection using sampling. In:Proc. of the Very Large Data Bases Conf. 2002. 514-525.[doi:10. 1016/B978-155860869-6/50052-4]
    [24] Faure H, Lemieux C. Improved Halton sequences and discrepancy bounds. Monte Carlo Methods Applications, 2010,16(3):1-18.[doi:10.1515/mcma.2010.008]
    [25] Leskovec J, Lang K, Dasgupta A, Mahoney M. Community structure in large networks:Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009,6(1):29-123.[doi:10.1080/15427951.2009.10129177]
    [26] McAuley J, Leskovec J. Learning to discover social circles in ego networks. In:Proc. of the Int'l Conf. on Neural Information Processing Systems. 2012. 539-547.
    [27] Fu KW, Chan CH, Chau M. Assessing censorship on microblogs in China:Discriminatory keyword analysis and impact evaluation of the ‘real name registration’ policy. IEEE Internet Computing, 2013,17(3):42-50.[doi:10.1109/MIC.2013.28]
    [28] Le BD, Nguyen HX, Shen H, Falkner N. GLFR:A generalized LFR benchmark for testing community detection algorithms. In:Proc. of the Int'l Conf. on Computer Communication and Networks. 2017. 1-9.[doi:10.1109/ICCCN.2017.8038442]
    附中文参考文献:
    [1] 王建民.领域大数据应用开发与运行平台技术研究.软件学报,2017,28(6):1516-1528. http://www.jos.org.cn/1000-9825/5231.htm[doi:10.13328/j.cnki.jos.005231]
    [3] 张金增,孟小峰.移动Web搜索研究.软件学报,2012,23(1):46-64. http://www.jos.org.cn/1000-9825/4120.htm[doi:10.3724/SP.J. 1001.2012.04120]
    [4] 王桂玲,韩燕波,张仲妹,朱美玲.基于云计算的流数据集成与服务.计算机学报,2017,2017(1):107-125.[doi:10.11897/SP.J.1016. 2017.00107]
    [5] 夏丁,王亚沙,赵梓棚,崔达.面向智慧民生领域的增量交互式数据集成方法.计算机研究与发展,2017,54(3):586-596.[doi:10. 7544/issn1000-1239.2017.20151048]
    [6] 林海伦,王元卓,贾岩涛,张鹏,王伟平.面向网络大数据的知识融合方法综述.计算机学报,2017,2017(1):1-27.[doi:10.11897/SP.J. 1016.2017.00001]
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

尹子都,岳昆,张彬彬,李劲.基于采样的在线大图数据收集和更新.软件学报,2020,31(11):3540-3558

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2018-10-25
  • 最后修改日期:2019-01-16
  • 在线发布日期: 2020-11-07
  • 出版日期: 2020-11-06
文章二维码
您是第19937664位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号