基于采样的在线大图数据收集和更新

doi:10.13328/j.cnki.jos.005843

微信服务号

微信订阅号

2025年6月1日 14:49 星期日

首页 > 过刊浏览>2020年第31卷第11期 >3540-3558. DOI:10.13328/j.cnki.jos.005843

PDF HTML阅读 XML下载导出引用引用提醒

基于采样的在线大图数据收集和更新
DOI:
                        10.13328/j.cnki.jos.005843
                    
CSTR:
                        
                    
作者:
                        尹子都尹子都
云南大学 信息学院, 云南 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
岳昆岳昆
云南大学 信息学院, 云南 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
张彬彬张彬彬
云南大学 信息学院, 云南 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找
李劲李劲
云南大学 软件学院, 云南 昆明 650500
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:尹子都(1990-),男,博士生,主要研究领域为海量数据处理与分析,知识融合.
岳昆(1979-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为海量数据处理与分析,大数据知识工程.
张彬彬(1982-),女,博士,讲师,CCF专业会员,主要研究领域为云计算和知识发现.
李劲(1975-),男,博士,副教授,CCF专业会员,主要研究领域为海量数据处理和机器学习.
通讯作者:岳昆,E-mail:kyue@ynu.edu.cn
中图分类号:
基金项目:国家自然科学基金（U1802271，62002311）；云南省基础研究计划杰出青年项目（2019FJ011）；云南省青年拔尖人才培养支持计划（C6193032）；云南大学东陆学者培育计划

Sampling-based Collection and Updating of Online Big Graph Data

Author:

YIN Zi-Du
YIN Zi-Du
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
在期刊界中查找
在百度中查找
在本站中查找
YUE Kun
YUE Kun
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
在期刊界中查找
在百度中查找
在本站中查找
ZHANG Bin-Bin
ZHANG Bin-Bin
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
在期刊界中查找
在百度中查找
在本站中查找
LI Jin
LI Jin
School of Software, Yunnan University, Kunming 650500, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

National Natural Science Foundation of China (U1802271, 62002311); Science Foundation for Distinguished Young Scholars of Yunnan Province (2019FJ011); Young Talent Support Program of Yunnan Province(C6193032); Donglu Scholars Training Program of Yunnan University

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

互联网中，以网页、社交媒体和知识库等为载体呈现的大量非结构化数据可表示为在线大图.在线大图数据的获取包括数据收集和更新，是大数据分析与知识工程的重要基础，但面临着数据量大、分布广、异构和变化快速等挑战.基于采样技术，提出并行、自适应的在线大图数据收集和更新方法.首先，将分支限界方法与半蒙特卡罗采样技术相结合，提出能够自适应地收集在线大图数据的HD-QMC算法；然后，为了使收集的数据能反映实际中在线大图的动态变化，进一步基于信息熵及泊松过程，提出高效更新在线大图数据的EPP算法.从理论上分析了该算法的有效性，并将获取的各类在线大图数据统一表示为RDF三元组的形式，为在线大图数据分析及相关研究提供方便易用的数据基础.基于Spark实现了在线大图数据的收集和更新算法，人工生成数据和真实数据上的实验结果展示了该方法的有效性和高效性.

关键词:在线大图;数据收集;数据更新;并行爬虫;Spark

Abstract:

The large volume of unstructured data obtained from Web pages, social media and knowledge bases on the Internet could be represented as an online big graph (OBG). Confronted with many challenges, such as its large-scale, widespread, heterogeneous, and fast-changing properties, OBG data acquisition includes data collection and updating, which is the basis of massive data analysis and knowledge engineering. In this study, the method for adaptive and parallel data collection and updating is proposed based on sampling techniques. First, the HD-QMC algorithm is given for adaptive data collection of OBG data by combining the branch-and-bound method and quasi-Monte Carlo sampling technique. Next, the EPP algorithm is given for efficient data updating based on entropy and Poisson process to make the collected data reflect the dynamic change of OBGs in real-world environments. Further, the effectiveness of the proposed algorithms is analyzed theoretically, and various kinds of collected OBG data are represented by triples universally to provide an easy-to-use data foundation for OBG analysis and relevant studies. Finally, the proposed algorithms for data collection and updating are implemented with Spark, and experimental results on simulated and real-world datasets show the effectiveness and efficiency of the proposed method.

Key words:online big graph;data collection;data updating;parallel crawler;Spark

引用本文

尹子都,岳昆,张彬彬,李劲.基于采样的在线大图数据收集和更新.软件学报,2020,31(11):3540-3558

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2018-10-25
最后修改日期:2019-01-16
录用日期:
在线发布日期: 2020-11-07
出版日期: 2020-11-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码