基于动态赋权近邻传播的数据增量采样方法
CSTR:
作者:
作者单位:

作者简介:

陈晓琪(1994-),女,硕士生,主要研究领域为大数据知识发现.
谢振平(1977-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为知识建模,认知计算,智能系统软件.
刘渊(1967-),男,教授,博士生导师,CCF高级会员,主要研究领域为数字媒体,网络安全.
詹千熠(1989-),女,博士,副教授,CCF专业会员,主要研究领域为数据挖掘,社交网络分析.

通讯作者:

谢振平,E-mail:xiezp@jiangnan.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(61872166);江苏省“六大人才高峰”项目(XYDXX-161);江苏省科技计划(BE2018056)


Incremental Data Sampling Method Using Affinity Propagation with Dynamic Weighting
Author:
Affiliation:

Fund Project:

National Natural Science Foundation of China (61872166); Six Talent Peaks Project of Jiangsu Province (XYDXX-161); Science and Technology Planning Project of Jiangsu Province (BE2018056)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    数据采样是快速提取大规模数据集中有用信息的重要手段,为更好地应对越来越大规模的数据高效处理要求,借助近邻传播算法的优异性能,通过引入分层增量处理和样本点动态赋权策略,实现了一种能够非常有效地平衡处理效率和采样质量的新方法.其中的分层增量处理策略考虑将原始的大规模数据集进行分批处理后再综合;而样本点动态赋权则考虑在近邻传播过程中对样本点进行合理的动态赋权,以获得采样的数据空间上更好的全局一致性.实验中,分别使用人工数据集、UCI标准数据集和图像数据集进行性能分析,结果表明:新方法与现有相关方法在采样划分质量上可达到同等水平,而计算效率则可实现大幅提升.进一步将新方法应用于深度学习的数据增强任务中,相应的实验结果表明:在原始数据增强方法上结合进高效增量采样处理后,在保持总训练数据集规模的情况下,所获得的模型性能可实现显著的提升.

    Abstract:

    Data sampling is an important manner to efficiently extract useful information from original huge datasets. In order to fit with the requirements of efficiently dealing with more and more large-scale data, a novel incremental data sampling method originated from affinity propagation method is proposed, in which two integrated algorithm strategies including hierarchical incremental processing and the dynamic weighting of data samples are introduced. The proposed method mainly can balance the computational efficiency and sampling quality very well. For hierarchical incremental processing strategy, it firstly samples data items in batches and then composites samples by hierarchical way. For dynamic weighting of data samples strategy, it dynamically re-weights the preference to retain better global consistency of possible samples on data space in the incremental sampling procedure. In the experiments, artificial datasets, UCI datasets, and image datasets are used to analyze the sampling performance. The experimental results with several compared algorithms indicate that, the proposed method can gain similar sampling quality but with notably higher computational efficiency especially for more large-scale datasets. This study further applies the new method to data augmentation task in deep learning, and the corresponding experimental results show that the proposed method performs excellently. Concretely, if basic training dataset are processed by sampling enhancement with the proposed new method, the trained model performance using similar number of training samples can be significantly improved compared to traditional data enhancement strategies.

    参考文献
    相似文献
    引证文献
引用本文

陈晓琪,谢振平,刘渊,詹千熠.基于动态赋权近邻传播的数据增量采样方法.软件学报,2021,32(12):3884-3900

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-08-01
  • 最后修改日期:2020-06-15
  • 录用日期:
  • 在线发布日期: 2021-12-02
  • 出版日期: 2021-12-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号