 |
|
|
|
 |
 |
 |
|
 |
|
 |
|
|
刘贝贝,马儒宁,丁军娣.大数据的密度统计合并算法.软件学报,2015,26(11):2820-2835 |
大数据的密度统计合并算法 |
Density-Based Statistical Merging Algorithm for Large Data Sets |
投稿时间:2015-05-30 修订日期:2015-08-26 |
DOI:10.13328/j.cnki.jos.004902 |
中文关键词: 聚类 抽样 代表点 密度 大数据 |
英文关键词:clustering sampling leader density large data |
基金项目:国家自然科学基金(61103058, 61233011) |
|
摘要点击次数: 4269 |
全文下载次数: 3141 |
中文摘要: |
针对处理大数据时传统聚类算法失效或效果不理想的问题,提出了一种大数据的密度统计合并算法(density-based statistical merging algorithm for large data sets,简称DSML).该算法将数据点的每个特征看作一组独立随机变量,并根据独立有限差分不等式获得统计合并判定准则.首先,使用统计合并判定准则对Leaders算法做出改进,获得代表点集;随后,结合代表点的密度和邻域信息,再次使用统计合并判定准则完成对整个数据集的聚类.理论分析和实验结果表明,DSML算法具有近似线性的时间复杂度,能处理任意形状的数据集,且对噪声具有良好的鲁棒性,非常有利于处理大规模数据集. |
英文摘要: |
To tackle the failure of traditional clustering algorithms in dealing with large-scale data, the paper proposes a density-based statistical merging algorithm for large data sets (DSML). The algorithm takes each feature of data points as a set of independent random variable, and gets statistical merger criteria from the independent bounded difference inequality. To begin with, DSML improves Leaders algorithm by using the statistical merger criteria, and makes the improved algorithm as the sampling algorithm to obtain representative points. Secondly, combined with the density and the neighborhood information of representative points, the algorithm uses statistical merger criteria again to complete the clustering of the whole data set. Theoretical analysis and experimental results show that, DSML algorithm has nearly linear time complexity, can handle arbitrary data sets, and is insensitive to noise data. This fully proves the validity of DSML algorithm for large data sets. |
HTML 下载PDF全文 查看/发表评论 下载PDF阅读器 |
|
|
|
|
|
|
 |
|
|
|
|
 |
|
 |
|
 |
|