基于流信息距离的多文本流热点挖掘

doi:10.3724/SP.J.1001.2011.03893

微信服务号

微信订阅号

首页 > 过刊浏览>2011年第22卷第8期 >1761-1770. DOI:10.3724/SP.J.1001.2011.03893

基于流信息距离的多文本流热点挖掘
DOI:
                        10.3724/SP.J.1001.2011.03893
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(600773169); 国家科技支撑计划(2006BAI05A01)

Mining Hotspots from Multiple Text Streams Based on Stream Information Distance

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

把文本流中的热点区分为局部热点和全局热点,分析了二者的相关性,并将Kolmogorov 复杂度应用于多文本流中的热点挖掘.首先,定义了基于Kolmogorov 复杂度的冗余信息的概念,并论证了文本流存在局部热点的必要条件是冗余信息超过某个阈值;其次,基于条件Kolmogorov 复杂度提出了一个相似性度量指标——流信息距离(stream information distance,简称SID),以衡量不同文本流之间的相似度;并借鉴计算生物学领域中的种系发生树的思想,提出了一种基于层次聚类的多文本流全局热点挖掘启发式算法.在合成和真实数据集的实验,验证了算法的收敛性、有效性和规模可伸缩性.

Abstract:

This paper characterizes the local and global hotspots in text streams and elaborates their correlation. The paper then applies Kolmogorov complexity to mining the hotspots in multiple text streams. The Redundant Information is defined based on Kolmogorov complexity, and it has been demonstrated that the Redundant Information exceeding a threshold is necessary for the local hotspots. Secondly, a similarity metric, termed as Stream Information Distance (SID), is suggested based on the conditional Kolmogorov complexity to quantify the similarity between different text streams. Borrowing ideas of Phylogeny originated from Computational Biology, a heuristic algorithm based on hierarchical clustering is proposed to mine the global hostspots from multiple text streams. Finally, the convergency, effectiveness, and scalability of this algorithm are validated by the extensive experiments over synthetic and real data set.

参考文献

相似文献

引证文献

引用本文

杨宁,唐常杰,王悦,陈瑜,郑皎凌,李红军.基于流信息距离的多文本流热点挖掘.软件学报,2011,22(8):1761-1770

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2009-10-12
最后修改日期:2010-03-29
录用日期:
在线发布日期:
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码