基于聚类的直推式学习的性能分析

doi:10.13328/j.cnki.jos.004726

微信服务号

微信订阅号

2025年4月26日 23:12 星期六

首页 > 过刊浏览>2014年第25卷第12期 >2865-2876. DOI:10.13328/j.cnki.jos.004726

PDF HTML阅读 XML下载导出引用引用提醒

基于聚类的直推式学习的性能分析
DOI:
                        10.13328/j.cnki.jos.004726
                    
CSTR:
                        
                    
作者:
                        张新张新
中国科学院大学 计算机与控制学院, 北京 101408
在期刊界中查找
在百度中查找
在本站中查找
何苯何苯
中国科学院大学 计算机与控制学院, 北京 101408
在期刊界中查找
在百度中查找
在本站中查找
罗铁坚罗铁坚
中国科学院大学 计算机与控制学院, 北京 101408
在期刊界中查找
在百度中查找
在本站中查找
李东星李东星
中国科学院大学 计算机与控制学院, 北京 101408
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(61103131,61472391);教育部留学回国人员科研启动基金;北京市自然科学基金(4142050)

Performance Analysis of Clustering-Based Transductive Learning

Author:

ZHANG Xin
ZHANG Xin
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
在期刊界中查找
在百度中查找
在本站中查找
HE Ben
HE Ben
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
在期刊界中查找
在百度中查找
在本站中查找
LUO Tie-Jian
LUO Tie-Jian
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
在期刊界中查找
在百度中查找
在本站中查找
LI Dong-Xing
LI Dong-Xing
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [30]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

近年来,Twitter搜索在社交网络领域引起越来越多学者的关注.尽管排序学习可以融合Twitter中丰富的特征,但是训练数据的匮乏,会降低排序学习的性能.直推式学习作为一种常用的半监督学习方法,在解决训练数据的稀少性中发挥着重要的作用.由于在直推式学习的迭代过程中会生成噪音,基于聚类的直推式学习方法被提出.在基于聚类的直推式学习方法中有两个重要的参数,分别为聚类的阈值以及聚类文档的数量.在原有工作的基础上,提出使用另外一种不同的聚类算法.大量在标准TREC数据集Tweets11上的实验表明,聚类的阈值以及聚类过程中文档数量的选择都会对模型的检索性能产生影响.另外,也分析了基于聚类的直推式学习模型的鲁棒性在不同查询集上的表现.最后,引入名为簇凝聚度的质量控制因子,提出了一种基于聚类的自适应的直推式方法来实现Twitter检索.实验结果表明,基于聚类的自适应学习算法具有更好的鲁棒性.

关键词:聚类;直推学习;Twitter检索;自适应;性能

Abstract:

Recently, Twitter search has drawn much attention of researchers in social networks. Although rich features of Twitter can be incorporated into rank learning, the retrieval effectiveness can be hurt by the lack of training data. Transductive learning, as a common semi-supervised learning method, has been playing an import role in dealing with the lacking of training data. Due to the fact that noise is generated during the iterative process of transductive learning, a clustering-based transductive method is proposed. There exist two important parameters in the clustering-based transductive approach, namely the threshold of clustering and the number of the documents that will be clustered. This paper extends the method by utilizing a different clustering algorithm. As shown by extensive experiments on the standard TREC Tweets11 collection, both of the two parameters have an effect on the retrieval effectiveness. Furthermore, the robustness of the clustering-based transduction approach on different query sets is also studied. Finally, the paper proposes an adaptive clustering-based approach by introducing a so called cluster coherence as quality controller. The experimental results show that the robustness of the proposed method is better.

Key words:clustering;transductive learning;Twitter search;adaptive;performance

参考文献

[1] Kwak H, Lee C, Park H, Moon S. What is twitter, a social network or a news media? In: Proc. of the 19th Int'l World Wide Web (WWW) Conf. New York: ACM Press, 2010. 591-600.

[2] Li XY, Croft WB. Time-Based language models. In: Proc. of the twelfth Int'l Conf. on Information and Knowledge Management (CIKM 2003). New York: ACM Press, 2003. 469-475.

[3] Efron M, Golovchinsky G. Estimation methods for ranking recent information. In: Proc. of the 34th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2011). New York: ACM Press, 2011. 495-504.

[4] Liu TY. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009,3(3):225-331.

[5] El-Yaniv R, Pechyony D. Stable transductive learning. In: Proc. of the 19th Annual Conf. on Learning Theory (COLT 2006). 2006. 35-49.

[6] Li M, Li H, Zhou ZH. Semi-Supervised document retrieval. Information Processing and Management, 2009,45:341-355.

[7] Huang JX, Miao J, He B. High performance query expansion using adaptive co-training. Information Processing and Management, 2013,49(2):441-453.

[8] Huang X, Huang YH, Wen M, An A, Liu Y, Poon J. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In: Proc. of the IEEE Int'l Conf. on Data Mining Series (ICDM). IEEE, 2006. 295-306.

[9] Zhang X, He B, Luo TJ. Transductive learning for real-time Twitter search. In: Proc. of the Int'l Conf. on Weblogs and Social Media (ICWSM). AAAI, 2012. 611-614.

[10] Amati G, Amodeo G, Bianchi M, Celi A, Nicola CD, Flammini M, Gaibisso C, Gambosi G, Marcone G. Fub, IASI-CNR, UNIVAQ at trec 2011. Technical Report, Gaithersburg: TREC, 2011.

[11] Ounis I, Macdonald C, Lin J, Soboroff I. Overview of the TREC 2011 microblog track. Technical Report, Gaithersburg: TREC, 2011.

[12] Geng XB, Qin T, Liu TY, Cheng XQ, Li H. Selecting optimal training data for learning to rank. Information Processing and Management, 2011,47(5):730-741.

[13] Zhang X, He B, Luo TJ, Li DX, Xu JG. Clustering-Based transduction for learning a ranking model with limited human labels. In: Proc. of the 22nd ACM Int'l Conf. on Information & Knowledge Management (CIKM 2013). New York: ACM Press, 2013, 1777-1782.

[14] Dong A, Chang Y, Zheng ZH, Mishne G, Bai J, Zhang RQ, Buchner K, Liao C, Diaz F. Towards recency ranking in Web search. In: Proc. of the 3rd ACM Int'l Conf. on Web Search and Data Mining (WSDM 2010). New York: ACM Press, 2010. 11-20.

[15] Efron M, Golovchinksy G. Estimation methods for ranking recent information. In: Proc. of the 34th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2011). New York: ACM Press, 2011. 495-504.

[16] Cha M, Haddadi H, Benevenuto F, Gummadi KP. Measuring user influence in twitter: The million follower fallacy. In: Proc. of the Int'l Aaai Conf. on Weblogs and Social Media (ICWSM). AAAI, 2010.

[17] Duan YJ, Jiang L, Qin T, Zhou M, Shum HY. An empirical study on learning to rank of tweets. In: Proc. of the 23rd Int'l Conf. on Computational Linguistics (COLING 2010). Stroudsburg: Association for Computational Linguistics, 2010. 295-303.

[18] Metzler D, Cai C. USC/ISI at TREC 2011: Microblog track. Technical Report, Gaithersburg: TREC, 2011.

[19] Miyanishi T, Okamura N, Liu XX, Seki K, Uehara K. TREC 2011 microblog track experiments at KOBE University. Technical Report, Gaithersburg: TREC, 2011.

[20] Vapnik VN. Statistical Learning Theory. New York: Wiley, 1998.

[21] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977,39(1):1-38.

[22] Blum A, Chawla S. Learning from labeled and unlabeled data using graph mincuts. In: Brodley CE, Danyluk AP, eds. Proc. of the 18th Int'l Conf. on Machine Learning (ICML 2001). San Francisco: Morgan Kaufmann Publishers, 2001. 19-26.

[23] Blum A, Mitchell TM. Combining labeled and unlabeled data with co-training. In: Proc. of the 11th Annual Conf. on Computational Learning Theory (COLT'98). New York: ACM Press, 1998. 92-100.

[24] Sellamanickam S, Garg P, Selvaraj SK. A pairwise ranking based approach to learning with positive and unlabeled examples. In: Berendt B, de Vries A, Fan WF, Macdonald C, Ounis I, Ruthven I, eds. Proc. of the 20th ACM Int'l Conf. on Information and Knowledge Management (CIKM 2011). New York: AC Press, 2011. 663-672.

[25] Duh K, Kirchhoff K. Learning to rank with partially-labeled data. In: Proc. of the 31st Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2008). New York: ACM Press, 2008. 251-258.

[26] Zhang X, He B, Luo TJ, Li B. Query-Biased learning to rank for real-time Twitter search. In: Proc. of the 21st ACM Int'l Conf. on Information and Knowledge Management (CIKM 2012). New York: ACM Press, 2012. 1915-1919.

[27] Alldrin N, Smith A, Turnbull D. Clustering with EM and k-means. Technical Report, California: University of San Diego, 2003.

[28] Niu SZ, Cheng XQ, Guo JF. Noise sensitivity in learning to rank. Journal of Chinese Information Process, 2012,26(5) (in Chinese with English abstract).

[29] Joachims T. Optimizing search engines using clickthrough data. In: Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD 2002). 2002. 133-142.

[30] Ounis I, Amati G, Plachouras V, He B, Macdonald C, Lioma C. Terrier: A high performance and scalable information retrieval platform. In: Proc. of the ACM Workshop on Open Source Information Retrieval (SIGIR 2006). New York: ACM Press, 2006.Smucker MD, Allan J, Carterette B. A comparison of statistical significance tests for information retrieval evaluation. In: Proc. of the 16th ACM Conf. on Information and Knowledge Management (CIKM 2007). New York: ACM Press, 2007. 623-632.

引用本文

张新,何苯,罗铁坚,李东星.基于聚类的直推式学习的性能分析.软件学报,2014,25(12):2865-2876

复制

文章指标

点击次数:6262
下载次数: 7286
HTML阅读次数: 2571
引用次数: 0

历史

收稿日期:2014-05-05
最后修改日期:2014-08-21
录用日期:
在线发布日期: 2014-12-04
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码