基于聚类的直推式学习的性能分析
作者:
基金项目:

国家自然科学基金(61103131,61472391);教育部留学回国人员科研启动基金;北京市自然科学基金(4142050)


Performance Analysis of Clustering-Based Transductive Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [30]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    近年来,Twitter搜索在社交网络领域引起越来越多学者的关注.尽管排序学习可以融合Twitter中丰富的特征,但是训练数据的匮乏,会降低排序学习的性能.直推式学习作为一种常用的半监督学习方法,在解决训练数据的稀少性中发挥着重要的作用.由于在直推式学习的迭代过程中会生成噪音,基于聚类的直推式学习方法被提出.在基于聚类的直推式学习方法中有两个重要的参数,分别为聚类的阈值以及聚类文档的数量.在原有工作的基础上,提出使用另外一种不同的聚类算法.大量在标准TREC数据集Tweets11上的实验表明,聚类的阈值以及聚类过程中文档数量的选择都会对模型的检索性能产生影响.另外,也分析了基于聚类的直推式学习模型的鲁棒性在不同查询集上的表现.最后,引入名为簇凝聚度的质量控制因子,提出了一种基于聚类的自适应的直推式方法来实现Twitter检索.实验结果表明,基于聚类的自适应学习算法具有更好的鲁棒性.

    Abstract:

    Recently, Twitter search has drawn much attention of researchers in social networks. Although rich features of Twitter can be incorporated into rank learning, the retrieval effectiveness can be hurt by the lack of training data. Transductive learning, as a common semi-supervised learning method, has been playing an import role in dealing with the lacking of training data. Due to the fact that noise is generated during the iterative process of transductive learning, a clustering-based transductive method is proposed. There exist two important parameters in the clustering-based transductive approach, namely the threshold of clustering and the number of the documents that will be clustered. This paper extends the method by utilizing a different clustering algorithm. As shown by extensive experiments on the standard TREC Tweets11 collection, both of the two parameters have an effect on the retrieval effectiveness. Furthermore, the robustness of the clustering-based transduction approach on different query sets is also studied. Finally, the paper proposes an adaptive clustering-based approach by introducing a so called cluster coherence as quality controller. The experimental results show that the robustness of the proposed method is better.

    参考文献
    [1] Kwak H, Lee C, Park H, Moon S. What is twitter, a social network or a news media? In: Proc. of the 19th Int'l World Wide Web (WWW) Conf. New York: ACM Press, 2010. 591-600.
    [2] Li XY, Croft WB. Time-Based language models. In: Proc. of the twelfth Int'l Conf. on Information and Knowledge Management (CIKM 2003). New York: ACM Press, 2003. 469-475.
    [3] Efron M, Golovchinsky G. Estimation methods for ranking recent information. In: Proc. of the 34th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2011). New York: ACM Press, 2011. 495-504.
    [4] Liu TY. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009,3(3):225-331.
    [5] El-Yaniv R, Pechyony D. Stable transductive learning. In: Proc. of the 19th Annual Conf. on Learning Theory (COLT 2006). 2006. 35-49.
    [6] Li M, Li H, Zhou ZH. Semi-Supervised document retrieval. Information Processing and Management, 2009,45:341-355.
    [7] Huang JX, Miao J, He B. High performance query expansion using adaptive co-training. Information Processing and Management, 2013,49(2):441-453.
    [8] Huang X, Huang YH, Wen M, An A, Liu Y, Poon J. Applying data mining to pseudo-relevance feedback for high performance text retrieval. In: Proc. of the IEEE Int'l Conf. on Data Mining Series (ICDM). IEEE, 2006. 295-306.
    [9] Zhang X, He B, Luo TJ. Transductive learning for real-time Twitter search. In: Proc. of the Int'l Conf. on Weblogs and Social Media (ICWSM). AAAI, 2012. 611-614.
    [10] Amati G, Amodeo G, Bianchi M, Celi A, Nicola CD, Flammini M, Gaibisso C, Gambosi G, Marcone G. Fub, IASI-CNR, UNIVAQ at trec 2011. Technical Report, Gaithersburg: TREC, 2011.
    [11] Ounis I, Macdonald C, Lin J, Soboroff I. Overview of the TREC 2011 microblog track. Technical Report, Gaithersburg: TREC, 2011.
    [12] Geng XB, Qin T, Liu TY, Cheng XQ, Li H. Selecting optimal training data for learning to rank. Information Processing and Management, 2011,47(5):730-741.
    [13] Zhang X, He B, Luo TJ, Li DX, Xu JG. Clustering-Based transduction for learning a ranking model with limited human labels. In: Proc. of the 22nd ACM Int'l Conf. on Information & Knowledge Management (CIKM 2013). New York: ACM Press, 2013, 1777-1782.
    [14] Dong A, Chang Y, Zheng ZH, Mishne G, Bai J, Zhang RQ, Buchner K, Liao C, Diaz F. Towards recency ranking in Web search. In: Proc. of the 3rd ACM Int'l Conf. on Web Search and Data Mining (WSDM 2010). New York: ACM Press, 2010. 11-20.
    [15] Efron M, Golovchinksy G. Estimation methods for ranking recent information. In: Proc. of the 34th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2011). New York: ACM Press, 2011. 495-504.
    [16] Cha M, Haddadi H, Benevenuto F, Gummadi KP. Measuring user influence in twitter: The million follower fallacy. In: Proc. of the Int'l Aaai Conf. on Weblogs and Social Media (ICWSM). AAAI, 2010.
    [17] Duan YJ, Jiang L, Qin T, Zhou M, Shum HY. An empirical study on learning to rank of tweets. In: Proc. of the 23rd Int'l Conf. on Computational Linguistics (COLING 2010). Stroudsburg: Association for Computational Linguistics, 2010. 295-303.
    [18] Metzler D, Cai C. USC/ISI at TREC 2011: Microblog track. Technical Report, Gaithersburg: TREC, 2011.
    [19] Miyanishi T, Okamura N, Liu XX, Seki K, Uehara K. TREC 2011 microblog track experiments at KOBE University. Technical Report, Gaithersburg: TREC, 2011.
    [20] Vapnik VN. Statistical Learning Theory. New York: Wiley, 1998.
    [21] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977,39(1):1-38.
    [22] Blum A, Chawla S. Learning from labeled and unlabeled data using graph mincuts. In: Brodley CE, Danyluk AP, eds. Proc. of the 18th Int'l Conf. on Machine Learning (ICML 2001). San Francisco: Morgan Kaufmann Publishers, 2001. 19-26.
    [23] Blum A, Mitchell TM. Combining labeled and unlabeled data with co-training. In: Proc. of the 11th Annual Conf. on Computational Learning Theory (COLT'98). New York: ACM Press, 1998. 92-100.
    [24] Sellamanickam S, Garg P, Selvaraj SK. A pairwise ranking based approach to learning with positive and unlabeled examples. In: Berendt B, de Vries A, Fan WF, Macdonald C, Ounis I, Ruthven I, eds. Proc. of the 20th ACM Int'l Conf. on Information and Knowledge Management (CIKM 2011). New York: AC Press, 2011. 663-672.
    [25] Duh K, Kirchhoff K. Learning to rank with partially-labeled data. In: Proc. of the 31st Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2008). New York: ACM Press, 2008. 251-258.
    [26] Zhang X, He B, Luo TJ, Li B. Query-Biased learning to rank for real-time Twitter search. In: Proc. of the 21st ACM Int'l Conf. on Information and Knowledge Management (CIKM 2012). New York: ACM Press, 2012. 1915-1919.
    [27] Alldrin N, Smith A, Turnbull D. Clustering with EM and k-means. Technical Report, California: University of San Diego, 2003.
    [28] Niu SZ, Cheng XQ, Guo JF. Noise sensitivity in learning to rank. Journal of Chinese Information Process, 2012,26(5) (in Chinese with English abstract).
    [29] Joachims T. Optimizing search engines using clickthrough data. In: Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD 2002). 2002. 133-142.
    [30] Ounis I, Amati G, Plachouras V, He B, Macdonald C, Lioma C. Terrier: A high performance and scalable information retrieval platform. In: Proc. of the ACM Workshop on Open Source Information Retrieval (SIGIR 2006). New York: ACM Press, 2006.Smucker MD, Allan J, Carterette B. A comparison of statistical significance tests for information retrieval evaluation. In: Proc. of the 16th ACM Conf. on Information and Knowledge Management (CIKM 2007). New York: ACM Press, 2007. 623-632.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

张新,何苯,罗铁坚,李东星.基于聚类的直推式学习的性能分析.软件学报,2014,25(12):2865-2876

复制
分享
文章指标
  • 点击次数:6262
  • 下载次数: 7286
  • HTML阅读次数: 2571
  • 引用次数: 0
历史
  • 收稿日期:2014-05-05
  • 最后修改日期:2014-08-21
  • 在线发布日期: 2014-12-04
文章二维码
您是第19877027位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号