软件问答社区的问题删除预测方法
作者:
通讯作者:

张莉, lily@buaa.edu.cn

中图分类号:

TP311

基金项目:

科技创新2030-“新一代人工智能”重大项目(2018AAA0102304); 国家自然科学基金(62177003); 中央高校基本科研业务费(YWF-20-BJ-J-1018)


Prediction Method for Question Deletion in Software Question and Answer Community
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    Stack Overflow是最受欢迎的软件问答社区之一, 用户可以在该网站发布问题并得到其他用户的回答. 为了保证问题质量, 网站需要尽快发现并删除低质量或者不符合社区主题的问题. 当前, Stack Overflow主要采用人工检查的方式发现需要被删除的问题. 然而这种方式往往不能保证问题被及时发现、删除, 而且加重了社区管理员的负担. 为了快速发现需要删除的问题, 提出了自动化预测问题删除的方法MulPredictor. 该方法提取问题的语义内容特征、语义统计特征和元特征, 使用随机森林分类器计算问题会被删除的概率. 实验结果表明: 与现有方法DelPredictor和NLPPredictor相比, MulPredictor的准确率在平衡测试集上分别提升了16.34%和12.78%, 在随机测试集上分别提升了12.38%和14.14%. 此外, 分析了影响问题删除的重要特征, 发现代码段、问题的标题和正文第1段的特征对问题删除有重要的影响.

    Abstract:

    Stack Overflow is one of the most popular software question and answer communities, where users can post questions and receive answers from others. In order to ensure the quality of questions, the website needs to promptly discover and delete questions with low quality or not conforming to the community’s theme. Currently, Stack Overflow mainly relies on manual inspection to find questions that need to be deleted. However, this way usually hardly guarantees to discover and delete questions in time, and increases the burden of community administrators. In order to quickly find questions that need to be deleted, this study proposes a method to automatically predict question deletion, which is named MulPredictor. This method extracts the semantic content features, the semantic statistical features and the meta features of a question, and uses the random forest classifier to calculate the probability that it will be deleted. Experimental results showed that, compared with existing methods DelPredictor and NLPPredictor, MulPredictor increases the accuracy by 16.34% and 12.78% on balanced test set, and increases the accuracy by 12.38% and 14.14% on random test set. In addition, this study also analyzes important features in question deletion, and finds that the code segment, the question’s title, and the first paragraph of the question’s body have the most significant impacts on question deletion.

    参考文献
    [1] Phukan D, Singha AK. Feasibility analysis for popularity prediction of stack exchange posts based on its initial content. In: Proc. of the 3rd Int’l Conf. on Computing for Sustainable Global Development (INDIACom). New Delhi: IEEE, 2016. 1397–1402.
    [2] Wu YH, Wang SW, Bezemer CP, Inoue K. How do developers utilize source code from stack overflow? Empirical Software Engineering, 2019, 24(2): 637–673. [doi: 10.1007/s10664-018-9634-5
    [3] Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA. Classifying stack overflow posts on API issues. In: Proc. of the 25th IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Campobasso: IEEE, 2018. 244–254.
    [4] Mamykina L, Manoim B, Mittal M, Hripcsak G, Hartmann B. Design lessons from the fastest Q&A site in the west. In: Proc. of the SIGCHI Conf. on Human Factors in Computing Systems. Vancouver: ACM, 2011. 2857–2866.
    [5] Correa D, Sureka A. Chaff from the wheat: Characterization and modeling of deleted questions on stack overflow. In: Proc. of the 23rd Int’l Conf. on World Wide Web. Seoul: ACM, 2014. 631–642.
    [6] Barua A, Thomas SW, Hassan AE. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 2014, 19(3): 619–654. [doi: 10.1007/s10664-012-9231-y
    [7] Xia X, Lo D, Correa D, Sureka A, Shihab E. It takes two to tango: Deleted stack overflow question prediction with text and meta features. In: Proc. of the 40th IEEE Annual Computer Software and Applications Conf. (COMPSAC). Atlanta: IEEE, 2016. 73–82.
    [8] Tóth L, Nagy B, Gyimóthy T, Vidács L. Why will my question be closed?: NLP-based pre-submission predictions of question closing reasons on stack overflow. In: Proc. of the 42nd ACM/IEEE Int’l Conf. on Software Engineering: New Ideas and Emerging Results. Seoul: ACM, 2020. 45–48.
    [9] Zhang W, Wang W, Wang J, Zha HY. User-guided hierarchical attention network for multi-modal social image popularity prediction. In: Proc. of the 2018 World Wide Web Conf. Lyon: Int’l World Wide Web Conferences Steering Committee, 2018. 1277–1286.
    [10] Zhou JY, Wang SW, Bezemer CP, Hassan AE. Bounties on technical Q&A sites: A case study of Stack Overflow bounties. Empirical Software Engineering, 2020, 25(1): 139–177. [doi: 10.1007/s10664-019-09744-3
    [11] Pâr?achi PP, Dash SK, Treude C, Barr ET. POSIT: Simultaneously tagging natural and programming languages. In: Proc. of the 42nd ACM/IEEE Int’l Conf. on Software Engineering. Seoul: ACM, 2020. 1348–1358.
    [12] Zhang TY, Upadhyaya G, Reinhardt A, Rajan H, Kim M. Are code examples on an online Q&A forum reliable?: A study of API misuse on stack overflow. In: Proc. of the 40th IEEE/ACM Int’l Conf. on Software Engineering (ICSE). Gothenburg: IEEE, 2018. 886–896.
    [13] Beyer S, Macho C, Di Penta M, Pinzger M. What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories. Empirical Software Engineering, 2020, 25(3): 2258–2301. [doi: 10.1007/s10664-019-09758-x
    [14] An L, Mlouki O, Khomh F, Antoniol G. Stack overflow: A code laundering platform? In: Proc. of the 24th IEEE Int’l Conf. on Software Analysis, Evolution and Reengineering (SANER). Klagenfurt: IEEE, 2017. 283–293.
    [15] Gómez C, Cleary B, Singer L. A study of innovation diffusion through link sharing on stack overflow. In: Proc. of the 10th Working Conf. on Mining Software Repositories (MSR). San Francisco: IEEE, 2013. 81–84.
    [16] Linares-Vásquez M, Dit B, Poshyvanyk D. An exploratory analysis of mobile development issues using stack overflow. In: Proc. of the 10th Working Conf. on Mining Software Repositories (MSR). San Francisco: IEEE, 2013. 93–96.
    [17] Wang W, Godfrey MW. Detecting API usage obstacles: A study of iOS and android developer questions. In: Proc. of the 10th Working Conf. on Mining Software Repositories (MSR). San Francisco: IEEE, 2013. 61–64.
    [18] Zhang HX, Wang SW, Chen TH, Zou Y, Hassan AE. An empirical study of obsolete answers on Stack Overflow. IEEE Trans. on Software Engineering, 2021, 47(4): 850–862. [doi: 10.1109/TSE.2019.2906315
    [19] Ren XX, Xing ZC, Xia X, Li GQ, Sun JL. Discovering, explaining and summarizing controversial discussions in community Q&A sites. In: Proc. of the 34th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). San Diego: IEEE, 2019. 151–162.
    [20] Singh P, Chopra R, Sharma O, et al. Stackoverflow tag prediction using tag associations and code analysis. Journal of Discrete Mathematical Sciences and Cryptography, 2020, 23(1): 35–43.
    [21] Kim Y. Convolutional neural networks for sentence classification. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014. 1746–1751.
    [22] Tausczik YR, Pennebaker JW. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 2010, 29(1): 24–54. [doi: 10.1177/0261927X09351676
    [23] McHaney R, Tako A, Robinson S. Using LIWC to choose simulation approaches: A feasibility study. Decision Support Systems, 2018, 111: 1–12. [doi: 10.1016/j.dss.2018.04.002
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

蒋竞,苗萌,赵丽娴,张莉.软件问答社区的问题删除预测方法.软件学报,2022,33(5):1699-1710

复制
分享
文章指标
  • 点击次数:1007
  • 下载次数: 4212
  • HTML阅读次数: 2677
  • 引用次数: 0
历史
  • 收稿日期:2021-08-10
  • 最后修改日期:2021-10-09
  • 在线发布日期: 2022-01-28
  • 出版日期: 2022-05-06
文章二维码
您是第19710079位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号