[关键词]
[摘要]
文本分类任务作为文本挖掘的核心问题,已成为自然语言处理领域的一个重要课题.而短文本分类由于稀疏性、实时性和不规范性等特点,已成为文本分类亟待解决的问题之一.在某些特定场景,短文本存在大量隐含语义,由此给挖掘有限文本内的隐含语义特征等任务带来挑战.已有的方法对短文本分类主要采用传统机器学习或深度学习算法,但该类算法的模型构建复杂且工作量大,效率不高.此外,短文本包含有效信息较少且口语化严重,对模型的特征学习能力要求较高.针对以上问题,提出了KAeRCNN模型,该模型在TextRCNN模型的基础上,融合了知识感知与双重注意力机制.知识感知包含了知识图谱实体链接和知识图谱嵌入,可以引入外部知识以获取语义特征,同时,双重注意力机制可以提高模型对短文本中有效信息提取的效率.实验结果表明,KAeRCNN模型在分类准确度、F1值和实际应用效果等方面显著优于传统的机器学习算法.对算法的性能和适应性进行了验证,准确率达到95.54%,F1值达到0.901,对比4种传统机器学习算法,准确率平均提高了约14%,F1值提升了约13%.与TextRCNN相比,KAeRCNN模型在准确性方面提升了约3%.此外,与深度学习算法的对比实验结果也说明,该模型在其他领域的短文本分类中也有较好的表现.理论和实验结果都证明,所提出的KAeRCNN模型对短文本分类效果更优.
[Key word]
[Abstract]
As the core problem of text mining, text classification task has become an essential issue in the field of natural language processing. Short text classification is a hot-spot topic, and one of many urgent problems to be solved in text classification due to its sparseness, real-time, and non-standard characteristics. In certain specific scenarios, short texts have many implicit semantics, which brings challenges to tasks such as mining implicit semantic features in limited texts. The existing research methods mainly apply traditional machine learning or deep learning algorithms for short text classification. However, this series of algorithm is complex and requires enormous cost to build an effective model, meanwhile, the algorithms are not efficient. In addition, short text contains less effective information and abundant colloquial language, which requires a stronger feature learning ability of the model. In response to the above problems, the KAeRCNN model is proposed based on the TextRCNN model, which combines knowledge aware and the dual attention mechanism. The knowledge-aware is constructed in two parts, which includes the stage of knowledge graph entity linking and knowledge graph embedding, as external knowledge can be introduced to obtain semantic features. At the same time, the dual attention mechanism can improve the model's efficiency in extracting effective information from short texts. Excessive experimental results show that the KAeRCNN model proposed in this study is significantly better than traditional machine learning algorithms in terms of classification accuracy, the F1 score, and practical application effects. The performance and adaptability of the algorithm are further verified with different datasets. The accuracy rate of the proposed approach reaches 95.54%, and the F1 score reaches 0.901. Compared with the four traditional machine learning algorithms, the accuracy rate is increased by about 14% on average, and the F1 score is increased by about 13%. Compared with TextRCNN, the KAeRCNN model improves accuracy by about 3%. In addition, the experimental results of comparison with deep learning algorithms also show that the proposed model has better performance in classification of short text from other fields. Both theoretical and experimental results indicate that the KAeRCNN model proposed in this study is effective for short text classification.
[中图分类号]
TP18
[基金项目]
国家自然科学基金(62172351,61728204);高安全系统的软件开发与验证技术工业和信息化部重点实验室(NJ2018014);中国学位与研究生教育学会(B-2017Y0904-162);华为创新DB IRP (CCF-HUAWEI DBIR2020001A)