[关键词]
[摘要]
自动术语抽取是从文本集合中自动抽取领域相关的词或短语,是本体构建、文本摘要、知识图谱等领域的关键基础问题和研究热点.特别是,随着近年来对非结构化文本大数据研究的兴起,使得自动术语抽取技术进一步得到学者的广泛关注,取得了较为丰富的研究成果.以术语排序算法为主线,对自动术语抽取方法的理论、技术、现状及优缺点进行研究综述:首先概述了自动术语抽取问题的形式化定义和解决框架.然后围绕"浅层语言分析"中基础语言信息和关系结构信息两个层面的特征对近年来国内外的研究成果进行分类,系统总结了现有自动术语抽取方法的研究进展和面临的挑战.最后对术语抽取使用的数据资源及实验评价进行分析,并对自动术语抽取未来可能的研究趋势进行了探讨与展望.
[Key word]
[Abstract]
Automatic term extraction is to extract domain-related words or phrases from document collections. It is a core basic problem and research hotspot in the fields of ontology construction, text summarization, and knowledge graph. In particular, under the rise of unstructured text studies in big data, automatic term extraction technology has been further concerned by researchers and has obtained rich research results recently. With the terminology sorting algorithm as the main clue, this study surveys the basic theories, technologies, current research works, advantages and disadvantages of automatic term extraction methods. First, the formalized definition and solution framework of automatic term extraction problem are outlined. Then, based on the features of the basic language information and the relational structure information in the "shallow parsing", the latest study results are classified, research progress and major challenges of existing automatic term extraction methods are summarized systematically. Finally, some available data resources are listed, evaluation approaches are analyzed, and the possible research trends in the future are predicted.
[中图分类号]
[基金项目]
国家自然科学基金(61772537,61772536,61702522,61532021);国家重点研发计划(2018YFB1004401)