赵京胜(1969-), 男, 副教授, 主要研究领域为自然语言处理, 中文信息处理.
宋梦雪(1996-), 女, 硕士, 主要研究领域为智能信息处理.
高祥(1992-), 男, 硕士, 主要研究领域为智能信息处理.
朱巧明(1963-), 男, 博士, 教授, 博士生导师, CCF杰出会员, 主要研究领域为自然语言处理, 智能信息处理.
赵京胜,zhao5199@163.com
TP391
国家自然科学基金(61773276, 61836007)
National Natural Science Foundation of China (61773276; 61836007)
自然语言处理是人工智能的核心技术, 文本表示是自然语言处理的基础性和必要性工作, 影响甚至决定着自然语言处理系统的质量和性能. 探讨了文本表示的基本原理、自然语言的形式化、语言模型以及文本表示的内涵和外延. 宏观上分析了文本表示的技术分类, 对主流技术和方法, 包括基于向量空间、基于主题模型、基于图、基于神经网络、基于表示学习的文本表示, 进行了分析、归纳和总结, 对基于事件、基于语义和基于知识的文本表示也进行了介绍. 对文本表示技术的发展趋势和方向进行了预测和进一步讨论. 以神经网络为基础的深度学习以及表示学习在文本表示中将发挥重要作用, 预训练加调优的策略将逐渐成为主流, 文本表示需要具体问题具体分析, 技术和应用融合是推动力.
Natural language processing is the core technology of artificial intelligence. Text representation is the basic and necessary work of natural language processing, which affects or even determines the quality and performance of natural language processing systems. This study discusses the basic principle of text representation, the formalization of natural language, the language model, and the connotation and extension of text representation. The technical classification of text representation on a macro level is analyzed. The mainstreams of text representation technologies and methods are analyzed, induced and summarized, including vector space model, topic model, graph-based model, neural network-based model, and representation learning. Event-based, semantic-based, and knowledge-based text representation technologies are also introduced. The development trends and directions of text representation technology are predicted and further discussed. Neural network-based deep learning and representation learning on text will play an important role in natural language processing. The strategy of pre-training and fine-tune optimization will gradually become the mainstream technology. Text representation needs specific analysis according to specific problems. The integration of technology and application is the driving force.
赵京胜,宋梦雪,高祥,朱巧明.自然语言处理中的文本表示研究.软件学报,2022,33(1):102-128
复制