中文医疗文本中的嵌套实体识别方法
作者:
作者简介:

闫璟辉(1992-), 男, 博士, 主要研究领域为知识抽取, 自然语言处理.
宗成庆(1963-), 男, 博士, 研究员, 博士生导师, CCF会士, 主要研究领域为机器翻译, 自然语言处理.
徐金安(1970-), 男, 博士, 教授, 博士生导师, CCF杰出会员, 主要研究领域为机器翻译, 自然语言处理, 知识图谱及其应用

通讯作者:

宗成庆, E-mail: cqzong@nlpr.ia.ac.cn

中图分类号:

TP18


Nested Entity Recognition Approach in Chinese Medical Text
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [35]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    实体识别是信息抽取的关键技术. 相较于普通文本, 中文医疗文本的实体识别任务往往面对大量的嵌套实体. 以往识别实体的方法往往忽视了医疗文本本身所特有的实体嵌套规则而直接采用序列标注方法, 为此, 提出一种融合实体嵌套规则的中文实体识别方法. 所提方法在训练过程中将实体的识别任务转化为实体的边界识别与边界首尾关系识别的联合训练任务, 在解码过程中结合从实际医疗文本中所总结出来的实体嵌套规则对解码结果进行过滤, 从而使得识别结果能够符合实际文本中内外层实体嵌套组合的组成规律. 在公开的医疗文本实体识别的实验上取得良好的效果. 数据集上的实验表明, 所提方法在嵌套类型实体识别性能上显著优于已有的方法, 在整体准确率方面比最先进的方法提高0.5%.

    Abstract:

    Entity recognition is a key technology for information extraction. Compared with ordinary text, the entity recognition of Chinese medical text is often faced with a large number of nested entities. Previous methods of entity recognition often ignore the entity nesting rules unique to medical text and directly use sequence annotation methods. Therefore, a Chinese entity recognition method that incorporates entity nesting rules is proposed. This method transforms the entity recognition task into a joint training task of entity boundary recognition and boundary first-tail relationship recognition in the training process and filters the results by combining the entity nesting rules summarized from actual medical text in the decoding process. In this way, the recognition results are in line with the composition law of the nested combinations of inner and outer entities in the actual text. Good results have been achieved in public experiments on entity recognition of medical text. Experiments on the dataset show that the proposed method is significantly superior to the existing methods in terms of nested-type entity recognition performance, and the overall accuracy is increased by 0.5% compared with the state-of-the-art methods.

    参考文献
    [1] Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F, Goldman S, Janmohamed S, Kreuzer J, Leenay M, Michel A, Ong S, Pell JP, Southworth MR, Stough WG, Thoenes M, Zannad F, Zalewski A. Electronic health records to facilitate clinical research. Clinical Research in Cardiology, 2017, 106(1): 1–9. [doi: 10.1007/s00392-016-1025-6]
    [2] Denaxas SC, Morley KI. Big biomedical data and cardiovascular disease research: Opportunities and challenges. European Heart Journal-Quality of Care and Clinical Outcomes, 2015, 1(1): 9–16. [doi: 10.1093/ehjqcco/qcv005]
    [3] Li I, Pan J, Goldwasser J, Verma N, Wong WP, Nuzumlalı MY, Rosand B, Li YX, Zhang M, Chang D, Taylor RA, Krumholz HM, Radev D. Neural natural language processing for unstructured data in electronic health records: A review. Computer Science Review, 2022, 46: 100511. [doi: 10.1016/j.cosrev.2022.100511]
    [4] Li M, Xiang L, Kang XM, Zhao Y, Zhou Y, Zong CQ. Medical term and status generation from Chinese clinical dialogue with multi-granularity transformer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3362–3374. [doi: 10.1109/TASLP.2021.3122301]
    [5] Sun J, Zhou Y, Zong CQ. One-shot relation learning for knowledge graphs via neighborhood aggregation and paths encoding. ACM Transactions on Asian and Low-Resource Language Information Processing, 2021, 21(3): 52. [doi: 10.1145/3484729]
    [6] 周永惠. 关于现代汉语语素. 西南民族学院学报·哲学社会科学版, 2001, 22(7): 202–205.
    Zhou YH. About modern Chinese morphemes. Journal of Southwest University for Nationalities (Philosophy and Social Sciences), 2001, 22(7): 202–205 (in Chinese with English abstract).
    [7] Chiu JPC, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 2016, 4: 357–370. [doi: 10.1162/tacl_a_00104]
    [8] Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. In: Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: ACL, 2016. 260–270.
    [9] Dong CH, Zhang JJ, Zong CQ, Hattori M, Di H. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In: Proc. of th 5th CCF Conf. on Natural Language Processing and Chinese Computing (NLPCC 2016), and the 24th Int’l Conf. on Computer Processing of Oriental Languages. Kunming: Springer, 2016. 239–250.
    [10] Huang ZH, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991, 2015.
    [11] Sheikhshab G, Birol I, Sarkar A. In-domain context-aware token embeddings improve biomedical named entity recognition. In: Proc. of the 9th Int’l Workshop on Health Text Mining and Information Analysis. Brussels: ACL, 2018. 160–164.
    [12] Li XN, Yan H, Qiu XP, Huang XJ. FLAT: Chinese ner using flat-lattice transformer. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020. 6836–6842.
    [13] Bekoulis G, Deleu J, Demeester T, Develder C. Joint entity recognition and relation extraction as a multi-head selection problem. Expert Systems with Applications, 2018, 114: 34–45. [doi: 10.1016/j.eswa.2018.07.032]
    [14] Yu JT, Bohnet B, Poesio M. Named entity recognition as dependency parsing. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 6470–6476.
    [15] Shen YL, Ma XY, Tan ZQ, Zhang S, Wang W, Lu WM. Locate and label: A two-stage identifier for nested named entity recognition. In: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing (Vol. 1: Long Papers). ACL, 2021. 2782–2794.
    [16] Isozaki H, Kazawa H. Efficient support vector classifiers for named entity recognition. In: Proc. of the 19th Int’l Conf. on Computational Linguistics. Taipei: ACL, 2002. 1–7.
    [17] Lee KJ, Hwang YS, Kim S, Rim HC. Biomedical named entity recognition using two-phase model based on SVMs. Journal of Biomedical Informatics, 2004, 37(6): 436–447. [doi: 10.1016/j.jbi.2004.08.012]
    [18] Ju ZF, Wang J, Zhu F. Named entity recognition from biomedical text using SVM. In: Proc. of the 5th Int’l Conf. on Bioinformatics and Biomedical Engineering. Wuhan: IEEE, 2011. 1–4.
    [19] Zhou GD, Su J. Named entity recognition using an HMM-based chunk tagger. In: Proc. of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002. 473–480.
    [20] Zhao SJ. Named entity recognition in biomedical texts using an HMM model. In: Proc. of the 2004 Int’l Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Geneva: COLING, 2004. 87–90.
    [21] Zhang J, Shen D, Zhou GD, Su J, Tan CL. Enhancing HMM-based biomedical named entity recognition by studying special phenomena. Journal of Biomedical Informatics, 2004, 37(6): 411–422. [doi: 10.1016/j.jbi.2004.08.005]
    [22] McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and Web-enhanced lexicons. In: Proc. of the 7th Conf. on Natural Language Learning at HLT-NAACL 2003. Edmonton: ACL, 2003. 188–191.
    [23] Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proc. of the 2004 Int’l Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Geneva: COLING, 2004. 107–110.
    [24] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011, 12: 2493–2537.
    [25] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). New Orleans: ACL, 2018. 2227–2237.
    [26] Hakala K, Pyysalo S. Biomedical named entity recognition with multilingual BERT. In: Proc. of the 5th Workshop on BioNLP Open Shared Tasks. Hong Kong: ACL, 2019. 56–61.
    [27] Shen D, Zhang J, Zhou GD, Su J, Tan CL. Effective adaptation of hidden markov model-based named entity recognizer for biomedical domain. In: Proc. of the 2003 ACL Workshop on Natural Language Processing in Biomedicine. Sapporo: ACL, 2003. 49–56.
    [28] Zhou GD, Zhang J, Su J, Shen D, Tan C. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics, 2004, 20(7): 1178–1190. [doi: 10.1093/bioinformatics/bth060]
    [29] Zhou GD. Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid. International Journal of Medical Informatics, 2006, 75(6): 456–467. [doi: 10.1016/j.ijmedinf.2005.06.012]
    [30] Ju MZ, Miwa M, Ananiadou S. A neural layered model for nested named entity recognition. In: Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers). New Orleans: ACL, 2018. 1446–1459.
    [31] Wang J, Shou LD, Chen K, Chen G. Pyramid: A layered model for nested named entity recognition. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 5918–5928.
    [32] Zheng CM, Cai Y, Xu JY, Leung HF, Xu GD. A boundary-aware neural model for nested named entity recognition. In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing. Hong Kong: ACL, 2019. 357–366.
    [33] Su JL, Murtadha A, Pan SF, Hou J, Sun J, Huang WW, Wen B, Liu YF. Global pointer: Novel efficient span-based approach for named entity recognition. arXiv:2208.03054, 2022.
    [34] Zhang NY, Jia QH, Yin KP, Dong L, Gao F, Hua NW. Conceptualized representation learning for Chinese biomedical text mining. arXiv:2008.10813, 2020.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

闫璟辉,宗成庆,徐金安.中文医疗文本中的嵌套实体识别方法.软件学报,2024,35(6):2923-2935

复制
分享
文章指标
  • 点击次数:676
  • 下载次数: 2241
  • HTML阅读次数: 1177
  • 引用次数: 0
历史
  • 收稿日期:2022-09-30
  • 最后修改日期:2022-11-03
  • 在线发布日期: 2023-08-23
  • 出版日期: 2024-06-06
文章二维码
您是第20525312位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号