中文电子病历命名实体和实体关系语料库构建
作者:

Corpus Construction for Named Entities and Entity Relations on Chinese Electronic Medical Records
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [69]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    电子病历是由医务人员撰写的面向患者个体描述医疗活动的记录,蕴含了大量的医疗知识和患者的健康信息.电子病历命名实体识别和实体关系抽取等信息抽取研究对于临床决策支持、循证医学实践和个性化医疗服务等具有重要意义,而电子病历命名实体和实体关系标注语料库的构建是首当其冲的.在调研了国内外电子病历命名实体和实体关系标注语料库构建的基础上,结合中文电子病历的特点,提出适合中文电子病历的命名实体和实体关系的标注体系,在医生的指导和参与下,制定了命名实体和实体关系的详细标注规范,构建了标注体系完整、规模较大且一致性较高的标注语料库.语料库包含病历文本992份,命名实体标注一致性达到0.922,实体关系一致性达到0.895.为中文电子病历信息抽取后续研究打下了坚实的基础.

    Abstract:

    An electronic medical record (EMR) is a patient's individual medical record written by health care providers and stored in digital format in which much medical knowledge and information about patient's personal health conditions are kept. The construction of annotated corpus for named entities and entity relations on EMR is a primary and fundamental task for information extraction which plays important role in clinical decision support, practice of evidence-based medicine, and other medical applications. Based on survey of current research on corpus construction for named entities and entity relations on EMR, this research proposes an annotation scheme for named entities and entity relations on Chinese electronic medical records (CEMR) according to characteristics of the records. Under the supervision of physicians, a complete and detailed annotation specification on CEMR is formulated, and an annotated corpus with high agreement is constructed. The corpus comprises 992 medical text documents, and inter-annotator agreement (IAA) of named entity annotations and entity relation annotations attain 0.922 and 0.895, respectively. The work presented in this paper builds substantial foundations for the subsequent research on information extraction in CEMR.

    参考文献
    [1] Ministry of Health of the People's Republic of China. The basic specifications of electronic medical records (trial). 2013(in Chinese). http://www.gov.cn/gzdt/att/att/site1/20100304/001e3741a2cc0cf99ded01.doc
    [2] Wasserman RC. Electronic medical records (EMRs), epidemiology, and epistemology:Reflections on EMRs and future pediatric clinical research. Academic Pediatrics, 2011,11(4):280-287.[doi:10.1016/j.acap.2011.02.007]
    [3] Uzuner O, Mailoa J, Ryan R, Sibanda T. Semantic relations for problem-oriented medical records. Artificial Intelligence in Medicine, 2010,50(2):63-73.[doi:10.1016/j.artmed.2010.05.006]
    [4] Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? Journal of Biomedical Informatics, 2009,42(5):760-772.[doi:10.1016/j.jbi.2009.08.007]
    [5] Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing:An introduction. Journal of the American Medical Informatics Association, 2011,18(5):544-551.[doi:10.1136/amiajnl-2011-000464]
    [6] Prokosch HU, Ganslandt T. Perspectives for medical informatics:Reusing the electronic medical record for clinical research. Methods of Information in Medicine, 2009,48(1):38-44.[doi:10.3414/ME9132]
    [7] Greenes RA, Shortliffe EH. Medical informatics:An emerging academic discipline and institutional priority. JAMA:the Journal of the American Medical Association, 1990, 263(8):1114-1120.[doi:10.1001/jama.1990.03440080092030]
    [8] Gardner RM, Overhage JM, Steen EB, Munger BS, Holmes JH, Williamson JJ, Detmer DE. Core content for the subspecialty of clinical informatics. Journal of the American Medical Informatics Association, 2009,16(2):153-157.[doi:10.1197/jamia. M3045]
    [9] Sackett DL. Evidence-Based medicine. Seminars in Perinatology, 1997,21(1):3-5.[doi:10.1016/S0146-0005(97)80013-4]
    [10] Frankovich J, Longhurst CA, Sutherland SM. Evidence-Based medicine in the EMR era. The New England Journal of Medicine, 2011,365(19):1758-1759.[doi:10.1056/NEJMp1108726]
    [11] Fowler SA, Yaeger LH, Yu F, Doerhoff D, Schoening P, Kelly B. Electronic health record:Integrating evidence-based information at the point of clinical decision making. Journal of the Medical Library Association, 2014,102(1):52-55.[doi:10.3163/1536-5050. 102.1.010]
    [12] Miller RH, Sim I. Physicians' use of electronic medical records:Barriers and solutions. Health Affairs (Project Hope), 2004,23(2):116-126.[doi:10.1377/hlthaff.23.2.116]
    [13] O'Donnell HC, Kaushal R, Barrón Y, Callahan MA, Adelman RD, Siegler EL. Physicians' attitudes towards copy and pasting in electronic note writing. Journal of General Internal Medicine, 2009,24(1):63-68.[doi:10.1007/s11606-008-0843-2]
    [14] Hammond KW, Helbig ST, Benson CC, Brathwaite-Sketoe BM. Are electronic medical records trustworthy? Observations on copying, pasting and duplication. In:Proc. of the AMIA Annual Symp. Bethesda:American Medical Informatics Association, 2003. 269-273.
    [15] Wilcox L, Lu J, Lai J, Feiner S, Jordan D. ActiveNotes:Computer-Assisted creation of patient progress notes. In:Proc. of the 27th Int'l Conf. on Extended Abstracts on Human Factors in Computing Systems. New York:ACM Press, 2009. 3323-3328.[doi:10. 1145/1520340.1520480]
    [16] Wilcox L, Lu J, Lai J, Feiner S, Jordan D. Physician-Driven management of patient progress notes in an intensive care unit. In:Proc. of the 28th Int'l Conf. on Human Factors in Computing Systems. New York:ACM Press, 2010. 1879.[doi:10.1145/1753326. 1753609]
    [17] Ministry of Health of the People's Republic of China. Measurement and standard of the level of electronic medical records (EMR) capabilities (trial). 2010(in Chinese). http://www.moh.gov.cn/publicfiles/business/cmsresources/mohyzs/cmsrsdocument/doc13271.doc
    [18] Archer N, Fevrier-Thomas U, Lokker C, McKibbon KA, Straus SE. Personal health records:A scoping review. Journal of the American Medical Informatics Association, 2011,18(4):515-522.[doi:10.1136/amiajnl-2011-000105]
    [19] Barbarito F, Pinciroli F, Barone A, Pizzo F, Ranza R, Mason J, Mazzola L, Bonacina S, Marceglia S. Implementing the lifelong personal health record in a regionalised health information system:The case of Lombardy, Italy. Computers in Biology and Medicine, 2015,59(C):164-174.[doi:10.1016/j.compbiomed.2013.10.021]
    [20] Wiesner M, Pfeifer D. Health recommender systems:Concepts, requirements, technical basics and challenges. Int'l Journal of Environmental Research and Public Health, 2014,11(3):2580-2607.[doi:10.3390/ijerph110302580]
    [21] Eysenbach G. Recent advances:Consumer health informatics. BMJ, 2000,320(7251):1713-1716.[doi:10.1136/bmj.320.7251. 1713]
    [22] Alpay L, Verhoef J, Xie B, Te'eni D, Zwetsloot-Schonk JHM. Current challenge in consumer health informatics:Bridging the gap between access to information and information understanding. Biomedical Informatics Insights, 2009,2(1):1-10.
    [23] Lehmann CU, Altuwaijri MM, Li YC, Ball MJ, Haux R. Translational research in medical informatics or from theory to practice:A call for an applied informatics journal. Methods of Information in Medicine, 2008,47(1):1-3.
    [24] i2b2. https://www.i2b2.org/
    [25] Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 2007,14(5):550-563.[doi:10.1197/jamia.M2444]
    [26] Uzuner O, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 2007,15(1):14-24.[doi:10.1197/jamia.M2408]
    [27] Uzuner O. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 2009, 16(4):561-570.[doi:10.1197/jamia.M3115]
    [28] Uzuner O, Solti I, Cadag E. Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 2010,17(5):514-518.[doi:10.1136/jamia.2010.003947]
    [29] Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 2011,18(5):552-556.[doi:10.1136/amiajnl-2011-000203]
    [30] Uzuner O, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association, 2012,19(5):786-791.[doi:10.1136/amiajnl-2011-000784]
    [31] Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text:2012 i2b2 challenge. Journal of the American Medical Informatics Association, 2013,20(5):806-813.[doi:10.1136/amiajnl-2013-001628]
    [32] Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages:A description based on the theories of Zellig Harris. Journal of Biomedical Informatics, 2002,35(4):222-235.[doi:10.1016/S1532-0464(03)00012-1]
    [33] Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record:A review of recent research. Yearbook of Medical Informatics, 2008,47(Suppl 1):128-144.
    [34] Yang JF, Yu QB, Guan Y, Jiang ZP. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automatica Sinica, 2014,40(8):1537-1562(in Chinese with English abstract).[doi:10.3724/SP.J. 1004.2014.01537]
    [35] Roberts A, Gaizauskas R, Hepple M, Demetriou G, Guo Y, Roberts I, Setzer A. Building a semantically annotated corpus of clinical texts. Journal of Biomedical Informatics, 2009,42(5):950-966.[doi:10.1016/j.jbi.2008.12.013]
    [36] Meystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents:Performance evaluation. Journal of Biomedical Informatics, 2006,39(6):589-599.[doi:10.1016/j.jbi.2005.11.004]
    [37] Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical text analysis and knowledge extraction system (cTAKES):Architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 2010,17(5):507-513.[doi:10.1136/jamia.2009.001560]
    [38] Bodenreider O. The unified medical language system (UMLS):Integrating biomedical terminology. Nucleic Acids Research, 2004, 32(Database Issue):D267-D270.[doi:10.1093/nar/gkh061]
    [39] Weed LL. Medical records that guide and teach. New England Journal of Medicine, 1968,278(12):593-600.[doi:10.1056/NEJM196803212781204]
    [40] Albright D, Lanfranchi A, Fredriksen A, Styler WF, Warner C, Hwang JD, Choi JD, Dligach D, Nielsen RD, Martin J, Ward W, Palmer M, Savova GK. Towards comprehensive syntactic and semantic annotations of the clinical narrative. Journal of the American Medical Informatics Association, 2013,20(5):922-930.[doi:10.1136/amiajnl-2012-001317]
    [41] Pradhan S, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, Suominen H, Chapman WW, Savova G. Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. Journal of the American Medical Informatics Association, 2015,22(1):143-154.[doi:10.1136/amiajnl-2013-002544]
    [42] Analysis of clinical text. http://alt.qcri.org/semeval2014/task7/
    [43] Mizuki M, Yoshinobu K, Tomoko O, Mai M, Aramaki E. Overview of the NTCIR-10 MedNLP task. In:Proc. of the NTCIR-10. 2013.
    [44] Lei J, Tang B, Lu X, Gao K, Jiang M, Xu H. A comprehensive study of named entity recognition in Chinese clinical text. Journal of the American Medical Informatics Association, 2014,21(5):808-814.[doi:10.1136/amiajnl-2013-002381]
    [45] Xu Y, Wang Y, Liu T, Liu J, Fan Y, Qian Y, Tsujii J, Chang EI. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. Journal of the American Medical Informatics Association, 2014,21(e1):84-92.[doi:10.1136/amiajnl-2013-001806]
    [46] Zhou X, Peng Y, Liu B. Text mining for traditional Chinese medical knowledge discovery:A survey. Journal of Biomedical Informatics, 2010,43(4):650-660.[doi:10.1016/j.jbi.2010.01.002]
    [47] Wang Y, Yu Z, Chen L, Chen Y, Liu Y, Hu X, Jiang Y. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine:An empirical study. Journal of Biomedical Informatics, 2014,47:91-104.[doi:10.1016/j. jbi.2013.09.008]
    [48] Wang H, Zhang W, Zeng Q, Li Z, Feng K, Liu L. Extracting important information from Chinese operation notes with natural language processing methods. Journal of Biomedical Informatics, 2014,48(C):130-136.[doi:10.1016/j.jbi.2013.12.017]
    [49] Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K, Cooper J, Guan W, De Groen PC. Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. Journal of Biomedical Informatics, 2009,42(5):937-949.[doi:10.1016/j.jbi.2008.12.005]
    [50] Lloyd-Jones DM. Cardiovascular risk prediction:Basic concepts, current status, and future directions. Circulation, 2010,121(15):1768-1777.[doi:10.1161/CIRCULATIONAHA.109.849166]
    [51] Ye F, Chen YY, Zhou GG, Li HM, Li Y. Intelligent recognition of named entity in electronic medical records. Chinese Journal of Biomedical Engineering, 2011,30(2):256-262(in Chinese with English abstract).[doi:10.3969/j.issn.0258-8021.2011.02.014]
    [52] Xia F, Yetisgen-Yildiz M. Clinical corpus annotation:Challenges and strategies. In:Proc. of the 3rd Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012) of the Int'l Conf. on Language Resources and Evaluation (LREC). 2012. 32-39.
    [53] Snow R, O'Connor B, Jurafsky D, Ng AY. Cheap and fast-But is it good? Evaluating non-expert annotations for natural language tasks. In:Proc. of the Conf. on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2008. 254-263.
    [54] Measuring degrees of relational similarity. http://www.cs.york.ac.uk/semeval-2012/task2/
    [55] Uzuner Ö, Solti I, Xia F, Cadag E. Community annotation experiment for ground truth generation for the i2b2 medication challenge. Journal of the American Medical Informatics Association, 2010,17(5):519-523.[doi:10.1136/jamia.2010.004200]
    [56] Yang JF, Qu CY, He B. Annotation specification for named entities and entity relations on Chinese electronic medical records. Harbin Institute of Technology, 2004(in Chinese). http://wi.hit.edu.cn/dev/YuLiao/NER.pdf
    [57] Annotation tool. https://github.com/yangjinfeng/emrproject
    [58] Carletta J. Assessing agreement on classification tasks:The kappa statistic. Computational Linguistics, 1996,22(2):249-254.
    [59] Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 2005,12(3):296-298.[doi:10.1197/jamia.M1733]
    [60] Ogren P, Savova G, Chute C. Constructing evaluation corpora for automated clinical named entity recognition. In:Proc. of the 12th World Congress on Health (Medical) Informatics. Marrakech:European Language Resources Association (ELRA), 2008. 2325-2330.
    [61] Artstein R, Poesio M. Inter-Coder agreement for computational linguistics. Computational Linguistics, 2008,34(4):555-596.[doi:10.1162/coli.07-034-R2]
    [62] Jiang ZP, Zhao FF, Guan Y, Yang JF. Research on Chinese electronic medical record oriented lexical corpus annotation. High Technology Letters, 2014,24(6):609-615(in Chinese with English abstract).[doi:10.3772/j.issn.1002-0470.2014.06.009]
    [63] Jiang Z, Zhao F, Guan Y. Developing a linguistically annotated corpus of Chinese electronic medical record. In:Proc. of the IEEE Int'l Conf. on Bioinformatics and Biomedicine (BIBM). Belfast:IEEE, 2014.[doi:10.1109/BIBM.2014.6999174]
    [64] Wang WY, Mazaitis K, Lao N, Mitchell TM, Cohen WW. Efficient inference and learning in a large knowledge base. Machine Learning, 2015,100(1):101-126.[doi:10.1007/s10994-015-5488-x]
    [65] Lao N, Mitchell T, Cohen WW. Random walk inference and learning in a large scale knowledge base. In:Proc. of the Conf. on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2011. 529-539.
    附中文参考文献:
    [1] 中华人民共和国卫生部.电子病历基本规范(试行).2013. http://www.gov.cn/gzdt/att/att/site1/20100304/001e3741a2cc0cf99ded01.doc
    [17] 中华人民共和国卫生部.电子病历系统功能应用水平分级评价方法及标准(试行).2013. http://www.moh.gov.cn/publicfiles/business/cmsresources/mohyzs/cmsrsdocument/doc13271.doc
    [34] 杨锦锋,于秋滨,关毅,蒋志鹏.电子病历命名实体识别和实体关系抽取研究综述.自动化学报,2014,40(8):1537-1562.[doi:10. 3724/SP.J.1004.2014.01537]
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

杨锦锋,关毅,何彬,曲春燕,于秋滨,刘雅欣,赵永杰.中文电子病历命名实体和实体关系语料库构建.软件学报,2016,27(11):2725-2746

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2014-12-03
  • 最后修改日期:2015-06-24
  • 在线发布日期: 2016-03-24
文章二维码
您是第20040568位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号