• Article
  • | |
  • Metrics
  • |
  • Reference [22]
  • |
  • Related [20]
  • |
  • Cited by [9]
  • | |
  • Comments
    Abstract:

    This paper proposes SVM+BiHMM, a hybrid statistic model of metadata extraction based on SVM (support vector machine) and BiHMM (bigram HMM (hidden Markov model)). The BiHMM model modifies the HMM model with both Bigram sequential relation and position information of words, by means of distinguishing the beginning emitting probability from the inner emitting probability. First, the rule based extractor segments documents into line-blocks. Second, the SVM classifier tags the blocks into metadata elements. Finally, the SVM+BiHMM model is built based on the BiHMM model, with the emitting probability adjusted by the Sigmoid function of SVM score, and the transition probability trained by Bigram HMM. The SVM classifier benefits from the structure patterns of document line data while the Bigram HMM considers both words' Bigram sequential relation and position information, so the complementary SVM+BiHMM outperforms HMM, BiHMM, and SVM methods in the experiments on the same task.

    Reference
    [1]Morville P,Rosenfeld L.Information Architecture for the World Wide Web:Designing Large-Scale Web Site.3rd ed.,Sebastopol:0'Reilly&Associates,2006.
    [2]Chidlovskii B Wrapping web information providers by transducer induction.In:Raedt L,Flach P,eds.Proc of the 12th Int'l of European Conf.on Machine Learning (ECML 2001).LNCS 2167,Heidelberg:Springer-Verlag,2001.61-72.
    [3]Hitchcock S,Carr L,Jiao Z,Bergmark D,Hall W,Lagoze C,Harnad S.Developing services for open eprint archives:Globalisation,integration and the impact of links.In:Proc.of the 5th ACM Conf.on Digital Libraries (ACMDL 2000).New York:ACM Press,2000.143-151.
    [4]Klink S,Dengel A,Kieninger T.Rule-Based document structure understanding with a fuzzy combination of layout and textual features.Int'l Journal on Document Analysis and Recognition,2001,4(1):18-26.
    [5]Kim J,Le DX,Thoma GR.Automated labeling algorithms for biomedical document images.In:Proc.of the 7th World Multiconference on Systemics,Cybernetics and Informatics.Orlando:IIIS,2003.352-357.
    [6]Zhang M,Yang DQ,Deng ZH,Feng Y,Wang WQ,Zhao PX,Wu S,Wang SA,Tang SW.PKUSpace:A collaborative platform for scientific researching.In:Liu WY,Shi YC,Li Q,eds.Proc of the Int'l Conf.of Web-based Learning (ICWL 2004).LNCS 3143,Heidelberg:Springer-Verlag,2004.120-127.
    [7]Zhao PX,Zhang M,Yang DQ,Tang SW.Automatic extraction of metadata from digital documents.Computer Science,2003,30(10):217-204 (in Chinese with English abstract).
    [8]Bikel DM,Miller S,Schwartz R,Weischedel R.Nymble:A high performance learning name finder.In:Proc.of the 5th Conf.on Applied Natural Language Processing (ANLC'97).San Francisco:Morgan Kaufmann Publishers,1997.194-201.
    [9]Seymore K,McCallum A,Rosenreid R.Learning hidden Markov model structure for information extraction.In:Califf ME,Freitag D,Kushmerick N,Muslea I,eds.Proc.of the AAAI'99 Workshop on Machine Learning for Information Extraction.Cambridge:MIT Press,1999.37-42.
    [10]Borkar VR,Deshmukh K,Sarawagi S.Automatic segmentation of text into structured records.In:Aref WG,ed.Proc.of the ACM-SIGMOD Int'l Conf.Management of Data (SIGMOD 2001).New York:ACM Press,2001.175-186.
    [11]Yin P,Zhang M,Deng ZH,Yang DQ.Metadata extraction from bibliographies Using bigram HMM.In:Chen Z,Chen H,Miao Q,Fu Y,Fox E,Lim E,eds.Proc.of the Int'l Conf.of Asian Digital Libraries (ICADL 2004).LNCS 3334,Heidelberg:Springer-Verlag,2004.310-319.
    [12]McCallum A,Freitag D,Pereira F.Maximum entropy Markov models for information extraction and segmentation.In:Langley P,ed.Proc.of the Int'l Conf.on Machine Learning (ICML 2000).San Francisco:Morgan Kaufmann Publishers,2000.591-598.
    [13]Lafferty J,McCallum A,Pereira F.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In:Brodley C,Danyluk A,eds.Proc.of the Int'l Conf.on Machine Learning (ICML 2001).San Francisco:Morgan Kaufmann Publishers,2001.282-289.
    [14]Peng F,McCallum A.Accurate information extraction from research papers using conditional random fields.In:Dumais S,Marcu D,Roukos S,eds.Proc.of the Human Language Technology Conf.and North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004).New York:ACM Press,2004.329-336.
    [15]Han H,Giles CL,Mnavoglu E,Zha HY,Zhang ZY,Fox EA.Automatic document metadata extraction using support vector machine.In:Proc.of the ACM/IEEE Joint Conf.on Digital Libraries (JCDL 2003).New York:ACM Press,2003.37-48.
    [16]Stitson MO,Weston JAE,Gammerman A,Vovk V,Vapnik V.Theory of support vector machines.Technical Report,CSD-TR-96-17,London:University of London,1996.
    [17]Stolcke A,Omohundro SM.Best-First model merging for hidden Markov model induction.Technical Report,TR-94-003,Computer Science Division,University of California at Berkeley,Int'l Computer Science Institute,1994.
    [18]McCallum AK,Nigam K,Rennie J,Seymore K.Automating the construction of internet portals with machine learning.Information Retrieval Journal,2000,3(2):127-163.
    [19]Rabiner LR.A tutorial on hidden Markov models and selected applications in speech recognition.Proc.of the IEEE,1989,77(2):257-285.
    [20]Ganapathiraju A,Hamaker JE,Picone J.Applications of support vector machines to speech recognition.IEEE Trans.on Signal Processing,2004,52(8):2348-2355.
    [21]Zadrozny B,Elkan C.Transforming classifier scores into accurate multiclass probability estimates.In:Hand D,Keim D,Ng R,eds.Proc.of the 8th ACM SIGKDD Int'l Conf.on Knowledge Discovery and Data Mining (KDD 2002).New York:ACM Press,2002.694-699.
    [22]Venkataramani V,Byrne V.Lattice segmentation and support vector machines for large vocabulary continuous speech recognition.In:Petropulu AP,Xia XG,eds.Proc.of the Int'l Conf.on Acoustics,Speech,and Signal Processing (ICASSP 2005).Washington:IEEE Computer Society,2005.817-820.
    Comments
    Comments
    分享到微博
    Submit
Get Citation

张 铭,银 平,邓志鸿,杨冬青. SVM+BiHMM:基于统计方法的元数据抽取混合模型.软件学报,2008,19(2):358-368

Copy
Share
Article Metrics
  • Abstract:4927
  • PDF: 6417
  • HTML: 0
  • Cited by: 0
History
  • Received:March 28,2006
  • Revised:June 07,2007
You are the first2045328Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063