Abstract:This paper proposes SVM+BiHMM, a hybrid statistic model of metadata extraction based on SVM (support vector machine) and BiHMM (bigram HMM (hidden Markov model)). The BiHMM model modifies the HMM model with both Bigram sequential relation and position information of words, by means of distinguishing the beginning emitting probability from the inner emitting probability. First, the rule based extractor segments documents into line-blocks. Second, the SVM classifier tags the blocks into metadata elements. Finally, the SVM+BiHMM model is built based on the BiHMM model, with the emitting probability adjusted by the Sigmoid function of SVM score, and the transition probability trained by Bigram HMM. The SVM classifier benefits from the structure patterns of document line data while the Bigram HMM considers both words' Bigram sequential relation and position information, so the complementary SVM+BiHMM outperforms HMM, BiHMM, and SVM methods in the experiments on the same task.