Abstract:Based on the information theory, this paper presents a model based on Web format information quantity in blog information extraction. First, the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page. Second, the format information of blog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text. This model is language insensitive and can be used in a lot of blogs which are written in different natural languages. Experimental results show that this method achieves high precision in locating main text and separating the post and comment.