Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction

微信服务号

微信订阅号

Home > Archive>Volume 20, Issue 5, 2009 >1282-1291

Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction
DOI:
                        
Author:
                        
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Based on the information theory, this paper presents a model based on Web format information quantity in blog information extraction. First, the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page. Second, the format information of blog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text. This model is language insensitive and can be used in a lot of blogs which are written in different natural languages. Experimental results show that this method achieves high precision in locating main text and separating the post and comment.

Reference

Cited by

Get Citation

曹冬林,廖祥文,许洪波,白硕.基于网页格式信息量的博客文章和评论抽取模型.软件学报,2009,20(5):1282-1291

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:July 03,2007
Revised:February 27,2008
Adopted:
Online:
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History