Automatic Data Extraction from Template-Generated Web Pages

微信小程序

微信服务号

微信订阅号

Home > Archive>Volume 19, Issue 2, 2008 >209-223

PDF HTML XML Export Cite reminder

Automatic Data Extraction from Template-Generated Web Pages
DOI:
                        
Author:
                        
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases, such as product description pages on e-commerce sites. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data (e.g., product name, price...). The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Comparing with many other existing work, the approach is applicable for both "list pages" and "detail pages". Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.

Reference

Cited by

Get Citation

杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法.软件学报,2008,19(2):209-223

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:September 07,2007
Revised:November 29,2007
Adopted:
Online:
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063