• Article
  • | |
  • Metrics
  • |
  • Reference [12]
  • |
  • Related [20]
  • |
  • Cited by [14]
  • | |
  • Comments
    Abstract:

    A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases, such as product description pages on e-commerce sites. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data (e.g., product name, price...). The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Comparing with many other existing work, the approach is applicable for both "list pages" and "detail pages". Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.

    Reference
    [1] Chang CH, Kayed M, Girgis MR, Shaalan K. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10):1411-1428.
    [2] Gold ME. Language identification in the limit. Information and Control, 1967,10(5):447-474.
    [3] Laender AHF, Ribeiro-Neto BA, da Silva AD, Teixeira JS. A brief survey of Web data extraction tools. SIGMOD Record, 2002,31(2):84-93.
    [4] Arasu A, Hector GM. Extracting structured data from Web pages. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 337?348.
    [5] EXALG datasets. http://infolab.stanford.edu/~arvind/extract/
    [6] TBDW v1.02. http://daisen.cc.kyushu-u.ac.jp/TBDW/testbed/
    [7] Zhao HK, Meng WY, Wu ZH, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l Conf. on World Wide Web (WWW 2005). Chiba: ACM Press, 2005. 66-75.
    [8] Simon K, Lausen G. ViPER: Augmenting automatic information extraction with visual perceptions. In: Proc. of the ACM CIKM Int'l Conf. on Information and Knowledge Management. Bremen: ACM Press, 2005. 381-388.
    [9] Crescenzi V, Mecca G, Meraldo P. RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB 2001). Roma: Morgan Kaufmann Publishers, 2001. 109-118.
    [10] Wang JY, Lochovsky FH. Data extraction and label assignment for Web databases. In: Proc. of the 12th Int'l World Wide Web Conf. (WWW 2003). Budapest: ACM Press, 2003. 187-196.
    [11] Liu W, Meng XF, Meng WY. Vision-Based Web data records extraction. In: Proc. of the 9th SIGMOD Int'l Workshop on Web and Databases (WebDB 2006). Chicago: ACM Press, 2006.
    [12] Zhai YH, Liu B. Structured data extraction from the Web based on partial tree alignment. IEEE Trans. on Knowledge and Data Engineering, 2006,18(12):1614-1628.
    Comments
    Comments
    分享到微博
    Submit
Get Citation

杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法.软件学报,2008,19(2):209-223

Copy
Share
Article Metrics
  • Abstract:8865
  • PDF: 8161
  • HTML: 0
  • Cited by: 0
History
  • Received:September 07,2007
  • Revised:November 29,2007
You are the first2033213Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063