[关键词]
[摘要]
[Key word]
[Abstract]
In times of Web 2.0, more and more websites adopt dynamic scripts for user interaction, and the switches between pages are no longer all based on the “
” tags and the URL is no longer the unique identification of a Web page. Traditional Web crawlers can’t deal with Web pages containing dynamic scripts, as a result, search engines, such as Google, give up these Web pages. The research on crawling website with dynamic scripts is still in the early stage. This paper proposes an efficient valid page crawling approach for websites with dynamic scripts. Firstly, by training the paper can get the events and the Web elements that triggered the events, which would lead the people to desired Web pages. Then, the paper generates the XPath patterns of these elements and record the events the people need to trigger. During crawling, the paper only considers these event and element combinations for accelerating the crawling. Additionally, the paper demonstrates the efficiency and the effectiveness of the approach by extensive experimental evaluation.
[中图分类号]
[基金项目]
Supported by the National Natural Science Foundation of China under Grant No.60873062(国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant Nos.2009AA01Z150, 2007AA01Z191, 2006AA01Z230 (国家高技术研究发展计划(863)); the Peking Universi