• Article
  • | |
  • Metrics
  • |
  • Reference [34]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    There are three core issues recognized for WAN-based distributed Web crawling systems: Web Partition, Agent collaboration and Agent deployment. Centering around these issues, this paper presents a comprehensive overview of the current strategies adopted by academic and business communities. The experiences, problems and challenges encountered by the WAN-based distributed Web crawlers are classified and discussed in depth. A summary of the current evaluation indicators is also given. Finally, conclusion and some suggestions for future research are put forward.

    Reference
    [1] CNNIC. The 21st statistical survey report on the Internet development in China. 2008 (in Chinese). http://www.cnnic.net.cn/ uploadfiles/pdf/2008/1/17/104156.pdf
    [2] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1-7):107?117. [doi: 10.1016/S0169-7552(98)00110-X]
    [3] Burner M. Crawling towards eternity—Building an archive of the World Wide Web. Web Techniques Magazine, 1997,2(5):37?40.
    [4] Heydon A, Najork M. Mercator: A scalable, extensible Web crawler. World Wide Web, 1999,2(4):219?229. [doi: 10.1023/ A:1019213109274]
    [5] Korpela E, Werthimer D, Anderson D, Cobb J, Lebofsky M. SETI@HOME—Massively distributed computing for SETI. Computing in Science & Engineering, 2001,3(1):78?83. [doi: 10.1109/5992.895191]
    [6] Cho J, Garcia-Molina H. Parallel crawlers. In: Proc. of the 11th Int’l Conf. on World Wide Web. New York: ACM Press, 2002. 124?135.
    [7] Boldi P, Codenotti B, Santini M, Vigna S. Ubicrawler: A scalable fully distributed Web crawler. Software-Practice & Experience, 2004,34(8):711?726.
    [8] Boswell D. Distributed high-performance Web crawlers: A survey of the state of the art. 2003. http://www.cs.ucsd.edu/~dboswell/ PastWork/WebCrawlingSurvey.pdf
    [9] Baeza-Yates R, Castillo C, Junqueira F, Plachouras V, Silvestri F. Challenges in distributed information retrieval. In: Proc. of the Int’l Conf. on Data Engineering (ICDE). Washington: IEEE Computer Society Press, 2007.
    [10] Papapetrou O, Samaras G. IPMicra: An IP-address based location aware distributed Web crawler. In: Proc. of the 5th Int’l Conf. on Internet Computing (IC 2004). 2004. 694?699.
    [11] Cambazoglu BB, Karaca E, Kucukyilmaz T, Turk A, Aykanat C. Architecture of a grid-enabled Web search engine. Information Processing and Management, 2007,43(3):609?623. [doi: 10.1016/j.ipm.2006.10.011]
    [12] Foster I, Kesselman C, Wrote; Jin H, Yuan PP, Shi K, Trans. The Grid 2: Blueprint for a New Computing Infrastructure— Application Tuning and Adaptation (2nd ed.). Beijing: Publishing House of Electronics Industry, 2004 (in Chinese).
    [13] Singh A, Srivatsa M, Liu L, Miller T. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In: Proc. of the SIGIR 2003 Workshop on Distributed Information Retrieval. 2004. 126?142.
    [14] Li XM, Yan HF, Wang JM. Search Engine: Principle, Technology and System. Beijing: Science Press, 2005 (in Chinese).
    [15] Ye YM, Yu S, Ma FY, Song H, Zhang L. On distributed Web crawler: Architecture, algorithms and strategy. Acta Electronica Sinica, 2002,30(12A):2008?2011 (in Chinese with English abstract).
    [16] Karger D, Lehman E, Leighton T, Levine M, Lewin D, Panigrahy R. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: Proc. of the ACM Symp. on Theory of Computing. New York: ACM Press, 1997. 654?663.
    [17] Exposto J, Macedo J, Pina A, Alves A, Rufino J. Geographical partition for distributed Web crawling. In: Proc. of the 2005 Workshop on Geographic Information Retrieval. New York: ACM Press, 2005. 55?60.
    [18] Jiang Y, Hu MZ, Fang BX, Zhang HL. An Internet router level topology automatically discovery system. Journal of China Institute of Communications, 2002,23(12):54?62 (in Chinese with English abstract).
    [19] Francis P, Jamin S, Jin C, Jin Y, Raz D, Shavitt Y, Zhang L. IDMaps: A global Internet host distance estimation service. IEEE/ACM Trans. on Networking, 2001,9(5):525?540. [doi: 10.1109/90.958323]
    [20] Francis P, Jamin S, Paxson V, Zhang LX, Gryniewicz DF, Yin YX. An architecture for a global internet host distance estimation service. In: Proc. of the 8th Annual Joint Conf. of the IEEE Computer and Communications Societies (INFOCOM’99). Washington: IEEE Computer Society Press, 1999. 210?217.
    [21] Ng TSE, Zhang H. Towards global network positioning. In: Proc. of the ACM SIGCOMM Internet Measurement Workshop. New York: ACM Press, 2001. 25?29.
    [22] Ng TSE, Zhang H. A network positioning system for the Internet. In: Proc. of the USENIX Annual Technical Conf. Berkeley: USENIX Association, 2004. 141?154.
    [23] Costa M, Castro M, Rowstron A, Key P. PIC: Practical Internet coordinates for distance estimation. In: Proc. of the Int’l Conf. on Distributed Systems. Washington: IEEE Computer Society Press, 2004.
    [24] Pias M, Crowcroft J, Wilbur S, Harris T, Bhatti S. Lighthouses for scalable distributed location. In: Proc. of the 2nd Int’l Workshop on Peer-to-Peer Systems (IPTPS 2003). Berlin, Heidelberg: Springer-Verlag, 2003. 278?291.
    [25] Stoica I, Morris R, Karger D, Kaashoek MF, Balakrishnan H. Chord: A scalable peer-to-peer lookup service for Internet applications. In: Proc. of the 2001 Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications. New York: ACM Press, 2001. 149?160.
    [26] Loo BT, Cooper O, Krishnamurthy S. Distributed Web crawling over DHTs. Technical Report, CSD-4-1305, Berkeley: Technical Department of Electrical Engineering and Computer Sciences, University of California, 2004.
    [27] Doi K, Tagashira S, Fujita S. Proximity-Aware content addressable network based on Vivaldi network coordinate system. In: Proc. of the 5th Int’l Workshop on Databases, Information Systems and Peer-to-Peer Computing. 2007.
    [28] Nikolaos E, Athanasios C, Spyros D, Odysseas K. L-CAN: Locality aware structured overlay for P2P live streaming. In: Proc. of the 11th IFIP/IEEE Int’l Conf. on Management of Multimedia and Mobile Networks and Services: Management of Converged Multimedia Networks and Services. Berlin, Heidelberg: Springer-Verlag, 2008. 77?90.
    [29] Ratnasamy S, Francis P, Handley M, Karp R, Shenker S. A scalable content addressable network. In: Proc. of the 2001 Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communications. New York: ACM Press, 2001. 161?172.
    附中文参考文献: [1] CNNIC.第21 次中国互联网络发展状况统计报告.2008.
    [12] Foster I,Kesselman C,著;金海,袁平鹏,石柯,译.网格计算(第二版).北京:电子工业出版社,2004.
    [14] 李晓明,闫宏飞,王继民.搜索引擎:原理、技术与系统.北京:科学出版社,2005.
    [15] 叶允明,于水,马范援,等.分布式Web Crawler 的研究:结构、算法和策略.电子学报,2002,30(12A):2008?2011.
    [18] 姜誉,胡铭曾,方滨兴.一个Internet 路由器级拓扑自动发现系统.通信学报,2002,23(12):54?62.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

许 笑,张伟哲,张宏莉,方滨兴.广域网分布式Web 爬虫.软件学报,2010,21(5):1067-1082

Copy
Share
Article Metrics
  • Abstract:9653
  • PDF: 14811
  • HTML: 0
  • Cited by: 0
History
  • Received:September 27,2008
  • Revised:September 03,2009
You are the first2033139Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063