基于P2P的Web搜索技术
作者:
基金项目:

Supported by the National Natural Science Foundation of China under Grant Nos.60433040, 90412006, 90412011, 60573110, 60673152, 90612016 (国家自然科学基金); the National Basic Research Program of China under Grant Nos.2003CB317007, 2004CB318000 (国家重点基础研究发展计划(973)); the National High-Tech Research and Development Plan of China under Grant Nos.2006AA01A101, 2006AA01A108, 2006AA01A111 (国家高技术研究发展计划(863))

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [70]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    Web搜索引擎已经成为人们从海量Web信息中快速找到所需信息的重要工具,随着Web数据量的爆炸性增长,传统的集中式搜索引擎已经越来越不能满足人们不断增长的信息获取需求.随着对等网络(peer-to-peer,简称P2P)技术的快速发展,人们提出了基于P2P的Web搜索技术并迅速成为研究热点.研究的目的是对现有的基于P2P的Web搜索技术进行总结,以期为进一步研究指明方向.首先分析了基于P2P的Web搜索面临的诸多挑战;然后重点总结分析了基于P2P的Web搜索的各项关键技术的研究现状,包括系统拓扑结构、数据存放策略、查询路由机制、索引切分策略、数据集选择、相关性排序、网页收集方法等;最后对已有的3个较有特色的基于P2P的Web搜索原型系统进行了介绍.

    Abstract:

    Web search engine has become a very important tool for finding information efficiently from the massive Web data. With the explosive growth of the Web data, traditional centralized search engines become harder to catch up with the growing step of people's information needs. With the rapid development of peer-to-peer (P2P) technology, the notion of P2P Web search has been proposed and quickly becomes a research focus. The goal of this paper is to give a brief summary of current P2P Web search technologies in order to facilitate future research. First, some main challenges for P2P Web search are presented. Then, key techniques for building a feasible and efficient P2P Web search engine are reviewed, including system topology, data placement, query routing, index partitioning, collection selection, relevance ranking and Web crawling. Finally, three recently proposed novel P2P Web search prototypes are introduced.

    参考文献
    [1] Bergman MK. The deep Web: Surfacing hidden value. White paper. The Journal of Electronic Publishing. 2001. http://www.press. umich.edu/jep/07-01/bergman.html
    [2] Wang CG, Li B. Peer-to-Peer overlay networks: A survey. Technical Report, Department of Computer Science, Hong Kong University of Science and Technology, 2003.
    [3] Reynolds P, Vahdat A. Efficient peer-to-peer keyword searching. In: Endler M, Schmidt D, eds. Middleware 2003. LNCS 2672, Berlin: Springer-Verlag, 2003. 21-40.
    [4] Zhong M, Moore J, Shen K, Murphy AL. An evaluation and comparison of current peer-to-peer full-text keyword search techniques. In: Doan A, Neven F, McCann R, Bex GJ, eds. Proc. of the 8th Int’l Workshop on the Web and Databases (WebDB 2005). New York: ACM Press, 2005. 61-66.
    [5] Yang Y, Dunlap R, Rexroad M, Cooper BF. Performance of full text search in structured and unstructured peer-to-peer systems. In: Proc. of the 25th IEEE Int’l Conf. on Computer Communications (INFOCOM 2006). IEEE Computer Society, 2006. 1-12.
    [6] Yang B, Garcia-Molina H. Improving search in peer-to-peer networks. In: Rodrigues L, Raynal M, Chen W, eds. Proc. of the 22nd Int’l Conf. on Distributed Computing Systems (ICDCS 2002). Washington: IEEE Computer Society, 2002. 5-14.
    [7] Risson J, Moors T. Survey of research towards robust peer-to-peer networks: Search methods. Computer Networks, 2006, 50(17):3485-3521.
    [8] Cholvi V, Felber P, Biersack E. Efficient search in unstructured peer-to-peer networks. In: Gibbons PB, Adler M, eds. Proc. of the 16th Annual ACM Symp. on Parallelism in Algorithms and Architectures (SPAA 2004). New York: ACM Press, 2004. 271-272.
    [9] Joung YJ, Fang CT, Yang LW. Keyword search in DHT-based peer-to-peer networks. In: Proc. of the 25th IEEE Int’l Conf. on Distributed Computing Systems (ICDCS 2005). IEEE Computer Society, 2005. 339-348.
    [10] Joung YJ, Yang LW. Wildcard search in structured peer-to-peer networks. IEEE Trans. on Knowledge and Data Engineering, 2007,19(11):1524-1540.
    [11] Skobeltsyn G, Luu T, Zarko IP, Rajman M, Aberer K. Query-Driven indexing for peer-to-peer text retrieval. In: Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ, eds. Proc. of the 16th Int’l World Wide Web Conf. (WWW’75). New York: ACM Press, 2007. 1185-1186.
    [12] Zeinalipour-Yazti D, Kalogeraki V, Gunopulos D. pFusion: A P2P architecture for Internet-scale content-based search and retrieval. IEEE Trans. on Parallel and Distributed Systems, 2007,18(6):804-817.
    [13] Zhang RM, Hu YC. Assisted peer-to-peer search with partial indexing. IEEE Trans. on Parallel and Distributed Systems, 2007,18(8):1146-1158.
    [14] Li JY, Loo BT, Hellerstein JM, Kaashoek MF, Karger DR, Morris R. On the feasibility of peer-to-peer web indexing and search. In: Kaashoek F, Stoica I, eds. Proc. of the 2nd Int’l Workshop on Peer-to-Peer Systems (IPTPS 2003). London: Springer-Verlag, 2003. 207-215.
    [15] Daswani N, Garcia-Molina H, Yang B. Open problems in data-sharing peer-to-peer systems. In: Calvanese D, Lenzerini M, Motwani R, eds. Proc. of the 9th Int’l Conf. on Database Theory (ICDT 2003). Siena: Springer-Verlag, 2003. 1-15.
    [16] Huebsch R, Hellerstein JM, Lanham N, Loo BT, Shenker S. Querying the Internet with PIER. In: Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A, eds. Proc. of the 29th Int’l Conf. on Very Large Databases (VLDB 2003). Berlin: Morgan Kaufmann Publishers, 2003. 321-332.
    [17] Ratnasamy S, Francis P, Handley M, Karp R, Shenker S. A scalable content-addressable network. ACM SIGCOMM Computer Communication Review, 2001,31(4):161-172.
    [18] Stoica I, Morris R, Karger D, Kaashoek MF, Balakrishnan H. Chord: A scalable peer-to-peer lookup service for Internet applications. ACM SIGCOMM Computer Communication Review, 2001,31(4):149-160.
    [19] Rowstron A, Druschel P. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In: Guerraoui R, ed. Proc. of the 18th IFIP/ACM Int’l Conf. on Distributed Systems Platforms (Middleware 2001). London: Springer-Verlag, 2001. 329-350.
    [20] Zhao BY, Kubiatowicz J, Joseph AD. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report, UCB/CSD-01-1141, Berkeley: UC Berkeley, 2001.
    [21] Harvey NJA, Jones MB, Saroiu S, Theimer M, Wolman A. SkipNet: A scalable overlay network with practical locality properties. In: Proc. of the 4th USENIX Symp. on Internet Technologies and Systems (USITS 2003). Berkeley: USENIX Association, 2003. 113-126. http://research.microsoft.com/users/alecw//usits-2003.pdf
    [22] Saia J, Fiat A, Gribble S, Karlin AR, Saroiu S. Dynamically fault-tolerant content addressable networks. In: Druschel P, Kaashoek F, Rowstron A, eds. Proc. of the 1st Int’l Workshop on Peer-to-Peer Systems (IPTPS 2002). London: Springer-Verlag, 2002. 270-279.
    [23] Doulkeridis C, Norvag K, Vazirgiannis M. The SOWES approach to P2P Web search using semantic overlays. In: Carr L, Roure D, Iyengar A, Goble CA, Dahlin M, eds. Proc. of the 15th Int’l World Wide Web Conf. (WWW 2006). New York: ACM Press, 2006. 1027-1028.
    [24] Zhou J, Li K, Tang L. Towards a fully distributed P2P Web search engine. In: Proc. of the 10th IEEE Int’l Workshop on Future Trends of Distributed Computing Systems (FTDCS 2004). Washington: IEEE Computer Society, 2004. 332-338.
    [25] Zhou J, Li K, Tang L. Coopeer: A peer-to-peer Web search engine towards collaboration, humanization and personalization. In: Hassanein H, Oliver RL, Richard III GG, Wilson LF, eds. Proc. of the 23rd IEEE Int’l Performance, Computing and Communications Conf. (IPCCC 2004). Phoenix: IEEE Computer Society, 2004. 313-314.
    [26] Suel T, Mathur C, Wu JW, Zhang JG, Delis A, Kharrazi M, Long XH, Shanmugasundaram K. ODISSEA: A peer-to-peer architecture for scalable Web search and information retrieval. In: Christophides V, Freire J, eds. Proc. of the 6th Int’l Workshop on the Web and Databases (WebDB 2003). New York: ACM Press, 2003. 67-72.
    [27] Wang Y, Galanis L, Dewitt DJ. GALANX: An efficient peer-to-peer search engine system. http://www.cs.wisc.edu/~yuanwang/ papers
    [28] Bender M, Michel S, Triantafillou P, Weikum G, Zimmer C. MINERVA: Collaborative P2P search. In: B?hm K, Jensen CS, Haas LM, Kersten ML, Larson PA, Ooi BC, eds. Proc. of the 31st Int’l Conf. on Very Large Data Bases (VLDB 2005). Trondheim: VLDB Endowment, 2005. 1263-1266.
    [29] Michel S, Triantafillou P, Weikum G. MINERVA: A scalable efficient peer-to-peer search engine. In: Alonso G, ed. Proc. of the Middleware 2005. Berlin: Springer-Verlag, 2005. 60-81.
    [30] Sripanidkulchai K, Maggs B, Zhang H. Efficient content location using interest-based locality in peer-to-peer systems. In: Proc. of the IEEE INFOCOM 2003 Conf. San Francisco: IEEE Computer Society, 2003. 2166-2176.
    [31] Cohen E, Fiat A, Kaplan H. Associative search in peer-to-peer networks: harnessing latent semantics. In: Proc. of the IEEE INFOCOM 2003 Conf. San Francisco: IEEE Computer Society, 2003. 1261-1271. http://www.ieee-infocom.org/2003/papers/ 31_02.PDF
    [32] Rao A, Lakshminarayanan K, Surana S, Karp R, Stoica I. Load balancing in structured P2P systems. In: Kaashoek F, Stoica I, eds. Proc. of the 2nd Int’l Workshop on Peer-to-Peer Systems (IPTPS 2003). London: Springer-Verlag, 2003. 68-79.
    [33] Byers J, Considine J, Mitzenmacher M. Simple load balancing for distributed hash tables. In: Kaashoek F, Stoica I, eds. Proc. of the 2nd Int’l Workshop on Peer-to-Peer Systems (IPTPS 2003). London: Springer-Verlag, 2003. 80-88.
    [34] Karger DR, Ruhl M. Simple efficient load balancing algorithms for peer-to-peer systems. Theory of Computing Systems, 2006,39(6):787-804.
    [35] Godfrey B, Lakshminarayanan K, Surana S, Karp R, Stoica I. Load balancing in dynamic structured P2P systems. In: Proc. of the IEEE INFOCOM 2004 Conf. Hong Kong: IEEE Computer Society, 2004. 2253-2262.
    [36] Tsoumakos D, Roussopoulos N. A comparison of peer-to-peer search methods. In: Christophides V, Freire J, eds. Proc. of the 6th Int’l Workshop on the Web and Databases (WebDB 2003). New York: ACM Press, 2003. 61-66.
    [37] Kalogeraki V, Gunopulos D, Zeinalipour-Yazti D. A local search mechanism for peer-to-peer networks. In: Proc. of the 11th Int’l Conf. on Information and Knowledge Management (CIKM 2002). New York: ACM Press, 2002. 300-307.
    [38] Lv Q, Cao P, Cohen E, Li K, Shenker S. Search and replication in unstructured peer-to-peer networks. In: Proc. of the 16th Int’l Conf. on Supercomputing (ICS 2002). New York: ACM Press, 2002. 84-95.
    [39] Crespo A, Garcia-Molina H. Routing indices for peer-to-peer systems. In: Rodrigues L, Raynal M, Chen W, eds. Proc. of the 22nd Int’l Conf. on Distributed Computing Systems (ICDCS 2002). Vienna: IEEE Computer Society, 2002. 23-32.
    [40] Crespo A, Garcia-Molina H. Semantic overlay networks for P2P systems. In: Moro G, Bergamaschi S, Aberer K, eds. Proc. of the 3rd Int’l Workshop on Agents and Peer-to-Peer Computing (AP2PC 2004). Heidelberg: Springer-Verlag, 2005. 1-13.
    [41] Ratnasamy S, Stoica I, Shenker S. Routing algorithms for DHTs: Some open questions. In: Druschel P, Kaashoek F, Rowstron A, eds. Proc. of the 1st Int’l Workshop on Peer-to-Peer Systems (IPTPS 2002). London: Springer-Verlag, 2002. 45-52.
    [42] Harren M, Hellerstein JM, Huebsch R, Loo BT, Shenker S, Stoica I. Complex queries in DHT-based peer-to-peer networks. In: Druschel P, Kaashoek F, Rowstron A, eds. Proc. of the 1st Int’l Workshop on Peer-to-Peer Systems (IPTPS 2002). London: Springer-Verlag, 2002. 242-250.
    [43] Manber U, Myers G. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 1993,22(5):935-948.
    [44] Badue C, Ribeiro-Neto B, Baeza-Yates R, Ziviani N. Distributed query processing using partitioned inverted files. In: Proc. of the 8th Int’l Symp. on String Processing and Information Retrieval (SPIRE). Laguna de San Rafael: IEEE Computer Society, 2001. 10-20. http://homepages.dcc.ufmg.br/~berthier/conference_papers/spire_2001.pdf
    [45] Shi SM, Yang GW, Wang DX, Yu J, Qu SG, Chen M. Making peer-to-peer keyword searching feasible using multi-level partitioning. In: Voelker GM, Shenker S, eds. Proc. of the 3rd Int’l Workshop on Peer-to-Peer Systems (IPTPS 2004). Berlin: Springer-Verlag, 2004. 151-161.
    [46] Callan J. Distributed information retrieval. In: Croft WB, ed. Advances in information retrieval. Dordrecht: Kluwer Academic Publishers, 2000. 127-150.
    [47] Fuhr N. A decision-theoretic approach to database selection in networked IR. ACM Trans. on Information Systems, 1999,17(3): 229-249.
    [48] Gravano L, Garcia-Molina H, Tomasic A. Gloss: Text-Source discovery over the Internet. ACM Trans. on Database Systems, 1999,24(2):229-264.
    [49] Si L, Jin R, Callan J, Ogilvie P. A language modeling framework for resource selection and results merging. In: Proc. of the 11th Int’l Conf. on Information and Knowledge Management (CIKM 2002). New York: ACM Press, 2002. 391-397.
    [50] Byers J, Considine J, Mitzenmacher M, Rost S. Informed content delivery across adaptive overlay networks. ACM SIGCOMM Computer Communication Review, 2002,32(4):47-60.
    [51] Florescu D, Koller D, Levy A. Using probabilistic information in data integration. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA, eds. Proc. of the 23rd Int’l Conf. on Very Large Data Bases (VLDB’97). San Francisco: Morgan Kaufmann Publishers, 1997. 216-225.
    [52] Zhang Y, Callan J, Minka T. Novelty and redundancy detection in adaptive filtering. In: Proc. of the 25th annual Int’l ACM SIGIR Conf. (SIGIR 2002). New York: ACM Press, 2002. 81-88. http://www.cs.cmu.edu/~callan/Papers/sigir02-yiz.pdf
    [53] Nie ZQ, Kambhampati S, Nambiar U. Effectively mining and using coverage and overlap statistics in data integration. IEEE Trans. on Knowledge and Data Engineering, 2005,17(5):638-651.
    [54] Hernandez T, Kambhampati S. Improving text collection selection with coverage and overlap statistics. In: Ellis A, Hagino T, eds. Proc. of the 14th Int’l World Wide Web Conference (WWW 2005). New York: ACM Press, 2005. 1128-1129.
    [55] Bender M, Michel S, Triantafillou P, Weikum G, Zimmer C. Improving collection selection with overlap awareness in P2P search engines. In: Baeza-Yates RA, Ziviani N, Marchionini G, Moffat A, Tait J, eds. Proc. of the 28th Annual Int’l ACM SIGIR Conf. (SIGIR 2005). New York: ACM Press, 2005. 67-74.
    [56] Bender M, Michel S, Triantafillou P, Weikum G. Global document frequency estimation in peer-to-peer web search. In: Proc. of the 9th Int’l Workshop on the Web and Database (WebDB 2006). New York: ACM Press, 2006. 69-74.
    [57] Tang CQ, Xu ZC, Dwarkadas S. Peer-to-Peer information retrieval using self-organizing semantic overlay networks. In: Proc. of ACM SIGCOMM 2003 Conf. New York: ACM Press, 2003. 175-186.
    [58] Bhattacharya I, Kashyap SR, Parthasarathy S. Similarity searching in peer-to-peer databases. In: Proc. of the 25th IEEE Int’l Conf. on Distributed Computing Systems (ICDCS 2005). Washington: IEEE Computer Society, 2005. 329-338.
    [59] Cuenca-Acuna FM, Peery C, Martin RP, Nguyen TD. PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities. In: Proc. of the 12th IEEE Int’l Symp. on High Performance Distributed Computing (HPDC 2003). Washington: IEEE Computer Society, 2003. 236-246.
    [60] Bawa M, Gionis A, Garcia-Molina H, Gionis A, Motwani R. Estimating aggregates on a peer-to-peer network. Technical Report, 2003-24, Department of Computer Science, Stanford University, 2003.
    [61] Gopalakrishnan V, Morselli R, Bhattacharjee B, Keleher P, Srinivasan A. Ranking search results in P2P systems. Technical Report, CR-TR-4779, Department of Computer Science, University of Maryland, 2006.
    [62] Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the Web. Technical Report, Stanford University, 1998. http://dbpubs.stanford.edu/pub/1999-66
    [63] Kleinberg JM. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999,46(5):604-632.
    [64] Sankaralingam K, Sethumadhavan S, Browne JC. Distributed pagerank for P2P systems. In: Proc. of the 12th IEEE Int’l Symp. on High Performance Distributed Computing (HPDC 2003). Washington: IEEE Computer Society, 2003. 58-68.
    [65] Shi SM, Yu J, Yang GW, Wang DX. Distributed page ranking in structured P2P networks. In: Proc. of the 32nd Int’l Conf. on Parallel Processing (ICPP 2003). 2003. Kaohsiung: IEEE Computer Society, 2003. 179-186.
    [66] Parreira JX, Donato D, Michel S, Weikum G. Efficient and decentralized pagerank approximation in a peer-to-peer Web search network. In: Proc. of the 32nd Int’l Conf. on Very Large Data Bases (VLDB 2006). Seoul: VLDB Endowment, 2006. 415-426.
    [67] Takahashi T, Soonsang H, Taura K, Yonezawa A. World Wide Web crawler. In: Poster Proc. of the 11th Int’l World Wide Web Conf. (WWW 2002). 2002. http://www2002.org/CDROM/poster/182/index.html
    [68] Loo BT, Cooper O, Krishnamurthy S. Distributed web crawling over DHTs. Technical Report, UCB/CSD-04-1305, Berkeley: UC Berkeley, 2004.
    [69] Singh A, Srivatsa M, Liu L, Miller T. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In: Proc. of the 26th Annual Int’l ACM SIGIR Conf. (SIGIR 2003). Berlin: Springer-Verlag, 2003. 126-142.
    [70] Bender M, Michel S, Weikum G. P2P directories for distributed Web search: From each according to his ability, to each according to his needs. In: Proc. of the 22nd Int’l Conf. on Data Engineering Workshops (ICDEW 2006). 2006. 51-60.
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

方启明,杨广文,武永卫,郑纬民.基于P2P的Web搜索技术.软件学报,2008,19(10):2706-2719

复制
分享
文章指标
  • 点击次数:12311
  • 下载次数: 14472
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2007-07-30
  • 最后修改日期:2008-02-25
文章二维码
您是第20046329位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号