
CN 51-1686/N


Research on Collaborating Strategy among the Multi-agent Focused Crawlers

  • 摘要: 在多个Web主题爬虫并行爬行中, 如何避免重复访问网页并高效地获取与主题相关网页, 成为搜索引擎主题爬行的热点研究内容之一。为完成系统爬行任务充分发挥每个爬虫自身能力, 文章立足于每个爬虫相对独立爬行、共同协作、彼此竞争的思想, 将爬虫的历史爬行网页作为背景知识, 分析这些网页文本内容, 提取网页中“概念”和概念间的语义关系, 探讨不同爬虫背景知识之间的语义相似性, 提出基于分层概念背景图的爬虫之间理解方法、协作和竞争策略。该策略包括4个方面的内容: 主题爬虫背景知识的分层概念背景图的表示模型、基于分层概念背景图的爬虫语义理解方法、在语义理解模型下同组多个网络爬虫之间协作与竞争机制及实现、在语义理解模型下异组多个爬虫之间协作与竞争机制及实现。


    Abstract: In a focused cralwing system, multi-crawlers crawl parallelly Web and download Web pages. It is one of hotspot researches for a search engine how the diferent focused crawlers avoid to visit the same URLs and download efficiently Web pages related to the search topic. In order to rapidly accomplish the crawling tasks of the system for the specific topic, and embody fully every Web crawler's ability, the author considers that these history visited Web pages (URLs) of every focused crawler reflect their backgroup knowledge. On the basis of cralwing independently, collaborating togather and competing with each other for Web crawlers of the system, the paper proposes the novel understanding, cooperating and competing strategy of concept context grap. It includes four aspects as follows: constructing the mathematical model of backgrounp knowledge of every Web crawler based on hierarchy concept context graph, according to the semantic characteristics-concepts of Web pages and their semantic relationships among the concepts; studying the understanding method and model among Web crawlers based on hierarchy concept context graph; studying and implementing the cooprtating, competing model among Web crawlers of the same group managing by a F-Agent; studying and implementing the cooprtating, competing model among Web crawlers of the diferent group managing by F-Agents.


