网络爬行虫智能化研究分析

Analysis of the Study Status on Web Crawler with Certain Intelligence

摘要: Web爬行虫是当今搜索引擎的主要组成部分, 也是信息检索领域研究的热点问题。本文综述了具有一定智能性的网络爬行虫的研究历史与现状, 主要包括两个方面: 传统的人工智能方法如神经网络、遗传算法、蚁群算法等在网络爬行虫的应用, 以及借助这些方法发展起来的主题爬行虫; 多网络爬行虫系统中爬行虫的协调的Agent技术。在此基础上, 提出了一个语义概念背景图的网络爬行的基本思路。

Abstract: At present, web crawler is a major component of the current search engine, also a hot point for the research on information retrieval. In this paper, the author reviews the research history and current situation about web crawler with certain intelligence, mainly in two aspects: first, the application of traditional artificial intelligence methods in web crawler, such as neural network, genetic algorithm, ant colony optimization and so on, as well as the focused crawler which is developed based on these methods; second, the agent technology about the coordination of the web crawler in multi-network crawling system. On this basis, a basic idea about web crawling based on semantic concept context graph is proposed.