www.sqnote.cn

最近查看搜索相关内容,使用到Heritrix. 网上查询,原来apache下还有一个Nutch项目(感觉不是很活跃)

Heritrix:http://crawler.archive.org/

Nutch:http://lucene.apache.org/nutch/ (2 April 2007: Nutch 0.9 Released )

----------------------------------------------------------------------------------------------

Heritrix安装(环境:heritrix-1.14.2    windows vista  jdk1.5.0)

1:下载heritrix-1.14.2.zip

2:定义环境变量  HERITRIX_HOME

3:拷贝

% HERITRIX_HOME%/conf/jmxremote.password.template -->

% HERITRIX_HOME%/jmxremote.password

同时修改末尾的内容
monitorRole @PASSWORD@  ==> monitorRole admin
controlRole @PASSWORD@   ==> controlRole admin

4:在cmd中运行  % HERITRIX_HOME%/bin/heritrix.cmd --admin=admin:admin

5: 浏览器中127.0.0.1:8080即可访问。

Heritrix的架构图:

...

Tags: heritrix nutch

By SQ post on 2009-2-3 22:54 PM Web |

添加评论

5+9