spiderman抓取百度搜索出的新闻

XiaoXinMa 发布于 2013/12/13 10:06
阅读 1K+
收藏 0

@自风 你好,想跟你请教个问题:

我现在想通过spiderman抓取百度搜索出的所有网页内容,分别贴出xml和debug 


<?xml version="1.0" encoding="UTF-8"?>
<!--
  | Spiderman Java开源垂直网络爬虿
  | 项目主页: https://gitcafe.com/laiweiwei/Spiderman
  | author: l.weiwei@163.com
  | blog: http://laiweiweihi.iteye.com,http://my.oschina.net/laiweiwei
  | qq: 493781187
  | email: l.weiwei@163.com
  | create: 2013-01-08 16:12
  | update: 2013-04-10 18:06
-->
<beans>
	<!--
	  | name:名称
	  | url:种子链接
	  | skipStatusCode:设置哪些状态码需要忽略,多个用逗号隔开
	  | userAgent:设置爬虫标识
	  | includeHttps:0|1是否抓取https顿	  | isDupRemovalStrict:0|1是否严格去掉重复的TargetUrl,即已访问过一次的TargetUrl不会再被访问,若否,就算是重复的TargetUrl,只要它的来源URL不同,都会被访问
	  | isFollowRedirects:0|1是否递归跟随30X返回的location继续抓取
	  | reqDelay:{n}s|{n}m|{n}h|n每次请求之前延缓时间
	  | enable:0|1是否开启本网站的抓叿	  | charset:网站字符雿	  | schedule:调度时间,每隔多长时间重新从种子链接抓取
	  | thread:分配给本网站爬虫的线程数
	  | waitQueue:当任务队列空的时候爬虫等待多长时间再索取任务
	  | timeout:HTTP请求超时
	-->
	<site name="oschina" enable="1" includeHttps="1" url="http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&amp;pn=00&amp;cl=2&amp;ct=1&amp;tn=news&amp;rn=20&amp;ie=utf-8&amp;bt=0&amp;et=0&amp;rsv_page=1" reqDelay="1s" charset="utf-8" schedule="1h" thread="2" waitQueue="10s">
		<!--
		  | 配置多个种子链接
		  | name:种子名称http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&amp;pn=00&amp;cl=2&amp;ct=1&amp;tn=news&amp;rn=20&amp;ie=utf-8&amp;bt=0&amp;et=0&amp;rsv_page=1
		  | url:种子链接
		-->
		<!--seeds>
			<seed name="" url="" />
		</seeds-->
		<!--
		  | 告诉爬虫仅抓取以下这些host的链接,多数是应对二级或多级域名的情冿		-->
		<!--validHosts>
			<validHost value="www.baidu.com" />
			<validHost value="www.softxy.com" />
			<validHost value="baike.baidu.com" />
		</validHosts-->
		
		<!--
		  | HTTP Header
		<headers>
			<header name="" value="" />
		</headers>-->
		<!--
		  | HTTP Cookie
		<cookies>
			<cookie name="" value="" host="" path="" />
		</cookies>-->
		<!--
		  | 进入任务队列的URL规则
		  | policy:多个rule的策略,and | or
		-->
		<queueRules policy="and">
			<!--
			  | 规则
			  | type:规则类型,包拿regex | equal | start | end | contains 所有规则可以在前面添加 "!" 表示取反
			  | value:倿			-->
			<rule type="!regex" value="^.*\.(jpg|png|gif)$" />
		</queueRules>
		<!--
		  | 抓取目标
		-->
		<targets>
			<!--
			  | 限制目标URL的来溿一般来说,对应的就是网站的频道页,例如某个分类下的新闻列表顿			-->
			<sourceRules policy="and">
				<rule type="regex" value="http://news\.baidu\.com/ns\?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&amp;pn=00&amp;cl=2&amp;ct=1&amp;tn=news&amp;rn=\d+&amp;ie=utf-8&amp;bt=0&amp;et=0&amp;rsv_page=1">
					<!--
 					  | 定义如何在来源页面上挖掘新的 UR  http://www\.baidu\.com/s\?wd=%E8%88%86%E8%AE%BA%E7%9B%91%E6%B5%8B&amp;pn=\d+&amp;ie=utf-8&amp;usm=1
					  | 这个节点跿<model> 节点是一样的结构,只不过名称不叫model而是叫做digUrls而已
					-->
					<digUrls>
						<field name="page_url" isArray="1">
							<parsers>
							
								<parser xpath="//body[1]//div[4]/p[1]/a[@href]" attribute="href" />
								<parser exp="'http://news.baidu.com'+$this" />
							</parsers>
						</field>
						<field name="target_url" isArray="1"> 
							<parsers>
									<parser xpath="//h3[@class='c-title']//a[@href]" attribute="href" />
							</parsers>
						</field>
					</digUrls>
				</rule>
			</sourceRules>
			<!--
			  | name:目标名称	
			-->
			<target name="question">
				<!--
				  | 目标URL的规刿				-->
				<urlRules policy="and">
					<rule type="regex" value=".*" />
				</urlRules>
				<!--
				</urlRules>
				  | 目标网页的数据模垿				  | cType: 目标网页的contentTypehttp://www\.baidu\.com/link\?url=.*
				  | isForceUseXmlParser:0|1 是否强制使用XML的解析器来解析目标网页,此选项可以让HTML页面支持XPath2.0
				  | isIgnoreComments:0|1 是否忽略注释
				  | isArray:0|1 目标网页是否有多个数据模型,一般一些RSS XML页面上就会有很多个数据模型需要解析,即在一个xml页面上解析多个Model对象
				  | xpath: 搭配 isArray 来使用,可逿				-->
				<model>
					<!--
					  | 目标网页的命名空间配罿一般用于xml页面
					  | prefix: 前缀
					  | uri: 关联的URI
					<namespaces>
						<namespace prefix="" uri="" />
					</namespaces>
					-->
					<!--
					  | 属性的配置
					  | name:属性名秿					  | isArray:0|1 是否是多倿					  | isMergeArray:0|1 是否将多值合并,搭配isArray使用
					  | isParam:0|1 是否作为参数提供给别的field节点使用,如果是,则生命周期不会保持到最吿					  | isFinal:0|1 是否是不可变的参数,搭配isParam使用,如果是,第一次赋值之后不会再被改叿					  | isAlsoParseInNextPage:0|1 是否在分页的下一页里继续解析,用于目标网页有分页的情冿					  | isTrim:0|1 是否去掉前后空格
					  | isForDigNewUrl:0|1 是否将返回值作为新URL放入任务队列
					-->
					
					<field name="content">
						<parsers>
							<parser xpath="//body" exp="$output($this)" />
							
							<!--attribute 黑名卿-->
							
							<!--  <parser xpath="//a[@href]" attribute="href" />
							<parser exp="$output($this)" />-->
							
							<!--tag 黑名单,去掉内嵌内容-->
							<parser exp="$Tags.xml($this).rm('map').rm('iframe').rm('object').empty().ok()" />
							<!--tag 白名单,保留的标签,除此之外都要删除(不删除其他标签内嵌内容-->
							<parser exp="$Tags.xml($this).kp('br').kp('h1').kp('h2').kp('h3').kp('h4').kp('h5').kp('h6').kp('table').kp('th').kp('tr').kp('td').kp('img').kp('p').kp('a').kp('ul').kp('ol').kp('li').kp('td').kp('em').kp('i').kp('u').kp('er').kp('b').kp('strong').ok()" />
							<!--其他-->
						</parsers>
					</field>
				</model>
			</target>
		</targets>
		<!--
		  | 插件
		-->
		<plugins>
			<!--
			  | enable:是否开启
			  | name:插件名
			  | version:插件版本
			  | desc:插件描述
			-->
			<plugin enable="1" name="spider_plugin" version="0.0.1" desc="这是一个官方实现的默认插件,实现了所有扩展点。">
				<!--
				  | 每个插件包含了对若干扩展点的实现
				-->
				<extensions>
					<!--
					  | point:扩展点名它们包括  task_poll, begin, fetch, dig, dup_removal, task_sort, task_push, target, parse, pojo, end
					-->
					<extension point="task_poll">
						<!--
						  | 扩展点实现类
						  | type: 如何获取实现类 ,默认通过无参构造器实例化给定的类名,可以设置为ioc,这样就会从EWeb4J的IOC容器里获取
						  | value: 当时type=ioc的时候填写IOC的bean_id,否则填写完整类名
						  | sort: 排序,同一个扩展点有多个实现类,这些实现类会以责任链的方式进行执行,因此它们的执行顺序将变得很重要
						-->
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.TaskPollPointImpl" sort="0"/>
					</extension>
					<extension point="begin">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.BeginPointImpl" sort="0"/>
					</extension>
					<extension point="fetch">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.FetchPointImpl" sort="0"/>
					</extension>
					<extension point="dig">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.DigPointImpl" sort="0"/>
					</extension>
					<extension point="dup_removal">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.DupRemovalPointImpl" sort="0"/>
					</extension>
					<extension point="task_sort">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.TaskSortPointImpl" sort="0"/>
					</extension>
					<extension point="task_push">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.TaskPushPointImpl" sort="0"/>
					</extension>
					<extension point="target">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.TargetPointImpl" sort="0"/>
					</extension>
					<extension point="parse">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.ParsePointImpl" sort="0"/>
					</extension>
					<extension point="end">
						<impl type="" value="org.eweb4j.spiderman.plugin.impl.EndPointImpl" sort="0"/>
					</extension>
				</extensions>
				<providers>
					<provider>
						<orgnization name="CFuture" website="http://lurencun.com" desc="Color your future">
							<author name="weiwei" website="http://laiweiweihi.iteye.com | http://my.oschina.net/laiweiwei" email="l.weiwei@163.com" weibo="http://weibo.com/weiweimiss" desc="一个喜欢自由、音乐、绘画的IT老男孩" />
						</orgnization>
					</provider>
				</providers>
			</plugin>
		</plugins>
	</site>
</beans>

debug显示

[SPIDERMAN] 10:01:26 [INFO] ~ init thread pool size->1 success 
[SPIDERMAN] 10:01:26 [INFO] ~ site thread size -> 2
[SPIDERMAN] 10:01:26 [INFO] ~ spider tasks of site[oschina] start... 
2013-12-13 10:01:27 org.apache.http.client.protocol.ResponseProcessCookies processCookies
����: Cookie rejected: "[version: 0][name: BDRCVFR[C0p6oIjvx-c]][value: mk3SLVN4HKm][domain: www.baidu.com][path: /][expiry: null]". Illegal domain attribute "www.baidu.com". Domain of origin: "news.baidu.com"
[SPIDERMAN] 10:01:28 [DIG] ~ field->page_url, 10, [http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=20&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=40&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=60&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=80&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=100&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=120&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=140&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=160&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=180&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0, http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=20&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1]
	 from -> http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [DIG] ~ field->target_url, 20, [http://news.163.com/13/1212/08/9FSNIUOP00014JB6.html, http://www.pcpop.com/doc/0/970/970436.shtml, http://it.chinabyte.com/158/12802158.shtml, http://it.cri.cn/615213/973957926044b.shtml, http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131212_638701.htm, http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131211_638218.htm, http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131211_638216.htm, http://www.cpnn.com.cn/zdcmyqjc/ttrd/201312/t20131210_637894.htm, http://yuqing.people.com.cn/n/2013/1210/c210118-23794149.html, http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131210_637887.htm, http://nm.people.com.cn/n/2013/1209/c196689-20103060.html, http://yuqing.people.com.cn/n/2013/1211/c210118-23809085.html, http://yuqing.hexun.com/2013-12-11/160494319.html, http://yuqing.hexun.com/2013-12-10/160459916.html, http://www.cqn.com.cn/news/zjpd/dfdt/813558.html, http://epaper.oeeee.com/J/html/2013-12/12/content_1988983.htm, http://www.farmer.com.cn/xwpd/jsbd/201312/t20131209_921047.htm, http://yuqing.people.com.cn/n/2013/1211/c212785-23814825.html, http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131209_637275.htm, http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131209_637286.htm]
	 from -> http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://news.163.com/13/1212/08/9FSNIUOP00014JB6.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.pcpop.com/doc/0/970/970436.shtml's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://it.chinabyte.com/158/12802158.shtml's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://it.cri.cn/615213/973957926044b.shtml's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131212_638701.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131211_638218.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131211_638216.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/ttrd/201312/t20131210_637894.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://yuqing.people.com.cn/n/2013/1210/c210118-23794149.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131210_637887.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://nm.people.com.cn/n/2013/1209/c196689-20103060.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://yuqing.people.com.cn/n/2013/1211/c210118-23809085.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://yuqing.hexun.com/2013-12-11/160494319.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://yuqing.hexun.com/2013-12-10/160459916.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cqn.com.cn/news/zjpd/dfdt/813558.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://epaper.oeeee.com/J/html/2013-12/12/content_1988983.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.farmer.com.cn/xwpd/jsbd/201312/t20131209_921047.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://yuqing.people.com.cn/n/2013/1211/c212785-23814825.html's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131209_637275.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ task.url->http://www.cpnn.com.cn/zdcmyqjc/mtdj/201312/t20131209_637286.htm's host is not the same as site.host->http://news.baidu.com/ns?word=%E8%88%86%E6%83%85%E7%9B%91%E6%B5%8B&pn=00&cl=2&ct=1&tn=news&rn=20&ie=utf-8&bt=0&et=0&rsv_page=1
[SPIDERMAN] 10:01:28 [INFO] ~ C:\Users\����\Desktop\spiderman3\spiderman-sample\target\test-classes\Data\oschina\question/count_1_no_source_url_ create finished...
[SPIDERMAN] 10:01:28 [INFO] ~ site -> oschina task parse finished count ->1
2013-12-13 10:01:28 org.apache.http.client.protocol.ResponseProcessCookies processCookies
����: Cookie rejected: "[version: 0][name: BDRCVFR[C0p6oIjvx-c]][value: mk3SLVN4HKm][domain: www.baidu.com][path: /][expiry: null]". Illegal domain attribute "www.baidu.com". Domain of origin: "news.baidu.com"
[SPIDERMAN] 10:01:29 [INFO] ~ C:\Users\����\Desktop\spiderman3\spiderman-sample\target\test-classes\Data\oschina\question/count_2 create finished...
[SPIDERMAN] 10:01:29 [INFO] ~ site -> oschina task parse finished count ->2
2013-12-13 10:01:29 org.apache.http.client.protocol.ResponseProcessCookies processCookies
����: Cookie rejected: "[version: 0][name: BDRCVFR[C0p6oIjvx-c]][value: mk3SLVN4HKm][domain: www.baidu.com][path: /][expiry: null]". Illegal domain attribute "www.baidu.com". Domain of origin: "news.baidu.com"
[SPIDERMAN] 10:01:30 [INFO] ~ C:\Users\����\Desktop\spiderman3\spiderman-sample\target\test-classes\Data\oschina\question/count_3 create finished...
[SPIDERMAN] 10:01:30 [INFO] ~ site -> oschina task parse finished count ->3
2013-12-13 10:01:30 org.apache.http.client.protocol.ResponseProcessCookies processCookies
����: Cookie rejected: "[version: 0][name: BDRCVFR[C0p6oIjvx-c]][value: mk3SLVN4HKm][domain: www.baidu.com][path: /][expiry: null]". Illegal domain attribute "www.baidu.com". Domain of origin: "news.baidu.com"
[SPIDERMAN] 10:01:31 [INFO] ~ C:\Users\����\Desktop\spiderman3\spiderman-sample\target\test-classes\Data\oschina\question/count_4 create finished...
[SPIDERMAN] 10:01:31 [INFO] ~ site -> oschina task parse finished count ->4
2013-12-13 10:01:31 org.apache.http.client.protocol.ResponseProcessCookies processCookies
����: Cookie rejected: "[version: 0][name: BDRCVFR[C0p6oIjvx-c]][value: mk3SLVN4HKm][domain: www.baidu.com][path: /][expiry: null]". Illegal domain attribute "www.baidu.com". Domain of origin: "news.baidu.com"
[SPIDERMAN] 10:01:32 [INFO] ~ C:\Users\����\Desktop\spiderman3\spiderman-sample\target\test-classes\Data\oschina\question/count_5 create finished...
[SPIDERMAN] 10:01:32 [INFO] ~ site -> oschina task parse finished count ->5

请问怎么修改才能达到目的,谢谢





加载中
0
自风
自风
你这个貌似是Cookie被拒绝了啊
0
自风
自风

您好,新版本Spiderman2默认就是以您这个例子作为测试用例哦。感兴趣的话可以去看看。http://git.oschina.net/l-weiwei/Spiderman2

返回顶部
顶部