看着教程边学学改了一个python scrapy爬虫,提示没有错误.就是不采集.大家帮我看看?谢谢

bb2018 发布于 2017/04/15 08:29
阅读 2K+
收藏 2

看着教程边学学改了一个python scrapy爬虫,提示没有错误.就是不采集.大家帮我看看?谢谢 

python2.7 scrapy 1.3 

运行: scrapy crawl dingdianspider 

提示: 

 

E:\pypro\dingdian>scrapy crawl dingdianspider 
2017-04-14 23:19:33 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: dingdian) 
2017-04-14 23:19:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'dingdian.spiders', 'SPIDER_MODULES': ['dingdian.spiders'], 'ROBOTSTXT_OBEY': True, 'HTTPCACHE_ENABLED': True, 'BOT_NAM 
E': 'dingdian'} 
2017-04-14 23:19:36 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-04-14 23:19:39 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats', 
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware'] 
2017-04-14 23:19:39 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-04-14 23:19:41 [scrapy.middleware] INFO: Enabled item pipelines: 
['dingdian.mysqlpipelines.pipelines.DingdianPipeline'] 
2017-04-14 23:19:41 [scrapy.core.engine] INFO: Spider opened 
2017-04-14 23:19:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-04-14 23:19:42 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/robots.txt> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/1_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/2_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/3_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/1_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/4_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/2_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/5_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/6_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/3_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/7_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/4_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:43 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/8_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:43 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/5_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:44 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/9_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/6_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/7_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:44 [scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/10_1.htm> (referer: None) ['cached'] 
2017-04-14 23:19:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/8_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/9_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:44 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404http://www.23us.com/class/10_1.htm>: HTTP status code is not handled or not allowed 
2017-04-14 23:19:44 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-04-14 23:19:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 2451, 
'downloader/request_count': 11, 
'downloader/request_method_count/GET': 11, 
'downloader/response_bytes': 18975, 
'downloader/response_count': 11, 
'downloader/response_status_count/404': 11, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 4, 14, 15, 19, 44, 268000), 
'httpcache/hit': 11, 
'log_count/DEBUG': 12, 
'log_count/INFO': 17, 
'response_received_count': 11, 
'scheduler/dequeued': 10, 
'scheduler/dequeued/memory': 10, 
'scheduler/enqueued': 10, 
'scheduler/enqueued/memory': 10, 
'start_time': datetime.datetime(2017, 4, 14, 15, 19, 42, 940000)} 
2017-04-14 23:19:44 [scrapy.core.engine] INFO: Spider closed (finished) 




代码太多.我打包放在百度盘里了,地址: http://pan.baidu.com/s/1skBu8Pn 

小弟新手.请哪位朋友帮忙看看指点一下. 

为什么代码没错.但是采集不了.也没提示什么需要改进? 

加载中
0
m
magiclogy
[scrapy.core.engine] DEBUG: Crawled (404) <GEThttp://www.23us.com/class/7_1.htm> (referer: None) ['cached'] 

404啦,HTTP连接是错的,注意你仔细看一下就会发现应该是.html而不是.htm结尾

0
chenhong00
chenhong00
初学程序要分步调试,总得定位bug在哪里!
返回顶部
顶部