0
回答
在Linux上webMagic抓取Content-Encoding 是gzip格式的网页,报错
利用AWS快速构建适用于生产的无服务器应用程序,免费试用12个月>>>   

@黄亿华 你好,想跟你请教个问题:

大神,在啊,我现在用webMagic抓取网页时,在本地没有问题,在线上Linux服务器上,经常报:(java.util.zip.ZipException: Not in GZIP format),我查看网址的Content-Encoding 是gzip格式,这个如何解决啊?在线等待

错误日志,如下:


get page: http://ent.huanqiu.com/movie/yingshi-neidi/?pindao=22
[INFO][2016-11-04 16:08:36][us.codecraft.webmagic.Spider]-Spider ent.huanqiu.com started!
[INFO][2016-11-04 16:08:36][us.codecraft.webmagic.downloader.HttpClientDownloader]-downloading page http://ent.huanqiu.com/movie/yingshi-gangtai/?pindao=22
[WARN][2016-11-04 16:08:37][us.codecraft.webmagic.downloader.HttpClientDownloader]-download page http://ent.huanqiu.com/movie/yingshi-gangtai/?pindao=22 error
java.util.zip.ZipException: Not in GZIP format
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
    at org.apache.http.client.protocol.ResponseContentEncoding$1.create(ResponseContentEncoding.java:67)
    at org.apache.http.client.entity.LazyDecompressingInputStream.initWrapper(LazyDecompressingInputStream.java:54)
    at org.apache.http.client.entity.LazyDecompressingInputStream.read(LazyDecompressingInputStream.java:66)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1025)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
    at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:218)
    at us.codecraft.webmagic.downloader.HttpClientDownloader.getContent(HttpClientDownloader.java:189)
    at us.codecraft.webmagic.downloader.HttpClientDownloader.handleResponse(HttpClientDownloader.java:178)
    at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:96)
    at us.codecraft.webmagic.Spider.processRequest(Spider.java:408)
    at us.codecraft.webmagic.Spider$1.run(Spider.java:322)
    at us.codecraft.webmagic.selector.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
[INFO][2016-11-04 16:08:37][us.codecraft.webmagic.Spider]-Spider ent.huanqiu.com started!
[INFO][2016-11-04 16:08:37][us.codecraft.webmagic.downloader.HttpClientDownloader]-downloading page http://ent.huanqiu.com/movie/yingshi-guoji/?pindao=22
[INFO][2016-11-04 16:08:38][org.jdiy.core.Dao]-SELECT * FROM TaskInfo WHER


举报
jibaole
发帖于1年前 0回/205阅
顶部