当前访客身份:游客 [ 登录 | 加入 OSCHINA ]

代码分享

当前位置:
代码分享 » Python  » 网络编程
nkiy

多线程下载百度图片大图

nkiy 发布于 2012年03月20日 17时, 21评/3748阅
分享到: 
收藏 +0
1
v0.1.0:添加两个异常处理,使得程序在url读取时不因为异常而退出
v0.1.1:修改一处条件语句错误
v0.1.2:修改Baidu模块nextPage函数,修改App模块nextPage函数参数
v0.1.3:修改几处错误,改为多线程下载
v0.1.4:添加代理功能
v0.1.5:使用Queue代替threadDown子类
v0.1.6:独立Config配置
v0.1.7:添加煎蛋模块,未实现功能
v0.1.8:添加代理密码验证,未测试

标签: 百度

代码片段(9) [全屏查看所有代码]

1. [文件] App.py ~ 2KB     下载(115)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
from Baidu import getImageUrlList, search, nextPage, searchResult
from Downloader import downloadFromQueue
from FileHelper import getFilenameFromURL, addExtension, makedir
from Queue import Queue
from thread import start_new_thread
from  Config import Config
from NetworkPrepare import prepare
import os, sys

def baseURL():
  if Config.site == 'baidu':
    return search(Config.keyword, Config.addtional)
  if Config.site == 'jandan':
    return 'http://jandan.net/ooxx'

def main():
  # 开始准备
  prepare()
  while_n = 0 # 循环计数器
  imglist = []
  makedir(Config.directory)
  print 'Generate search url'
  URL = baseURL()
  # 下载 #############
  # 获取搜索结果数量并与_count比较取其较小值
  count = min(searchResult(URL), Config.count)
  # 没有搜索结果时退出
  if not count:
    print "No search result at current condition."
    sys.exit(1)
  # 获得指定数量的url, 存放于list  
  print 'Fetching page',
  while len(imglist) < count:
    print while_n,
    while_n += 1
    tmplist = getImageUrlList(URL)
    imglist = imglist + tmplist
    URL = nextPage(URL, len(tmplist))
  print '' # 换行
  count = len(imglist)
  print "There're %d files to download" % count
  # 将已有文件从imglist中去除
  imglist = [url for url in imglist
             if not getFilenameFromURL(url) in os.listdir(Config.directory)]
  print "There's %d files already downloaded." % (count - len(imglist))
  # 下载该list 
  print 'Fetching list of %d files' % len(imglist)
  queue = Queue()
  for url in imglist:
    queue.put(url)
  failure = []
  for i in range(Config.thread_count):
    start_new_thread(downloadFromQueue, (
                                         queue, failure, Config.directory, Config.timeout))
  queue.join()
  print "%d failed to fetch." % len(failure)

def clean():
  # 清理
  # 1.添加后缀
  print 'Adding extension ...'
  for fname in os.listdir(Config.directory):
    addExtension(Config.directory + os.sep + fname, '.jpg')
  print 'done.'
  # 2.保存cookie
  Config.cj.save()

if __name__ == "__main__":
  main()
  clean()

2. [文件] Baidu.py ~ 2KB     下载(95)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
from Downloader import getStream
from MyParser import MyParser
from String import longestString, cutTo, cutBegin, getCodingContent
from urllib import urlencode
import json
import re

def getImageUrlFromScript(script):
  pattern = re.compile(r'(?<="objURL":").*?(?=")')
  groups = pattern.findall(script)
  new_group = [amatch.strip() for amatch in groups] # 更Pythonic的方式
  return new_group

def getImageUrlList(url):
  imglist = []
  for i in _getJsonList(url):
    imglist.append(i['objURL'].strip())
  return imglist

def _getJsonList(url):
  stream = getStream(url)
  data = getCodingContent(stream)
  pattern = re.compile(r'(?<=var imgdata =).*?(?=;v)')
  block = pattern.findall(data)[0]
  jsonlist = json.loads(block)
  return jsonlist['data'][:-1]

def nextPage(url, pn):
  url_pn = cutBegin(url, '&pn=')
  if not url_pn:
    url_pn = 0
  url_pn = int(url_pn) + pn
  return cutTo(url, '&pn') + '&pn=' + str(url_pn)

def search(keyword, addtionParams={}):
  """Generate a search url by the given keyword.
  params keyword: utf8 string"""
  url = 'http://image.baidu.com/i?'
  parser = MyParser()
  params = _getParams('http://image.baidu.com', parser)
  params.update(addtionParams)
  params.update({'word':keyword.decode('utf8').encode('gbk')})
  return url + urlencode(params)

def searchResult(url):
  parser = MyParser()
  parser.feed(getCodingContent(getStream(url)))
  block = longestString(parser.scriptList)
  parser.close()
  pattern = re.compile('(?<="listNum":)\d*(?=,)')
  count = pattern.findall(block)
  if count:
    count = int(count[0])
    return count
  return 0

def _getParams(url, parser):
  """Get a dict contained the url params"""
  stream = getStream(url)
  data = getCodingContent(stream)
  parser.feed(data)
  return parser.formParams

def _appendParams(adict):
  """Generate a url with params in adict."""
  p = [key + '=' + adict[key] for key in adict]
  return '&'.join(p)

3. [文件] Config.py ~ 734B     下载(94)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
from cookielib import LWPCookieJar
class Config:
  keyword = '美女' # 要搜索的关键字 注意不要改变文件编码
  addtional = {'width':'1920', 'height':'1200'} # 宽度和高度 可以为空 {}
  directory = r'image'  # 存放的位置
  count = 30     # 要下载的数量,自动进到20的倍数
  thread_count = 15 # 线程数
  timeout = 20 # 下载超时限制 使用超时20 10好像小了点
  # 代理设置
  proxy = 'http://localhost:7001'
  use_proxy = False
  proxy_user = 'user_name'
  proxy_pass = 'password'
  proxy_auth = False
  cookies = 'cookies.txt'
  use_cookies = True
  cj = LWPCookieJar(cookies)
  site = 'baidu'  #site='jandan'

4. [文件] Downloader.py ~ 1KB     下载(87)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
from FileHelper import getFilenameFromURL, writeBinFile
import urllib2

def getStream(url, timeout=10):
  # 返回一个url流或者False
  request = urllib2.Request(url)
  request.add_header('User-Agent', UserAgent.Mozilla)
  try:
    stream = urllib2.urlopen(request, timeout=timeout)
  except (Exception, SystemExit): # catch SystemExit to keep running
    print "URL open error. Probably timed out."
    return False
  return stream

def downloadFromQueue(queue, failure, directory='.', timeout=10):
  """Get files from a list of urls.
  return : list, contained the failure fetch"""
  while not queue.empty():
    url = queue.get()
    stream = getStream(url, timeout=timeout)
    file_name = getFilenameFromURL(url)
    if stream and writeBinFile(stream, file_name, directory):
      queue.task_done()
      print "Fetching", url, 'done.'
      continue
    failure.append(url)
    queue.task_done()
  return failure

class UserAgent:
  Mozilla = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) Gecko/20080404 (FoxPlus) Firefox/2.0.0.14'

5. [文件] FileHelper.py ~ 1KB     下载(89)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
import re, os
def getFilenameFromURL(url):
  # 在 Downloader 中使用
  pos = url.rfind('/')
  shorted = url[pos + 1:]
  pattern = re.compile(r'\w*[\.\w]*')
  f_name = pattern.findall(shorted)[0]
  return f_name

def addExtension(fname, ext):
  # 在 App 中使用,添加扩展名
  # 没有后缀才添加
  if '.' not in fname:
    rename(fname, ext)
def rename(old, ext):
  # ext='.jpg'
  if os.path.isfile(old + ext):
    ext = '2' + ext
    rename(old, ext)
    return None
  print 'rename', old, old + ext
  os.rename(old, old + ext)

def makedir(directory):
  if not os.path.isdir(directory):
    os.mkdir(directory) # 不捕获_directory是文件时的异常,让程序自己退出

def writeBinFile(stream, file_name, directory='.', mode='wb'):
  """Read from the given url and write to file_name."""
  file_name = directory + os.sep + file_name
  if os.path.isfile(file_name):
    print 'File %s exist.' % file_name
    return False
  CHUNCK_SIZE = 1024
  with open(file_name, mode) as fp:
    while True:
      try:
        chunck = stream.read(CHUNCK_SIZE)
      except (Exception, SystemExit):
        print 'Fetching error. Probably timed out.'
        fp.close()
        os.remove(file_name)
        return False
      if not chunck:break
      fp.write(chunck)
  return True

6. [文件] Jandan.py ~ 47B     下载(86)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf8 -*-

7. [文件] MyParser.py ~ 1KB     下载(88)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
import HTMLParser

class MyParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.toggle_script_parse = False
    self.toggle_form_parse = False
    self.scriptList = []
    self.formParams = {}
    self.result = 0

  def handle_starttag(self, tag, attrs):
    HTMLParser.HTMLParser.handle_starttag(self, tag, attrs)
    attrs = dict(attrs)
    if tag == 'script':
      self.toggle_script_parse = True
    # parse start parse form to get attrs in input tag
    if tag == 'form' and attrs.has_key('name') and attrs['name'] == 'f1':
      self.toggle_form_parse = True
    if tag == 'input' and self.toggle_form_parse:
      if attrs.has_key('type') and attrs['type'] == 'hidden':
        key = attrs['name'];value = attrs['value']
        self.formParams[key] = value

  def handle_endtag(self, tag):
    HTMLParser.HTMLParser.handle_endtag(self, tag)
    if tag == 'form' and self.toggle_form_parse:
      self.toggle_form_parse = False

  def handle_data(self, data):
    HTMLParser.HTMLParser.handle_data(self, data)
    if self.toggle_script_parse:
      self.scriptList.append(data)
      self.toggle_script_parse = False

  def reset(self):
    HTMLParser.HTMLParser.reset(self)
    self.toggle_script_parse = False
    self.toggle_form_parse = False
    self.scriptList = []
    self.formParams = {}
    self.result = 0

8. [文件] NetworkPrepare.py ~ 852B     下载(89)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
import urllib2
from Config import Config

def proxy_handler(proxy, use_proxy, proxy_auth=False, puser='', ppass=''):
  if use_proxy:
    return urllib2.ProxyHandler({"http" : proxy})
  return urllib2.ProxyHandler({})

def cookie_handler(cj):
  try:
    cj.revert(cj)
  except Exception:
    pass
  cj.clear_expired_cookies()
  return urllib2.HTTPCookieProcessor(cj)

def prepare():
  ch = cookie_handler(Config.cj)
  ph = proxy_handler(Config.proxy, Config.use_proxy)
  if Config.proxy_auth:
    pm = urllib2.HTTPPasswordMgrWithDefaultRealm()
    pm.add_password(None, Config.proxy, Config.proxy_user, Config.proxy_pass)
    urllib2.install_opener(urllib2.build_opener(ch, ph, urllib2.ProxyBasicAuthHandler(pm)))
    return
  urllib2.install_opener(urllib2.build_opener(ch, ph))

9. [文件] String.py ~ 1KB     下载(93)     跳至 [1] [2] [3] [4] [5] [6] [7] [8] [9] [全屏预览]

#!/usr/bin/env python2
# -*- coding:utf-8 -*-
def determinCoding(content, header):
  """Determin a coding of a given url content and it's header.
  params headers : HTMLHeader instance"""
  content_type = header['Content-Type']
  tag = 'charset='
  if content_type:
    if tag in content_type:
      pos = content_type.index(tag)
      pos += 8
      return content_type[pos:]
  content = content.lower()
  if tag in content:
    startpos = content.index(tag)
    endpos = content[startpos:].index('"')
    return content[startpos:endpos][startpos + 8:]

def getCodingContent(stream):
  # 获取stream的编码
  """Return a string in which is the content of given url.
  return - content : unicode string"""
  content = stream.read()
  coding = determinCoding(content, stream.headers)
  stream.close()
  return content.decode(coding)

def longestString(alist):
  """Return the longest string of a list of strings."""
  a_new_list = [len(a_str) for a_str in alist]
  pos = a_new_list.index(max(a_new_list))
  return alist[pos]

def cutTo(str_1, str_2):
  """Cut str_1 to the position just befor str_2."""
  # 不包含 str_2
  if not str_2 in str_1 :
    return str_1
  pos = str_1.index(str_2)
  return str_1[0:pos]

def cutBegin(str_1, str_2):
  # 在MyParser中使用
  if not str_2 in str_1:
    return None
  pos = str_1.index(str_2) + len(str_2)
  return str_1[pos:]


开源中国-程序员在线工具:Git代码托管 API文档大全(120+) JS在线编辑演示 二维码 更多»

发表评论 回到顶部 网友评论(21)

  • 1楼:丁杨帆 发表于 2012-03-20 19:35 回复此评论
    采不下来啊。
  • 2楼:任民 发表于 2012-03-20 20:18 回复此评论

    引用来自“丁杨帆”的评论

    采不下来啊。
    你搜的关键词被过滤了吧:-)
  • 3楼:nkiy 发表于 2012-03-20 20:46 回复此评论

    引用来自“丁杨帆”的评论

    采不下来啊。
    马上发改进版本,主程序是App.py,在里面设置几个参数
  • 4楼:nkiy 发表于 2012-03-20 20:58 回复此评论
    已更新
  • 5楼:nkiy 发表于 2012-03-21 14:15 回复此评论
    求评论啊 :)
  • 6楼:蛋疼的淡定哥 发表于 2012-03-22 17:00 回复此评论

    引用来自“nkiy”的评论

    求评论啊 :)
    我这通过代理才能上网,我把把Downloader下的getStream模块添加request.set_proxy('192.168.168.1', '2000'),设置代理,但是还是不能下载,请问这种内网代理应该如何处理
  • 7楼:nkiy 发表于 2012-03-22 17:38 回复此评论

    引用来自“蛋疼的淡定哥”的评论

    引用来自“nkiy”的评论

    求评论啊 :)
    我这通过代理才能上网,我把把Downloader下的getStream模块添加request.set_proxy('192.168.168.1', '2000'),设置代理,但是还是不能下载,请问这种内网代理应该如何处理
    功能已添加,我这里不太方便测试,有问题评论或者站内 :)
  • 8楼:KermitLau 发表于 2012-03-22 21:08 回复此评论
    Traceback (most recent call last):
      File "App.py", line 3, in <module>
        from Baidu import getImageUrlList, search, nextPage, searchResult
      File "/home/kermit/download-baidu-pictures/Baidu.py", line 4, in <module>
        from MyParser import MyParser
      File "/home/kermit/download-baidu-pictures/MyParser.py", line 16
        attrs = {key:value for key, value in attrs}
                             ^

  • 9楼:nkiy 发表于 2012-03-23 07:18 回复此评论

    引用来自“liuleilei”的评论

    Traceback (most recent call last):
      File "App.py", line 3, in <module>
        from Baidu import getImageUrlList, search, nextPage, searchResult
      File "/home/kermit/download-baidu-pictures/Baidu.py", line 4, in <module>
        from MyParser import MyParser
      File "/home/kermit/download-baidu-pictures/MyParser.py", line 16
        attrs = {key:value for key, value in attrs}
                             ^

    不好意思,这是全部的traceback输出吗?你可以把MyParser.py的16行改为 attrs = dict(attrs)
  • 10楼:永远对你好 发表于 2012-03-23 17:15 回复此评论
    不知楼主弄没弄过模拟登陆啥的,弄一下午了,愣是没登陆成功一个网站
    我用的就是 urllib和urllib2.。。。
  • 11楼:永远对你好 发表于 2012-03-23 17:15 回复此评论
    不知楼主弄没弄过模拟登陆啥的,弄一下午了,愣是没登陆成功一个网站
    我用的就是 urllib和urllib2.。。。
  • 12楼:nkiy 发表于 2012-03-23 18:43 回复此评论

    引用来自“talentwang”的评论

    不知楼主弄没弄过模拟登陆啥的,弄一下午了,愣是没登陆成功一个网站
    我用的就是 urllib和urllib2.。。。
    什么网站?试试cookielib
  • 13楼:solu 发表于 2012-03-24 00:10 回复此评论

    引用来自“talentwang”的评论

    不知楼主弄没弄过模拟登陆啥的,弄一下午了,愣是没登陆成功一个网站
    我用的就是 urllib和urllib2.。。。
    试试twill,你会发现很爽!
  • 14楼:KermitLau 发表于 2012-03-26 13:57 回复此评论

    引用来自“nkiy”的评论

    引用来自“liuleilei”的评论

    Traceback (most recent call last):
      File "App.py", line 3, in <module>
        from Baidu import getImageUrlList, search, nextPage, searchResult
      File "/home/kermit/download-baidu-pictures/Baidu.py", line 4, in <module>
        from MyParser import MyParser
      File "/home/kermit/download-baidu-pictures/MyParser.py", line 16
        attrs = {key:value for key, value in attrs}
                             ^

    不好意思,这是全部的traceback输出吗?你可以把MyParser.py的16行改为 attrs = dict(attrs)
    ths, it works...
  • 15楼:nkiy 发表于 2012-03-26 15:13 回复此评论

    引用来自“KermitLau”的评论

    引用来自“nkiy”的评论

    引用来自“liuleilei”的评论

    Traceback (most recent call last):
      File "App.py", line 3, in <module>
        from Baidu import getImageUrlList, search, nextPage, searchResult
      File "/home/kermit/download-baidu-pictures/Baidu.py", line 4, in <module>
        from MyParser import MyParser
      File "/home/kermit/download-baidu-pictures/MyParser.py", line 16
        attrs = {key:value for key, value in attrs}
                             ^

    不好意思,这是全部的traceback输出吗?你可以把MyParser.py的16行改为 attrs = dict(attrs)
    ths, it works...
    呵呵,多来看看,我回常常更新的
  • 16楼:lbxoqy 发表于 2012-04-03 13:53 回复此评论
    代理要通過認證(帳號,密碼)又如何修改啊?修改proxy增加 帳號:密碼??
  • 17楼:nkiy 发表于 2012-04-03 14:19 回复此评论

    引用来自“lbxoqy”的评论

    代理要通過認證(帳號,密碼)又如何修改啊?修改proxy增加 帳號:密碼??
    关键代码: pm = urllib2.HTTPPasswordMgrWithDefaultRealm() pm.add_password(None, Config.proxy, Config.proxy_user, Config.proxy_pass) urllib2.install_opener(urllib2.build_opener(ch, ph, urllib2.ProxyBasicAuthHandler(pm))) 代码随后更新
  • 18楼:nkiy 发表于 2012-04-03 14:24 回复此评论

    引用来自“lbxoqy”的评论

    代理要通過認證(帳號,密碼)又如何修改啊?修改proxy增加 帳號:密碼??
    我这没法测试,你试一下行不行
  • 19楼:lbxoqy 发表于 2012-04-05 14:42 回复此评论

    引用来自“nkiy”的评论

    引用来自“lbxoqy”的评论

    代理要通過認證(帳號,密碼)又如何修改啊?修改proxy增加 帳號:密碼??
    我这没法测试,你试一下行不行
    哈,好的,不過已經可以解決了
  • 20楼:bobjoin 发表于 2012-04-11 14:30 回复此评论
开源从代码分享开始 分享代码
nkiy的其它代码 全部(1)...