当前访客身份:游客 [ 登录 | 加入 OSCHINA ]

代码分享

当前位置:
代码分享 » Python  » 网络编程
hcqenjoy

利用python下载百度空间文章

hcqenjoy 发布于 2011年01月22日 17时, 1评/1330阅
分享到: 
收藏 +0
1
<无详细内容>
标签: 百度

代码片段(1) [全屏查看所有代码]

1. [代码][Python]代码     跳至 [1] [全屏预览]

#! /usr/bin/env python
#coding=utf-8
import urllib2
import re
import sys
import os

pattern = ""
reg_tail = ""
username = ""

def downURL(url, filename):
print "Download %s, save as %s"%(url, filename)
try:
fp = urllib2.urlopen(url)
except:
print "download exception"
return 0
paths = os.getcwd()+username+'/'+filename
op = open(paths, "wb")
while 1:
s = fp.read()
if not s:
break
op.write(s)
fp.close( )
op.close( )
return 1

def getURL(url):
print "Parsing %s"%url
try:
fp = urllib2.urlopen(url)
contents = fp.readlines()
except:
print "exception"
return []

item_list = []
for s in contents:
urls = pattern.findall(s)
if urls:
item_list.extend(urls)
fp.close( )
return item_list
def CreateDirectory():
if not os.path.exists(os.getcwd()+username):
os.mkdir(os.getcwd()+username)
print 'step 2:Create Directory  Success!'
else:
print 'step 2:Directory has existed!'
def reptile(base_url):
"""
Download all articles from base_url.
Arguments:
- `base_url`: Url of website.
"""
page_list = []
base_page = base_url.rstrip("/")+"/blog/index/"
sign_tail = u"尾页"
tail = ""
total_page = 10
global username
print 'step 3:Number of index'

try:
fp = urllib2.urlopen(base_page+"0")
except:
print "%s: Not such url"%page
print sys.exc_info()
else:
for s in fp.readlines():
if sign_tail in s.decode("gbk"):
tail = s.decode("gbk")
break
fp.close()

if tail:
pos = tail.rfind(u"尾页")
total_page =int(tail[:pos-3].split("/")[-1])

output_list = [ ]
for idx in range(total_page+1):
item_page = "%s%d"%(base_page, idx)
item_list = getURL(item_page)
if item_list:
output_list.extend(item_list)

print 'step 4:Down pages!'
item_list = list(set(output_list))
for item in item_list:
down_url = item.replace("/%s"%username,
"http://hi.baidu.com/%s"%username)
local_file = down_url.split("/")[-1]
ret = downURL(down_url,local_file)
print "step 5:Total: %d articles."%(len(item_list))
print "Congratulations"
pass

if __name__ == '__main__':
if len(sys.argv) != 2:
print "Usage: %s url of baidu space"%sys.argv[0]
print "Such as: %s http://hi.baidu.com/Username"
sys.exit(1)
base_url = sys.argv[1]
if not base_url.startswith("http://hi.baidu.com/"):
print "Wrong Type of URL??", "It works on Baidu Space only."
sys.exit(1)

username = base_url.rstrip("/").split("/")[-1]
print ('step 1:'+username)
CreateDirectory()
reg_tail = re.compile(u"%s.*?尾页"%username)
pattern = re.compile("/%s/blog/item/.*?\.html"%username)
reptile (base_url)


开源中国-程序员在线工具:Git代码托管 API文档大全(120+) JS在线编辑演示 二维码 更多»

发表评论 回到顶部 网友评论(1)

  • 1楼:lbxoqy 发表于 2011-12-16 15:58 回复此评论
    格式都不对~~
开源从代码分享开始 分享代码
hcqenjoy的其它代码 全部(340)...