python+lxml 抓取网页数据遇到的奇葩问题

王囧 发布于 2015/12/23 00:51
阅读 1K+
收藏 0

网页访问地址是

https://www.theice.com/marketdata/reports/icebenchmarkadmin/ICELiborHistoricalRates.shtml?criteria.currencyCode=EUR&criteria.reportDate=17-Dec-2015

读取到网页后,解析数据

content = urllib2.urlopen(req, timeout=60*3).read()
htmlSource = lxml.html.fromstring(content)
xpath_1 = '''//*[@id="ratesTable"]/tbody/tr'''
tree = htmlSource.xpath(xpath_1)
for idx,tr in enumerate(tree):
    content = lxml.etree.tostring(tr)
    htmltmp = lxml.html.fromstring(content)
    print idx,htmltmp.xpath("//td[1]/span")[0].text.strip()
    print idx,tr.xpath("//td[1]/span")[0].text.strip()
    print idx,tr.xpath("//td[1]/span")[idx].text.strip()



输出结果:

0 Overnight
0 Overnight
0 Overnight
1 1 Week
1 Overnight
1 1 Week
2 1 Month
2 Overnight
2 1 Month
3 2 Month
3 Overnight
3 2 Month
4 3 Month
4 Overnight
4 3 Month
5 6 Month
5 Overnight
5 6 Month
6 1 Year
6 Overnight
6 1 Year

后面的循环中,tr和htmltmp按理说应该是一样的html片段,为什么第一个print和第二个print出来的结果不一样。。。。
加载中
0
xjfengck
xjfengck
手机上没法调试,建议用html.open_in_browser(tree)查看真正要解析的文档结构,查看时候设置浏览器offline,不要用浏览器浏览得结果对比,因为urllib2不支持javascript动态网业
返回顶部
顶部