正则获取头部的meta

Fries 发布于 2012/11/08 16:59
阅读 1K+
收藏 0

我用curl获取了页面的内容,现在想用正则将description获取出来。要考虑单双引号。自己试了一下,由于对正则不熟悉。最终还是失败了。得不到自己想要的内容。出来找各位大侠帮帮忙。

头部内容如下:

<head>

    <meta name="viewport" content="width=device-width" />

        <meta name="keywords" content="Compression,Connectors,Mechanical,Connectors,Wire,Terminals,Cab">

        <meta name="description" content="Celestra is a global manufacturer, trade wholesalers, and a reliable partner for government works contracts trustworthy contractor.">

        <meta name="author" content="Celestra" />

        <meta name="copyright" content="Celestra Corporation Copyright 2002-2012,http://www.celestra.cn" />

            <meta charset="UTF-8"><title>Celestra | Home</title><link href="/base.css" media="screen" rel="stylesheet" type="text/css" >        <link href="http://www.celestra.cn/public/skins/celestra/css/index.css" rel="stylesheet" media="screen" type="text/css" />

        <script type="text/javascript" src="/jquery-1.7.1.min.js"></script>

        <script type="text/javascript" src="/base.js"></script>

        <link rel="stylesheet" media="screen" type="text/css" href="/css/en_US.css" /><script type="text/javascript" src="http://www.celestra.cn/public/js/celestra/miancarousel.js"></script>        <script> var baseUrl = "http://www.celestra.cn"; </script>

        <script type="text/javascript" src="/ipad.js"></script>

    </head>

    

加载中
0
骠骑将军
骠骑将军

就用最暴力的方式照着meta格式写吧,python的测试

>>> import re
>>> s = ''' <head>

    <meta name="viewport" content="width=device-width" />

        <meta name="keywords" content="Compression,Connectors,Mechanical,Connectors,Wire,Terminals,Cab">

        <meta name="description" content="Celestra is a global manufacturer, trade wholesalers, and a reliable partner for government works contracts trustworthy contractor.">

        <meta name="author" content="Celestra" />

        <meta name="copyright" content="Celestra Corporation Copyright 2002-2012,http://www.celestra.cn" />

            <meta charset="UTF-8"><title>Celestra | Home</title><link href="/base.css" media="screen" rel="stylesheet" type="text/css" >        <link href="http://www.celestra.cn/public/skins/celestra/css/index.css" rel="stylesheet" media="screen" type="text/css" />

        <script type="text/javascript" src="/jquery-1.7.1.min.js"></script>

        <script type="text/javascript" src="/base.js"></script>

        <link rel="stylesheet" media="screen" type="text/css" href="/css/en_US.css" /><script type="text/javascript" src="http://www.celestra.cn/public/js/celestra/miancarousel.js"></script>        <script> var baseUrl = "http://www.celestra.cn"; </script>

        <script type="text/javascript" src="/ipad.js"></script>

    </head> '''
>>> res = r'meta name\=\"description\" content\=\"(.*?)\"'
>>> m = re.findall(res,s)
>>> len(m)
1
>>> m
['Celestra is a global manufacturer, trade wholesalers, and a reliable partner for government works contracts trustworthy contractor.']
>>>

Fries
Fries
谢谢,但是如果出现content和name的位置调换又要怎么处理呢
0
骠骑将军
骠骑将军

谢谢,但是如果出现content和name的位置调换又要怎么处理呢?
-----------------------------------

那你就需要解析所有meta,然后判断name输出,还是python例子

>>> import re
>>> s = ''' <head>

    <meta name="viewport" content="width=device-width" />

        <meta name="keywords" content="Compression,Connectors,Mechanical,Connectors,Wire,Terminals,Cab">

        <meta name="description" content="Celestra is a global manufacturer, trade wholesalers, and a reliable partner for government works contracts trustworthy contractor.">

        <meta name="author" content="Celestra" />

        <meta name="copyright" content="Celestra Corporation Copyright 2002-2012,http://www.celestra.cn" />

            <meta charset="UTF-8"><title>Celestra | Home</title><link href="/base.css" media="screen" rel="stylesheet" type="text/css" >        <link href="http://www.celestra.cn/public/skins/celestra/css/index.css" rel="stylesheet" media="screen" type="text/css" />

        <script type="text/javascript" src="/jquery-1.7.1.min.js"></script>

        <script type="text/javascript" src="/base.js"></script>

        <link rel="stylesheet" media="screen" type="text/css" href="/css/en_US.css" /><script type="text/javascript" src="http://www.celestra.cn/public/js/celestra/miancarousel.js"></script>        <script> var baseUrl = "http://www.celestra.cn"; </script>

        <script type="text/javascript" src="/ipad.js"></script>

    </head> '''
>>> res = r'<meta(.*?)>'
>>> m = re.findall(res,s)
>>> for item in m:
	if 'name="description"' in item:
		print item[item.index('content="')+len('content="'):item.rindex('"')]

		
Celestra is a global manufacturer, trade wholesalers, and a reliable partner for government works contracts trustworthy contractor.
>>>

返回顶部
顶部