java网页抓取怎么提取该网页中SCRIPT的信息?

qizi456258 发布于 2012/07/11 17:33
阅读 4K+
收藏 6
http://www.fedex.com/Tracking?clienttype=dotcomreg&ascend_header=1&cntry_code=cn&language=sim&mi=n&tracknumbers=874589732820
在该页面中,要抓取货件托运历史中的进程信息。但观看源文件中,其中数据在javascript中,无法正常抓取,其中var detailInfoObject为SCRIPT中要提取的数据,求高手帮助
加载中
0
catty
catty

使用Jsoup(parse html) + ScriptEngine(執行js)

import java.net.URL;

import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import net.sf.json.JSONArray;
import net.sf.json.JSONObject;


public class BbTest3 {

	public static void main(String args[]) throws Exception {
		// 使用json來parse html
		String url = "http://www.fedex.com/Tracking?clienttype=dotcomreg&ascend_header=1&cntry_code=cn&language=sim&mi=n&tracknumbers=874589732820";
		Document doc = Jsoup.parse(new URL(url), 3000);

		// 取得所有的script tag
		Elements eles = doc.getElementsByTag("script");
		for (Element ele : eles) {

			// 檢查是否有detailInfoObject字串
			String script = ele.toString();
			if (script.indexOf("detailInfoObject") > -1) {

				// 只取得script的內容
				script = ele.childNode(0).toString();

				// 使用ScriptEngine來parse
				ScriptEngine engine = new ScriptEngineManager().getEngineByName("javascript");
				engine.eval(script);

				// 取得你要的變數
				Object obj = engine.get("detailInfoObject");
				System.out.println("detailInfoObject = " + obj);

				// 將obj轉成Json物件
				JSONObject json = JSONObject.fromObject(obj);
				System.out.println("json = " + json);

				// 取得欄位
				System.out.println("destInfo = " + json.get("destInfo"));

				// 取得欄位(array type)
				JSONArray scans = json.getJSONArray("scans");
				for (int i = 0, max = scans.size(); i < max; i++) {
					JSONObject child = (JSONObject) scans.get(i);
					System.out.println("scans[" + i + "] = " + child);
				}

			}
		}
	}
}


0
catty
catty
執行結果 

執行結果

q
qizi456258
回复 @catty : 您给的链接我打不开 提示无法访问啊
catty
catty
回复 @qizi456258 : 我沒QQ吔, 我改放在dropbox上 https://dl.dropbox.com/u/19427089/temp/ParseHtmlAndEvalScript.zip
q
qizi456258
你那项目方便发过来么,我这调试很久没进展,我Q278473051
0
q
qizi456258
detailInfoObject = [object Object]
2012-7-12 13:43:36 net.sf.json.JSONObject _fromBean
警告: Property 'attributes' has no read method. SKIPPED
2012-7-12 13:43:36 net.sf.json.JSONObject _fromBean
警告: Property 'attributes' has no read method. SKIPPED
2012-7-12 13:43:36 net.sf.json.JSONObject _fromBean
警告: Property 'attributes' has no read method. SKIPPED
Exception in thread "main" net.sf.json.JSONException: There is a cycle in the hierarchy!
at net.sf.json.util.CycleDetectionStrategy$StrictCycleDetectionStrategy.handleRepeatedReferenceAsObject(CycleDetectionStrategy.java:73)
at net.sf.json.JSONObject._fromBean(JSONObject.java:658)
at net.sf.json.JSONObject.fromObject(JSONObject.java:182)
at net.sf.json.JSONObject._processValue(JSONObject.java:2426)
at net.sf.json.JSONObject._setInternal(JSONObject.java:2447)
at net.sf.json.JSONObject.setValue(JSONObject.java:1189)
at net.sf.json.JSONObject._fromBean(JSONObject.java:725)
at net.sf.json.JSONObject.fromObject(JSONObject.java:182)
at net.sf.json.JSONObject._processValue(JSONObject.java:2426)
at net.sf.json.JSONObject._setInternal(JSONObject.java:2447)
at net.sf.json.JSONObject.setValue(JSONObject.java:1189)
at net.sf.json.JSONObject._fromBean(JSONObject.java:725)
at net.sf.json.JSONObject.fromObject(JSONObject.java:182)
at net.sf.json.JSONObject._processValue(JSONObject.java:2426)
at net.sf.json.JSONObject._setInternal(JSONObject.java:2447)
at net.sf.json.JSONObject.setValue(JSONObject.java:1189)
at net.sf.json.JSONObject._fromBean(JSONObject.java:725)
at net.sf.json.JSONObject.fromObject(JSONObject.java:182)
at net.sf.json.JSONObject.fromObject(JSONObject.java:145)
at Test.Test.main(Test.java:39)

运行后有这个错误  求指导
0
catty
catty

我用的是

  1. JDK 7
  2. json-lib: 2.4
  3. jsoup: 1.6.1

 

剛試一下, JDK 6會出現你這個error..... =_=

 

0
catty
catty

http://blog.csdn.net/muye4455/article/details/7586756

可以參考這篇, 設定JsonConfig

q
qizi456258
谢了 我一会试试吧
0
星星爷
星星爷
貌似是7里才开始支持脚本 
返回顶部
顶部