正则取xml内容比dom4j快50倍……?

负心杏 发布于 2014/11/26 15:31
阅读 1K+
收藏 10

jdk 6

解析的是微信群发返回的xml。

dom4j要用200多毫秒,正则几乎就个位数4毫秒左右,50倍的差距。这让习惯用包的情何以堪……

一下代码如果有包的话,直接运行。


代码:

			long t1 = System.nanoTime();
			String str = "<xml><ToUserName><![CDATA[gh_520f99dff7cc]]></ToUserName><FromUserName><![CDATA[oBAMOs3aZB0dkbILsBR1wksbmli4]]></FromUserName><CreateTime>1416900555</CreateTime><MsgType><![CDATA[event]]></MsgType><Event><![CDATA[MASSSENDJOBFINISH]]></Event><MsgID>2348714844</MsgID><Status><![CDATA[send success]]></Status><TotalCount>1</TotalCount><FilterCount>1</FilterCount><SentCount>1</SentCount><ErrorCount>0</ErrorCount></xml>";
//			Document doc = null;
//			try {
//				doc = DocumentHelper.parseText(str);
//			} catch (DocumentException e) {
//				log.error("解析群发xml错误:"+e.getMessage(), e);
//			}
//			
//			Element root = doc.getRootElement();
//			String msgid = root.elementTextTrim("MsgID");
//			String Status = root.elementTextTrim("Status");
//			String TotalCount = root.elementTextTrim("TotalCount");
//			String FilterCount = root.elementTextTrim("FilterCount");
//			String SentCount = root.elementTextTrim("SentCount");
//			String ErrorCount = root.elementTextTrim("ErrorCount");
			String msgid = RegExp.getString(str,
					"(?<=<MsgID>)[\\s\\S]*?(?=</MsgID>)").trim();
			String Status = RegExp.getString(str,
				"(?<=<Status><!\\[CDATA\\[)[\\s\\S]*?(?=\\]\\]></Status>)")
				.trim();
			String TotalCount = RegExp.getString(str,
				"(?<=<TotalCount>)[\\s\\S]*?(?=</TotalCount>)")
				.trim();
			String FilterCount = RegExp.getString(str,
				"(?<=<FilterCount>)[\\s\\S]*?(?=</FilterCount>)")
				.trim();
			String SentCount = RegExp.getString(str,
				"(?<=<SentCount>)[\\s\\S]*?(?=</SentCount>)")
				.trim();
			String ErrorCount = RegExp.getString(str,
				"(?<=<ErrorCount>)[\\s\\S]*?(?=</ErrorCount>)")
				.trim();
			long t2 = System.nanoTime();
			log.info(t2-t1);
			log.info((t2-t1)*0.000001);
			log.info(msgid+", "+Status+", "+TotalCount+", "+FilterCount+", "+SentCount+", "+ErrorCount);



dom4j运行结果:

2014-11-26 15:25:29,716 INFO [Test] 70 - <220279310>
2014-11-26 15:25:29,719 INFO [Test] 71 - <220.27930999999998>《==看这里
2014-11-26 15:25:29,719 INFO [Test] 72 - <2348714844, send success, 1, 1, 1, 0>

正则运行结果:

2014-11-26 15:28:08,575 INFO [Test] 70 - <4633684>
2014-11-26 15:28:08,578 INFO [Test] 71 - <4.633684>《==看这里
2014-11-26 15:28:08,578 INFO [Test] 72 - <2348714844</MsgID>, <![CDATA[send success]]></Status>, 1</TotalCount>, 1</FilterCount>, 1</SentCount>, 0</ErrorCount>>


正则代码:
public class RegExp
{
  public static ArrayList<String> getStrs(String source, String regex)
  {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(source);
    ArrayList<String> list = new ArrayList();
    while (m.find()) {
      list.add(source.substring(m.start(), m.end()));
    }
    return list;
  }
  
  public static String getString(String source, String regex)
  {
    ArrayList<String> list = getStrs(source, regex);
    if (list.size() > 0) {
      return (String)list.get(0);
    }
    return "";
  }
  
  public static ArrayList<String> getStrs(String source, String beginStr, String endStr, boolean isLong)
  {
    if (isLong) {
      return getStrs(source, "(?<=" + replay(beginStr) + ")[\\s\\S]*(?=" + replay(endStr) + ")");
    }
    return getStrs(source, "(?<=" + replay(beginStr) + ")[\\s\\S]*?(?=" + replay(endStr) + ")");
  }
  
  public static String getString(String source, String beginStr, String endStr, boolean isLong)
  {
    if (isLong) {
      return getString(source, "(?<=" + replay(beginStr) + ")[\\s\\S]*(?=" + replay(endStr) + ")");
    }
    return getString(source, "(?<=" + replay(beginStr) + ")[\\s\\S]*?(?=" + replay(endStr) + ")");
  }
  
  private static String replay(String source)
  {
    String result = "";
    result = source.replace("\\", "\\\\");
    result = source.replace(".", "\\.");
    result = result.replace("(", "\\(");
    result = result.replace(")", "\\)");
    result = result.replace("[", "\\[");
    result = result.replace("]", "\\]");
    result = result.replace("{", "\\{");
    result = result.replace("}", "\\}");
    result = result.replace("$", "\\$");
    result = result.replace("?", "\\?");
    result = result.replace("&", "\\&");
    result = result.replace("*", "\\*");
    result = result.replace("!", "\\!");
    result = result.replace("^", "\\^");
    result = result.replace("+", "\\+");
    result = result.replace("#", "\\#");
    return result;
  }
}



加载中
1
负心杏

好了结论已定:

用SaxReader读取post流,构建xml读取值。SaxReader读取构建使用时间:第一次比较长即时毫秒,后面2毫秒多一点。Dom4j取值只有0.1毫秒。

使用正则方式:将流读取为字符串,正则取值,0.8毫秒左右

总结:dom4j 加载资源耗费时间,但读取异常快,写法简单。 正则写法复杂,但耗费时间少。

内存占用未测试。

附部分代码:

InputStream in = request.getInputStream();
			SAXReader sax = new SAXReader();
			Document doc = sax.read(in);
			long t1 = System.nanoTime();
			Element root = doc.getRootElement();
			String MsgType = root.elementTextTrim("MsgType");
			String content = root.elementTextTrim("Content");
			String MsgId = root.elementTextTrim("MsgId");
			String from = root.elementTextTrim("from");

			
//			byte[] rebyte = new byte[1024];
//			int len = 0;
//			int temp = 0;
//			int s = 1024;
//			while ((temp = in.read()) != -1) {
//				rebyte[len] = (byte) temp;
//				len++;
//				if (len == s) {
//					s *= 2;
//					byte[] nbyte = new byte[s];
//					System.arraycopy(rebyte, 0, nbyte, 0, rebyte.length);
//					rebyte = nbyte;
//				}
//			}
//			in.close();
//			xml = new String(rebyte, 0, len, "UTF-8");
//			String MsgType = RegExp
//					.getString(xml,
//							"(?<=<MsgType><!\\[CDATA\\[)[\\s\\S]*?(?=\\]\\]></MsgType>)")
//					.trim();
//			
//			String content = RegExp.getString(xml,
//			"(?<=<Content><!\\[CDATA\\[)[\\s\\S]*?(?=\\]\\]></Content>)")
//			.trim();
//			String MsgId = RegExp.getString(xml,
//			"(?<=<MsgId>)[\\s\\S]*?(?=</MsgId>)").trim();
//			String from = RegExp
//			.getString(xml,
//					"(?<=<FromUserName><!\\[CDATA\\[)[\\s\\S]*?(?=\\]\\]></FromUserName>)")
//			.trim();
			long t2 = System.nanoTime();
			log.info(t2-t1);
			log.info((t2-t1)*0.000001);
			log.info(MsgType+", "+content+","+MsgId+","+from);
			if(true){
				return null;
			}




0
牧沐
牧沐

不错 正则看起来效率确实高 不过写起来费点事

dom4j无脑一些

各有千秋吧

负心杏
这个只证明了:在数据(节点)比较少的情况下,正则比dom4j快很多。 其他情况没有测试。比如:节点多、数据块大……
0
朱宏青
朱宏青

正则是执行解析的

你可以换用sax来对比测试下

另外我觉得long t1 = System.nanoTime();应该放在Element root = doc.getRootElement();前面/后面.

负心杏
无语,改了种方法:dom4j只用4毫秒…… long t1 = System.nanoTime(); byte[] bs = str.getBytes(); SAXReader sax = new SAXReader(); doc = sax.read(new ByteArrayInputStream(bs));
负心杏
3ks。试验了,200+毫秒,99.9%的时间都花在解析字符串为Document对象了,实际取值不到1毫秒,比正则解析快几十倍…… 但xml字符串是业务决定的,微信传过来就是字符串。。。哎对了,人过来也是post流方式过来的,我试试,直接根据流构建xml时间
0
一个角
你搞个几百兆的文件试试
0
公孙二狗
公孙二狗

引用来自“一个角”的评论

你搞个几百兆的文件试试
这样的话,正则和Dom4J应该直接挂了
0
color丶苏色
color丶苏色
StAX应该是最快的,性能都比这些强,正则开销其实很大的,没实测过...解析过几百M的,从接口传过来的字符串...
返回顶部
顶部