如何实现html转成txt纯文本?

北柯一梦 发布于 2010/09/13 16:29
阅读 3K+
收藏 3

还是举例说明吧

1.html 转换成 1.txt

1.html内容如下:

------------------------------------------------------------------------------------------------------

<html>
<head>天灾人祸</head>
<body>
<a href="www.baidu.com">百度</a>   
<a href="www.google.com">谷歌</a>   
<DIV><FONT color=#c0c0c0 size=2 face=Verdana><SPAN>张宇</SPAN> </FONT></DIV>
<HR color=#b5c4df SIZE=1>
</body>
</html>

------------------------------------------------------------------------------------------------------

 

转换成 的1.txt 效果如下:

------------------------------------------------------------------------------------------------------

天灾人祸 百度 谷歌

张宇

------------------------------------------------------------------------------------------------------

 

 

html里面的没有实际意义的标签都去掉,有意义的文字换行就行。

小弟不知道怎么去实现,还望大侠们指点一二。 多谢了。

加载中
0
jing31
jing31

jsoup 是不二之选。


0
北柯一梦
北柯一梦

引用来自#2楼“红薯”的帖子

用 jsoup 啊,请看这里:http://www.oschina.net/bbs/thread/10227

够强!多谢多谢

0
曾建凯
曾建凯

mootools,more-1.2.4.2.js的String扩展:

(function(){
  
var special = ['À','à','Á','á','Â','â','Ã','ã','Ä','ä','Å','å','Ă','ă','Ą','ą','Ć','ć','Č','č','Ç','ç', 'Ď','ď','Đ','đ', 'È','è','É','é','Ê','ê','Ë','ë','Ě','ě','Ę','ę', 'Ğ','ğ','Ì','ì','Í','í','Î','î','Ï','ï', 'Ĺ','ĺ','Ľ','ľ','Ł','ł', 'Ñ','ñ','Ň','ň','Ń','ń','Ò','ò','Ó','ó','Ô','ô','Õ','õ','Ö','ö','Ø','ø','ő','Ř','ř','Ŕ','ŕ','Š','š','Ş','ş','Ś','ś', 'Ť','ť','Ť','ť','Ţ','ţ','Ù','ù','Ú','ú','Û','û','Ü','ü','Ů','ů', 'Ÿ','ÿ','ý','Ý','Ž','ž','Ź','ź','Ż','ż', 'Þ','þ','Ð','ð','ß','Œ','œ','Æ','æ','µ'];

var standard = ['A','a','A','a','A','a','A','a','Ae','ae','A','a','A','a','A','a','C','c','C','c','C','c','D','d','D','d', 'E','e','E','e','E','e','E','e','E','e','E','e','G','g','I','i','I','i','I','i','I','i','L','l','L','l','L','l', 'N','n','N','n','N','n', 'O','o','O','o','O','o','O','o','Oe','oe','O','o','o', 'R','r','R','r', 'S','s','S','s','S','s','T','t','T','t','T','t', 'U','u','U','u','U','u','Ue','ue','U','u','Y','y','Y','y','Z','z','Z','z','Z','z','TH','th','DH','dh','ss','OE','oe','AE','ae','u'];

var tidymap = {
	"[\xa0\u2002\u2003\u2009]": " ",
	"\xb7": "*",
	"[\u2018\u2019]": "'",
	"[\u201c\u201d]": '"',
	"\u2026": "...",
	"\u2013": "-",
	"\u2014": "--",
	"\uFFFD": "&raquo;"
};

var getRegForTag = function(tag, contents) {
	tag = tag || '';
	var regstr = contents ? "<" + tag + "[^>]*>([\\s\\S]*?)<\/" + tag + ">" : "<\/?" + tag + "([^>]+)?>";
	reg = new RegExp(regstr, "gi");
	return reg;
};

String.implement({

	standardize: function(){
		var text = this;
		special.each(function(ch, i){
			text = text.replace(new RegExp(ch, 'g'), standard[i]);
		});
		return text;
	},

	repeat: function(times){
		return new Array(times + 1).join(this);
	},

	pad: function(length, str, dir){
		if (this.length >= length) return this;
		var pad = (str == null ? ' ' : '' + str).repeat(length - this.length).substr(0, length - this.length);
		if (!dir || dir == 'right') return this + pad;
		if (dir == 'left') return pad + this;
		return pad.substr(0, (pad.length / 2).floor()) + this + pad.substr(0, (pad.length / 2).ceil());
	},

	getTags: function(tag, contents){
		return this.match(getRegForTag(tag, contents)) || [];
	},

	stripTags: function(tag, contents){
		return this.replace(getRegForTag(tag, contents), '');
	},

	tidy: function(){
		var txt = this.toString();
		$each(tidymap, function(value, key){
			txt = txt.replace(new RegExp(key, 'g'), value);
		});
		return txt;
	}

});

})();

你只要看:stripTags和getRegForTag即可。

顺便继续鄙视jQuery。

0
G.
G.

为什么要鄙视jQuery呢? 我觉得它挺好的啊.

0
曾建凯
曾建凯

红薯的帖子的代码只要两行:

String html = "你好,我是来自<a href='http://www.oschina.net/' target='_blank'>开源中国社区</a>的红薯。";
System.out.println(Jsoup.parse(html).text());

我来个javascript更短的,也是基于mootools:

$$('html').get('text');
0
曾建凯
曾建凯

引用来自#6楼“linxiuxiu”的帖子

为什么要鄙视jQuery呢? 我觉得它挺好的啊.

没办法啊,就好象我第一眼就爱上我老婆,但是我永远都不会看上jQuery。

0
shijacky
shijacky

一个正则把 <.*> 都删掉不就行了

返回顶部
顶部