【开源中国 APP 全新上线】“动弹” 回归、集成大模型对话、畅读技术报告”
http://www.oschina.net/code/snippet_12_834
看看这里,获取 Element 元素后,直接调用 text() 方法获取纯文本
用正则:
/// <Header> /// 去除 HTML tag /// </Header> /// <param name="HTML">源</param> /// <returns>结果</returns> public static string StripHTML(string HTML) //google "StripHTML" 得到 { string[] Regexs = { @"<script[^>]*?>.*?</script>", @"<(\/\s*)?!?((\w+:)?\w+)(\w+(\s*=?\s*(([""'])(\\[""'tbnr]|[^\7])*?\7|\w+)|.{0})|\s)*?(\/\s*)?>", @"([\r\n])[\s]+", @"&(quot|#34);", @"&(amp|#38);", @"&(lt|#60);", @"&(gt|#62);", @"&(nbsp|#160);", @"&(iexcl|#161);", @"&(cent|#162);", @"&(pound|#163);", @"&(copy|#169);", @"&#(\d+);", @"-->", @"<!--.*\n" }; string[] Replaces = { "", "", "", "\"", "&", "<", ">", " ", "\xa1", //chr(161), "\xa2", //chr(162), "\xa3", //chr(163), "\xa9", //chr(169), "", "\r\n", "" }; string s = HTML; for (int i = 0; i < Regexs.Length; i++) { s = new Regex(Regexs[i], RegexOptions.Multiline | RegexOptions.IgnoreCase).Replace(s, Replaces[i]); } s.Replace("<", ""); s.Replace(">", ""); s.Replace("\r\n", ""); return s; } }
把HTML代码都吃掉!
http://www.oschina.net/code/snippet_12_834
看看这里,获取 Element 元素后,直接调用 text() 方法获取纯文本
用正则:
把HTML代码都吃掉!