读取网页源代码乱码。。

cooc123 发布于 2011/10/27 10:13
阅读 1K+
收藏 1
String url = "http://roll.sohu.com/20111026/n323511012.shtml";

        String str = getHttp(url);
        System.out.println(str);

    public String getHttp(String url) {
        try {
            URL u = new URL(url);
            HttpURLConnection http = (HttpURLConnection) u.openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(http.getInputStream(), "gbk"));
            StringBuilder sb = new StringBuilder();
            String line = "";
            while ((line = in.readLine()) != null) {
                sb.append(line).append("\n");
            }
            in.close();
            http.disconnect();
            return sb.toString();
        } catch (Exception ex) {
            Logger.getLogger(Http.class.getName()).log(Level.SEVERE, null, ex);
            return null;
        }
    }

http://roll.sohu.com/20111026/n323511012.shtml 

明明是GBK 的为什么读取出来是乱码


加载中
0
TrulyBelieve
TrulyBelieve
是否gzip压缩了呀
0
Yisen
Yisen

试下utf-8

0
cooc123
cooc123

引用来自“TrulyBelieve”的答案

是否gzip压缩了呀

怎么看,如果GZIP压缩了,要怎样读。

3楼的,用UTF-8不好使,乱的更厉害

0
鉴客
鉴客
httpclient 去读试试
0
cooc123
cooc123
            HttpURLConnection http = (HttpURLConnection) u.openConnection();
            if (http.getHeaderField("Content-Encoding") != null) {
                String html = this.getGZipString(http.getInputStream(),charset);
                http.disconnect();
                return html;
            }

用gzip 读取后,中文无法显示了

 

            gzin = new GZIPInputStream(fin);
            byte[] buf = new byte[1024]; // 缓冲区大小
            int num;
            StringBuilder sb = new StringBuilder();
            while ((num = gzin.read(buf, 0, buf.length)) != -1) { // 如果文件未读完  
                String line = new String(buf, 0, num);
                //line = new String(line.getBytes("Iso-8859-1"),charset);
                //加上这句中文全是问号 不加是奇怪字符
                sb.append(line);
            }
            gzin.close(); // 关闭压缩输入流
            return sb.toString();

0
firstrose
firstrose
确实是gzip压缩了
0
cooc123
cooc123

引用来自“firstrose”的答案

确实是gzip压缩了
恩,怎样对他编码呢。现在用gzip读取出的中文是奇怪字符
0
TrulyBelieve
TrulyBelieve
while ((num = gzin.read(buf, 0, buf.length)) != -1) { 
    String line = new String(buf, 0, num);//这句有问题吧,gzip读出来的是GBK编码的二进制,没有将其编码转化为String啊
    sb.append(line);
}
返回顶部
顶部