0
回答
新手网页抓取问题 求助,谢谢谢谢
利用AWS快速构建适用于生产的无服务器应用程序,免费试用12个月>>>   

java抓取网页的简单实例:

源代码:

import java.io.*;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.PostMethod;

public class RetrivePage
{
 private static HttpClient httpClient = new HttpClient();
 // 设置代理服务器
 static
 {
  // 设置代理服务器的IP 地址和端口
  httpClient.getHostConfiguration().setProxy("10.110.0.52", 8080);
 }

 public static boolean downloadPage(String path) throws HttpException,
   IOException
 {
  InputStream input = null;
  OutputStream output = null;
  // 得到post 方法
  PostMethod postMethod = new PostMethod(path);
  // 设置post 方法的参数
  NameValuePair[] postData = new NameValuePair[2];
  postData[0] = new NameValuePair("name", "baidu");
  postData[1] = new NameValuePair("password", "*****");
  postMethod.addParameters(postData);
  // 执行,返回状态码
  int statusCode = httpClient.executeMethod(postMethod);
  // 针对状态码进行处理(简单起见,只处理返回值为200 的状态码)
  if (statusCode == HttpStatus.SC_OK)
  {
   input = postMethod.getResponseBodyAsStream();
   // 得到文件名
   String filename = path.substring(path.lastIndexOf('/') +1);
   // 获得文件输出流
   output = new FileOutputStream(filename);
   // 输出到文件
   int tempByte = -1;
   while ((tempByte = input.read()) > 0)
   {
    output.write(tempByte);
   }
   // 关闭输入输出流
   if (input != null)
   {
    input.close();
   }
   if (output != null)
   {
    output.close();
   }
   return true;
  }
  return false;
 }

 /**
  * 测试代码
  */
 public static void main(String[] args)
 {
  // 抓取lietu 首页,输出
  try
  {
   RetrivePage.downloadPage("http://www.baidu.com/");
  }
  catch (HttpException e)
  {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  catch (IOException e)
  {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}

程序运行中出现异常,异常信息如下:

java.io.FileNotFoundException: 
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.<init>(Unknown Source)
 at java.io.FileOutputStream.<init>(Unknown Source)
 at RetrivePage.downloadPage(RetrivePage.java:39)
 at RetrivePage.main(RetrivePage.java:68)

举报
赵长勇
发帖于6年前 0回/467阅
顶部