Jsoup为一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。
场景如下:
1.获取京东的图书类目
2.以类目id为key,类目名称为value保存到map中
代码如下:
private static Map<String, String> getWareCategory() {
Connection conn = Jsoup.connect(JDConstants.CATEGORY_URL_FORMAT).userAgent(
JDConstants.MOZILLA_AGENT).timeout(JDConstants.TIME_OUT);
Map<String, String> categoryMap = new HashMap<String, String>();
Document document = null;
try {
Connection.Response response = conn.execute();
int statusCode = response.statusCode();
if (statusCode != JDConstants.HTTP_OK_CODE) {
return categoryMap;
}
document = conn.get();
Elements tmp = document.select("div.left").select("#booksort").first().select(
"div.mc ul").first().select("li");
for (int i = 0; i < tmp.size(); i++) {
Element e = tmp.get(i);
String url = e.select("a").attr("href");
String name = e.select("a").text();
String categoryId = StringUtils.isNotEmpty(url) ? (url.split("-").length == 3 ? url
.split("-")[1] : "") : "";
categoryMap.put(categoryId, name);
}
} catch (Exception e) {
LOG.error("getCategory response:" + document);
LOG.error("getCategory error:" + e.getMessage());
}
LOG.info("***********categoryMap:" + categoryMap);
return categoryMap;
}
?其他常量变量如下:
public abstract class JDConstants {
public static final int TIME_OUT = 1000 * 60 * 30;
public static final String MOZILLA_AGENT = "Mozilla";
public static final int HTTP_OK_CODE = 200;
public static final String CATEGORY_URL_FORMAT = "http://www.360buy.com/products/1713-3269-000.html";
}
?评价:
操作非常方便
