.Net解析html文档使用类库HtmlAgilityPack

HtmlAgilityPack是一个基于.Net的、第三方免费开源的微型类库，主要用于在服务器端解析html文档（在B/S结构的程序中客户端可以用Javascript、jquery解析html）下载地址：

http://htmlagilitypack.codeplex.com/。

准备:

如果你有安装Nuget的话，可以直接查找安装即可。

下载后解压缩后有3个文件，这里只需要将其中的HtmlAgilityPack.dll（程序集）、HtmlAgilityPack.xml（文档，用于Visual Studio 2008中代码智能提示和帮助说明之用）引入解决方案中即可使用，无需安装任何东西，非常好用。

在C#类文件开头引入using HtmlAgilityPack;就可以使用该命名空间下的类型了。实际使用中，几乎都是以HtmlDocument类为主线的，这一点非常类似于微软.net framework中的XmlDocument类。XmlDocument类是操作的是xml文档，而HtmlDocument类操作的是html文档（其实也可以操作xml文档），它们的操作方式都是基于Dom，所不同的是后者取消了诸如GetElementsByTagName这样的方法，强化了GetElementById方法（在HtmlDocument中可以直接使用，而XmlDocument则不可以）。

HtmlAgilityPack中定位节点基本上都用Xpath表达式，Xpath表达式的参考文档可见：

http://www.w3school.com.cn/xpath/xpath_syntax.asp。自行学习。

以上是准备工作。下面说一下HtmlAgilityPack读取web页面，并解析的方法步骤。

1.读取url：

HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();

HtmlAgilityPack.HtmlDocument doccc = hw.Load(url);//是你需要解析的url

ArrayList ImagePtahs = GetHrefs(doccc);

2.用Xpath解析。

这一步就比较简单了。就用Xpath选出你想要的数据，遍历他们，取出他们的value即可。

实例代码:

 
private ArrayList GetHrefs(HtmlAgilityPack.HtmlDocument _doc)
{
    try
    {
        Images = new ArrayList();
        HtmlNodeCollection hrefs = _doc.DocumentNode.SelectNodes("//li/h3/a[@href]");
        HtmlNodeCollection hrefs2 = _doc.DocumentNode.SelectNodes("//div[starts-with(@class,'content_single')]");
        if (hrefs == null)
             return new ArrayList();
        foreach (HtmlNode href in hrefs)
        {
            // Images.Add(href.Attributes["src"].Value); 
            string hreff = href.Attributes["href"].Value;// 排除 博海拾贝第二百零二期】吃完薯条寂寞了 
            string title = href.Attributes["title"].Value;
            if (title.IndexOf("邪恶") >= 0)
            {
                continue;
            }
            if (title.IndexOf("恶搞") >= 0)
            {
                continue;
            }
            if (title.IndexOf("雷人") >= 0)
            {
                continue;
            }
            ///执行数据保存的逻辑
        }
    }
    catch (Exception ex)
    {
        ShowLogMsg("出错了："+ex.Message+ex.StackTrace);
        return new ArrayList();
    }
}

每一个Htmlnode，你要获取他的数据用这个方法： img.Attributes["src"].Value

以上就HTMLAgilityPack的一些简单用法

	private ArrayList GetHrefs(HtmlAgilityPack.HtmlDocument _doc)
	{
	try
	{
	Images = new ArrayList();
	HtmlNodeCollection hrefs = _doc.DocumentNode.SelectNodes("//li/h3/a[@href]");
	HtmlNodeCollection hrefs2 = _doc.DocumentNode.SelectNodes("//div[starts-with(@class,'content_single')]");
	if (hrefs == null)
	return new ArrayList();
	foreach (HtmlNode href in hrefs)
	{
	// Images.Add(href.Attributes["src"].Value);
	string hreff = href.Attributes["href"].Value;// 排除博海拾贝第二百零二期】吃完薯条寂寞了
	string title = href.Attributes["title"].Value;
	if (title.IndexOf("邪恶") >= 0)
	{
	continue;
	}
	if (title.IndexOf("恶搞") >= 0)
	{
	continue;
	}
	if (title.IndexOf("雷人") >= 0)
	{
	continue;
	}
	///执行数据保存的逻辑
	}
	}
	catch (Exception ex)
	{
	ShowLogMsg("出错了："+ex.Message+ex.StackTrace);
	return new ArrayList();
	}
	}