爬虫步骤
- 明确目标(确定在哪个网站搜索)
- 爬(爬下内容)
- 取(筛选想要的)
- 处理数据(按照你的想法去处理)
扩展包
go get github.com/antchfx/htmlquery
代码如下
package main
import ("fmt""github.com/antchfx/htmlquery""strings""sync"
)
const url = "https://learnku.com/go"
var wg sync.WaitGroup
func ParseEmails() {
defer wg.Done()
defer func() {if recover() != nil {
fmt.Println(recover())}}()
doc, err := htmlquery.LoadURL(url)if err != nil {panic("解析URL错误")}
rules := "//span[@class='topic-title']/text()"
nodes, err := htmlquery.QueryAll(doc, rules)
if err != nil {panic(`not a valid XPath expression.`)}if len(nodes) == 0 {
fmt.Println("未找到任何内容")return}
res := htmlquery.InnerText(node)
resTrim := strings.TrimSpace(res)if resTrim != "" {
fmt.Printf("parse value == %s\n", resTrim)}}
}
func main() {
wg.Add(1)
go ParseEmails()
wg.Wait()
fmt.Println("爬虫完成")
}
运行结果
......
parse value == JWT身份认证(附带源码讲解)
parse value == [系列文章] Go 学习笔记 - Go 基础语法(2)
parse value == 第 14 课:并发 concurrency ?《Go 编程基础(视频)》
parse value == 组合函数 Collection《Go 编程实例 Go by Example 2020 》
parse value == 今日面试总结
爬虫完成