A Basic Web Scraper in Go

A basic web scraper usually involves a few steps, fetching the content, querying it for the data you’re after and then sometimes, using that data to go and find more as a loop.

In our example below, we use a package called goquery to do most of the heavy lifting for us. This library will go further than Go’s standard library by allowing us to do both the first and second steps a bit easier. It uses NewDocument() to get the content, then Find() to query it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package main

import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {

    url := "https://gophercoding.com/"

    // Fetch page content along with headers used
    headers, err := FetchContent(url)
    if err != nil {
        log.Println(err)
    }

    // Loop through headers and print out
    for _, h := range headers {
        fmt.Println(h)
    }
}

func FetchContent(url string) ([]string, error) {

    var headers []string

    doc, err := goquery.NewDocument(url)
    if err != nil {
        return headers, err
    }

    // Find the desired HTML elements and extract the data
    doc.Find("h1").Each(func(i int, s *goquery.Selection) {
        headers = append(headers, s.Text())
    })
    return headers, nil
}

This will list out all header elements on your page, which when calling it on this site, GopherCoding, will list of all new posts from our home page.

There are additional steps we can take:

Finding links in our content and following them (making a spider of sorts)
Adding in concurrency using goroutines and WaitGroups.

Related Posts