A basic web scraper usually involves a few steps, fetching the content, querying it for the data you’re after and then sometimes, using that data to go and find more as a loop.
In our example below, we use a package called goquery to do most of the heavy lifting for us. This library will go further than Go’s standard library by allowing us to do both the first and second steps a bit easier. It uses NewDocument() to get the content, then Find() to query it.
packagemainimport("fmt""log""github.com/PuerkitoBio/goquery")funcmain(){url:="https://gophercoding.com/"// Fetch page content along with headers used
headers,err:=FetchContent(url)iferr!=nil{log.Println(err)}// Loop through headers and print out
for_,h:=rangeheaders{fmt.Println(h)}}funcFetchContent(urlstring)([]string,error){varheaders[]stringdoc,err:=goquery.NewDocument(url)iferr!=nil{returnheaders,err}// Find the desired HTML elements and extract the data
doc.Find("h1").Each(func(iint,s*goquery.Selection){headers=append(headers,s.Text())})returnheaders,nil}
This will list out all header elements on your page, which when calling it on this site, GopherCoding, will list of all new posts from our home page.
There are additional steps we can take:
Finding links in our content and following them (making a spider of sorts)
Adding in concurrency using goroutines and WaitGroups.
Edd is a PHP and Go developer who enjoys blogging about his experiences, mostly about creating and coding new things he's working on and is a big beliver in open-source and Linux.