My baby steps with Go — Building a basic web crawler with Neo4j integration
Well, I’m a Java developer, I recently started learning Go and I’m enjoying most of its features. For experimental purposes, I decided to create a small web crawler. Why a web crawler? Well, because it’s complex enough to provide some good examples of parsing text, handling events, using the standard API and relying on 3rd-party APIs.
The Goal:
The goal of this post is to create a basic web crawler that captures your site structure by getting all its internal links before storing them in Neo4j database. So the idea is very simple and it follow these steps: Get request for a given URL Parse the response Extract all internal links from response Store the extracted links in Neo4j Repeat the 1st step with each link until exploring all the site Finally, we’ll use Neo4j Browser to display the output graph.
Prerequisites:
This post is accessible for Go beginners (just like me). I’ll provide a helpful link each time a new concept is introduced. For Neo4j, basic knowledge of graph oriented databases would be helpful. I’m assuming that you have both Go and Neo4j installed on your local machine. If it’s not the case, please follow the documentation instructions in Golang and Neo4j websites.
Creating the crawler:
Now that we have all we need to start coding. Let’s start.
The main function:
Go is a scripting language. Basically, all you need to run a program is a ‘main’ package and a ‘main’ function.
Now, let’s run it.
go run main.go
Alternatively, you can compile the file and run it manually. go build main.go
Retrieving a single page from the internet :
Enough with the basics, it’s time to write some (not so) complicated code that helps us retrieve a specific page from the internet.
package main
import (
"fmt"
"io"
"net/http"
"os"
)
type responseWriter struct{}
func main() {
resp, err := http.Get("http://www.sfeir.com")
if err != nil {
fmt.Println("Error:", err)
os.Exit(1)
}
rw := responseWriter{}
io.Copy(rw, resp.Body)
}
func (responseWriter) Write(bs []byte) (int, error) {
fmt.Printf(string(bs))
return len(bs), nil
}
I started by declaring the main package and importing the required packages. Next, I declared a struct that will implement the ‘Writer’ interface. In the main function, you’re going to notice multiple variable assignment. Basically, the ‘http.Get’ will return the response and some error value if anything went wrong. This is a common way of handling error in a Go program. If you take a look at the documentation you’ll find the ‘Writer’ interface with a single function. In order to implement this interface, we need to add a receiver function to our ‘responseWriter’ struct that matches the ‘Writer’ function signature. If you’re coming from Java, you would probably expect a ‘implements Writer’ or similar syntax. Well, this is not the case for GO since interface implementation goes implicitly. Finally, I used the ‘io.Copy’ to write the response body to our response variable. The next step is to modify our code to extract links from a given website URL. After some refactoring we’ll have two files. This main.go :
package main
import (
"fmt"
"os"
)
func main() {
if len(os.Args) < 2 {
fmt.Println("Web site url is missing")
os.Exit(1)
}
url := os.Args[1]
retreive(url)
}
And the retreiver.go:
package main
import (
"fmt"
"io"
"net/http"
"os"
)
type responseWriter struct{}
func (responseWriter) Write(bs []byte) (int, error) {
fmt.Printf(string(bs))
return len(bs), nil
}
func retreive(uri string) {
resp, err := http.Get(uri)
if err != nil {
fmt.Println("Error:", err)
os.Exit(1)
}
rw := responseWriter{}
io.Copy(rw, resp.Body)
}
We can run this against a simple website:
go run main.go retreiver.go http://www.sfeir.com
Now we’ve made our first step to create the crawler. It’s able to boot, parse a given URL, open a connection to the right remote host, and retrieve the html content.
Getting all hyperlinks for a single page
Now, this is the part where we need to extract all links from the HTML document. Unfortunately, there’s no available helpers to manipulate HTML in the Go API. So, we must look for 3rd party API. Let’s consider ‘goquery’. As you might guess, it’s similar to ‘jquery’ but with Go. You can easily get the ‘goquery’ package by running the following command:
go get github.com/PuerkitoBio/goquery
package main
import (
"fmt"
"os"
)
func main() {
if len(os.Args) < 2 {
fmt.Println("Web site url is missing")
os.Exit(1)
}
url := os.Args[1]
links, err := retrieve(url)
if err != nil {
fmt.Println("Error:", err)
os.Exit(1)
}
for _, link := range links {
fmt.Println(link)
}
}
I changed our retrieve function to return a list of links of a given web page.
package main
import (
"fmt"
"net/http"
"net/url"
"strings"
"github.com/PuerkitoBio/goquery"
)
func retrieve(uri string) ([]string, error) {
resp, err := http.Get(uri)
if err != nil {
fmt.Println("Error:", err)
return nil, err
}
doc, readerErr := goquery.NewDocumentFromReader(resp.Body)
if readerErr != nil {
fmt.Println("Error:", readerErr)
return nil, readerErr
}
u, parseErr := url.Parse(uri)
if parseErr != nil {
fmt.Println("Error:", parseErr)
return nil, parseErr
}
host := u.Host
links := []string{}
doc.Find("a[href]").Each(func(index int, item *goquery.Selection) {
href, _ := item.Attr("href")
lu, err := url.Parse(href)
if err != nil {
fmt.Println("Error:", err)
return
}
if isInternalURL(host, lu) {
links = append(links, u.ResolveReference(lu).String())
}
})
return unique(links), nil
}
// insures that the link is internal
func isInternalURL(host string, lu *url.URL) bool {
if lu.IsAbs() {
return strings.EqualFold(host, lu.Host)
}
return len(lu.Host) == 0
}
// insures that there is no repetition
func unique(s []string) []string {
keys := make(map[string]bool)
list := []string{}
for _, entry := range s {
if _, value := keys[entry]; !value {
keys[entry] = true
list = append(list, entry)
}
}
return list
}
As you can see, our ‘retrieve’ function has significantly improved. I removed the ‘responseWriter’ struct because it’s no longer needed since the ‘goquery’ has its own implementation of ‘Writer’ interface. I also added two helper functions. The first one, detect whether the URL is pointing to an internal page. The second one, ensure that the list does not contain any duplicated links. Again, we can run this against a simple website: go run main.go retreiver.go http://www.sfeir.com
Getting all hyperlinks for the entire site
Yeah! We made a huge progress. The next thing we’re going to see is how to improve the ‘retrieve’ function in order to get links in other pages too. So, I’m considering the recursive approach. We’ll create another function called ‘crawl ’ and this function will call it self recursively with each link given by the ‘retrieve’ function. Also, we’ll need to keep track of the visited pages to avoid visiting the same page multiple times. Let’s check this :
// part of retreiver.go
var visited = make(map[string]bool)
func crawl(uri string) {
links, _ := retrieve(uri)
for _, l := range links {
if !visited[l] {
fmt.Println("Fetching", l)
visited[uri] = true
crawl(l)
}
}
}
Now we can call the ‘crawl’ instead of the ‘retrieve’ function in the ‘main.go’. The code will be the following :
package main
import (
"fmt"
"os"
)
func main() {
if len(os.Args) < 2 {
fmt.Println("Web site url is missing")
os.Exit(1)
}
url := os.Args[1]
crawl(url)
}
Let’s run our program:
go run main.go retreiver.go http://www.sfeir.com
Implementing events listeners through Channels
In the previous section we saw that the fetched URL is being displayed inside the ‘crawl’ function. This is not the best solution especially when you need to do more than just printing on the screen. To fix this, basically, we’ll need to implement an event listener for fetching URLs through Channels. Let’s have a look at this :
// same imports
type link struct {
source string
target string
}
type retriever struct {
events map[string][]chan link
visited map[string]bool
}
func (b *retriever) addEvent(e string, ch chan link) {
if b.events == nil {
b.events = make(map[string][]chan link)
}
if _, ok := b.events[e]; ok {
b.events[e] = append(b.events[e], ch)
} else {
b.events[e] = []chan link{ch}
}
}
func (b *retriever) removeEvent(e string, ch chan link) {
if _, ok := b.events[e]; ok {
for i := range b.events[e] {
if b.events[e][i] == ch {
b.events[e] = append(b.events[e][:i], b.events[e][i+1:]...)
break
}
}
}
}
func (b *retriever) emit(e string, response link) {
if _, ok := b.events[e]; ok {
for _, handler := range b.events[e] {
go func(handler chan link) {
handler <- response
}(handler)
}
}
}
func (b *retriever) crawl(uri string) {
links, _ := b.retrieve(uri)
for _, l := range links {
if !b.visited[l] {
b.emit("newLink", link{
source: uri,
target: l,
})
b.visited[uri] = true
b.crawl(l)
}
}
}
func (b *retriever) retrieve(uri string) ([]string, error) {
resp, err := http.Get(uri)
if err != nil {
fmt.Println("Error:", err)
return nil, err
}
doc, readerErr := goquery.NewDocumentFromReader(resp.Body)
if readerErr != nil {
fmt.Println("Error:", readerErr)
return nil, readerErr
}
u, parseErr := url.Parse(uri)
if parseErr != nil {
fmt.Println("Error:", parseErr)
return nil, parseErr
}
host := u.Host
links := []string{}
doc.Find("a[href]").Each(func(index int, item *goquery.Selection) {
href, _ := item.Attr("href")
lu, err := url.Parse(href)
if err != nil {
fmt.Println("Error:", err)
return
}
if isInternalURL(host, lu) {
links = append(links, u.ResolveReference(lu).String())
}
})
return unique(links), nil
}
// same helper functions
As you can see, we have three additional functions to help us manage the events for a given ‘retriever’. For this code I used the ‘go’ keyword. Basically, writing ‘go foo()’ will make the ‘foo’ function run asynchronously. In our case, we’re using a ‘go’ with an anonymous function to send the event parameter (the link) for all listeners through channels Note: I’ve set the channel data type to ‘link’ that contains the source and target page. Now let’s have a look on the ‘main’ function :
package main
import (
"fmt"
"os"
)
func main() {
if len(os.Args) < 2 {
fmt.Println("Web site url is missing")
os.Exit(1)
}
url := os.Args[1]
ev := make(chan link)
r := retriever{visited: make(map[string]bool)}
r.addEvent("newLink", ev)
go func() {
for {
l := <-ev
fmt.Println(l.source + " -> " + l.target)
}
}()
r.crawl(url)
}
Again I used the ‘go’ keyword, this time for receiving the event parameter sent by the ‘crawl’ function. If we run our program now we should see all internal links for the given website. That’s it for the crawler. Neo4j Integration Now that we’re done with the crawler, let’s get to the Neo4j part. The first thing we’re going to do is to install the driver.
go get github.com/neo4j/neo4j-go-driver/neo4j
After installing the driver, we need to create some basic functions that will allow us to work with Neo4j. Let’s create a new file called ‘neo4j.go’ :
package main
import (
"github.com/neo4j/neo4j-go-driver/neo4j"
)
func connectToNeo4j() (neo4j.Driver, neo4j.Session, error) {
configForNeo4j40 := func(conf *neo4j.Config) { conf.Encrypted = false }
driver, err := neo4j.NewDriver("bolt://localhost:7687", neo4j.BasicAuth(
"neo4j", "alice!in!wonderland", ""), configForNeo4j40)
if err != nil {
return nil, nil, err
}
sessionConfig := neo4j.SessionConfig{AccessMode: neo4j.AccessModeWrite}
session, err := driver.NewSession(sessionConfig)
if err != nil {
return nil, nil, err
}
return driver, session, nil
}
func createNode(session *neo4j.Session, l *link) (neo4j.Result, error) {
r, err := (*session).Run("CREATE (:WebLink{source: $source, target: $target}) ", map[string]interface{}{
"source": l.source,
"target": l.target,
})
if err != nil {
return nil, err
}
return r, err
}
func createNodesRelationship(session *neo4j.Session) (neo4j.Result, error) {
r, err := (*session).Run("MATCH (a:WebLink),(b:WebLink) WHERE a.target = b.source CREATE (a)-[r:point_to]->(b)", map[string]interface{}{})
if err != nil {
return nil, err
}
return r, err
}
Basically, we have three functions responsible for initiating connection to Neo4j with basic querying. Note: You might need to change the Neo4j configuration to work with your local instance. To create a ‘WebLink’ node we simply need to run the following query:
CREATE (:WebLink{source: "http://www.sfeir.com/", target: "http://www.sfeir.com/en/services"})
Once the nodes are created, we need to create relationship between them by running the following query :
MATCH (a:WebLink),(b:WebLink)
WHERE a.target = b.source
CREATE (a)-[r:point_to]->(b)
Now, let’s update our ‘main’ function.
package main
import (
"fmt"
"os"
"github.com/neo4j/neo4j-go-driver/neo4j"
)
func main() {
if len(os.Args) < 2 {
fmt.Println("Web site url is missing")
os.Exit(1)
}
driver, session, connErr := connectToNeo4j()
if connErr != nil {
fmt.Println("Error connecting to Database:", connErr)
os.Exit(1)
}
defer driver.Close()
defer session.Close()
url := os.Args[1]
ev := make(chan link)
r := retriever{visited: make(map[string]bool)}
r.addEvent("newLink", ev)
go func(session *neo4j.Session) {
for {
l := <-ev
fmt.Println(l.source + " -> " + l.target)
_, err := createNode(session, &l)
if err != nil {
fmt.Println("Failed to create node:", err)
}
}
}(&session)
r.crawl(url)
fmt.Println("Creation of relationship between nodes.. ")
_, qErr := createNodesRelationship(&session)
if qErr == nil {
fmt.Println("Nodes updated")
} else {
fmt.Println("Error while updating nodes:", qErr)
}
}
With the usage of three functions declared in the ‘neo4j.go’ our program will initiate a connection to neo4j, subscribe to ‘newLink’ event to insert nodes and finally update nodes relationship. I used the the ‘defer’ keyword to defers the execution of a function until the surrounding ‘main’ function returns. Let’s run this for the last time :
go run main.go retreiver.go neo4j.go http://www.sfeir.com
To check in the result on Neo4j, you can run the following query on your Neo4j Browser:
MATCH (n:WebLink) RETURN count(n) AS count
Or this query to display all nodes:
MATCH (n:WebLink) RETURN n
Et Voilà! The result after running the last query :
It’s pretty, isn’t it?
Conclusion
Through this post we explored a lot of features of the Go programming language including multiple variable assignment, implementation interfaces and channels and goroutines. Also, we used the standard library as well as some 3rd party libraries. Thank you for reading it. The code source is available on my GitHub.