My baby steps with Go — Building a basic web crawler with Neo4j integration

Well, I’m a Java developer, I recently started learning Go and I’m enjoying most of its features. For experimental purposes, I decided to create a small web crawler. Why a web crawler? Well, because it’s complex enough to provide some good examples of parsing text, handling events, using the standard API and relying on 3rd-party APIs.

The Goal:

The goal of this post is to create a basic web crawler that captures your site structure by getting all its internal links before storing them in Neo4j database. So the idea is very simple and it follow these steps: Get request for a given URL Parse the response Extract all internal links from response Store the extracted links in Neo4j Repeat the 1st step with each link until exploring all the site Finally, we’ll use Neo4j Browser to display the output graph.

Prerequisites:

This post is accessible for Go beginners (just like me). I’ll provide a helpful link each time a new concept is introduced. For Neo4j, basic knowledge of graph oriented databases would be helpful. I’m assuming that you have both Go and Neo4j installed on your local machine. If it’s not the case, please follow the documentation instructions in Golang and Neo4j websites.

Creating the crawler:

Now that we have all we need to start coding. Let’s start.

The main function:

Go is a scripting language. Basically, all you need to run a program is a ‘main’ package and a ‘main’ function.

Now, let’s run it.

go run main.go

Alternatively, you can compile the file and run it manually. go build main.go

Retrieving a single page from the internet :

Enough with the basics, it’s time to write some (not so) complicated code that helps us retrieve a specific page from the internet.

package main

import (
	"fmt"
	"io"
	"net/http"
	"os"
)

type responseWriter struct{}

func main() {

	resp, err := http.Get("http://www.sfeir.com")

	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	rw := responseWriter{}
	io.Copy(rw, resp.Body)
}

func (responseWriter) Write(bs []byte) (int, error) {
	fmt.Printf(string(bs))

	return len(bs), nil
}

I started by declaring the main package and importing the required packages. Next, I declared a struct that will implement the ‘Writer’ interface. In the main function, you’re going to notice multiple variable assignment. Basically, the ‘http.Get’ will return the response and some error value if anything went wrong. This is a common way of handling error in a Go program. If you take a look at the documentation you’ll find the ‘Writer’ interface with a single function. In order to implement this interface, we need to add a receiver function to our ‘responseWriter’ struct that matches the ‘Writer’ function signature. If you’re coming from Java, you would probably expect a ‘implements Writer’ or similar syntax. Well, this is not the case for GO since interface implementation goes implicitly. Finally, I used the ‘io.Copy’ to write the response body to our response variable. The next step is to modify our code to extract links from a given website URL. After some refactoring we’ll have two files. This main.go :

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]

	retreive(url)
}

And the retreiver.go:

package main

import (
	"fmt"
	"io"
	"net/http"
	"os"
)

type responseWriter struct{}

func (responseWriter) Write(bs []byte) (int, error) {
	fmt.Printf(string(bs))

	return len(bs), nil
}

func retreive(uri string) {
	resp, err := http.Get(uri)

	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	rw := responseWriter{}
	io.Copy(rw, resp.Body)
}

We can run this against a simple website:

go run main.go retreiver.go http://www.sfeir.com

Now we’ve made our first step to create the crawler. It’s able to boot, parse a given URL, open a connection to the right remote host, and retrieve the html content.

Getting all hyperlinks for a single page

Now, this is the part where we need to extract all links from the HTML document. Unfortunately, there’s no available helpers to manipulate HTML in the Go API. So, we must look for 3rd party API. Let’s consider ‘goquery’. As you might guess, it’s similar to ‘jquery’ but with Go. You can easily get the ‘goquery’ package by running the following command:

go get github.com/PuerkitoBio/goquery

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]

	links, err := retrieve(url)

	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	for _, link := range links {
		fmt.Println(link)
	}
}

I changed our retrieve function to return a list of links of a given web page.

package main

import (
	"fmt"
	"net/http"
	"net/url"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func retrieve(uri string) ([]string, error) {
	resp, err := http.Get(uri)
	if err != nil {
		fmt.Println("Error:", err)
		return nil, err
	}

	doc, readerErr := goquery.NewDocumentFromReader(resp.Body)
	if readerErr != nil {
		fmt.Println("Error:", readerErr)
		return nil, readerErr
	}
	u, parseErr := url.Parse(uri)
	if parseErr != nil {
		fmt.Println("Error:", parseErr)
		return nil, parseErr
	}
	host := u.Host

	links := []string{}
	doc.Find("a[href]").Each(func(index int, item *goquery.Selection) {
		href, _ := item.Attr("href")
		lu, err := url.Parse(href)
		if err != nil {
			fmt.Println("Error:", err)
			return
		}
		if isInternalURL(host, lu) {
			links = append(links, u.ResolveReference(lu).String())
		}

	})

	return unique(links), nil
}

// insures that the link is internal
func isInternalURL(host string, lu *url.URL) bool {

	if lu.IsAbs() {
		return strings.EqualFold(host, lu.Host)
	}
	return len(lu.Host) == 0
}

// insures that there is no repetition
func unique(s []string) []string {
	keys := make(map[string]bool)
	list := []string{}
	for _, entry := range s {
		if _, value := keys[entry]; !value {
			keys[entry] = true
			list = append(list, entry)
		}
	}
	return list
}

As you can see, our ‘retrieve’ function has significantly improved. I removed the ‘responseWriter’ struct because it’s no longer needed since the ‘goquery’ has its own implementation of ‘Writer’ interface. I also added two helper functions. The first one, detect whether the URL is pointing to an internal page. The second one, ensure that the list does not contain any duplicated links. Again, we can run this against a simple website: go run main.go retreiver.go http://www.sfeir.com

Getting all hyperlinks for the entire site

Yeah! We made a huge progress. The next thing we’re going to see is how to improve the ‘retrieve’ function in order to get links in other pages too. So, I’m considering the recursive approach. We’ll create another function called ‘crawl ’ and this function will call it self recursively with each link given by the ‘retrieve’ function. Also, we’ll need to keep track of the visited pages to avoid visiting the same page multiple times. Let’s check this :

// part of retreiver.go
var visited = make(map[string]bool)

func crawl(uri string) {

	links, _ := retrieve(uri)

	for _, l := range links {
		if !visited[l] {
			fmt.Println("Fetching", l)
			visited[uri] = true
			crawl(l)
		}
	}
}

Now we can call the ‘crawl’ instead of the ‘retrieve’ function in the ‘main.go’. The code will be the following :

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]

	crawl(url)

}

Let’s run our program:

go run main.go retreiver.go http://www.sfeir.com

Implementing events listeners through Channels

In the previous section we saw that the fetched URL is being displayed inside the ‘crawl’ function. This is not the best solution especially when you need to do more than just printing on the screen. To fix this, basically, we’ll need to implement an event listener for fetching URLs through Channels. Let’s have a look at this :

// same imports
type link struct {
	source string
	target string
}

type retriever struct {
	events  map[string][]chan link
	visited map[string]bool
}

func (b *retriever) addEvent(e string, ch chan link) {
	if b.events == nil {
		b.events = make(map[string][]chan link)
	}
	if _, ok := b.events[e]; ok {
		b.events[e] = append(b.events[e], ch)
	} else {
		b.events[e] = []chan link{ch}
	}
}

func (b *retriever) removeEvent(e string, ch chan link) {
	if _, ok := b.events[e]; ok {
		for i := range b.events[e] {
			if b.events[e][i] == ch {
				b.events[e] = append(b.events[e][:i], b.events[e][i+1:]...)
				break
			}
		}
	}
}

func (b *retriever) emit(e string, response link) {
	if _, ok := b.events[e]; ok {
		for _, handler := range b.events[e] {
			go func(handler chan link) {
				handler <- response
			}(handler)
		}
	}
}

func (b *retriever) crawl(uri string) {

	links, _ := b.retrieve(uri)

	for _, l := range links {
		if !b.visited[l] {
			b.emit("newLink", link{
				source: uri,
				target: l,
			})
			b.visited[uri] = true
			b.crawl(l)
		}
	}
}

func (b *retriever) retrieve(uri string) ([]string, error) {
	resp, err := http.Get(uri)
	if err != nil {
		fmt.Println("Error:", err)
		return nil, err
	}

	doc, readerErr := goquery.NewDocumentFromReader(resp.Body)
	if readerErr != nil {
		fmt.Println("Error:", readerErr)
		return nil, readerErr
	}
	u, parseErr := url.Parse(uri)
	if parseErr != nil {
		fmt.Println("Error:", parseErr)
		return nil, parseErr
	}
	host := u.Host

	links := []string{}
	doc.Find("a[href]").Each(func(index int, item *goquery.Selection) {
		href, _ := item.Attr("href")
		lu, err := url.Parse(href)
		if err != nil {
			fmt.Println("Error:", err)
			return
		}
		if isInternalURL(host, lu) {
			links = append(links, u.ResolveReference(lu).String())
		}

	})

	return unique(links), nil
}

// same helper functions

As you can see, we have three additional functions to help us manage the events for a given ‘retriever’. For this code I used the ‘go’ keyword. Basically, writing ‘go foo()’ will make the ‘foo’ function run asynchronously. In our case, we’re using a ‘go’ with an anonymous function to send the event parameter (the link) for all listeners through channels Note: I’ve set the channel data type to ‘link’ that contains the source and target page. Now let’s have a look on the ‘main’ function :

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]
	ev := make(chan link)
	r := retriever{visited: make(map[string]bool)}
	r.addEvent("newLink", ev)

	go func() {
		for {
			l := <-ev
			fmt.Println(l.source + " -> " + l.target)
		}
	}()

	r.crawl(url)

}

Again I used the ‘go’ keyword, this time for receiving the event parameter sent by the ‘crawl’ function. If we run our program now we should see all internal links for the given website. That’s it for the crawler. Neo4j Integration Now that we’re done with the crawler, let’s get to the Neo4j part. The first thing we’re going to do is to install the driver.

go get github.com/neo4j/neo4j-go-driver/neo4j

After installing the driver, we need to create some basic functions that will allow us to work with Neo4j. Let’s create a new file called ‘neo4j.go’ :

package main

import (
	"github.com/neo4j/neo4j-go-driver/neo4j"
)

func connectToNeo4j() (neo4j.Driver, neo4j.Session, error) {

	configForNeo4j40 := func(conf *neo4j.Config) { conf.Encrypted = false }
	
	driver, err := neo4j.NewDriver("bolt://localhost:7687", neo4j.BasicAuth(
		"neo4j", "alice!in!wonderland", ""), configForNeo4j40)

	if err != nil {
		return nil, nil, err
	}

	sessionConfig := neo4j.SessionConfig{AccessMode: neo4j.AccessModeWrite}
	session, err := driver.NewSession(sessionConfig)
	if err != nil {
		return nil, nil, err
	}

	return driver, session, nil
}

func createNode(session *neo4j.Session, l *link) (neo4j.Result, error) {
	r, err := (*session).Run("CREATE (:WebLink{source: $source, target: $target}) ", map[string]interface{}{
		"source": l.source,
		"target": l.target,
	})

	if err != nil {
		return nil, err
	}

	return r, err
}

func createNodesRelationship(session *neo4j.Session) (neo4j.Result, error) {
	r, err := (*session).Run("MATCH (a:WebLink),(b:WebLink) WHERE a.target = b.source CREATE (a)-[r:point_to]->(b)", map[string]interface{}{})

	if err != nil {
		return nil, err
	}

	return r, err
}

Basically, we have three functions responsible for initiating connection to Neo4j with basic querying. Note: You might need to change the Neo4j configuration to work with your local instance. To create a ‘WebLink’ node we simply need to run the following query:

CREATE (:WebLink{source: "http://www.sfeir.com/", target: "http://www.sfeir.com/en/services"})

Once the nodes are created, we need to create relationship between them by running the following query :

MATCH (a:WebLink),(b:WebLink) 
WHERE a.target = b.source 
CREATE (a)-[r:point_to]->(b)

Now, let’s update our ‘main’ function.

package main

import (
	"fmt"
	"os"

	"github.com/neo4j/neo4j-go-driver/neo4j"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	driver, session, connErr := connectToNeo4j()

	if connErr != nil {
		fmt.Println("Error connecting to Database:", connErr)
		os.Exit(1)
	}

	defer driver.Close()

	defer session.Close()

	url := os.Args[1]
	ev := make(chan link)
	r := retriever{visited: make(map[string]bool)}
	r.addEvent("newLink", ev)

	go func(session *neo4j.Session) {
		for {
			l := <-ev
			fmt.Println(l.source + " -> " + l.target)
			_, err := createNode(session, &l)

			if err != nil {
				fmt.Println("Failed to create node:", err)
			}

		}
	}(&session)

	r.crawl(url)

	fmt.Println("Creation of relationship between nodes.. ")
	_, qErr := createNodesRelationship(&session)

	if qErr == nil {
		fmt.Println("Nodes updated")
	} else {
		fmt.Println("Error while updating nodes:", qErr)
	}

}

With the usage of three functions declared in the ‘neo4j.go’ our program will initiate a connection to neo4j, subscribe to ‘newLink’ event to insert nodes and finally update nodes relationship. I used the the ‘defer’ keyword to defers the execution of a function until the surrounding ‘main’ function returns. Let’s run this for the last time :

go run main.go retreiver.go neo4j.go http://www.sfeir.com

To check in the result on Neo4j, you can run the following query on your Neo4j Browser:

MATCH (n:WebLink) RETURN count(n) AS count

Or this query to display all nodes:

MATCH (n:WebLink) RETURN n

Et Voilà! The result after running the last query :

It’s pretty, isn’t it?

Conclusion

Through this post we explored a lot of features of the Go programming language including multiple variable assignment, implementation interfaces and channels and goroutines. Also, we used the standard library as well as some 3rd party libraries. Thank you for reading it. The code source is available on my GitHub.