Если вы видите что-то необычное, просто сообщите мне. Skip to main content

Первые шаги на Go — Построение простого веб приложения с Neo4j

Цель:

Цель этого поста создать простое веб приложение которое снимает структуру вашего сайта получая все ссылки которые содержит и сохраняет их в neo4j базу данных. Идея проста - и в ней следующие шаги:

  • Делаем запрос на URL
  • парсим ответ
  • Извлекаем ссылки из ответа
  • Сохраняем извлеченные ссылки в neo4j
  • Повторяем 1 шаг с полученными ссылками пока не исследуем весь сайт
  • Наконец используем Neo4j веб интерфейс чтобы посмотреть на структуру.

Требования:

Эта статья подойдет начинающим. Будут приведены ссылки, каждый раз, когда будет преставленна новая идея. Для Neo4j, базовое знание графово ориентированной базы данных будет к месту. Предполагается что Go и Neo4j уже установленны на машинет.

Создаем ползунок:

Теперь, когда у нас все есть. Начнем.

Получение одной страницы из интеренета:

Время написать сожный код, которы поможет нам получить определенную страницу из интернета.

package main

import (
	"fmt"
	"io"
	"net/http"
	"os"
)

type responseWriter struct{}

func main() {

	resp, err := http.Get("http://www.sfeir.com")

	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	rw := responseWriter{}
	io.Copy(rw, resp.Body)
}

func (responseWriter) Write(bs []byte) (int, error) {
	fmt.Printf(string(bs))

	return len(bs), nil
}

Мы начали с объявления главного пакета и импорта требуемых пакетов. Дальше, мы объявили структуру которая будет реализовывать Writer интерфейс. В main функции, мы собираемся присвоить множествую переменных значения. В основном. http.Get будет возвращать значения с ответом и некоторой ошибкой, если что-то пойдет не так. Это общий способ обработки ошибок в Go программах.

Если вы посмотрите на документацию, вы найдете Writer интерфейс с одной функцией. Для того, чтобы реализовать этот интерфейс, нам нужно добавить получателя функции к нашей responseWriter структуре которая совпадает с функцией Writer. Если вы пришли из Java вы должны ожидать синтаксис типа реалзиация Writer. Ну чтож, это не тот случай для Go, так как тут реализация происходит неявно.

IfНаконец, youмы takeиспользуем aio.Copy lookдля atзаписи theтела documentationответа you’llв findнашу theпеременную ‘Writer’ответа. interfaceСледующий withшаг aэто singleмодификация function.нашего Inкода orderдля toизвлечения implementссылок thisиз interface,данного weадреса needстраницы. toПосле addнекоторого aрефакторинга, receiverу functionнас toбудет ourдва ‘responseWriter’фала. structЭто that matches the ‘Writer’ function signature. If you’re coming from Java, you would probably expect a ‘implements Writer’ or similar syntax. Well, this is not the case for GO since interface implementation goes implicitly. Finally, I used the ‘io.Copy’ to write the response body to our response variable. The next step is to modify our code to extract links from a given website URL. After some refactoring we’ll have two files. This main.go :

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]

	retreive(url)
}

AndИ theэтот retreiver.go:

package main

import (
	"fmt"
	"io"
	"net/http"
	"os"
)

type responseWriter struct{}

func (responseWriter) Write(bs []byte) (int, error) {
	fmt.Printf(string(bs))

	return len(bs), nil
}

func retreive(uri string) {
	resp, err := http.Get(uri)

	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	rw := responseWriter{}
	io.Copy(rw, resp.Body)
}

WeМы canможем runзапустить thisэто againstдля aнебольшого simple website:вебсайта:

go run main.go retreiver.go http://www.sfeir.com

NowТеперь we’veмы madeсделали ourнаш firstпервый stepшаг toдля createсоздание theползунка. crawler.Есть It’sвозможность ableзагрузить toи boot,спарсить parseданный aадрес, givenоткрыть URL,подключение openпрям aк connectionудлённому toхосту, theи right remote host, and retrieve theполучить html content.содержание.

Getting all hyperlinks for a single page

Now, this is the part where we need to extract all links from the HTML document. Unfortunately, there’s no available helpers to manipulate HTML in the Go API. So, we must look for 3rd party API. Let’s consider ‘goquery’. As you might guess, it’s similar to ‘jquery’ but with Go. You can easily get the ‘goquery’ package by running the following command:

go get github.com/PuerkitoBio/goquery
package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]

	links, err := retrieve(url)

	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	for _, link := range links {
		fmt.Println(link)
	}
}

I changed our retrieve function to return a list of links of a given web page.

package main

import (
	"fmt"
	"net/http"
	"net/url"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func retrieve(uri string) ([]string, error) {
	resp, err := http.Get(uri)
	if err != nil {
		fmt.Println("Error:", err)
		return nil, err
	}

	doc, readerErr := goquery.NewDocumentFromReader(resp.Body)
	if readerErr != nil {
		fmt.Println("Error:", readerErr)
		return nil, readerErr
	}
	u, parseErr := url.Parse(uri)
	if parseErr != nil {
		fmt.Println("Error:", parseErr)
		return nil, parseErr
	}
	host := u.Host

	links := []string{}
	doc.Find("a[href]").Each(func(index int, item *goquery.Selection) {
		href, _ := item.Attr("href")
		lu, err := url.Parse(href)
		if err != nil {
			fmt.Println("Error:", err)
			return
		}
		if isInternalURL(host, lu) {
			links = append(links, u.ResolveReference(lu).String())
		}

	})

	return unique(links), nil
}

// insures that the link is internal
func isInternalURL(host string, lu *url.URL) bool {

	if lu.IsAbs() {
		return strings.EqualFold(host, lu.Host)
	}
	return len(lu.Host) == 0
}

// insures that there is no repetition
func unique(s []string) []string {
	keys := make(map[string]bool)
	list := []string{}
	for _, entry := range s {
		if _, value := keys[entry]; !value {
			keys[entry] = true
			list = append(list, entry)
		}
	}
	return list
}

As you can see, our ‘retrieve’ function has significantly improved. I removed the ‘responseWriter’ struct because it’s no longer needed since the ‘goquery’ has its own implementation of ‘Writer’ interface. I also added two helper functions. The first one, detect whether the URL is pointing to an internal page. The second one, ensure that the list does not contain any duplicated links. Again, we can run this against a simple website: go run main.go retreiver.go http://www.sfeir.com

Getting all hyperlinks for the entire site

Yeah! We made a huge progress. The next thing we’re going to see is how to improve the ‘retrieve’ function in order to get links in other pages too. So, I’m considering the recursive approach. We’ll create another function called ‘crawl ’ and this function will call it self recursively with each link given by the ‘retrieve’ function. Also, we’ll need to keep track of the visited pages to avoid visiting the same page multiple times. Let’s check this :

// part of retreiver.go
var visited = make(map[string]bool)

func crawl(uri string) {

	links, _ := retrieve(uri)

	for _, l := range links {
		if !visited[l] {
			fmt.Println("Fetching", l)
			visited[uri] = true
			crawl(l)
		}
	}
}

Now we can call the ‘crawl’ instead of the ‘retrieve’ function in the ‘main.go’. The code will be the following :

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]

	crawl(url)

}

Let’s run our program:

go run main.go retreiver.go http://www.sfeir.com

Implementing events listeners through Channels

In the previous section we saw that the fetched URL is being displayed inside the ‘crawl’ function. This is not the best solution especially when you need to do more than just printing on the screen. To fix this, basically, we’ll need to implement an event listener for fetching URLs through Channels. Let’s have a look at this :

// same imports
type link struct {
	source string
	target string
}

type retriever struct {
	events  map[string][]chan link
	visited map[string]bool
}

func (b *retriever) addEvent(e string, ch chan link) {
	if b.events == nil {
		b.events = make(map[string][]chan link)
	}
	if _, ok := b.events[e]; ok {
		b.events[e] = append(b.events[e], ch)
	} else {
		b.events[e] = []chan link{ch}
	}
}

func (b *retriever) removeEvent(e string, ch chan link) {
	if _, ok := b.events[e]; ok {
		for i := range b.events[e] {
			if b.events[e][i] == ch {
				b.events[e] = append(b.events[e][:i], b.events[e][i+1:]...)
				break
			}
		}
	}
}

func (b *retriever) emit(e string, response link) {
	if _, ok := b.events[e]; ok {
		for _, handler := range b.events[e] {
			go func(handler chan link) {
				handler <- response
			}(handler)
		}
	}
}

func (b *retriever) crawl(uri string) {

	links, _ := b.retrieve(uri)

	for _, l := range links {
		if !b.visited[l] {
			b.emit("newLink", link{
				source: uri,
				target: l,
			})
			b.visited[uri] = true
			b.crawl(l)
		}
	}
}

func (b *retriever) retrieve(uri string) ([]string, error) {
	resp, err := http.Get(uri)
	if err != nil {
		fmt.Println("Error:", err)
		return nil, err
	}

	doc, readerErr := goquery.NewDocumentFromReader(resp.Body)
	if readerErr != nil {
		fmt.Println("Error:", readerErr)
		return nil, readerErr
	}
	u, parseErr := url.Parse(uri)
	if parseErr != nil {
		fmt.Println("Error:", parseErr)
		return nil, parseErr
	}
	host := u.Host

	links := []string{}
	doc.Find("a[href]").Each(func(index int, item *goquery.Selection) {
		href, _ := item.Attr("href")
		lu, err := url.Parse(href)
		if err != nil {
			fmt.Println("Error:", err)
			return
		}
		if isInternalURL(host, lu) {
			links = append(links, u.ResolveReference(lu).String())
		}

	})

	return unique(links), nil
}

// same helper functions

As you can see, we have three additional functions to help us manage the events for a given ‘retriever’. For this code I used the ‘go’ keyword. Basically, writing ‘go foo()’ will make the ‘foo’ function run asynchronously. In our case, we’re using a ‘go’ with an anonymous function to send the event parameter (the link) for all listeners through channels Note: I’ve set the channel data type to ‘link’ that contains the source and target page. Now let’s have a look on the ‘main’ function :

package main

import (
	"fmt"
	"os"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	url := os.Args[1]
	ev := make(chan link)
	r := retriever{visited: make(map[string]bool)}
	r.addEvent("newLink", ev)

	go func() {
		for {
			l := <-ev
			fmt.Println(l.source + " -> " + l.target)
		}
	}()

	r.crawl(url)

}

Again I used the ‘go’ keyword, this time for receiving the event parameter sent by the ‘crawl’ function. If we run our program now we should see all internal links for the given website. That’s it for the crawler. Neo4j Integration Now that we’re done with the crawler, let’s get to the Neo4j part. The first thing we’re going to do is to install the driver.

go get github.com/neo4j/neo4j-go-driver/neo4j

After installing the driver, we need to create some basic functions that will allow us to work with Neo4j. Let’s create a new file called ‘neo4j.go’ :

package main

import (
	"github.com/neo4j/neo4j-go-driver/neo4j"
)

func connectToNeo4j() (neo4j.Driver, neo4j.Session, error) {

	configForNeo4j40 := func(conf *neo4j.Config) { conf.Encrypted = false }
	
	driver, err := neo4j.NewDriver("bolt://localhost:7687", neo4j.BasicAuth(
		"neo4j", "alice!in!wonderland", ""), configForNeo4j40)

	if err != nil {
		return nil, nil, err
	}

	sessionConfig := neo4j.SessionConfig{AccessMode: neo4j.AccessModeWrite}
	session, err := driver.NewSession(sessionConfig)
	if err != nil {
		return nil, nil, err
	}

	return driver, session, nil
}

func createNode(session *neo4j.Session, l *link) (neo4j.Result, error) {
	r, err := (*session).Run("CREATE (:WebLink{source: $source, target: $target}) ", map[string]interface{}{
		"source": l.source,
		"target": l.target,
	})

	if err != nil {
		return nil, err
	}

	return r, err
}

func createNodesRelationship(session *neo4j.Session) (neo4j.Result, error) {
	r, err := (*session).Run("MATCH (a:WebLink),(b:WebLink) WHERE a.target = b.source CREATE (a)-[r:point_to]->(b)", map[string]interface{}{})

	if err != nil {
		return nil, err
	}

	return r, err
}

Basically, we have three functions responsible for initiating connection to Neo4j with basic querying. Note: You might need to change the Neo4j configuration to work with your local instance. To create a ‘WebLink’ node we simply need to run the following query:

CREATE (:WebLink{source: "http://www.sfeir.com/", target: "http://www.sfeir.com/en/services"})

Once the nodes are created, we need to create relationship between them by running the following query :

MATCH (a:WebLink),(b:WebLink) 
WHERE a.target = b.source 
CREATE (a)-[r:point_to]->(b)

Now, let’s update our ‘main’ function.

package main

import (
	"fmt"
	"os"

	"github.com/neo4j/neo4j-go-driver/neo4j"
)

func main() {

	if len(os.Args) < 2 {
		fmt.Println("Web site url is missing")
		os.Exit(1)
	}

	driver, session, connErr := connectToNeo4j()

	if connErr != nil {
		fmt.Println("Error connecting to Database:", connErr)
		os.Exit(1)
	}

	defer driver.Close()

	defer session.Close()

	url := os.Args[1]
	ev := make(chan link)
	r := retriever{visited: make(map[string]bool)}
	r.addEvent("newLink", ev)

	go func(session *neo4j.Session) {
		for {
			l := <-ev
			fmt.Println(l.source + " -> " + l.target)
			_, err := createNode(session, &l)

			if err != nil {
				fmt.Println("Failed to create node:", err)
			}

		}
	}(&session)

	r.crawl(url)

	fmt.Println("Creation of relationship between nodes.. ")
	_, qErr := createNodesRelationship(&session)

	if qErr == nil {
		fmt.Println("Nodes updated")
	} else {
		fmt.Println("Error while updating nodes:", qErr)
	}

}

With the usage of three functions declared in the ‘neo4j.go’ our program will initiate a connection to neo4j, subscribe to ‘newLink’ event to insert nodes and finally update nodes relationship. I used the the ‘defer’ keyword to defers the execution of a function until the surrounding ‘main’ function returns. Let’s run this for the last time :

go run main.go retreiver.go neo4j.go http://www.sfeir.com

To check in the result on Neo4j, you can run the following query on your Neo4j Browser:

MATCH (n:WebLink) RETURN count(n) AS count

Or this query to display all nodes:

MATCH (n:WebLink) RETURN n

Et Voilà! The result after running the last query :

It’s pretty, isn’t it?

Conclusion

Through this post we explored a lot of features of the Go programming language including multiple variable assignment, implementation interfaces and channels and goroutines. Also, we used the standard library as well as some 3rd party libraries. Thank you for reading it. The code source is available on my GitHub.