HN Scraper

I’ve just released an example project written in Go that scrapes Hacker News and presents the results on a web page. For a more robust approach to obtaining HN data, I would recommend relying on the official API instead in case the markup changes.

Making the request

Grabbing the HTML from the HN front page can be done using the built-in HTTP client in Go, found under the http package:

package main

import (
	"net/http"
	"log"
	"io/ioutil"
)

func check(err error) {
	if err != nil {
		log.Fatal(err)
	}
}

func main() {
	resp, err := http.Get("https://news.ycombinator.com/news?p=1")
	check(err)
	defer resp.Body.Close()
	body, err := ioutil.ReadAll(resp.Body)
	check(err)
	log.Println(string(body))
}

Extracting article links

For each posting on HN, there are a number of attributes that we can capture:

Number of comments
Article title
Article hyperlink
Comment hyperlink
Time posted
Poster’s user id

To make things slightly tricky, there are also job-related postings, where the number of comments will not be displayed.

Fortunately, there’s a handy library called goquery that allows us to query a HTML document using a jQuery-like API. This just means that we can use CSS selectors to select which HTML elements we want to operate on. Queries in goquery take place on a goquery document object, which we can create by passing in http.Response directly:

doc, err := goquery.NewDocumentFromResponse(resp)
check(err)
doc.Find("#hnmain .title a").Each(func(i int, s *goquery.Selection) {
  // returns the set of all main anchor links
  // - href attribute contains link to the article
  // - child element contains the article title
}
doc.Find("#hnmain .subtext").Each(func(index int, s *goquery.Selection) {
  // returns each post’s properties: number of comments, time posted, and so on
}

Displaying the results

After extracting the properties that we want, we’ll need a way to serve the results. By encapsulating article properties within a struct, we can marshal it to JSON and serve it on the web using Go’s built-in HTTP server, also found within the http package:

type Article struct {
	Title    string `json:"title"`
	Href     string `json:"href"`
	Site     string `json:"site"`
	Time     string `json:"time"`
	User     string `json:"user"`
	Points   int    `json:"points"`
	Rank     int    `json:"rank"`
	Comments int    `json:"comments"`
	Id       int    `json:"id"`
}

func postsHandler(w http.ResponseWriter, r *http.Request) {
	page, err := strconv.Atoi(r.URL.Path[len("/posts/"):])
	if err != nil {
		w.WriteHeader(http.StatusInternalServerError)
		return
	}
	doc, err := GetPage(page)
	if err != nil {
		w.WriteHeader(http.StatusInternalServerError)
		return
	}
	articles := ExtractArticles(doc)
	resp, err := json.Marshal(articles)
	if err != nil {
		w.WriteHeader(http.StatusInternalServerError)
		return
	}
	w.Header().Add("Content-type", "application/json")
	_, _ = w.Write(resp)
}

func Serve() {
	http.HandleFunc("/posts/", postsHandler)
	log.Println("Running server at port 8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

For the actual web page itself, I’m just serving a skeleton index page which then requests the list of articles via an AJAX call. This static content could be served through a separate http server such as nginx, or directly from our http server using http.ServeFile.

Caching results

To prevent our program from hitting the HN web site too often, we can cache the responses together with an expiry time, using an in-memory map since the content will expire quickly.

Running the project

In order to run the project, clone the repository and run go run query.go server.go from the project directory. The command is slightly awkward but apparently necessary when a main program is split across multiple files because the Go compiler will only compile files specified in the command-line parameters. This tends to encourage most of the code to be packaged into a library, leaving the main program to either become a minimal stub or just to integrate various packages together.

golang