Go-Readability

Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from Readability.js by Mozilla and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.

Status

This package is stable enough for use and up to date with Readability.js v0.4.4 (commit b359811).

Installation

To install this package, just run go get :

go get -u -v github.com/nano-interactive/go-readability

Example

To get the readable content from an URL, you can use readability.FromURL. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content :

package main

import (
	"fmt"
	"log"
	"os"
	"time"

 readability "github.com/nano-interactive/go-readability"
)

var (
 urls = []string{
  // this one is article, so it's parse-able
  "https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
  // while this one is not an article, so readability will fail to parse.
  "https://www.nytimes.com/",
 }
)

func main() {
 for i, url := range urls {
  article, err := readability.FromURL(url, 30*time.Second)
  if err != nil {
   log.Fatalf("failed to parse %s, %v\n", url, err)
  }

  dstTxtFile, _ := os.Create(fmt.Sprintf("text-%02d.txt", i+1))
  defer dstTxtFile.Close()
  dstTxtFile.WriteString(article.TextContent)

  dstHTMLFile, _ := os.Create(fmt.Sprintf("html-%02d.html", i+1))
  defer dstHTMLFile.Close()
  dstHTMLFile.WriteString(article.Content)

  fmt.Printf("URL     : %s\n", url)
  fmt.Printf("Title   : %s\n", article.Title)
  fmt.Printf("Author  : %s\n", article.Byline)
  fmt.Printf("Length  : %d\n", article.Length)
  fmt.Printf("Excerpt : %s\n", article.Excerpt)
  fmt.Printf("SiteName: %s\n", article.SiteName)
  fmt.Printf("Image   : %s\n", article.Image)
  fmt.Printf("Favicon : %s\n", article.Favicon)
  fmt.Printf("Text content saved to \"text-%02d.txt\"\n", i+1)
  fmt.Printf("HTML content saved to \"html-%02d.html\"\n", i+1)
  fmt.Println()
 }
}

However, sometimes you want to parse an URL no matter if it's an article or not. For example is when you only want to get metadata of the page. To do that, you have to download the page manually using http.Get, then parse it using readability.FromReader :

package main

import (
 "fmt"
 "log"
 "net/http"
 "net/url"

 readability "github.com/nano-interactive/go-readability"
)

var (
	urls = []string{
		// Both will be parse-able now
		"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
		// But this one will not have any content
		"https://www.nytimes.com/",
	}
)

func main() {
	for _, u := range urls {
		resp, err := http.Get(u)
		if err != nil {
			log.Fatalf("failed to download %s: %v\n", u, err)
		}
		defer resp.Body.Close()

		parsedURL, err := url.Parse(u)
		if err != nil {
			log.Fatalf("error parsing url")
		}

		article, err := readability.FromReader(resp.Body, parsedURL)
		if err != nil {
			log.Fatalf("failed to parse %s: %v\n", u, err)
		}

		fmt.Printf("URL     : %s\n", u)
		fmt.Printf("Title   : %s\n", article.Title)
		fmt.Printf("Author  : %s\n", article.Byline)
		fmt.Printf("Length  : %d\n", article.Length)
		fmt.Printf("Excerpt : %s\n", article.Excerpt)
		fmt.Printf("SiteName: %s\n", article.SiteName)
		fmt.Printf("Image   : %s\n", article.Image)
		fmt.Printf("Favicon : %s\n", article.Favicon)
		fmt.Println()
	}
}

Licenses

Go-Readability is distributed under MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. If you like this project, please consider donating to me either via PayPal or Ko-Fi.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github		.github
examples		examples
scripts		scripts
test-pages		test-pages
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
go.mod		go.mod
go.sum		go.sum
parser-check.go		parser-check.go
parser-parse.go		parser-parse.go
parser.go		parser.go
parser_test.go		parser_test.go
readability.go		readability.go
utils.go		utils.go
utils_test.go		utils_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go-Readability

Table of Contents

Status

Installation

Example

Licenses

About

Releases 1

Packages

Languages

License

nano-interactive/go-readability

Folders and files

Latest commit

History

Repository files navigation

Go-Readability

Table of Contents

Status

Installation

Example

Licenses

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages