Fixed emails being saved to a wrong file under query=everything; impr…

…oved page saving process; fixed pages being saved not considering the actual setting; Added non-link resolving variation of FindPageLinks; Added query=archive functionality; working directory is now an actual working directory instead an executables directory
Unbewohnte · Feb 27, 2023 · 722f3fb · 722f3fb
1 parent c91986d
commit 722f3fb
Show file tree

Hide file tree

Showing 10 changed files with 324 additions and 418 deletions.
diff --git a/README.md b/README.md
@@ -4,9 +4,9 @@
 
 A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. 
 
-## Configuration
+## Configuration Overview
 
-The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wdir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`.
+The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`.
 
 The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them.
 
@@ -18,20 +18,21 @@ You can change search `query` at **runtime** via web dashboard if `launch_dashbo
 
 ### Search query
 
-There are some special `query` values:
+There are some special `query` values to control the flow of work:
 
 - `email` - tells wecr to scrape email addresses and output to `output_file`
 - `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully)
 - `videos` - find and fetch files that look like videos
 - `audio` - find and fetch files that look like audio
 - `documents` - find and fetch files that look like a document
 - `everything` - find and fetch images, audio, video, documents and email addresses
+- `archive` - no text to be searched, save every visited page
 
-When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it.
+When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it.
 
-### Output
+### Data Output
 
-By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side.
+If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory.
 
 The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. 
 
@@ -43,7 +44,7 @@ Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies.
 
 ## Examples
 
-See [page on my website](https://unbewohnte.su/wecr) for some basic examples.
+See [a page on my website](https://unbewohnte.su/wecr) for some basic examples.
 
 Dump of a basic configuration:
 
@@ -87,4 +88,4 @@ Dump of a basic configuration:
 ```
 
 ## License
-AGPLv3
+wecr is distributed under AGPLv3 license
diff --git a/src/config/config.go b/src/config/config.go
@@ -31,6 +31,7 @@ const (
 	QueryEmail      string = "email"
 	QueryDocuments  string = "documents"
 	QueryEverything string = "everything"
+	QueryArchive    string = "archive"
 )
 
 const (

diff --git a/src/main.go b/src/main.go
@@ -39,7 +39,7 @@ import (
 	"unbewohnte/wecr/worker"
 )
 
-const version = "v0.3.4"
+const version = "v0.3.5"
 
 const (
 	configFilename               string = "conf.json"
@@ -107,12 +107,12 @@ func init() {
 	if *wDir != "" {
 		workingDirectory = *wDir
 	} else {
-		exePath, err := os.Executable()
+		wdir, err := os.Getwd()
 		if err != nil {
-			logger.Error("Failed to determine executable's path: %s", err)
+			logger.Error("Failed to determine working directory path: %s", err)
 			return
 		}
-		workingDirectory = filepath.Dir(exePath)
+		workingDirectory = wdir
 	}
 
 	logger.Info("Working in \"%s\"", workingDirectory)
@@ -294,6 +294,8 @@ func main() {
 		logger.Info("Looking for audio (%+s)", web.AudioExtentions)
 	case config.QueryDocuments:
 		logger.Info("Looking for documents (%+s)", web.DocumentExtentions)
+	case config.QueryArchive:
+		logger.Info("Archiving every visited page")
 	case config.QueryEverything:
 		logger.Info("Looking for email addresses, images, videos, audio and various documents (%+s - %+s - %+s - %+s)",
 			web.ImageExtentions,
@@ -309,30 +311,6 @@ func main() {
 		}
 	}
 
-	// create and redirect logs if needed
-	if conf.Logging.OutputLogs {
-		if conf.Logging.LogsFile != "" {
-			// output logs to a file
-			logFile, err := os.Create(filepath.Join(workingDirectory, conf.Logging.LogsFile))
-			if err != nil {
-				logger.Error("Failed to create logs file: %s", err)
-				return
-			}
-			defer logFile.Close()
-
-			logger.Info("Outputting logs to %s", conf.Logging.LogsFile)
-			logger.SetOutput(logFile)
-		} else {
-			// output logs to stdout
-			logger.Info("Outputting logs to stdout")
-			logger.SetOutput(os.Stdout)
-		}
-	} else {
-		// no logging needed
-		logger.Info("No further logs will be outputted")
-		logger.SetOutput(nil)
-	}
-
 	// create visit queue file if not turned off
 	var visitQueueFile *os.File = nil
 	if !conf.InMemoryVisitQueue {
@@ -401,6 +379,30 @@ func main() {
 		logger.Info("Launched dashboard at http://localhost:%d", conf.Dashboard.Port)
 	}
 
+	// create and redirect logs if needed
+	if conf.Logging.OutputLogs {
+		if conf.Logging.LogsFile != "" {
+			// output logs to a file
+			logFile, err := os.Create(filepath.Join(workingDirectory, conf.Logging.LogsFile))
+			if err != nil {
+				logger.Error("Failed to create logs file: %s", err)
+				return
+			}
+			defer logFile.Close()
+
+			logger.Info("Outputting logs to %s", conf.Logging.LogsFile)
+			logger.SetOutput(logFile)
+		} else {
+			// output logs to stdout
+			logger.Info("Outputting logs to stdout")
+			logger.SetOutput(os.Stdout)
+		}
+	} else {
+		// no logging needed
+		logger.Info("No further logs will be outputted")
+		logger.SetOutput(nil)
+	}
+
 	// launch concurrent scraping !
 	workerPool.Work()
 	logger.Info("Started scraping...")

diff --git a/src/web/audio.go b/src/web/audio.go
@@ -20,99 +20,25 @@ package web
 
 import (
 	"net/url"
-	"strings"
 )
 
-func HasAudioExtention(url string) bool {
-	for _, extention := range AudioExtentions {
-		if strings.HasSuffix(url, extention) {
-			return true
-		}
-	}
-
-	return false
-}
-
 // Tries to find audio URLs on the page
-func FindPageAudio(pageBody []byte, from *url.URL) []string {
-	var urls []string
+func FindPageAudio(pageBody []byte, from url.URL) []url.URL {
+	var urls []url.URL
 
 	// for every element that has "src" attribute
-	for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
-		var linkStartIndex int
-		var linkEndIndex int
-
-		linkStartIndex = strings.Index(match, "\"")
-		if linkStartIndex == -1 {
-			linkStartIndex = strings.Index(match, "'")
-			if linkStartIndex == -1 {
-				continue
-			}
-
-			linkEndIndex = strings.LastIndex(match, "'")
-			if linkEndIndex == -1 {
-				continue
-			}
-		} else {
-			linkEndIndex = strings.LastIndex(match, "\"")
-			if linkEndIndex == -1 {
-				continue
-			}
-		}
-
-		if linkEndIndex <= linkStartIndex+1 {
-			continue
-		}
-
-		link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
-		if err != nil {
-			continue
-		}
-
-		linkResolved := ResolveLink(link, from.Host)
-		if HasAudioExtention(linkResolved) {
-			urls = append(urls, linkResolved)
+	for _, link := range FindPageSrcLinks(pageBody, from) {
+		if HasAudioExtention(link.EscapedPath()) {
+			urls = append(urls, link)
 		}
 	}
 
 	// for every "a" element as well
-	for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
-		var linkStartIndex int
-		var linkEndIndex int
-
-		linkStartIndex = strings.Index(match, "\"")
-		if linkStartIndex == -1 {
-			linkStartIndex = strings.Index(match, "'")
-			if linkStartIndex == -1 {
-				continue
-			}
-
-			linkEndIndex = strings.LastIndex(match, "'")
-			if linkEndIndex == -1 {
-				continue
-			}
-		} else {
-			linkEndIndex = strings.LastIndex(match, "\"")
-			if linkEndIndex == -1 {
-				continue
-			}
-		}
-
-		if linkEndIndex <= linkStartIndex+1 {
-			continue
-		}
-
-		link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
-		if err != nil {
-			continue
-		}
-
-		linkResolved := ResolveLink(link, from.Host)
-		if HasAudioExtention(linkResolved) {
-			urls = append(urls, linkResolved)
+	for _, link := range FindPageLinks(pageBody, from) {
+		if HasAudioExtention(link.EscapedPath()) {
+			urls = append(urls, link)
 		}
 	}
 
-	// return discovered mutual video urls
 	return urls
 }
diff --git a/src/web/documents.go b/src/web/documents.go
@@ -1,97 +1,42 @@
+/*
+	Wecr - crawl the web for data
+	Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte)
+
+	This program is free software: you can redistribute it and/or modify
+	it under the terms of the GNU Affero General Public License as published by
+	the Free Software Foundation, either version 3 of the License, or
+	(at your option) any later version.
+
+	This program is distributed in the hope that it will be useful,
+	but WITHOUT ANY WARRANTY; without even the implied warranty of
+	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+	GNU Affero General Public License for more details.
+
+	You should have received a copy of the GNU Affero General Public License
+	along with this program.  If not, see <https://www.gnu.org/licenses/>.
+*/
+
 package web
 
 import (
 	"net/url"
-	"strings"
 )
 
-func HasDocumentExtention(url string) bool {
-	for _, extention := range DocumentExtentions {
-		if strings.HasSuffix(url, extention) {
-			return true
-		}
-	}
-
-	return false
-}
-
 // Tries to find docs' URLs on the page
-func FindPageDocuments(pageBody []byte, from *url.URL) []string {
-	var urls []string
+func FindPageDocuments(pageBody []byte, from url.URL) []url.URL {
+	var urls []url.URL
 
 	// for every element that has "src" attribute
-	for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
-		var linkStartIndex int
-		var linkEndIndex int
-
-		linkStartIndex = strings.Index(match, "\"")
-		if linkStartIndex == -1 {
-			linkStartIndex = strings.Index(match, "'")
-			if linkStartIndex == -1 {
-				continue
-			}
-
-			linkEndIndex = strings.LastIndex(match, "'")
-			if linkEndIndex == -1 {
-				continue
-			}
-		} else {
-			linkEndIndex = strings.LastIndex(match, "\"")
-			if linkEndIndex == -1 {
-				continue
-			}
-		}
-
-		if linkEndIndex <= linkStartIndex+1 {
-			continue
-		}
-
-		link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
-		if err != nil {
-			continue
-		}
-
-		linkResolved := ResolveLink(link, from.Host)
-		if HasDocumentExtention(linkResolved) {
-			urls = append(urls, linkResolved)
+	for _, link := range FindPageSrcLinks(pageBody, from) {
+		if HasDocumentExtention(link.EscapedPath()) {
+			urls = append(urls, link)
 		}
 	}
 
 	// for every "a" element as well
-	for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
-		var linkStartIndex int
-		var linkEndIndex int
-
-		linkStartIndex = strings.Index(match, "\"")
-		if linkStartIndex == -1 {
-			linkStartIndex = strings.Index(match, "'")
-			if linkStartIndex == -1 {
-				continue
-			}
-
-			linkEndIndex = strings.LastIndex(match, "'")
-			if linkEndIndex == -1 {
-				continue
-			}
-		} else {
-			linkEndIndex = strings.LastIndex(match, "\"")
-			if linkEndIndex == -1 {
-				continue
-			}
-		}
-
-		if linkEndIndex <= linkStartIndex+1 {
-			continue
-		}
-
-		link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
-		if err != nil {
-			continue
-		}
-
-		linkResolved := ResolveLink(link, from.Host)
-		if HasDocumentExtention(linkResolved) {
-			urls = append(urls, linkResolved)
+	for _, link := range FindPageLinks(pageBody, from) {
+		if HasDocumentExtention(link.EscapedPath()) {
+			urls = append(urls, link)
 		}
 	}