Skip to content

Commit

Permalink
Fixed emails being saved to a wrong file under query=everything; impr…
Browse files Browse the repository at this point in the history
…oved page saving process; fixed pages being saved not considering the actual setting; Added non-link resolving variation of FindPageLinks; Added query=archive functionality; working directory is now an actual working directory instead an executables directory
  • Loading branch information
Unbewohnte committed Feb 27, 2023
1 parent c91986d commit 722f3fb
Show file tree
Hide file tree
Showing 10 changed files with 324 additions and 418 deletions.
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way.

## Configuration
## Configuration Overview

The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wdir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`.
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`.

The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them.

Expand All @@ -18,20 +18,21 @@ You can change search `query` at **runtime** via web dashboard if `launch_dashbo

### Search query

There are some special `query` values:
There are some special `query` values to control the flow of work:

- `email` - tells wecr to scrape email addresses and output to `output_file`
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully)
- `videos` - find and fetch files that look like videos
- `audio` - find and fetch files that look like audio
- `documents` - find and fetch files that look like a document
- `everything` - find and fetch images, audio, video, documents and email addresses
- `archive` - no text to be searched, save every visited page

When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it.
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it.

### Output
### Data Output

By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side.
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory.

The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file.

Expand All @@ -43,7 +44,7 @@ Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies.

## Examples

See [page on my website](https://unbewohnte.su/wecr) for some basic examples.
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples.

Dump of a basic configuration:

Expand Down Expand Up @@ -87,4 +88,4 @@ Dump of a basic configuration:
```

## License
AGPLv3
wecr is distributed under AGPLv3 license
1 change: 1 addition & 0 deletions src/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ const (
QueryEmail string = "email"
QueryDocuments string = "documents"
QueryEverything string = "everything"
QueryArchive string = "archive"
)

const (
Expand Down
58 changes: 30 additions & 28 deletions src/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ import (
"unbewohnte/wecr/worker"
)

const version = "v0.3.4"
const version = "v0.3.5"

const (
configFilename string = "conf.json"
Expand Down Expand Up @@ -107,12 +107,12 @@ func init() {
if *wDir != "" {
workingDirectory = *wDir
} else {
exePath, err := os.Executable()
wdir, err := os.Getwd()
if err != nil {
logger.Error("Failed to determine executable's path: %s", err)
logger.Error("Failed to determine working directory path: %s", err)
return
}
workingDirectory = filepath.Dir(exePath)
workingDirectory = wdir
}

logger.Info("Working in \"%s\"", workingDirectory)
Expand Down Expand Up @@ -294,6 +294,8 @@ func main() {
logger.Info("Looking for audio (%+s)", web.AudioExtentions)
case config.QueryDocuments:
logger.Info("Looking for documents (%+s)", web.DocumentExtentions)
case config.QueryArchive:
logger.Info("Archiving every visited page")
case config.QueryEverything:
logger.Info("Looking for email addresses, images, videos, audio and various documents (%+s - %+s - %+s - %+s)",
web.ImageExtentions,
Expand All @@ -309,30 +311,6 @@ func main() {
}
}

// create and redirect logs if needed
if conf.Logging.OutputLogs {
if conf.Logging.LogsFile != "" {
// output logs to a file
logFile, err := os.Create(filepath.Join(workingDirectory, conf.Logging.LogsFile))
if err != nil {
logger.Error("Failed to create logs file: %s", err)
return
}
defer logFile.Close()

logger.Info("Outputting logs to %s", conf.Logging.LogsFile)
logger.SetOutput(logFile)
} else {
// output logs to stdout
logger.Info("Outputting logs to stdout")
logger.SetOutput(os.Stdout)
}
} else {
// no logging needed
logger.Info("No further logs will be outputted")
logger.SetOutput(nil)
}

// create visit queue file if not turned off
var visitQueueFile *os.File = nil
if !conf.InMemoryVisitQueue {
Expand Down Expand Up @@ -401,6 +379,30 @@ func main() {
logger.Info("Launched dashboard at http://localhost:%d", conf.Dashboard.Port)
}

// create and redirect logs if needed
if conf.Logging.OutputLogs {
if conf.Logging.LogsFile != "" {
// output logs to a file
logFile, err := os.Create(filepath.Join(workingDirectory, conf.Logging.LogsFile))
if err != nil {
logger.Error("Failed to create logs file: %s", err)
return
}
defer logFile.Close()

logger.Info("Outputting logs to %s", conf.Logging.LogsFile)
logger.SetOutput(logFile)
} else {
// output logs to stdout
logger.Info("Outputting logs to stdout")
logger.SetOutput(os.Stdout)
}
} else {
// no logging needed
logger.Info("No further logs will be outputted")
logger.SetOutput(nil)
}

// launch concurrent scraping !
workerPool.Work()
logger.Info("Started scraping...")
Expand Down
90 changes: 8 additions & 82 deletions src/web/audio.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,99 +20,25 @@ package web

import (
"net/url"
"strings"
)

func HasAudioExtention(url string) bool {
for _, extention := range AudioExtentions {
if strings.HasSuffix(url, extention) {
return true
}
}

return false
}

// Tries to find audio URLs on the page
func FindPageAudio(pageBody []byte, from *url.URL) []string {
var urls []string
func FindPageAudio(pageBody []byte, from url.URL) []url.URL {
var urls []url.URL

// for every element that has "src" attribute
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int

linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}

linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}

if linkEndIndex <= linkStartIndex+1 {
continue
}

link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}

linkResolved := ResolveLink(link, from.Host)
if HasAudioExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageSrcLinks(pageBody, from) {
if HasAudioExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}

// for every "a" element as well
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int

linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}

linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}

if linkEndIndex <= linkStartIndex+1 {
continue
}

link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}

linkResolved := ResolveLink(link, from.Host)
if HasAudioExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageLinks(pageBody, from) {
if HasAudioExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}

// return discovered mutual video urls
return urls
}
107 changes: 26 additions & 81 deletions src/web/documents.go
Original file line number Diff line number Diff line change
@@ -1,97 +1,42 @@
/*
Wecr - crawl the web for data
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
*/

package web

import (
"net/url"
"strings"
)

func HasDocumentExtention(url string) bool {
for _, extention := range DocumentExtentions {
if strings.HasSuffix(url, extention) {
return true
}
}

return false
}

// Tries to find docs' URLs on the page
func FindPageDocuments(pageBody []byte, from *url.URL) []string {
var urls []string
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL {
var urls []url.URL

// for every element that has "src" attribute
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int

linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}

linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}

if linkEndIndex <= linkStartIndex+1 {
continue
}

link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}

linkResolved := ResolveLink(link, from.Host)
if HasDocumentExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageSrcLinks(pageBody, from) {
if HasDocumentExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}

// for every "a" element as well
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int

linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}

linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}

if linkEndIndex <= linkStartIndex+1 {
continue
}

link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}

linkResolved := ResolveLink(link, from.Host)
if HasDocumentExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageLinks(pageBody, from) {
if HasDocumentExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}

Expand Down
Loading

0 comments on commit 722f3fb

Please sign in to comment.