Website mirror app with priority for response consistency.
Easy to set up and run a mirror which copies content from somewhere else and provides a near exact web browsing experience in case the source server / network goes down.
- All web assets should be downloaded and have with their metadata intact (content type etc.)
- Links should be followed with some restriction to save resources.
- Cached data should be refreshed periodically.
- A web server should be provided to serve visitor.
Go to http://localhost:8080/https/github.com/ to see GitHub home page.
This is quite dangerous though, do NOT deploy this to the public internet to avoid abuses.
go-sitemirror -port 8080
Go to http://localhost:8081/ to see GitHub home page.
go-sitemirror -mirror https://github.com \
-mirror-port 8081 \
-auto-download-depth=0 \
-no-cross-host
-auto-download-depth=0
to turn off auto downloader-no-cross-host
to not modify assets urls from other domains
Do the same GitHub mirroring but with Docker.
docker run --rm -it \
-p 8081:8081 \
-v "$PWD/cache:/cache" \
ghcr.io/daohoangson/go-sitemirror -mirror https://github.com \
-mirror-port 8081 \
-auto-download-depth=0 \
-no-cross-host
See PR #8 for a couple of deployed demos.
The fly.toml
looks something like this:
app = "app-name"
[build]
image = "ghcr.io/daohoangson/go-sitemirror:latest"
[experimental]
cmd = ["go-sitemirror", "-mirror", "https://github.com", "-mirror-port", "80", "-auto-download-depth", "0", "-no-cross-host"]
[http_service]
internal_port = 80
force_https = true
min_machines_running = 0
-auto-download-depth=1:
Maximum link depth for auto downloads, default=1
-auto-refresh=0s:
Interval for url auto refreshes, default=no refresh
-cache-bump=1m0s:
Validity of cache bump
-cache-path="":
HTTP Cache path (default working directory)
-cache-ttl=10m0s:
Validity of cached data
-header=map[]:
Custom request header, must be 'key=value'
-http-timeout=10s:
HTTP request timeout
-log=4:
Logging output level
-mirror=[]:
URL to mirror, multiple urls are supported
-mirror-port=[]:
Port to mirror a single site, each port number should immediately follow its URL.
For url that doesn't have any port, it will still be mirrored but without a web server.
-no-cross-host=false:
Disable cross-host links
-port=-1:
Port to mirror all sites
-rewrite=map[]:
Link rewrites, must be 'source.domain.com=https://domain.com/some/path'
-whitelist=[]:
Restricted list of crawlable hosts
-workers=4:
Number of download workers