The PMC Wayback Machine is a self-hosted archiving solution to capture screenshots and snapshots of any url.
This project mimics a subset of the functionality of the Internet Archive's Wayback Machine. Also inspired by Archivebox.
This tool helps us to monitor for regressions across our properties.
This project is a pre-release work in progress.
- Cleanup environment variables and naming.
- Update the Checkly tests to trigger snapshots.
- Update the PHP application to push to the S3 bucket.
There are three pieces of architecture that come together to make this work.
- A NextJS application for browsing captured screenshots and snapshots.
- A PHP application for capturing snapshots.
- Checkly tests for capturing screenshots.
A new capture can be triggered manually, or via Checkly schedule for any url.
The Playwright test will execute in Checkly. It will capture screenshots in a headless browser and uploads them to an S3 bucket. Before exiting, the test will also trigger the PHP application to capture a snapshot.
Note: The PHP application is not yet hooked up to the Checkly test. This is a work in progress.
Once uploaded to S3, you can use the NextJS application to browse your screenshots and snapshots.
Note: Support for searching, filtering, pagination, diff viewing, and more are on the planned list of improvements.
Checkly supports scheduling tests, which we use to mimic a cron job.
Note:
The URLs in __checks__/screenshots-group.check.ts
should be abstracted into its own
file.
The domains.yml
used in the PHP app will become redundant once integrated with the
Checkly tests.
This section will guide you through installation and configuration.
The root of this project includes the NextJS project. You can setup using Vercel or a similar host.
-
NEXT_PUBLIC_URL
- URL for the application. -
AWS_ACCESS_KEY_ID
- Access Key ID for IAM user with read/write to the bucket. -
AWS_REGION
- Bucket region. -
AWS_SECRET_ACCESS_KEY
- Secret Access Key for IAM user with read/write to the bucket. -
BUCKET_NAME
- Bucket name. -
NEXT_PUBLIC_BUCKET_NAME
- Bucket name again (needs to be deduplicated with above). -
GITHUB_TOKEN
- Used to trigger the dispatch workflow for one-off url captures. -
SLACK_CHANNEL_ID
- Slack Channel ID for new one-off captures. -
SLACK_TOKEN
- Slack Token with privledges to read/write to the correct channel.
Refer to the Checkly documentation on deploying tests to your account.
There is a GitHub workflow that can automate this for you if you fork this repo.
- PHP 7.4 or higher
- Composer
- Clone the repository:
git clone https://github.com/penske-media-corp/pmc-wayback-machine.git
- Install dependencies:
cd pmc-wayback-machine
composer install
- Create the required directories:
mkdir logs
mkdir frontend/public/proxied_pages
- Create a
domains.yml
file in the project root directory. This file should contain a list of domain names and URLs for the sites you want to take snapshots of. Refer to thedomains.yml.example
file for the expected format.
To create snapshots of websites, you can use the createsnapshots.php
script. This script accepts a list of domains as a comma-separated string via the $_POST['domains'] parameter.
Example usage:
# Testing all domains and pages defined in domains.yml
curl -d https://your-server.com/path/to/createsnapshots.php
# Testing pages from two domains (must be defined in domains.yml)
curl -X POST -d "domains=example.com,example.org" https://your-server.com/path/to/createsnapshots.php
# Triggering a test from a browser using GET parameters
https://your-server.com/path/to/createsnapshots.php?domains=example.com,example.org
This will create snapshots for the example.com
and example.org
domains.
If no domains are provided, the script will create snapshots for all domains specified in the domains.yml
configuration file.
Command line examples:
# Testing all domains from the command line
php pmc-wayback-machine/backend/src/createsnapshots.php
# Testing specific domains from the command line (must be defined in domains.yml)
php pmc-wayback-machine/backend/src/createsnapshots.php example.com example.org
To retrieve the snapshots of websites, you can use the getsnapshots.php
script. This script accepts the following GET parameters:
domain
: The domain to filter the snapshots by (optional).startDate
: The start date to filter the snapshots by (optional, format: YYYY-MM-DD).endDate
: The end date to filter the snapshots by (optional, format: YYYY-MM-DD).
Unlike createsnapshots.php
, getsnapshots.php
is intended to be triggered only from a browser and not the command line. Command line can still be used for testing but request parameters are not supported (it will always fetch all).
Example:
# Fetching the default results (3 months, all domains).
http://your-server.com/path/to/getsnapshots.php
# Fetching using GET parameters to filter the results. This will retrieve the snapshots for the example.com domain between May 1, 2023, and May 31, 2023.
http://your-server.com/path/to/getsnapshots.php?domain=example.com&startDate=2023-05-01&endDate=2023-05-31
This will retrieve snapshots for April 15, 2023. If no parameters are provided, the script will retrieve snapshots for the default date set in DirectoryReader
.
This project includes unit tests written using PHPUnit. To run the tests, execute the following command from the project root directory:
cd backend
vendor/bin/phpunit tests/
You can configure the following options in the AsyncProxyFetcher
class:
MAX_CONCURRENT_JOBS
: Adjust this constant to change the maximum number of concurrent jobs for fetching URLs.
This project relies on the following third-party libraries:
guzzlehttp/guzzle
: A PHP HTTP client for sending HTTP requests.monolog/monolog
: A logging library for PHP.symfony/yaml
: The Symfony YAML component for parsing YAML files.
The application logs all errors and job states (success or failure) using the Monolog library. Logs are written to the logs/app.log
file in the project root directory.
This project is under proprietary license.
Bits and pieces plucked from https://github.com/tlinhart/s3-browser