Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure browsertrix proxies #1847

Merged
merged 58 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f0e67c8
backend: add ssh proxies configuration
vnznznz Jul 30, 2024
d96fff4
frontend: add wip ssh proxy selection
vnznznz Jul 30, 2024
2d3e9ef
scripts: add minikube utilities
vnznznz Jul 30, 2024
fca5886
ssh proxy: fix changing proxy in workflow editor
vnznznz Jul 30, 2024
25b813c
formatting
vnznznz Jul 30, 2024
425bed6
Merge branch 'main' into configure-socks-proxies
ikreymer Jul 30, 2024
80542df
cleanup: various renaming / simplifications, remove 'ssh' from names,…
ikreymer Jul 31, 2024
eb4f9f1
fixes: ensure proxyId defaults to "" if none
ikreymer Jul 31, 2024
ba07896
version: bump to 1.12.0-beta.0
ikreymer Jul 31, 2024
f0a3d11
fixes: ssh proxy - allow multiline known_hosts file
vnznznz Jul 31, 2024
e893f89
add proxy support for profiles!
ikreymer Jul 31, 2024
e59e1c8
make proxies more generic, can support ssh://, socks5:// and http://
ikreymer Aug 1, 2024
d575b87
show default proxy in `select-crawler-proxy` + misc visual fixes
vnznznz Aug 7, 2024
dbd51ed
Merge branch 'main' into configure-socks-proxies
ikreymer Aug 8, 2024
3969513
reformat
ikreymer Aug 8, 2024
d96ee8c
Merge branch 'main' into configure-socks-proxies
ikreymer Aug 9, 2024
bd43426
Merge branch 'main' into configure-socks-proxies
ikreymer Aug 15, 2024
ce71535
fix ui post frontend refactor, remove authstate
ikreymer Aug 15, 2024
c7b33fc
more removal of authstate, including from comments
ikreymer Aug 15, 2024
310b647
move proxy config to subchart, allow updating proxies without re-depl…
vnznznz Aug 15, 2024
e48a074
move passwd hack to main chart
vnznznz Aug 15, 2024
7266d1d
add missing docstring
vnznznz Aug 15, 2024
cfaa3b8
fix lint error
vnznznz Aug 29, 2024
8663875
proxies: add shared flag, org proxy settings
vnznznz Sep 2, 2024
c702ba7
proxies: fix backend bugs
vnznznz Sep 3, 2024
b63322c
frontend: add `proxy_not_found` error message
vnznznz Sep 3, 2024
b3dbfe1
frontend: add wip admin proxy gui
vnznznz Sep 3, 2024
2e5fa5f
add missing docstring
vnznznz Sep 3, 2024
379f0b7
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 12, 2024
0cb5d0e
proxy UI fixes after merge
ikreymer Sep 12, 2024
f591b4c
use proxyId from existing profile when running profile browser for ex…
ikreymer Sep 17, 2024
e08500a
proxies subchart: default to 'crawlers' namespace
ikreymer Sep 18, 2024
8d54e28
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 18, 2024
9549123
backend: unpin motor dependency, fixes ImportError on backend start
vnznznz Sep 20, 2024
ca37b2b
backend: improve `get_all_crawler_proxies` endpoint path
vnznznz Sep 20, 2024
d958fa6
backend: disable org shared proxies by default
vnznznz Sep 20, 2024
eaff240
frontend: few more labels to org proxy admin modal
vnznznz Sep 20, 2024
827023a
frontend: misc text changes
vnznznz Sep 20, 2024
d0839b4
ensure proxyId saved on Profile
ikreymer Sep 20, 2024
4214572
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 20, 2024
ae3e909
ensure proxyId is passed through to profile creation
ikreymer Sep 20, 2024
7b052b5
add proxy selector to org defaults
ikreymer Sep 21, 2024
81b07a6
form name fix
ikreymer Sep 21, 2024
8925a2b
fix proxy clearing
ikreymer Sep 21, 2024
4477a1f
misc tweaks: fix workflow default, EmailStr cast, add comments for bt…
ikreymer Sep 21, 2024
f94f31b
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 25, 2024
4dc72a9
reextract strings
ikreymer Sep 25, 2024
4fd3631
WIP: Start adding documentation
tw4l Sep 26, 2024
27c753e
adjust placement of socks proxy to be below profiles
ikreymer Oct 1, 2024
74fa4a8
ensure proxyId included in cronjob, skip cronjob if proxy is missing
ikreymer Oct 2, 2024
68571db
lint fixes
ikreymer Oct 2, 2024
1b3c5dc
Update documentation based on review comments
tw4l Oct 2, 2024
f192bbd
Wordsmith docs
tw4l Oct 2, 2024
a07b4c6
More wordsmithing
tw4l Oct 2, 2024
7214895
update proxy docs
ikreymer Oct 3, 2024
93feaf2
update docs, add proxies subchart to release
ikreymer Oct 3, 2024
c90bc0a
more docs tweaks
ikreymer Oct 3, 2024
3e5302c
rename proxies-passwd-hack -> force-user-and-group-name for clarity
ikreymer Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/deploy/customization.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,4 +149,4 @@ Browsertrix has the ability to cryptographically sign WACZ files with [Authsign]

## Enable Open Registration

You can enable sign-ups by setting `registration_enabled` to `"1"`. Once enabled, your users can register by visiting `/sign-up`.
You can enable sign-ups by setting `registration_enabled` to `"1"`. Once enabled, your users can register by visiting `/sign-up`.
2 changes: 1 addition & 1 deletion docs/deploy/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,6 @@ The main requirements for Browsertrix are:
- A Kubernetes Cluster
- [Helm 3](https://helm.sh/) (package manager for Kubernetes)

We have prepared a [Local Deployment Guide](local.md) which covers several options for testing Browsertrix locally on a single machine, as well as a [Production (Self-Hosted and Cloud) Deployment](remote.md) guide to help with setting up Browsertrix in different production scenarios. Information about configuring storage, crawler channels, and other details in local or production deployments is in the [Customizing Browsertrix Deployment Guide](customization.md).
We have prepared a [Local Deployment Guide](local.md) which covers several options for testing Browsertrix locally on a single machine, as well as a [Production (Self-Hosted and Cloud) Deployment](remote.md) guide to help with setting up Browsertrix in different production scenarios. Information about configuring storage, crawler channels, and other details in local or production deployments is in the [Customizing Browsertrix Deployment Guide](customization.md). Information about configuring proxies to use with Browsertrix can be found in the [Configuring Proxies](proxies.md) guide.

Details on managing org export and import for existing clusters can be found in the [Org Import & Export](admin/org-import-export.md) guide.
27 changes: 27 additions & 0 deletions docs/deploy/proxies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Configuring Proxies

Browsertrix can be configured to direct crawling traffic through dedicated proxy servers, so that websites can be crawled from a specific geographic location regardless of where Browsertrix itself is deployed.

This guide covers how to set up proxy servers for use with Browsertrix, as well as how to configure Browsertrix to make those proxies available.

## Proxy Configuration

Browsertrix supports crawling through HTTP and SOCKS5 proxies, including through a SOCKS5 proxy over an SSH tunnel. For more information on what is supported in the underlying Browsertrix Crawler, see the [Browsertrix Crawler documentation](https://crawler.docs.browsertrix.com/user-guide/proxies/).

Many commercial proxy services exist. If you are planning to use commercially-provided proxies, continue to [Browsertrix Configuration](#browsertrix-configuration) below.

To set up your own proxy server to use with Browsertrix as SOCKS5 over SSH, the first thing that is needed is a physical or virtual server that you intend to use as the proxy. Once you have access to this remote machine, you will need to add the public key of a public/private key pair (we recommend using a new ECDSA key pair) to support ssh connections to the remote machine. You will need to supply the corresponding private key to Browsertrix in [Browsertrix Configuration](#browsertrix-configuration) below.

(TODO: More technical setup details as needed)
tw4l marked this conversation as resolved.
Show resolved Hide resolved

## Browsertrix Configuration

Proxies are configured in Browsertrix through a separate deployment and subchart. This enables easier updates to available proxy servers without needing to redeploy the entire Browsertrix application.

To add or update proxies to your Browsertrix Deployment, modify the `btrix-proxies` section of the main Helm chart or your local override.

First, set `enabled` to `true`, which will enable deploying proxy servers.

Next, provide the details of each proxy server that you want available within Browsertrix in the `proxies` list. Minimally, an id, connection string URL, label, and two-letter country code must be set for each proxy. If you want a particular proxy to be shared and potentially available to all organizations on a Browsertrix deployment, set `shared` to `true`. For SSH proxy servers, an `ssh_private_key` is required, and the contents of a known hosts file can additionally be provided to help secure a connection.

Once all proxy details are set, deploy the proxies by (TODO: add these details)
tw4l marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 6 additions & 0 deletions docs/user-guide/workflow-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,12 @@ Sets the browser's [user agent](https://developer.mozilla.org/en-US/docs/Web/HTT

Sets the browser's language setting. Useful for crawling websites that detect the browser's language setting and serve content accordingly.

### Proxy

Sets the proxy server that [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) will direct traffic through while crawling. When a proxy is selected, crawled websites will see traffic as coming from the IP address of the proxy rather than where the Browsertrix Crawler node is deployed.

This setting will only be shown if proxies are available for use.
tw4l marked this conversation as resolved.
Show resolved Hide resolved

## Scheduling

Automatically start crawls periodically on a daily, weekly, or monthly schedule.
Expand Down
Loading