-
Notifications
You must be signed in to change notification settings - Fork 275
How to build and run in Docker
This document describes the process of pulling a pre-built version of OpenWayback from DockerHub or building it from source code locally and running to serve WARC files, all in the Docker environment. This can be very handy for development and testing in different environments without polluting the host machine with different versions of dependencies. The OpenWayback source code includes a Dockerfile. Generated Docker image is kept minimal which makes it suitable for running in production as well.
Docker (version 17.05
or later is required for building the image).
OpenWayback provides up-to-date official Docker images in DockerHub that are automatically built from the source at GitHub.
The latest
tag points to the latest stable release while the master
tag points to an image built from the bleeding edge code at the master
branch of the repo.
These two tags are overwritten when another corresponding build is completed successfully.
On the contrary, versioned tags such as openwayback-2.4.0
are supposed to be permanent.
In order to run a test instance of OpenWayback we first need to prepare the environment.
The default configuration of the OpenWayback uses the automatic BDB Indexer
and expects WARC
files at ${WAYBACK_BASEDIR}/files1/
or ${WAYBACK_BASEDIR}/files2/
.
By default the WAYBACK_BASEDIR
is set to /data
volume in the Docker image.
Create necessary directory structure on the host machine for testing and populate it with some test WARC files.
$ mkdir -p /tmp/owb/files1
$ wget -P /tmp/owb/files1/ https://github.com/iipc/openwayback-sample-overlay/raw/master/sample/warcs/example.com.warc.gz
In the above example, we have created a folder for testing at /tmp/owb/files1
and downloaded a sample WARC file named example.com.warc.gz
in that folder using wget
.
Alternatively, if you have any WARC files available locally, copy them in that folder.
$ cp /path/to/sample/*.warc /tmp/owb/files1/
With WARC files in place, we can pull the iipc/openwayback image from DockerHub. Then run a Docker container with appropriately mounted volumes and port mapping. By default the container would run the Tomcat server.
$ docker pull iipc/openwayback
$ docker container run -it --rm -v /tmp/owb:/data -p 8080:8080 iipc/openwayback
Once the WARC files are indexed, they should be ready for lookup at http://localhost:8080/wayback/.
If you have used the sample example.com.warc.gz
file above then you can search for the http://example.com/
URL using the search form and expect to find a capture of it, if everything went well.
OpenWayback allows certain configuration overrides using environment variables that can be customized when running a container, but these customization are very limited.
WAYBACK_HOME=/usr/local/tomcat/webapps/ROOT/WEB-INF
WAYBACK_BASEDIR=/data
WAYBACK_URL_SCHEME=http
WAYBACK_URL_HOST=localhost
WAYBACK_URL_PORT=8080
WAYBACK_URL_PREFIX=http://localhost:8080
However, by strategically mounting certain volumes, it is possible to run the OpenWayback server with custom configuration files.
$ docker container run -it --rm -p 8080:8080 \
-v /tmp/owb:/data \
-v /path/to/custom/wayback.xml:/usr/local/tomcat/webapps/ROOT/WEB-INF/wayback.xml \
-v /path/to/custom/CDXCollection.xml:/usr/local/tomcat/webapps/ROOT/WEB-INF/CDXCollection.xml \
iipc/openwayback
This way of mounting configuration files can be handy for testing. However, for production purposes it is better to create derived image and override configuration files with custom files. For more details on custom configuration, read the basic configuration documentation.
While DockerHub-hosted official iipc/openwayback[:<TAG>]
images are quicker and easier to use, they used the latest stable versions of Maven
, JDK
, Tomcat
and JRE
to build the image at the time they were built.
One can locally build a custom image with customized environment while still utilizing the Dockerfile
provided in the OpenWayback repo.
Local image building is also desired for development and testing with changes in the code that are not push to the upstream repo yet.
First, acquire the source code.
$ git clone https://github.com/iipc/openwayback.git
$ cd openwayback
Make any changes to the source code if needed. Then build the docker image.
$ docker image build -t iipc/openwayback .
This will download dependencies, compile the code, run tests, package, and place necessary components in appropriate places to build a minimal Docker image with the name iipc/openwayback
.
This process may take a while (depending on the network bandwidth and processor speed).
It utilizes Multi-Stage Build feature of Docker to exclude compile-time environment and dependencies from the final image, which makes it both, secure and smaller in size.
By default, the source is built using the latest versions of Maven
and JDK
then the image is packaged with the latest versions of Tomcat
and JRE
.
However, it is possible to build and package with custom combinations these dependencies using MAVEN_TAG
and TOMCAT_TAG
build arguments.
These variations can be helpful for both testing and production needs without making any changes in the Dockerfile.
$ docker image build \
--build-arg=MAVEN_TAG=3.5-jdk-7 \
--build-arg=TOMCAT_TAG=7-jre7-alpine \
-t iipc/openwayback:custom .
Above command would build an image named iipc/openwayback
with tag custom
where the source code would be built using Maven 3.5
with JDK 7
and then the built artifacts will be packaged in a small Alpine Linux
image with Tomcat 7
and JRE 7
.
See available values of MAVEN_TAG
and TOMCAT_TAG
build arguments.
Now, run the OpenWayback server using this custom image and access it from a web browser.
$ docker container run -it --rm -v /tmp/owb:/data -p 8080:8080 iipc/openwayback:custom
The Docker image contains various executable utilities with their necessary dependencies that can be used in one-off mode.
The following command illustrates one possible usage of the cdx-indexer
to index WARC
files into CDX
files on the host machine with appropriate volume mounting while utilizing a one-off container.
$ docker container run -it --rm -v /tmp/owb:/data iipc/openwayback cdx-indexer /data/files1/sample1.warc > /tmp/owb/index1.cdx
Alternatively, access the bash
prompt of the container to run utility scripts inside or perform debugging.
$ docker container run -it --rm -v /tmp/owb:/data iipc/openwayback bash
[CONTAINER ID]# cdx-indexer /data/files1/sample1.warc > /data/index1.cdx
IMPORTANT If you are using the bash sort
command to sort CDX files, you must set the environment variable LC_ALL=C
.
This tells sort how to sort and ensures that it matches how OpenWayback expects CDX indexes to be sorted.
For more details, read the description of all packaged utility scripts.
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git