This repository provides the tools to build and maintain a containerized version of the Vertica Kafka Scheduler, a standalone Java application that automatically consumes data from one or more Kafka topics and then loads the structured data into Vertica. The scheduler is controlled by the vkconfig
command line script.
You can download the pre-built vertica/kafka-scheduler image, or you can use the Dockerfile in this repo to build the image locally. The image is based on alpine:3.14 and includes the openjdk8-jre.
For in-depth details about streaming data with Vertica and Apache Kafka, see Apache Kafka Integration in the Vertica documentation.
- Docker Desktop, Docker Engine, or another container runtime
- Vertica installation or Vertica server image
- vertica/kafka-scheduler image
- (Optional) Docker Compose to run example.sh
To use this repository, clone the vertica/vertica-containers repository and navigate to the vertica-kafka-scheduler subdirectory:
$ git clone https://github.com/vertica/vertica-containers.git
$ cd vertica-kafka-scheduler
The following commands provide information about the options you can use to configure a scheduler and its components:
$ docker run vertica/kafka-scheduler vkconfig scheduler --help
$ docker run vertica/kafka-scheduler vkconfig cluster --help
$ docker run vertica/kafka-scheduler vkconfig source --help
$ docker run vertica/kafka-scheduler vkconfig target --help
$ docker run vertica/kafka-scheduler vkconfig load-spec --help
$ docker run vertica/kafka-scheduler vkconfig microbatch--help
To view runtime statistics on the scheduler, enter the following:
$ docker run vertica/kafka-scheduler vkconfig statistics --help
For in-depth details, see Configuring a scheduler.
To launch a scheduler, execute the following command from the /vertica-kafka-scheduler
directory:
$ docker run -it \
-v $PWD/vkconfig.conf:/etc/vkconfig.conf vertica/kafka-scheduler \
-v $PWD/vkafka-log-config-debug.xml:/opt/vertica/packages/kafka/config/vkafka-log-config.xml \
-v $PWD/log:/opt/vertica/log \
--user $(perl -E '@s=stat "'"$PWD/log"'"; say "$s[4]:$s[5]"') \
vkconfig launch --conf /etc/vkconfig.conf &
For in-depth details, see Launch a scheduler.
This repository contains example.sh
, a demonstration of a running scheduler. It uses containers and Docker Compose to create a complete Vertica/Kafka environment, automatically loads JSON-formatted test data into a Flex table, logs each action to the console, and then removes any build artifacts. To run the demo, use the following command:
$ make test
For in-depth details, see example.sh.
This repository contains the following utilities to help maintain and build a Vertica scheduler container.
The Makefile contains the following targets:
make help
: Displays the help for the Makefile.make version
: Displays the Vertica version that will be used in the build process.make java
: Copy the local install of the Java libraries from/opt/vertica/java
and saves them in a/java
directory in the/vertica-kafka-scheduler
directory.make kafka
: Copy the local install of the Kafka Scheduler from/opt/vertica/packages/kafka
and saves them in a/kafka
directory in the/vertica-kafka-scheduler
directory.make build
: Builds the container image.make push
: Pushes the custom container image to the remote Docker Hub repository.make test
: Runs example.sh to validate the vkconfig configuration.
A Compose file that starts the following services, each as a container:
The Compose file creates the scheduler
network so that the containers can communicate with each other.
A sample configuration file. You can customize this file by replacing the default values or adding more vkconfig script options.
A bash script that demonstrates a running scheduler. It creates a complete Vertica/Kafka environment with Docker Compose, then creates JSON-formatted test data that the scheduler automatically loads from a Kafka topic into a Vertica Flex table. Each action is logged to the console.
The demonstration performs the following steps:
- Sets up a test environment with docker-compose.yaml. The environment includes the following:
- A Vertica database
- Required database packages
- Database table
- Database user
- Resource pool
- Two Kafka topics
- Downloads the vertica/kafka-scheduler image, then configures a scheduler with the following components:
- Target Flex table
- Parser
- Kafka source
- Two Kafka topics
- Two microbatches (one for each Kafka topic)
- Launches the scheduler.
- Generates and sends JSON-formatted test data to Kafka.
- Displays the test data in the Flex table.
- Gracefully shuts down the scheduler.
- Removes the images pulled with the Compose file.
A scheduler is composed of individual components that define the load frequency, data type, and Vertica and Kafka environments. After you define properties for each component, launch the scheduler with configuration and logging utilities mounted as volumes.
Each component and the running scheduler process require access to the same database and environment settings. To provide these settings, create a configuration file that provides the following:
username
: Vertica database user that runs the scheduler.dbhost
: Vertica database host or IP address.dbport
: Port used to connect to the Vertica database.config-schema
: Name of the scheduler's schema.
The components and running scheduler process access configuration file values from within the scheduler container filesystem, so you must mount the configuration file as a volume. The scheduler expects the configuration file to be named vkconfig.conf
and stored in the /etc
directory. For example:
$ docker run -v <local-config.conf>:/etc/vkconfig.conf vertica/kafka-scheduler vkconfig <component> --conf /etc/vkconfig.conf <options>
For a sample configuration file, see example.conf in this repository.
Vertica recommends that you create a scheduler and define its components as a separate step from launching the scheduler. This ensures that the scheduler configuration persists in the event of planned or unplanned system downtime.
A scheduler requires the following components:
scheduler
: The scheduler itself.target
: The Vertica table that receives the streaming data.load-spec
: Defines the parser for the streaming data.cluster
: Details about the Kafka server.source
: A Kafka topic that sends data to Vertica.microbatch
: Combines each of the preceding components into a single COPY. statement that the Scheduler executes to load data into Vertica.
NOTE Additionally, the scheduler container includes the
statistics
component. This component does not configure the scheduler—it queries the stream_microbatch_history table for runtime statistics.
The following command returns a list of all available options for a component:
$ docker run vertica/kafka-scheduler vkconfig <component> --help
For example, to view the description of each microbatch
option, enter the following:
$ docker run vertica/kafka-scheduler vkconfig microbatch --help
To create a scheduler and its components, execute a docker run
command that does the following:
- Mounts a configuration file as a volume.
- Defines the scheduler image name and version.
- Defines scheduler components as a single string with the
bash -c
script option.
The scheduler component string must first define the scheduler
itself, and then add each additional required component with the --add
option. Each component is separated by a semi-colon. You must pass the --conf /etc/vkconfig.conf
option to each component definition to provide environment settings.
The following command provides an example format:
$ docker run \
-v <local-config.conf>:/etc/vkconfig.conf \
vertica/kafka-scheduler:<version> bash -c "
vkconfig scheduler \
-- conf /etc/vkconfig.conf \
<scheduler-options> ...; \
vkconfig <component-1> --add \
-- conf /etc/vkconfig.conf \
<component-1-options> ...; \
vkconfig <component-2> --add \
-- conf /etc/vkconfig.conf \
<component-2-options> ...; \
...
"
For a complete example, see the Set up Scheduler section in the example.sh script in this repository. The following snippet from that section defines the first microbatch
component:
$ docker run \
...
vkconfig microbatch --add \
--conf /etc/vkconfig.conf \
--microbatch KafkaBatch1 \
--add-source KafkaTopic1 \
--add-source-cluster KafkaCluster \
--target-schema public \
--target-table KafkaFlex \
--rejection-schema public \
--rejection-table KafkaFlex_rej \
--load-spec KafkaSpec; \
...
After you create a scheduler, launch the scheduler to begin scheduling microbatches. To launch a scheduler, execute a docker run
command that does the following:
- Mounts a configuration file as a volume.
- Specifies the scheduler image name.
- Mounts
vkafka-log-config-debug.xml
in the current directory into the container's/opt/vertica/packages/kafka/config
directory. This file configures log messages to help troubleshoot scheduler issues. - Mounts
log
in the current directory into the container's/opt/vertica/log
directory. This allows log messages to be written to the locallog
directory, assuming thatvkafka-log-config-debug.xml
configures log output in/opt/vertica/log
. - Passes the Docker
--user
command to specify the UID that has write access on the/log
directory.
The following command provides an example format. Execute this command from the top-level directory of your cloned repository:
$ docker run -it \
-v $PWD/vkconfig.conf:/etc/vkconfig.conf vertica/kafka-scheduler \
-v $PWD/vkafka-log-config-debug.xml:/opt/vertica/packages/kafka/config/vkafka-log-config.xml \
-v $PWD/log:/opt/vertica/log \
--user $(perl -E '@s=stat "'"$PWD/log"'"; say "$s[4]:$s[5]"') \
vkconfig launch --conf /etc/vkconfig.conf &
Additionally, the preceding command does the following:
- Defines the
--user
with a Perl script that extracts the/log
file owner and group information, and then formats those values inuser:group
format. - Uses the
&
operator to executevkconfig launch
as a background process.
In some circumstances, you might want to build a custom vertica/kafka-scheduler container. This repository provides a Makefile with targets that accept build variables to simplify the build process.
- Vertica binary or rpm2cpio vertica.rpm | cpio -idmv and export VERTICA_INSTALL=./opt/vertica
- Java libraries located in
/vertica/java
.
For additional information about Vertica and Java development, see Java SDK in the Vertica documentation.
Use the build
target to create a custom container. Depending on your Vertica environment, you might need to include build variables described in the following table:
Variable | Description |
---|---|
VERTICA_INSTALL | The location of your Vertica binary installation. Define this variable if you want to copy the local install of the Java libraries. Default: /opt/vertica |
VERTICA_VERSION | The Vertica version that you want to use to build the scheduler container. The scheduler version must match the Vertica database version. Default: The version of the Vertica binary that VERTICA_INSTALL points to. |
For example, if you installed Vertica in a custom directory, use the following command:
$ make build VERTICA_INSTALL=/path/to/vertica
In addition to the build target and variables, the Makefile provides the make java
and make kafka
targets so that extract Java and Kafka installation files from your local Vertica installation. For details, see Makefile.
The Makefile has a push
target that builds and pushes your custom scheduler container to Docker Hub:
$ VERTICA_VERSION=latest make push