This document details the step-by-step actions to deploy a Jenkins server in a new Nomad cluster, using Docker containers for both the master server and the execution agents. It currently runs against the Recommenders Live account, although moving to a different account should be fairly straightforward.
All actions against AWS are handled by Terraform and (mostly) implement the Core Engineering Team's Compliance Framework requirements.
This step configures a new AWS account managed by Terraform and creates the basic components like VPC, subnets, Route53 zones, etc. The corresponding code is in the recs-core-infra repository. It comprises 3 sub-folders to be run in this order: bootstrap
, util
, main
. Note that these folders share the Makefile
, dev.tfvars
, live.tfvars
and variables.tf
files through symbolic links.
The full bootstrap procedure is described in the recs-core-infra
README file. It creates among others the S3 bucket for storing Terraform state files, as well as a KMS key that will subsequently be used to encrypt secrets.
> cd recs-core-infra/bootstrap
> awsudo -u recs-live make init ENV=live
> awsudo -u recs-live make plan ENV=live
> awsudo -u recs-live make apply
This step creates the Utility VPC which will host the Nomad cluster. It creates a Bastion and a Puppet server using standard CE Terraform modules.
Before running Terraform, a puppetdb-pass
entry needs to be present in the recs-util-secrets
AWS Secrets Manager secret to store the Puppet DB password as explained here. Also Terraform needs to be able to automatically launch an SSH connexion to the Puppet server through Bastion as detailed here.
> cd recs-core-infra/util
> awsudo -u recs-live make init ENV=live
> awsudo -u recs-live make plan ENV=live
> awsudo -u recs-live make apply
This phase only creates the Route53 entries and a couple S3 buckets for now.
> cd recs-core-infra/main
> awsudo -u recs-live make init ENV=live
> awsudo -u recs-live make plan ENV=live
> awsudo -u recs-live make apply
This phase creates the Nomad cluster in the Utility VPC.
As a first step, the cluster configuration needs to be updated as follows.
- A unique cluster ID must be generated, typically with
python -c 'import uuid; print str(uuid.uuid4())'
, and updated in the Terraform variables and Puppet configuration (servers and worker nodes). - Certificates must be generated for the new environment and uploaded to S3 as described here.
- The Puppet configuration needs to be manually updated with the content of the EYAML-encrypted files generated in the
tls
folder by the previous step. On Mac OS a convenient way of doing this is to copy a file to the paste bin (clipboard) usingpbcopy < puppet-conf-server.yaml
for instance and then pasting it at the end of the corresponding config file (recs-util-puppet-config/hiera/instance_classification/prod/roles/container_server.yaml
in this case). - Then the new Puppet config needs to be pushed to Gitlab with
> git commit -am 'Nomad config for Live'
> git push origin master
- and the Puppet server made to load the new files:
> ssh centos@<puppet_server_ip> sudo /usr/bin/r10k_with_lock deploy environment -p -v debug
Terraform can then be launched to build the cluster:
> cd recs-util-nomad/terraform
> awsudo -u recs-live make init ENV=live
> awsudo -u recs-live make plan ENV=live
> awsudo -u recs-live make apply
The Consul and Nomad servers GUI can then be accessed through an SSH tunnel to the Bastion server, which is brought up as follows:
> cd recs-util-nomad/nomad
> make connect ENV=live # Set up an SSh tunnel to the servers
> make gui ENV=live # Launch the Consul and Nomad GUIs in a new Web Browser window
The most likely reason for which the Consul or Nomad servers would fail to start is that the Puppet configuration created above (in recs-util-puppet-config
) is incorrect. In order to check this, SSH to one of the server or worker nodes and inspect the content of the /etc/consul.d/config.json
and /etc/nomad.d/config.json
files as described below. Check that the values are correct and consistent between nodes. If necessary correct the Puppet configuration and repeat the steps above.
> ssh admin@<cluster_instance_ip>
> sudo -s
> apt-get -y install jq
> jq . < /etc/consul.d/config.json
{
"acl_datacenter": "us-east-1",
"acl_default_policy": "deny",
"acl_down_policy": "extend-cache",
"acl_master_token": "e60d406f-6c3c-48de-8948-d0891d0e28b9",
"acl_token": "e60d406f-6c3c-48de-8948-d0891d0e28b9",
"advertise_addr": "10.188.240.241",
"bind_addr": "0.0.0.0",
"ca_file": "/etc/consul.d/tls/recs-ca.crt",
"cert_file": "/etc/consul.d/tls/consul-server.crt",
"client_addr": "0.0.0.0",
"data_dir": "/opt/consul",
"datacenter": "us-east-1",
"disable_remote_exec": true,
"enable_syslog": true,
"encrypt": "jniUSnfs+X14gKOUAaXzFg==",
"key_file": "/etc/consul.d/tls/consul-server.key",
"log_level": "INFO",
"ports": {
"http": -1,
"https": 8500
},
"retry_join": [
"provider=aws region=us-east-1 addr_type=private_v4 tag_key=ClusterName tag_value=container-f1fdef65-495e-4f99-8ab5-98dc59faffec"
],
"server": true,
"verify_incoming_https": false,
"verify_incoming_rpc": true,
"verify_outgoing": true
}
When the configuration is correct, the following CLI commands when ran on a server node should give a result similar to the following:
> ssh admin@<cluster_server_ip>
> sudo -Es consul members
Node Address Status Type Build Protocol DC Segment
ip-10-188-240-156 10.188.240.156:8301 alive server 1.1.0 2 us-east-1 <all>
ip-10-188-240-206 10.188.240.206:8301 alive server 1.1.0 2 us-east-1 <all>
ip-10-188-241-241 10.188.241.241:8301 alive server 1.1.0 2 us-east-1 <all>
ip-10-188-240-137 10.188.240.137:8301 alive client 1.1.0 2 us-east-1 <default>
ip-10-188-241-185 10.188.241.185:8301 alive client 1.1.0 2 us-east-1 <default>
> sudo -Es nomad server members
Name Address Port Status Leader Protocol Build Datacenter Region
ip-10-188-240-156.us-east 10.188.240.156 4648 alive false 2 0.8.3 us-east-1 us-east
ip-10-188-240-206.us-east 10.188.240.206 4648 alive true 2 0.8.3 us-east-1 us-east
ip-10-188-241-241.us-east 10.188.241.241 4648 alive false 2 0.8.3 us-east-1 us-east
> sudo -Es nomad node status
ID DC Name Class Drain Eligibility Status
c4950737 us-east-1 ip-10-188-240-137 private false eligible ready
7eb9690c us-east-1 ip-10-188-241-185 private false eligible ready
At that point the Nomad cluster is up and ready to be used. The next step is to deploy the Jenkins server itself as a Nomad job in it.
Docker images need to be created for the Jenkins master and agents, and published to the corresponding AWS ECR repositories. This is done with the following commands:
> cd recs-util-nomad/docker/jenkins-master
> make pull # Retrieve latest version of the base Docker image
> make publish ENV=live # Build and push Jenkins master image to ECR
> cd recs-util-nomad/docker/jenkins-agent-terraform
> make pull
> make publish ENV=live
etc
Note that the Jenkins configuration files in recs-utils-nomad/docker/jenkins-master/ref/init.groovy.d
are fairly specific to the Recommenders team and would need to be adapted to a new environment. More about this below.
Jenkins is launched as a Nomad job as follows. We first launch Fabio which is a small load-balancer for Nomad jobs.
> cd recs-util-nomad/nomad
> make import ENV=live # Import certificates required to connect to the cluster
> make deploy ENV=live JOB=fabio # Launch Fabio
> make deploy ENV=live JOB=jenkins # Launch Jenkins
At that point you should be able to connect to the Jenkins server at https://jenkins.util.recs.d.elsevier.com
. You can also have a lok at the Fabio routing table at https://fabio.util.recs.d.elsevier.com/routes
.
As mentioned above, while the Nomad configuration described up to this point is fairly generic and should be reuseable with few changes, the Jenkins configuraiton itself is fairly specific to the Recommenders environment and would likely need to be extensively modified for a new deployment. The following paragraphs provide some high-level documenation that should help in doing that.
When a new Jenkins image is laucnhed, the first thing that happens is that all files in the /usr/share/jenkins/ref
folder are copied to Jenkins's home /var/jenkins_home
. Additionally, if a file has the .override
extension, it replaces its existing counterpart there. The files that are copied include the following directories, described in more detail below:
init.groovy.d
: initialisation scriptsplugins
: local pluginsjobs
: predefined jobs
When the Jenkins server starts, it first executes all .groovy
files in the /var/jenkins_home/init.groovy.d
directory. In our case the files are as follows:
g10_security.groovy
: applies a number of security-related settings, including disabling all unsafe master-agent communication protocols;g12_users.groovy
: parses the/usr/share/jenkins/creds/jenkins_users
file and creates the corresponding local users;g14_credentials.groovy
: loads.txt
and.id_rsa
files present in the/usr/share/jenkins/creds
folder and register them with the Jenkins credentials facility;g20_jekins_url.groovy
: sets the public (for end users) and private (for agents) server URLs;g30_shared_libs.groovy
: register the shared libraries used for pipelines definition;g32_nomad_cloud.groovy
: creates a Nomad Cloud for the Nomad plugin;g40_slack.groovy
: allows publishing to Slack;g99_jenkins_save.groovy
: saves the updated Jenkins configuration.
This installation uses the Nomad plugin for scheduling agents on the Nomad cluster. Normally required plugins are listed in the ref/plugins.txt
file and retrieved automatically at startup with their dependencies from the official Jenkins plugins repository. However, since the published version of the Nomad plugin is currently only 0.4
, which lacks features like volume mounts required for Docker-capable agents, we had to compile and install a more recent version as follows:
> git clone https://github.com/jenkinsci/nomad-plugin
> cd nomad-plugin
> vi pom.xml # Bump version number to e.g. 0.5.1
> mvn package
> mvn install
then copy target/nomad.hpi
to the ref/plugins
folder.
Only one Seed
job is created there, using a classic XML configuration. It creates all other jobs at startup time by downloading the recs-util-jenkins-jobs repository and executing all .groovy
files there using the Job DSL plugin.
All Jenkins jobs are handled by dedicated agents with specific capabilities (e.g., terraform
, sbt
, maven
, etc). The agents are built as Docker images in the recs-util-nomad/docker directory and registered with Jenkins by the init.groovy.d/g32_nomad_cloud.groovy initialisation file. The important parameters in this configuation are described below:
cpu
,memory
,disk
: amount of CPU (in MHz), of memory (in MiB) and storage (in MiB) allocated by Nomad to an instance of this agent;labels
: space-separated list of capabilities provided by this agent, which will be required by Jenkins jobs using the node directive;idleMinutes
: amount of time (in min) that the agent will stay around after completing a job, which provides an opportunity for it to be reused by another job and will be much faster than spawning a new agent;numExecutors
: number of jobs that the agent is allowed to process concurrently;hostVolumes
: definition of volumes to be mounted in the agent by the Docker driver, useful to allow an agent to access the Docker daemon socket.