Skip to content

Commit

Permalink
545 allow attached volumes (#562)
Browse files Browse the repository at this point in the history
* renamed volumeSize to bootVolumeSize to avoid name issues

* added implementation for adding volumes to permanent workers (they are not deleted)

* implemented creating and terminating volumes without filesystem for permanent workers

* fully working for permanent workers. masterMount is broken now, but also replaced. Will be fixed.

* Added volume creation to create_server. Not yet working.

* hostvar for each host

* Fixed information handling and naming issues

* fixed host yaml creation

* removed unnecessary prints

* improved readability fixed minor bugs

* added volume deletion and set volume_ap_version explicitly

* removed prints from test_provider.py

* improved readability greatly. Fixed overwriting host vars bug

* snapshot and existing volumes can now be attached to master and workers on startup

* snapshot and existing volumes can now be attached to master and workers on startup

* removed mountPoint from a log message in case no mount point is specified

* fixed lsblk not finding item.device due to race condition

* improved comments and naming

* removed server automount. This is now handled by a single automount task for both master and workers

* allows nor to start new permanent volumes if a name is given. One could consider adding tmp to not named volumes for additional clarity

* fixed wrong function call

* renamed nfs_mount to nfs_shares

* added semipermanent as an option

* fixed wrong method of default values for Ansible

* started reworking

* added volumes and changed bootVolumes

* updated bibigrid.yaml and aligned naming of bootVolume and volume

* added newline at end of file

* removed superfluous provider paramter

* pleased linting

* removed argument from function call

* moved host vars creation, vars deletion, added comments

* largely reworked how volumes are attached to servers to be more explicit

* small naming fixes

* updated priority order of permanent and semiPermanent. Updated documentation to new explicit bool setup. Added type as a key.

* fixed bug regarding dontUploadCredentials

* updated schema validation

* Update linting.yml

* Update linting.yml

* Update linting.yml

* Update linting.yml

* added __init__.py where appropriate

* update bibigrid.yaml for more explicit volumes documentation

* volumes are now validated and fixed old state of masterInstance in validate_schema.py

* Update linting.yml

* Update linting.yml

* fixed longtime naming bug for unknown openstack exceptions

* saves more info in .mem file

* moved structure of tests and added a basic integration_test file that needs to be expanded and improved

* moved tests

* added "not ready yet"

* updated bootVolume documentation

* moved tests added __init__.py files for better discovery. minor fixes

* updated tests and comments

* updated tests and comments

* updated tests, code and comments for ansible_configuration

* updated tests for ansible_configurator

* fixed test_ansible_configurator.py

* fixed test_configuration_handler.py

* improved exception messages

* pleased ansible linter

* fixed terminate return values test

* improved naming

* added tests to make sure that server regex only deletes bibigrid servers with fitting cluster id and same for volumes

* pleased pylint

* fixed validation issue when using exists in master

* removed forgotten print

* fixed description bug

* final bugfixes

* pleased linter

* fixed too many positional arguments
  • Loading branch information
XaverStiensmeier authored Dec 2, 2024
1 parent 7bdb8f0 commit 7569163
Show file tree
Hide file tree
Showing 53 changed files with 1,507 additions and 712 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
- name: Set up Python 3.12.3
uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.12.3'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand All @@ -17,4 +17,4 @@ jobs:
- name: ansible_lint
run: ansible-lint resources/playbook/roles/bibigrid/tasks/main.yaml
- name: pylint_lint
run: pylint bibigrid
run: pylint bibigrid
4 changes: 2 additions & 2 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -562,8 +562,8 @@ min-public-methods=2
[EXCEPTIONS]

# Exceptions that will emit a warning when caught.
overgeneral-exceptions=BaseException,
Exception
overgeneral-exceptions=builtins.BaseException,
builtins.Exception


[STRING]
Expand Down
159 changes: 90 additions & 69 deletions bibigrid.yaml
Original file line number Diff line number Diff line change
@@ -1,105 +1,126 @@
# See https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/ (after update)
# See https://github.com/BiBiServ/bibigrid/blob/master/documentation/markdown/features/configuration.md
# First configuration also holds general cluster information and must include the master.
# All other configurations mustn't include another master, but exactly one vpngtw instead (keys like master).
# For an easy introduction see https://github.com/deNBI/bibigrid_clum
# For more detailed information see https://github.com/BiBiServ/bibigrid/blob/master/documentation/markdown/features/configuration.md

- infrastructure: openstack # former mode. Describes what cloud provider is used (others are not implemented yet)
cloud: openstack # name of clouds.yaml cloud-specification key (which is value to top level key clouds)
- # -- BEGIN: GENERAL CLUSTER INFORMATION --
# The following options configure cluster wide keys
# Modify these according to your requirements

# -- BEGIN: GENERAL CLUSTER INFORMATION --
# sshTimeout: 5 # number of attempts to connect to instances during startup with delay in between
# cloudScheduling:
# sshTimeout: 5 # like sshTimeout but during the on demand scheduling on the running cluster

## sshPublicKeyFiles listed here will be added to access the cluster. A temporary key is created by bibigrid itself.
#sshPublicKeyFiles:
# - [public key one]
## sshPublicKeyFiles listed here will be added to the master's authorized_keys. A temporary key is stored at ~/.config/bibigrid/keys
# sshPublicKeyFiles:
# - [public key one]

## Volumes and snapshots that will be mounted to master
#masterMounts: (optional) # WARNING: will overwrite unidentified filesystems
# - name: [volume name]
# mountPoint: [where to mount to] # (optional)
# masterMounts: DEPRECATED -- see `volumes` key for each instance instead

#nfsShares: /vol/spool/ is automatically created as a nfs
# - [nfsShare one]
# nfsShares: # list of nfs shares. /vol/spool/ is automatically created as an nfs if nfs is true
# - [nfsShare one]

# userRoles: # see ansible_hosts for all options
## Ansible Related
# userRoles: # see ansible_hosts for all 'hosts' options
# - hosts:
# - "master"
# roles: # roles placed in resources/playbook/roles_user
# - name: "resistance_nextflow"
# varsFiles: # (optional)
# - [...]

## Uncomment if you don't want assign a public ip to the master; for internal cluster (Tuebingen).
## If you use a gateway or start a cluster from the cloud, your master does not need a public ip.
# useMasterWithPublicIp: False # defaults True if False no public-ip (floating-ip) will be allocated
# gateway: # if you want to use a gateway for create.
# ip: # IP of gateway to use
# portFunction: 30000 + oct4 # variables are called: oct1.oct2.oct3.oct4

# deleteTmpKeypairAfter: False
# dontUploadCredentials: False
## Only relevant for specific projects (e.g. SimpleVM)
# deleteTmpKeypairAfter: False # warning: if you don't pass a key via sshPublicKeyFiles you lose access!
# dontUploadCredentials: False # warning: enabling this prevents you from scheduling on demand!

## Additional Software
# zabbix: False
# nfs: False
# ide: False # installs a web ide on the master node. A nice way to view your cluster (like Visual Studio Code)

### Slurm Related
# elastic_scheduling: # for large or slow clusters increasing these timeouts might be necessary to avoid failures
# SuspendTimeout: 60 # after SuspendTimeout seconds, slurm allows to power up the node again
# ResumeTimeout: 1200 # if a node doesn't start in ResumeTimeout seconds, the start is considered failed.

# Other keys - these are default False
# Usually Ignored
##localFS: True
##localDNSlookup: True
# cloudScheduling:
# sshTimeout: 5 # like sshTimeout but during the on demand scheduling on the running cluster

#zabbix: True
#nfs: True
#ide: True # A nice way to view your cluster as if you were using Visual Studio Code
# useMasterAsCompute: True

useMasterAsCompute: True
# -- END: GENERAL CLUSTER INFORMATION --

# bootFromVolume: False
# terminateBootVolume: True
# volumeSize: 50
# waitForServices: # existing service name that runs after an instance is launched. BiBiGrid's playbook will wait until service is "stopped" to avoid issues
# -- BEGIN: MASTER CLOUD INFORMATION --
infrastructure: openstack # former mode. Describes what cloud provider is used (others are not implemented yet)
cloud: openstack # name of clouds.yaml cloud-specification key (which is value to top level key clouds)

# waitForServices: # list of existing service names that affect apt. BiBiGrid's playbook will wait until service is "stopped" to avoid issues
# - de.NBI_Bielefeld_environment.service # uncomment for cloud site Bielefeld

# master configuration
## master configuration
masterInstance:
type: # existing type/flavor on your cloud. See launch instance>flavor for options
image: # existing active image on your cloud. Consider using regex to prevent image updates from breaking your running cluster
type: # existing type/flavor from your cloud. See launch instance>flavor for options
image: # existing active image from your cloud. Consider using regex to prevent image updates from breaking your running cluster
# features: # list
# - feature1
# partitions: # list
# bootVolume: None
# bootFromVolume: True
# terminateBootVolume: True
# volumeSize: 50

# -- END: GENERAL CLUSTER INFORMATION --
# - partition1
# bootVolume: # optional
# name: # optional; if you want to boot from a specific volume
# terminate: True # whether the volume is terminated on server termination
# size: 50
# volumes: # optional
# - name: volumeName # empty for temporary volumes
# snapshot: snapshotName # optional; to create volume from a snapshot
# mountPoint: /vol/mountPath
# size: 50
# fstype: ext4 # must support chown
# type: # storage type; available values depend on your location; for Bielefeld CEPH_HDD, CEPH_NVME
## Select up to one of the following options; otherwise temporary is picked
# exists: False # if True looks for existing volume with exact name. count must be 1. Volume is never deleted.
# permanent: False # if True volume is never deleted; overwrites semiPermanent if both are given
# semiPermanent: False # if True volume is only deleted during cluster termination

# fallbackOnOtherImage: False # if True, most similar image by name will be picked. A regex can also be given instead.

# worker configuration
## worker configuration
# workerInstances:
# - type: # existing type/flavor on your cloud. See launch instance>flavor for options
# - type: # existing type/flavor from your cloud. See launch instance>flavor for options
# image: # same as master. Consider using regex to prevent image updates from breaking your running cluster
# count: # any number of workers you would like to create with set type, image combination
# count: 1 # number of workers you would like to create with set type, image combination
# # features: # list
# # partitions: # list
# # bootVolume: None
# # bootFromVolume: True
# # terminateBootVolume: True
# # volumeSize: 50

# Depends on cloud image
sshUser: # for example ubuntu

# Depends on cloud site and project
subnet: # existing subnet on your cloud. See https://openstack.cebitec.uni-bielefeld.de/project/networks/
# or network:

# Uncomment if no full DNS service for started instances is available.
# Currently, the case in Berlin, DKFZ, Heidelberg and Tuebingen.
#localDNSLookup: True

#features: # list

# elastic_scheduling: # for large or slow clusters increasing these timeouts might be necessary to avoid failures
# SuspendTimeout: 60 # after SuspendTimeout seconds, slurm allows to power up the node again
# ResumeTimeout: 1200 # if a node doesn't start in ResumeTimeout seconds, the start is considered failed.
# # partitions: # list of slurm features that all nodes of this group have
# # bootVolume: # optional
# # name: # optional; if you want to boot from a specific volume
# # terminate: True # whether the volume is terminated on server termination
# # size: 50
# # volumes: # optional
# # - name: volumeName # optional
# # snapshot: snapshotName # optional; to create volume from a snapshot
# # mountPoint: /vol/mountPath # optional; not mounted if no path is given
# # size: 50
# # fstype: ext4 # must support chown
# # type: # storage type; available values depend on your location; for Bielefeld CEPH_HDD, CEPH_NVME
# ## Select up to one of the following options; otherwise temporary is picked
# # exists: False # if True looks for existing volume with exact name. count must be 1. Volume is never deleted.
# # permanent: False # if True volume is never deleted; overwrites semiPermanent if both are given
# # semiPermanent: False # if True volume is only deleted during cluster termination

# Depends on image
sshUser: # for example 'ubuntu'

# Depends on project
subnet: # existing subnet from your cloud. See https://openstack.cebitec.uni-bielefeld.de/project/networks/
# network: # only if no subnet is given

# features: # list of slurm features that all nodes of this cloud have
# - feature1

# bootVolume: # optional (cloud wide)
# name: # optional; if you want to boot from a specific volume
# terminate: True # whether the volume is terminated on server termination
# size: 50

#- [next configurations]
Empty file added bibigrid/__init__.py
Empty file.
Empty file added bibigrid/core/__init__.py
Empty file.
Empty file.
Loading

0 comments on commit 7569163

Please sign in to comment.