545 allow attached volumes (#562)

* renamed volumeSize to bootVolumeSize to avoid name issues * added implementation for adding volumes to permanent workers (they are not deleted) * implemented creating and terminating volumes without filesystem for permanent workers * fully working for permanent workers. masterMount is broken now, but also replaced. Will be fixed. * Added volume creation to create_server. Not yet working. * hostvar for each host * Fixed information handling and naming issues * fixed host yaml creation * removed unnecessary prints * improved readability fixed minor bugs * added volume deletion and set volume_ap_version explicitly * removed prints from test_provider.py * improved readability greatly. Fixed overwriting host vars bug * snapshot and existing volumes can now be attached to master and workers on startup * snapshot and existing volumes can now be attached to master and workers on startup * removed mountPoint from a log message in case no mount point is specified * fixed lsblk not finding item.device due to race condition * improved comments and naming * removed server automount. This is now handled by a single automount task for both master and workers * allows nor to start new permanent volumes if a name is given. One could consider adding tmp to not named volumes for additional clarity * fixed wrong function call * renamed nfs_mount to nfs_shares * added semipermanent as an option * fixed wrong method of default values for Ansible * started reworking * added volumes and changed bootVolumes * updated bibigrid.yaml and aligned naming of bootVolume and volume * added newline at end of file * removed superfluous provider paramter * pleased linting * removed argument from function call * moved host vars creation, vars deletion, added comments * largely reworked how volumes are attached to servers to be more explicit * small naming fixes * updated priority order of permanent and semiPermanent. Updated documentation to new explicit bool setup. Added type as a key. * fixed bug regarding dontUploadCredentials * updated schema validation * Update linting.yml * Update linting.yml * Update linting.yml * Update linting.yml * added __init__.py where appropriate * update bibigrid.yaml for more explicit volumes documentation * volumes are now validated and fixed old state of masterInstance in validate_schema.py * Update linting.yml * Update linting.yml * fixed longtime naming bug for unknown openstack exceptions * saves more info in .mem file * moved structure of tests and added a basic integration_test file that needs to be expanded and improved * moved tests * added "not ready yet" * updated bootVolume documentation * moved tests added __init__.py files for better discovery. minor fixes * updated tests and comments * updated tests and comments * updated tests, code and comments for ansible_configuration * updated tests for ansible_configurator * fixed test_ansible_configurator.py * fixed test_configuration_handler.py * improved exception messages * pleased ansible linter * fixed terminate return values test * improved naming * added tests to make sure that server regex only deletes bibigrid servers with fitting cluster id and same for volumes * pleased pylint * fixed validation issue when using exists in master * removed forgotten print * fixed description bug * final bugfixes * pleased linter * fixed too many positional arguments
BiBiServ · Dec 2, 2024 · 7569163 · 7569163
1 parent 7bdb8f0
commit 7569163
Show file tree

Hide file tree

Showing 53 changed files with 1,507 additions and 712 deletions.
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -5,10 +5,10 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
-      - name: Set up Python 3.10
+      - name: Set up Python 3.12.3
         uses: actions/setup-python@v4
         with:
-          python-version: '3.10'
+          python-version: '3.12.3'
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
@@ -17,4 +17,4 @@ jobs:
       - name: ansible_lint
         run: ansible-lint resources/playbook/roles/bibigrid/tasks/main.yaml
       - name: pylint_lint
-        run: pylint bibigrid
+        run: pylint bibigrid
diff --git a/.pylintrc b/.pylintrc
@@ -562,8 +562,8 @@ min-public-methods=2
 [EXCEPTIONS]
 
 # Exceptions that will emit a warning when caught.
-overgeneral-exceptions=BaseException,
-                       Exception
+overgeneral-exceptions=builtins.BaseException,
+                       builtins.Exception
 
 
 [STRING]

diff --git a/bibigrid.yaml b/bibigrid.yaml
@@ -1,105 +1,126 @@
-  # See https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/ (after update)
-  # See https://github.com/BiBiServ/bibigrid/blob/master/documentation/markdown/features/configuration.md
-  # First configuration also holds general cluster information and must include the master.
-  # All other configurations mustn't include another master, but exactly one vpngtw instead (keys like master).
+  # For an easy introduction see https://github.com/deNBI/bibigrid_clum
+  # For more detailed information see https://github.com/BiBiServ/bibigrid/blob/master/documentation/markdown/features/configuration.md
 
-- infrastructure: openstack # former mode. Describes what cloud provider is used (others are not implemented yet)
-  cloud: openstack # name of clouds.yaml cloud-specification key (which is value to top level key clouds)
+- # -- BEGIN: GENERAL CLUSTER INFORMATION --
+  # The following options configure cluster wide keys
+  # Modify these according to your requirements
 
-  # -- BEGIN: GENERAL CLUSTER INFORMATION --
   # sshTimeout: 5 # number of attempts to connect to instances during startup with delay in between
-  # cloudScheduling:
-  #    sshTimeout: 5 # like sshTimeout but during the on demand scheduling on the running cluster
 
-  ## sshPublicKeyFiles listed here will be added to access the cluster. A temporary key is created by bibigrid itself.
-  #sshPublicKeyFiles:
-  #  - [public key one]
+  ## sshPublicKeyFiles listed here will be added to the master's authorized_keys. A temporary key is stored at ~/.config/bibigrid/keys
+  # sshPublicKeyFiles:
+  #   - [public key one]
 
-  ## Volumes and snapshots that will be mounted to master
-  #masterMounts: (optional) # WARNING: will overwrite unidentified filesystems
-  #  - name: [volume name]
-  #    mountPoint: [where to mount to] # (optional)
+  # masterMounts: DEPRECATED -- see `volumes` key for each instance instead
 
-  #nfsShares: /vol/spool/ is automatically created as a nfs
-  #  - [nfsShare one]
+  # nfsShares: # list of nfs shares. /vol/spool/ is automatically created as an nfs if nfs is true
+  #   - [nfsShare one]
 
-  # userRoles: # see ansible_hosts for all options
+  ## Ansible Related
+  # userRoles: # see ansible_hosts for all 'hosts' options
   #  - hosts:
   #    - "master"
   #    roles: # roles placed in resources/playbook/roles_user
   #    - name: "resistance_nextflow"
   #    varsFiles: # (optional)
   #    - [...]
 
-  ## Uncomment if you don't want assign a public ip to the master; for internal cluster (Tuebingen).
+  ## If you use a gateway or start a cluster from the cloud, your master does not need a public ip.
   # useMasterWithPublicIp: False # defaults True if False no public-ip (floating-ip) will be allocated
   # gateway: # if you want to use a gateway for create.
   # ip: # IP of gateway to use
   # portFunction: 30000 + oct4 # variables are called: oct1.oct2.oct3.oct4
 
-  # deleteTmpKeypairAfter: False
-  # dontUploadCredentials: False
+  ## Only relevant for specific projects (e.g. SimpleVM)
+  # deleteTmpKeypairAfter: False # warning: if you don't pass a key via sshPublicKeyFiles you lose access!
+  # dontUploadCredentials: False # warning: enabling this prevents you from scheduling on demand!
+
+  ## Additional Software
+  # zabbix: False
+  # nfs: False
+  # ide: False # installs a web ide on the master node. A nice way to view your cluster (like Visual Studio Code)
+
+  ### Slurm Related
+  # elastic_scheduling: # for large or slow clusters increasing these timeouts might be necessary to avoid failures
+  #   SuspendTimeout: 60 # after SuspendTimeout seconds, slurm allows to power up the node again
+  #   ResumeTimeout: 1200 # if a node doesn't start in ResumeTimeout seconds, the start is considered failed.
 
-  # Other keys - these are default False
-  # Usually Ignored
-  ##localFS: True
-  ##localDNSlookup: True
+  # cloudScheduling:
+  #    sshTimeout: 5 # like sshTimeout but during the on demand scheduling on the running cluster
 
-  #zabbix: True
-  #nfs: True
-  #ide: True # A nice way to view your cluster as if you were using Visual Studio Code
+  # useMasterAsCompute: True
 
-  useMasterAsCompute: True
+  # -- END: GENERAL CLUSTER INFORMATION --
 
-  # bootFromVolume: False
-  # terminateBootVolume: True
-  # volumeSize: 50
-  
-  # waitForServices:  # existing service name that runs after an instance is launched. BiBiGrid's playbook will wait until service is "stopped" to avoid issues
+  # -- BEGIN: MASTER CLOUD INFORMATION --
+  infrastructure: openstack # former mode. Describes what cloud provider is used (others are not implemented yet)
+  cloud: openstack # name of clouds.yaml cloud-specification key (which is value to top level key clouds)
+
+  # waitForServices:  # list of existing service names that affect apt. BiBiGrid's playbook will wait until service is "stopped" to avoid issues
   #  - de.NBI_Bielefeld_environment.service  # uncomment for cloud site Bielefeld
 
-  # master configuration
+  ## master configuration
   masterInstance:
-    type: # existing type/flavor on your cloud. See launch instance>flavor for options
-    image: # existing active image on your cloud. Consider using regex to prevent image updates from breaking your running cluster
+    type: # existing type/flavor from your cloud. See launch instance>flavor for options
+    image: # existing active image from your cloud. Consider using regex to prevent image updates from breaking your running cluster
     # features: # list
+    # - feature1
     # partitions: # list
-    # bootVolume: None
-    # bootFromVolume: True
-    # terminateBootVolume: True
-    # volumeSize: 50
-
-  # -- END: GENERAL CLUSTER INFORMATION --
+    # - partition1
+    # bootVolume: # optional
+    #   name: # optional; if you want to boot from a specific volume
+    #   terminate: True # whether the volume is terminated on server termination
+    #   size: 50
+    # volumes: # optional
+    # - name: volumeName # empty for temporary volumes
+    #   snapshot: snapshotName # optional; to create volume from a snapshot
+    #   mountPoint: /vol/mountPath
+    #   size: 50
+    #   fstype: ext4 # must support chown
+    #   type: # storage type; available values depend on your location; for Bielefeld CEPH_HDD, CEPH_NVME
+    ## Select up to one of the following options; otherwise temporary is picked
+    #   exists: False # if True looks for existing volume with exact name. count must be 1. Volume is never deleted.
+    #   permanent: False # if True volume is never deleted; overwrites semiPermanent if both are given
+    #   semiPermanent: False # if True volume is only deleted during cluster termination
 
   # fallbackOnOtherImage: False # if True, most similar image by name will be picked. A regex can also be given instead.
 
-  # worker configuration
+  ## worker configuration
   # workerInstances:
-  #  - type: # existing type/flavor on your cloud. See launch instance>flavor for options
+  #  - type: # existing type/flavor from your cloud. See launch instance>flavor for options
   #    image: # same as master. Consider using regex to prevent image updates from breaking your running cluster
-  #    count: # any number of workers you would like to create with set type, image combination
+  #    count: 1 # number of workers you would like to create with set type, image combination
   #    # features: # list
-  #    # partitions: # list
-  #    # bootVolume: None
-  #    # bootFromVolume: True
-  #    # terminateBootVolume: True
-  #    # volumeSize: 50
-
-  # Depends on cloud image
-  sshUser: # for example ubuntu
-
-  # Depends on cloud site and project
-  subnet: # existing subnet on your cloud. See https://openstack.cebitec.uni-bielefeld.de/project/networks/
-  # or network:
-
-  # Uncomment if no full DNS service for started instances is available.
-  # Currently, the case in Berlin, DKFZ, Heidelberg and Tuebingen.
-  #localDNSLookup: True
-
-  #features: # list
-
-  # elastic_scheduling: # for large or slow clusters increasing these timeouts might be necessary to avoid failures
-  #   SuspendTimeout: 60 # after SuspendTimeout seconds, slurm allows to power up the node again
-  #   ResumeTimeout: 1200 # if a node doesn't start in ResumeTimeout seconds, the start is considered failed.
+  #    # partitions: # list of slurm features that all nodes of this group have
+  #    # bootVolume: # optional
+  #    #   name: # optional; if you want to boot from a specific volume
+  #    #   terminate: True # whether the volume is terminated on server termination
+  #    #   size: 50
+  #    # volumes: # optional
+  #    # - name: volumeName # optional
+  #    #   snapshot: snapshotName # optional; to create volume from a snapshot
+  #    #   mountPoint: /vol/mountPath # optional; not mounted if no path is given
+  #    #   size: 50
+  #    #   fstype: ext4 # must support chown
+  #    #   type: # storage type; available values depend on your location; for Bielefeld CEPH_HDD, CEPH_NVME
+  #    ## Select up to one of the following options; otherwise temporary is picked
+  #    #   exists: False # if True looks for existing volume with exact name. count must be 1. Volume is never deleted.
+  #    #   permanent: False # if True volume is never deleted; overwrites semiPermanent if both are given
+  #    #   semiPermanent: False # if True volume is only deleted during cluster termination
+
+  # Depends on image
+  sshUser: # for example 'ubuntu'
+
+  # Depends on project
+  subnet: # existing subnet from your cloud. See https://openstack.cebitec.uni-bielefeld.de/project/networks/
+  # network: # only if no subnet is given
+
+  # features: # list of slurm features that all nodes of this cloud have
+  # - feature1
+
+  # bootVolume: # optional (cloud wide)
+  #   name: # optional; if you want to boot from a specific volume
+  #   terminate: True # whether the volume is terminated on server termination
+  #   size: 50
 
   #- [next configurations]
diff --git a/bibigrid/__init__.py b/bibigrid/__init__.py
diff --git a/bibigrid/core/__init__.py b/bibigrid/core/__init__.py
diff --git a/bibigrid/core/actions/__init__.py b/bibigrid/core/actions/__init__.py