Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* fixed rule setting for security groups * fixed multiple network is now list causing error bugs. * trying to figure out why route applying only works once. * Added more echo's for better debugging. * updated most tests * fixed validate_configuration.py tests. * Updated tests for startup.py * fixed bug in terminate that caused assume_yes to work as assume_no * updated terminate_cluster tests. * fixed formatting improved pylint * adapted tests * updated return threading test * updated provider_handler * tests not finished yet * Fixed server regex issue * test list clusters updated * fixed too open cluster_id regex * added missing "to" * fixed id_generation tests * renamed configuration handler to please linter * removed unnecessary tests and updated remaining * fixed remaining "subnet list gets handled as a single subnet" bug and finalized multiple routes handling. * updated tests not finished yet * improved code style * fixed tests further. One to fix left. * fixed additional tests * fixed all tests for ansible configurator * fixed comment * fixed multiple tests * fixed a few tests * Fixed create * fixed some issues regarding * fixing test_provider.py * removed infrastructure_cloud.yml * minor fixes * fixed all tests * removed print * changed prints to log * removed log * fixed None bug where [] is expected when no sshPublicKeyFile is given. * removed master from compute if use master as compute is false * reconstructured role additional in order to make it easier to include. Added quotes for consistency. * Updated all tests (#448) * updated most tests * fixed validate_configuration.py tests. * Updated tests for startup.py * fixed bug in terminate that caused assume_yes to work as assume_no * updated terminate_cluster tests. * fixed formatting improved pylint * adapted tests * updated return threading test * updated provider_handler * tests not finished yet * Fixed server regex issue * test list clusters updated * fixed too open cluster_id regex * added missing "to" * fixed id_generation tests * renamed configuration handler to please linter * removed unnecessary tests and updated remaining * updated tests not finished yet * improved code style * fixed tests further. One to fix left. * fixed additional tests * fixed all tests for ansible configurator * fixed comment * fixed multiple tests * fixed a few tests * Fixed create * fixed some issues regarding * fixing test_provider.py * removed infrastructure_cloud.yml * minor fixes * fixed all tests * removed print * changed prints to log * removed log * Introduced yaml lock (#464) * removed unnecessary close * simplified update_hosts * updated logging to separate folder and file based on creation date * many small changes and introducing locks * restructured log files again. Removed outdated key warnings from bibigrid.yml * added a few logs * further improved logging hierarchy * Added specific folder places for temporary job storage. This might solve the "SlurmSpoolDir full" bug. * Improved logging * Tried to fix temps and tried update to 23.11 but has errors so commented that part out * added initial space * added existing worker deletion on worker startup if worker already exists as no worker would've been started if Slurm would've known about the existing worker. This is not the best solution. (#468) * made waitForServices a cloud specific key (#465) * Improved log messages in validate_configuration.py to make fixing your configuration easier when using a hybrid-/multi-cloud setup (#466) * removed unnecessary line in provider.py and added cloud information to every log in validate_configuration.py for easier fixing. * track resources for providers separately to make quota checking precise * switched from low level cinder to high level block_storage.get_limits() * added keyword for ssh_timeout and improved argument passing for ssh. * Update issue templates * fixed a missing LOG * removed overwritten variable instantiation * Update bug_report.md * removed trailing whitespaces * added comment about sshTimeout key * Create dependabot.yml (#479) * Code cleanup and minor improvement (#482) * fixed :param and :return to @param and @return * many spelling mistakes fixed * added bibigrid_version to common configuration * added timeout to common_configuration * removed debug verbosity and improved log message wording * fixed is_active structure * fixed pip dependabot.yml * added documentation. Changed timeout to 2**(2+attempts) to decrease number of unlikely to work attempts * 474 allow non on demandpermanent workers (#487) * added worker server start without anything else * added host entry for permanent workers * added state unknown for permanent nodes * added on_demand key for groups and instances for ansible templating * fixed wording * temporary solution for custom execute list * added documentation for onDemand * added ansible.cfg replacement * fixed path. Added ansible.cfg to the gitignore * updated default creation and gitignore. Fixed non-vital bug that didn't reset hosts for new cluster start. * Code cleanup (#490) * fixed :param and :return to @param and @return * many spelling mistakes fixed * added bibigrid_version to common configuration * attempted zabbix linting fix. Needs testing. * fixed double import * Slurm upgrade fixes (#473) * removed slurm errors * added bibilog to show output log of most recent worker start. Tried fixing the slurm23.11 bug. * fixed a few vpnwkr -> vpngtw remnants. Excluded vpngtw from slurm setup * improved comments regarding changes and versions * removed cgroupautomount as it is defunct * Moved explicit slurm start to avoid errors caused by resume and suspend programs not being copied to their final location yet * added word for clarification * Fixed non-fatal bug that lead to non 0 exits on runs without any error. * changed slurm apt package to slurm-bibigrid * set version to 23.11.* * added a few more checks to make sure everything is set up before installing packages * Added configuration pinning * changed ignore_error to failed_when false * fixed or ignored lint fatals * Update tests (#493) * updated tests * removed print * updated tests * updated tests * fixed too loose condition * updated tests * added cloudScheduling and userRoles in bibigrid.yml * added userRoles in documentation * added varsFiles and comments * added folder path in documentation * fixed naming * added that vars are optional * polished userRoles documentation * 439 additional ansible roles (#495) * added roles structure * updated roles_path * fixed upper lower case * improved customRole implementation * minor fixes regarding role_paths * improved variable naming of user_roles * added documentation for other configurations * added new feature keys * fixed template files not being j2 * added helpful comments and removed no longer used roles/additional/ * userRoles crashes if no role set * fixed ansible.cfg path '"' * implemented partition system * added keys customAnsibleCfg and customSlurmConf as keys that stop the automatic copying * improved spacing * added logging * updated documentation * updated tests. Improved formatting * fix for service being too fast for startup * fixed remote src * changed RESUME to POWER_DOWN and removed delete call which is now handled via Slurm that calls terminate.sh (#503) * Update check (#499) * updated validate_configuration.py in order to provide schema validation. Moved cloud_identifier setting even closer to program start in order to be able to log better when performing other actions than create. * small log change and fix of schema key vpnInstance * updated tests * removed no longer relevant test * added schema validation tests * fixed ftype. Errors with multiple volumes. * made automount bound to defined mountPoints and therefore customizable * added empty line and updated bibigrid.yml * fixed nfsshare regex error and updated check to fit to the new name mountpoint pattern * hotfix: folder creation now before accessing hosts.yml * fixed tests * moved dnsmasq installation infront of /etc/resolv removal * fixed tests * fixed nfs exports by removing unnecessary "/" at the beginning * fixed master running slurmd but not being listed in slurm.conf. Now set to drained. * improved logging * increased timeout. Corrected comment in slurm.j2 * updated info regarding timeouts (changed from 4 to 5). * added SuspendTimeout as optional to elastic_scheduling * updated documentation * permission fix * fixes #394 * fixes #394 (also for hybrid cluster) * increased ResumeTimeout by 5 minutes. yml to yaml * changed all yml to yaml (as preferred by yaml) * updated timeouts. updated tests * fixes #394 - remove host from zabbix when terminated * zabbix api no longer used when not set in configuration * pleased linting by using false instead of no * added logging of traceroute even if debug flag is not set when error is not known. Added a few other logs * Update action 515 (#516) * configuration update possible 515 * added experimental * fixed indentation * fixed missing newline at EOF. Summarized restarts. * added check for running workers * fixed multiple workers due to faulty update * updated tests and removed done todos * updated documentation * removed print * Added apt-reactivate-auto-update to reactivate updates at the end of the playbook run (#518) * changed theia to 900. Added apt-reactivate-auto-update as new 999. * added new line at end of file * changed list representation * added multiple configuration keys for boot volume handling * updated documentation * updated documentation for new volumes and for usually ignored keys * updated and added tests --------- Co-authored-by: Jan Krueger <[email protected]>
- Loading branch information