From b1c61af6a95db9eb17f98b027905b436a9874a34 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20Kr=C3=BCger?= Date: Tue, 13 Dec 2022 13:54:06 +0100 Subject: [PATCH] Merge Dev into Master (#347) * fix : print possible configuration errors to stdout to be more specific that "an error occurs" * fix #222: print possible configuration errors to stdout * enhancement #281: add support for cloud site region (Openstack) * Fix/#275 - Integrated 'useHostname' config parameter (#280) * Adjusted ideConf Readme * Extended configuration by useHostname parameter, adjusted AnsibleHostsConfig writing if useHostname is enabled * Added useHostname config parameter to config schema, adjusted default parameters marked with "" * Fixed 'useHostname' instead of 'useHostnames' typo * Added description to 'useHostnames' parameter * enhancement #283: add option to overwrite service CIDR mask * enhancement #283: adjust changelog * enhancement #283: adjust documentation * Moved serviceCIDR validation from config to validator, added method description for validateServiceCIDR(string), Added string pattern description to CONFIGURATION_SCHEMA.md * bump version to 2.2.2 (main bugfix release) - readd missing 'break' in intentMode switch/case directive * fix/#217 : - adjust theia task - remove unused (unlinked) cloud9 task * #285: move CIDR mask validation into Configuration setter method * security/#291: change default ip range for opened ports from 0.0.0.0/0 to current * enhancement/#289 : - bump node version manager to 0.37.0 - bump prebuild theia-version to 1.8 * fix/#297: disable auto-upgrade as early as possible (#298) * fix/#297: disable auto-upgrade as early as possible * #297 add check for /var/lib/dpkg/lock * Move client2 module/#299 (#300) * Adjusted ideConf Readme * Moved client to providerModule * update ChangeLog.md * bump version from 2.2.2 to 2.3 * Determine quotas/#257 (#306) * Adjusted ideConf Readme * Moved Intents for Openstack to separate directory * Changed interface OSClient to correct implementation via OSClientV3 * ProviderTypes have not been tested and remained unused, check config independently only in -ch and -c case * Moved config validation to validator * Removed unnecessary duplicate block in checking instance and image resources, added documentation to ValidateIntent * Renamed ValidatorOpenstack method validate in validateCredentials to be more precise * Added OpenStack Implementation of ValidateIntent * Check quotas exceeded before launching worker instances in scaling up, Changed scale up method createWorkerInstances into createAdditionalWorkerInstances, Assign clusterId in runIntent() in StartUp in forehand, Removed help intentMode from runIntent(), since it is handled before, Added ValidateIntentOpenstack, Added todos * Created configureClusterInstances() including master and worker instance configuration to check quotas in forehand, changed list of map entries for instanceTypes to map, removed unnecessary 'prepare' value permanently, Fixed some issues in google cloud, aws and azure and other warnings * Fixed ValidateIntent checkQuotas() wrong parameters issue * Fixed issue with master and worker same providerType, enhanced logging * adjust ChangeLog.md bump jackson databind to 2.9.10.7 * adjust ChangeLog.md * Bump guava from 28.0-jre to 29.0-jre in /bibigrid-openstack Bumps [guava](https://github.com/google/guava) from 28.0-jre to 29.0-jre. - [Release notes](https://github.com/google/guava/releases) - [Commits](https://github.com/google/guava/commits) Signed-off-by: dependabot[bot] * Bump snakeyaml from 1.25 to 1.26 in /bibigrid-core Bumps [snakeyaml](https://bitbucket.org/asomov/snakeyaml) from 1.25 to 1.26. - [Commits](https://bitbucket.org/asomov/snakeyaml/branches/compare/snakeyaml-1.26..snakeyaml-1.25) --- updated-dependencies: - dependency-name: org.yaml:snakeyaml dependency-type: direct:production ... Signed-off-by: dependabot[bot] * Fix ide config check/#293 (#313) * Adjusted ideConf Readme * Remove ConfigurationFile.java which has been fully replaced by Configuration * Moved CommandLine interpretation to separate class. * Moved config validation into validate and create only, minor fixes * Removed deprecated slave instances configuration * Added missing loadConfiguration() method doc * Added opportunity to install IDE subsequently * Added cluster initialization in CreateCluster * Fixed rest api createCluster components * Minor enhancements * Moved loaded cluster to IdeIntent instead of loading in intent * Moved Yaml Interpretation into separate model, minor bug fixes * Enhanced YamlInterpreter logging * Moved isIpAdressFile() to YamlInterpreter * Added LoadIdeConfiguration Method in LoadClusterConfigurationIntent, minor code enhancements * Fix rebase merging errors * error handling when ide not installed, separate installSubsequently() method to install afterwards (not yet implemented) * Abort IDE start when configuration not loaded successfully * fix(BibigridInfoIdGetHandler):just added a space to avoid confusion * Fixes missing clusterMap loading in BibigridTerminateIdDeleteHandler (#316) * Added ansible ping check in shell script, Removed unnecessary warn logging (#319) * add workaround for buggy openstack4j identity api v3 handling (#321) * bump version to 2.3.1 * adjust slurm task (slurm 20.11, configuration) * fixes security vulnerabilities (#323, #324, #326) - bump jackson-databind to version 2.9.10.8 - bump undertow to version 2.1.6.Final * adjust configuration (#327) * adjust common and master configuration for slurm (#327) * return on empty line to dismiss empty warnings, Replace 'XXXX' with '****' in credentials output (#320) * -move all slurm task in an extra role (slurm, #327) - add ansible.cfg to speed up configuration (#332) - add support for slurmdbd (#331) and slurmrestd * remove ganglia task from common,master and worker role (#249) * remove ganglia, oge and other deprecated parts from documentation (#249) * make AnsibleConfig aware of removed OGE/Ganglia options. * cleanup * add ipRange values to example config * Add accounting and priority settings/#335 (#338) * Added AccountSettings and PrioritySettings and scheduler config in var/ * Transferred AccountingSettings and PrioritySettings directly into slurm.conf * fix(Ansible): wait till bielefeld_environment service is done if it exists (#345) * fix race condition * fix(cluster-scaling): added cluster initialization when adding worker instances (#346) * Probably the last release of the Java-Based Version of BiBiGrid * Update README.md Signed-off-by: dependabot[bot] Co-authored-by: tdilger <39946465+tdilger@users.noreply.github.com> Co-authored-by: tdilger Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dweinholz Co-authored-by: Christian Henke Co-authored-by: cnexcale <36278502+cnexcale@users.noreply.github.com> --- ChangeLog.md | 11 + README.md | 10 +- bibigrid-aws/pom.xml | 2 +- bibigrid-azure/pom.xml | 4 +- bibigrid-core/pom.xml | 2 +- .../bibigrid/core/intents/CreateCluster.java | 7 + .../bibigrid/core/model/AnsibleConfig.java | 33 +- .../bibigrid/core/model/Configuration.java | 100 +++-- .../bibigrid/core/util/SshFactory.java | 3 + .../src/main/resources/playbook/ansible.cfg | 8 + .../playbook/roles/common/handlers/main.yml | 1 + .../playbook/roles/common/tasks/001-apt.yml | 21 +- .../roles/common/tasks/010-ganglia.yml | 21 -- .../playbook/roles/common/tasks/041-slurm.yml | 36 -- .../playbook/roles/common/tasks/main.yml | 20 - .../common/templates/ganglia/gmond.conf.j2 | 342 ------------------ .../roles/master/tasks/002-zabbix.yml | 2 +- .../roles/master/tasks/006-database.yml | 8 + .../playbook/roles/master/tasks/021-slurm.yml | 39 -- .../playbook/roles/master/tasks/main.yml | 32 +- .../files/slurm => slurm/files}/cgroup.conf | 0 .../files}/cgroup_allowed_devices_file.conf | 3 +- .../roles/slurm/files/slurmrestd_default | 9 + .../slurm/files/slurmrestd_override.conf | 6 + .../playbook/roles/slurm/handlers/main.yml | 20 + .../playbook/roles/slurm/tasks/main.yml | 138 +++++++ .../slurm => slurm/templates}/slurm.conf | 34 +- .../roles/slurm/templates/slurmdbd.conf | 29 ++ .../playbook/roles/worker/tasks/021-slurm.yml | 25 +- .../playbook/roles/worker/tasks/main.yml | 28 +- .../src/main/resources/playbook/site.yml | 14 +- bibigrid-light-rest-4j/pom.xml | 9 +- .../openstack/OpenStackCredentials.java | 2 +- docs/CONFIGURATION_SCHEMA.md | 2 - docs/README.md | 19 +- 35 files changed, 414 insertions(+), 626 deletions(-) create mode 100644 bibigrid-core/src/main/resources/playbook/ansible.cfg create mode 100644 bibigrid-core/src/main/resources/playbook/roles/common/handlers/main.yml delete mode 100644 bibigrid-core/src/main/resources/playbook/roles/common/tasks/010-ganglia.yml delete mode 100644 bibigrid-core/src/main/resources/playbook/roles/common/tasks/041-slurm.yml delete mode 100644 bibigrid-core/src/main/resources/playbook/roles/common/templates/ganglia/gmond.conf.j2 create mode 100644 bibigrid-core/src/main/resources/playbook/roles/master/tasks/006-database.yml delete mode 100644 bibigrid-core/src/main/resources/playbook/roles/master/tasks/021-slurm.yml rename bibigrid-core/src/main/resources/playbook/roles/{common/files/slurm => slurm/files}/cgroup.conf (100%) rename bibigrid-core/src/main/resources/playbook/roles/{common/files/slurm => slurm/files}/cgroup_allowed_devices_file.conf (75%) create mode 100644 bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_default create mode 100644 bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_override.conf create mode 100644 bibigrid-core/src/main/resources/playbook/roles/slurm/handlers/main.yml create mode 100644 bibigrid-core/src/main/resources/playbook/roles/slurm/tasks/main.yml rename bibigrid-core/src/main/resources/playbook/roles/{common/templates/slurm => slurm/templates}/slurm.conf (62%) create mode 100644 bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurmdbd.conf diff --git a/ChangeLog.md b/ChangeLog.md index 18d8734f6..9a16d670e 100644 --- a/ChangeLog.md +++ b/ChangeLog.md @@ -1,3 +1,14 @@ +## Version 2.3.1 (3/8/2021) + +This will probably the latest version of BiBigrid depended on Java. We are currently working on a complete +reimplementation: BiBiGrid2. + +## Fixes +- minor ansible configuration cleanup + +## Features +- (#327) + ## Version 2.3 (3/2/2021) ## Fixes diff --git a/README.md b/README.md index 31df53cdc..00cf50309 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,14 @@ # BiBiGrid + +> **Note** +> The Java-based version of BiBiGrid was discontinued. The latest release is version 2.3.1. +> Since this version isn't maintained any longer you should consider using its successor. + + BiBiGrid is a tool for an easy cluster setup inside a cloud environment. It is written in Java and run on any OS a Java runtime is provided - any Java 8 is supported. BiBiGrid and its Cmdline UI based on a general cloud -provider api. Currently the implementation is based on OpenStack ([Openstack4j](https://github.com/openstack4j/openstack4j)). +provider api. Currently, the implementation is based on OpenStack ([Openstack4j](https://github.com/openstack4j/openstack4j)). There also exists implementations for Google (Compute Engine, using the official Google Cloud SDK), Amazon (AWS EC2 using the official AWS SDK) and Microsoft (Azure using the official Azure SDK) (WIP) which are currently not provided tested. @@ -16,7 +22,7 @@ a shared filesystem (on local discs and attached volumes), a cloud IDE for writi ([Theia Web IDE](https://github.com/theia-ide/theia)) and many more. During resource instantiation BiBiGrid configures the network, local and network volumes, (network) file systems and -also the software for an immediately usage of the started cluster. +also the software for an immediate usage of the started cluster. When using preinstalled images a full configured and ready to use cluster is available within a few minutes. diff --git a/bibigrid-aws/pom.xml b/bibigrid-aws/pom.xml index bcc6a7a94..305b821f2 100644 --- a/bibigrid-aws/pom.xml +++ b/bibigrid-aws/pom.xml @@ -16,7 +16,7 @@ de.unibi.cebitec.bibigrid bibigrid-core - 2.3 + 2.3.1 com.amazonaws diff --git a/bibigrid-azure/pom.xml b/bibigrid-azure/pom.xml index 12fd98be8..d02892f17 100644 --- a/bibigrid-azure/pom.xml +++ b/bibigrid-azure/pom.xml @@ -6,7 +6,7 @@ bibigrid de.unibi.cebitec.bibigrid - 2.3 + 2.3.1 4.0.0 @@ -17,7 +17,7 @@ de.unibi.cebitec.bibigrid bibigrid-core - 2.3 + 2.3.1 com.microsoft.azure diff --git a/bibigrid-core/pom.xml b/bibigrid-core/pom.xml index 23a813b9e..fe918d968 100644 --- a/bibigrid-core/pom.xml +++ b/bibigrid-core/pom.xml @@ -67,7 +67,7 @@ com.fasterxml.jackson.core jackson-databind - 2.9.10.7 + 2.9.10.8 compile diff --git a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/intents/CreateCluster.java b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/intents/CreateCluster.java index 6784d9c6c..92ea3131e 100644 --- a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/intents/CreateCluster.java +++ b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/intents/CreateCluster.java @@ -232,6 +232,13 @@ public boolean createWorkerInstances(int batchIndex, int count) { io.printStackTrace(); return false; } + + if (cluster.getMasterInstance() == null) { + LoadClusterConfigurationIntent loadIntent = providerModule.getLoadClusterConfigurationIntent(config); + loadIntent.loadClusterConfiguration(cluster.getClusterId()); + cluster = loadIntent.getCluster(cluster.getClusterId()); + } + Session sshSession = null; boolean success = true; try { diff --git a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/AnsibleConfig.java b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/AnsibleConfig.java index 86fe812fc..8b56fc94f 100644 --- a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/AnsibleConfig.java +++ b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/AnsibleConfig.java @@ -58,6 +58,19 @@ public static void writeHostsFile(ChannelSftp channel, String sshUser, List(); + role.put("role",name); + role.put("tags",tags); + return role; + } + /** * Generates site.yml automatically including custom ansible roles. * @@ -68,6 +81,7 @@ public static void writeHostsFile(ChannelSftp channel, String sshUser, List customMasterRoles, Map customWorkerRoles) { + List common_vars = Arrays.asList(AnsibleResources.LOGIN_YML, AnsibleResources.INSTANCES_YML, AnsibleResources.CONFIG_YML); String DEFAULT_IP_FILE = AnsibleResources.VARS_PATH + "{{ ansible_default_ipv4.address }}.yml"; @@ -83,9 +97,10 @@ public static void writeSiteFile(OutputStream stream, } } master.put("vars_files", vars_files); - List roles = new ArrayList<>(); + List roles = new ArrayList<>(); roles.add("common"); roles.add("master"); + roles.add(createRole("slurm",Arrays.asList("slurm","scale-up","scale-down"))); for (String role_name : customMasterRoles.keySet()) { roles.add("additional/" + role_name); } @@ -105,6 +120,7 @@ public static void writeSiteFile(OutputStream stream, roles = new ArrayList<>(); roles.add("common"); roles.add("worker"); + roles.add(createRole("slurm",Arrays.asList("slurm","scale-up","scale-down"))); for (String role_name : customWorkerRoles.keySet()) { roles.add("additional/" + role_name); } @@ -357,7 +373,6 @@ public static void writeConfigFile(OutputStream stream, Configuration config, St addBooleanOption(map, "enable_gridengine", config.isOge()); addBooleanOption(map, "enable_slurm",config.isSlurm()); addBooleanOption(map, "use_master_as_compute", config.isUseMasterAsCompute()); - addBooleanOption(map, "enable_ganglia",config.isGanglia()); addBooleanOption(map, "enable_zabbix", config.isZabbix()); addBooleanOption(map, "enable_ide", config.isIDE()); if (config.isNfs()) { @@ -367,12 +382,12 @@ public static void writeConfigFile(OutputStream stream, Configuration config, St if (config.isIDE()) { map.put("ideConf", getIdeConfMap(config.getIdeConf())); } + if (config.isSlurm()) { + map.put("slurmConf",getSlurmConfMap(config.getSlurmConf())); + } if (config.isZabbix()) { map.put("zabbix", getZabbixConfMap(config.getZabbixConf())); } - if (config.isOge()) { - map.put("oge", getOgeConfMap(config.getOgeConf())); - } if (config.hasCustomAnsibleRoles()) { map.put("ansible_roles", getAnsibleRoles(config.getAnsibleRoles())); } @@ -473,6 +488,14 @@ private static Map getZabbixConfMap(Configuration.ZabbixConf zc) return zabbixConf; } + private static Map getSlurmConfMap(Configuration.SlurmConf sc) { + Map slurmConf = new LinkedHashMap<>(); + slurmConf.put("db",sc.getDatabase()); + slurmConf.put("db_user",sc.getDb_user()); + slurmConf.put("db_password",sc.getDb_password()); + return slurmConf; + } + private static Map getOgeConfMap(Properties oc) { Map ogeConf = new HashMap<>(); for (final String name : oc.stringPropertyNames()) { diff --git a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/Configuration.java b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/Configuration.java index 968cf5bfb..f0a0deede 100644 --- a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/Configuration.java +++ b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/model/Configuration.java @@ -116,12 +116,12 @@ public static Configuration loadConfiguration(Class con private List workerInstances = new ArrayList<>(); private boolean oge; private boolean slurm; + private SlurmConf slurmConf = new SlurmConf(); private boolean localDNSLookup; private String mungeKey; private boolean nfs = true; private String serviceCIDR; private IdeConf ideConf = new IdeConf(); - private boolean ganglia; private boolean zabbix; private ZabbixConf zabbixConf = new ZabbixConf(); private List nfsShares = new ArrayList<>(Collections.singletonList("/vol/spool")); @@ -129,8 +129,6 @@ public static Configuration loadConfiguration(Class con private List extNfsShares = new ArrayList<>(); private FS localFS = FS.XFS; private boolean debugRequests; - @Deprecated - private Properties ogeConf = OgeConf.initOgeConfProperties(); private List ansibleRoles = new ArrayList<>(); private List ansibleGalaxyRoles = new ArrayList<>(); private boolean useHostnames = false; @@ -515,6 +513,14 @@ public void setSlurm(boolean slurm) { this.slurm = slurm; } + public SlurmConf getSlurmConf() { + return slurmConf; + } + + public void setSlurmConf(SlurmConf slurmConf) { + this.slurmConf = slurmConf; + } + public String getMungeKey() { if (mungeKey == null) { // create a unique hash @@ -539,18 +545,6 @@ public void setMungeKey(String mungeKey) { this.mungeKey = mungeKey; } - public boolean isGanglia() { - return ganglia; - } - - public void setGanglia(boolean ganglia) { - this.ganglia = ganglia; - if (ganglia) { - LOG.warn("Ganglia (oge) support is deprecated (only supported using Ubuntu 16.04.) " + - "and will be removed in the near future. Please use Zabbix instead."); - } - } - public boolean isZabbix() { return zabbix; } @@ -771,20 +765,6 @@ private static String bytesToHex(byte[] hash) { return hexString.toString(); } - public Properties getOgeConf() { - return ogeConf; - } - - /** - * Saves given values to ogeConf Properties. - * @param ogeConf Properties - */ - public void setOgeConf(Properties ogeConf) { - for (String key : ogeConf.stringPropertyNames()) { - this.ogeConf.setProperty(key, ogeConf.getProperty(key)); - } - } - /** * Provides support for GridEngine global configuration. */ @@ -811,29 +791,6 @@ private static OgeConf initOgeConfProperties() { } } - @Deprecated - public boolean isCloud9() { - return ideConf.isIde(); - } - - @Deprecated - public void setCloud9(boolean cloud9) { - LOG.warn("cloud9 parameter is deprecated. Please use IdeConf instead."); - LOG.warn("Cloud9 will not longer be supported and is replaced by the Theia Web IDE."); - ideConf.setIde(cloud9); - } - - @Deprecated - public boolean isTheia() { - return ideConf.isIde(); - } - - @Deprecated - public void setTheia(boolean theia) { - LOG.warn("theia parameter is deprecated. Please use IdeConf instead."); - ideConf.setIde(theia); - } - public boolean isIDE() { if (ideConf == null) { return false; @@ -907,6 +864,45 @@ public void setBuild(boolean build) { } } + /** + * Configuration of Slurm. + * Currently, all values are hard-coded. + */ + public static class SlurmConf { + private boolean slurm = true; + private String database = "slurm"; + private String db_user = "slurm"; + private String db_password = "changeme"; + + public boolean isSlurm() { + return slurm; + } + + public String getDatabase() { + return database; + } + + public void setDatabase(String database) { + this.database = database; + } + + public String getDb_user() { + return db_user; + } + + public void setDb_user(String db_user) { + this.db_user = db_user; + } + + public String getDb_password() { + return db_password; + } + + public void setDb_password(String db_password) { + this.db_password = db_password; + } + } + /** * Checks if custom ansible roles used. * diff --git a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/util/SshFactory.java b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/util/SshFactory.java index 10f5c44c0..657a22f4f 100644 --- a/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/util/SshFactory.java +++ b/bibigrid-core/src/main/java/de/unibi/cebitec/bibigrid/core/util/SshFactory.java @@ -198,6 +198,9 @@ class LineReaderRunnable implements Runnable { } private void work_on_line(String line) { + if (line.isEmpty()) { + return; + } if (regular) { if (line.contains("CONFIGURATION FINISHED")) { returnCode = 0; diff --git a/bibigrid-core/src/main/resources/playbook/ansible.cfg b/bibigrid-core/src/main/resources/playbook/ansible.cfg new file mode 100644 index 000000000..f5fae4200 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/ansible.cfg @@ -0,0 +1,8 @@ +[defaults] +inventory = ./ansible_hosts +host_key_checking = False +forks=50 +pipelining = True + +[ssh_connection] +ssh_args = -o ControlMaster=auto -o ControlPersist=60s \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/handlers/main.yml b/bibigrid-core/src/main/resources/playbook/roles/common/handlers/main.yml new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/common/handlers/main.yml @@ -0,0 +1 @@ + diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/001-apt.yml b/bibigrid-core/src/main/resources/playbook/roles/common/tasks/001-apt.yml index d381ad623..b7c5c524e 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/001-apt.yml +++ b/bibigrid-core/src/main/resources/playbook/roles/common/tasks/001-apt.yml @@ -10,6 +10,16 @@ group: root mode: 0644 +- name: Populate service facts + ansible.builtin.service_facts: + +- name: Wait till Apt_Mirror de.NBI Bielefeld Service is done + ansible.builtin.service_facts: + until: services['de.NBI_Bielefeld_environment.service'].state == 'stopped' + retries: 35 + delay: 10 + when: services['de.NBI_Bielefeld_environment.service'] is defined + - name: Update apt: update_cache: "yes" @@ -100,11 +110,14 @@ deb: 'https://repo.zabbix.com/zabbix/5.0/{{ ansible_distribution | lower }}/pool/main/z/zabbix-release/zabbix-release_5.0-1+{{ ansible_distribution_release }}_all.deb' state: present -- name: Add ondrej/php repository - apt_repository: - repo: ppa:ondrej/php +- name: Add apt.bi.denbi.de repository key + apt_key: + url: 'https://apt.bi.denbi.de/repo_key.key' state: present - when: "ansible_distribution == 'Ubuntu' and ansible_distribution_release == 'xenial'" + +- name: Add apt.bi.denbi.de repository + apt_repository: + repo: 'deb https://apt.bi.denbi.de/repos/apt/{{ ansible_distribution_release | lower }} {{ ansible_distribution_release | lower }} main' - name: Update apt cache apt: diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/010-ganglia.yml b/bibigrid-core/src/main/resources/playbook/roles/common/tasks/010-ganglia.yml deleted file mode 100644 index 987e0059d..000000000 --- a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/010-ganglia.yml +++ /dev/null @@ -1,21 +0,0 @@ -- name: Install Ganglia - apt: - name: ganglia-monitor - state: present - -- name: Configure Ganglia - template: - src: ganglia/gmond.conf.j2 - dest: /etc/ganglia/gmond.conf - owner: root - group: root - mode: 0644 - register: gmonf_conf - -- name: Restart Ganglia - systemd: - name: ganglia-monitor - state: restarted - enabled: yes - when: gmonf_conf is changed - diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/041-slurm.yml b/bibigrid-core/src/main/resources/playbook/roles/common/tasks/041-slurm.yml deleted file mode 100644 index 1f15a6bbb..000000000 --- a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/041-slurm.yml +++ /dev/null @@ -1,36 +0,0 @@ -- name: Create SLURM configuration directory - file: - path: /etc/slurm-llnl - state: directory - owner: root - group: root - mode: 0755 - - -- name: SLURM configuration - template: - src: slurm/slurm.conf - dest: /etc/slurm-llnl/slurm.conf - owner: root - group: root - mode: 0444 - register: slurm_conf - - -- name: SLURM cgroup configuration - copy: - src: slurm/cgroup.conf - dest: /etc/slurm-llnl/cgroup.conf - owner: root - group: root - mode: 0444 - register: slurm_cggroup_conf - -- name: SLURM cgroup allowed devices conf - copy: - src: slurm/cgroup_allowed_devices_file.conf - dest: /etc/slurm-llnl/cgroup_allowed_devices_file.conf - owner: root - group: root - mode: 0444 - register: SLURM_cgroup_allowed_devices_conf \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/main.yml b/bibigrid-core/src/main/resources/playbook/roles/common/tasks/main.yml index d2ffbf35f..602dcd929 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/common/tasks/main.yml +++ b/bibigrid-core/src/main/resources/playbook/roles/common/tasks/main.yml @@ -20,15 +20,6 @@ when: - local_dns_lookup == 'yes' -- block: - - debug: - msg: "[BIBIGRID] Setup Ganglia monitor" - - include: 010-ganglia.yml - tags: ['ganglia','common-ganglia'] - when: - - enable_ganglia == 'yes' - - ansible_distribution_release == 'xenial' - - block: - debug: msg: "[BIBIGRID] Setup Zabbix Agent" @@ -47,17 +38,6 @@ - import_tasks: 030-docker.yml tags: ["docker","common-docker"] -- block: - - debug: - msg: "[BIBIGRID] Munge" - - include: 040-munge.yml - - debug: - msg: "[BIBIGRID] SLURM Config" - - include: 041-slurm.yml - when: - - enable_slurm == 'yes' - tags: ['slurm',"common-slurm","scale-up","scale-down"] - - debug: msg: "[BIBIGRID] Measure cluster performance" - import_tasks: 999-bibigridperf.yml diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/templates/ganglia/gmond.conf.j2 b/bibigrid-core/src/main/resources/playbook/roles/common/templates/ganglia/gmond.conf.j2 deleted file mode 100644 index 4fc4eb2f4..000000000 --- a/bibigrid-core/src/main/resources/playbook/roles/common/templates/ganglia/gmond.conf.j2 +++ /dev/null @@ -1,342 +0,0 @@ -#CFG_GMOND_CONF_MASTER -/* This configuration is as close to 2.5.x default behavior as possible - The values closely match ./gmond/metric.h definitions in 2.5.x */ -globals { - daemonize = yes - setuid = yes - user = ganglia - debug_level = 0 - max_udp_msg_len = 1472 - mute = no - deaf = no - host_dmax = 0 /*secs */ - cleanup_threshold = 300 /*secs */ - gexec = no - send_metadata_interval = 30 -} - -/* If a cluster attribute is specified, then all gmond hosts are wrapped inside - * of a tag. If you do not specify a cluster tag, then all will - * NOT be wrapped inside of a tag. */ -cluster { - name = "BiBiGrid" - owner = "BiBiCloud" - latlong = "unspecified" - url = "unspecified" -} - -/* The host section describes attributes of the host, like the location */ -host { - location = "unspecified" -} - -/* Feel free to specify as many udp_send_channels as you like. Gmond - used to only support having a single channel */ -udp_send_channel { - host = {{ master.ip }} - port = 8649 - ttl = 1 -} - -/* You can specify as many udp_recv_channels as you like as well. */ -udp_recv_channel { - port = 8649 -} - -/* You can specify as many tcp_accept_channels as you like to share - an xml description of the state of the cluster */ -tcp_accept_channel { - port = 8649 -} - -/* Each metrics module that is referenced by gmond must be specified and - loaded. If the module has been statically linked with gmond, it does not - require a load path. However all dynamically loadable modules must include - a load path. */ -modules { - module { - name = "core_metrics" - } - module { - name = "cpu_module" - path = "/usr/lib/ganglia/modcpu.so" - } - module { - name = "disk_module" - path = "/usr/lib/ganglia/moddisk.so" - } - module { - name = "load_module" - path = "/usr/lib/ganglia/modload.so" - } - module { - name = "mem_module" - path = "/usr/lib/ganglia/modmem.so" - } - module { - name = "net_module" - path = "/usr/lib/ganglia/modnet.so" - } - module { - name = "proc_module" - path = "/usr/lib/ganglia/modproc.so" - } - module { - name = "sys_module" - path = "/usr/lib/ganglia/modsys.so" - } -} - -include ('/etc/ganglia/conf.d/*.conf') - - -/* The old internal 2.5.x metric array has been replaced by the following - collection_group directives. What follows is the default behavior for - collecting and sending metrics that is as close to 2.5.x behavior as - possible. */ - -/* This collection group will cause a heartbeat (or beacon) to be sent every - 20 seconds. In the heartbeat is the GMOND_STARTED data which expresses - the age of the running gmond. */ -collection_group { - collect_once = yes - time_threshold = 20 - metric { - name = "heartbeat" - } -} - -/* This collection group will send general info about this host every 1200 secs. - This information doesn't change between reboots and is only collected once. */ -collection_group { - collect_once = yes - time_threshold = 1200 - metric { - name = "cpu_num" - title = "CPU Count" - } - metric { - name = "cpu_speed" - title = "CPU Speed" - } - metric { - name = "mem_total" - title = "Memory Total" - } - /* Should this be here? Swap can be added/removed between reboots. */ - metric { - name = "swap_total" - title = "Swap Space Total" - } - metric { - name = "boottime" - title = "Last Boot Time" - } - metric { - name = "machine_type" - title = "Machine Type" - } - metric { - name = "os_name" - title = "Operating System" - } - metric { - name = "os_release" - title = "Operating System Release" - } - metric { - name = "location" - title = "Location" - } -} - -/* This collection group will send the status of gexecd for this host every 300 secs */ -/* Unlike 2.5.x the default behavior is to report gexecd OFF. */ -collection_group { - collect_once = yes - time_threshold = 300 - metric { - name = "gexec" - title = "Gexec Status" - } -} - -/* This collection group will collect the CPU status info every 20 secs. - The time threshold is set to 90 seconds. In honesty, this time_threshold could be - set significantly higher to reduce unneccessary network chatter. */ -collection_group { - collect_every = 20 - time_threshold = 90 - /* CPU status */ - metric { - name = "cpu_user" - value_threshold = "1.0" - title = "CPU User" - } - metric { - name = "cpu_system" - value_threshold = "1.0" - title = "CPU System" - } - metric { - name = "cpu_idle" - value_threshold = "5.0" - title = "CPU Idle" - } - metric { - name = "cpu_nice" - value_threshold = "1.0" - title = "CPU Nice" - } - metric { - name = "cpu_aidle" - value_threshold = "5.0" - title = "CPU aidle" - } - metric { - name = "cpu_wio" - value_threshold = "1.0" - title = "CPU wio" - } - /* The next two metrics are optional if you want more detail... - ... since they are accounted for in cpu_system. - metric { - name = "cpu_intr" - value_threshold = "1.0" - title = "CPU intr" - } - metric { - name = "cpu_sintr" - value_threshold = "1.0" - title = "CPU sintr" - } - */ -} - -collection_group { - collect_every = 20 - time_threshold = 90 - /* Load Averages */ - metric { - name = "load_one" - value_threshold = "1.0" - title = "One Minute Load Average" - } - metric { - name = "load_five" - value_threshold = "1.0" - title = "Five Minute Load Average" - } - metric { - name = "load_fifteen" - value_threshold = "1.0" - title = "Fifteen Minute Load Average" - } -} - -/* This group collects the number of running and total processes */ -collection_group { - collect_every = 80 - time_threshold = 950 - metric { - name = "proc_run" - value_threshold = "1.0" - title = "Total Running Processes" - } - metric { - name = "proc_total" - value_threshold = "1.0" - title = "Total Processes" - } -} - -/* This collection group grabs the volatile memory metrics every 40 secs and - sends them at least every 180 secs. This time_threshold can be increased - significantly to reduce unneeded network traffic. */ -collection_group { - collect_every = 40 - time_threshold = 180 - metric { - name = "mem_free" - value_threshold = "1024.0" - title = "Free Memory" - } - metric { - name = "mem_shared" - value_threshold = "1024.0" - title = "Shared Memory" - } - metric { - name = "mem_buffers" - value_threshold = "1024.0" - title = "Memory Buffers" - } - metric { - name = "mem_cached" - value_threshold = "1024.0" - title = "Cached Memory" - } - metric { - name = "swap_free" - value_threshold = "1024.0" - title = "Free Swap Space" - } -} - -collection_group { - collect_every = 40 - time_threshold = 300 - metric { - name = "bytes_out" - value_threshold = 4096 - title = "Bytes Sent" - } - metric { - name = "bytes_in" - value_threshold = 4096 - title = "Bytes Received" - } - metric { - name = "pkts_in" - value_threshold = 256 - title = "Packets Received" - } - metric { - name = "pkts_out" - value_threshold = 256 - title = "Packets Sent" - } -} - -/* Different than 2.5.x default since the old config made no sense */ -collection_group { - collect_every = 1800 - time_threshold = 3600 - metric { - name = "disk_total" - value_threshold = 1.0 - title = "Total Disk Space" - } -} - -collection_group { - collect_every = 40 - time_threshold = 180 - metric { - name = "disk_free" - value_threshold = 1.0 - title = "Disk Space Available" - } - metric { - name = "part_max_used" - value_threshold = 1.0 - title = "Maximum Disk Space Used" - } -} - - - - - - - diff --git a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/002-zabbix.yml b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/002-zabbix.yml index 16dd6eff9..7e95a9d37 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/002-zabbix.yml +++ b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/002-zabbix.yml @@ -190,7 +190,7 @@ host_groups: - 'Linux servers' link_templates: - - 'Template OS Linux by Zabbix Agent' # new in Zabbix 4.4 + - 'Linux by Zabbix agent' interfaces: - type: 1 # agent main: 1 # default diff --git a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/006-database.yml b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/006-database.yml new file mode 100644 index 000000000..e7c19f166 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/006-database.yml @@ -0,0 +1,8 @@ +- name: Install maria-db-server + apt: + name: "mariadb-server" + +- name: Install PyMySQL via pip + pip: + name: pymysql + diff --git a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/021-slurm.yml b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/021-slurm.yml deleted file mode 100644 index 6380ab00d..000000000 --- a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/021-slurm.yml +++ /dev/null @@ -1,39 +0,0 @@ -- name: Install Slurm packages - apt: - name: [slurm-wlm] - state: latest - -# (Re-)start slurmctl master daemon -- name: (Re-)start slurmctld master - systemd: - name: slurmctld - enabled: True - state: restarted - when: slurm_conf is changed or slurm_cggroup_conf is changed or SLURM_cgroup_allowed_devices_conf is changed - - -# (Re-)start slurmd worker daemon -- name: (Re-)start slurmd worker daemon - systemd: - name: slurmctld - enabled: True - state: restarted - when: - - use_master_as_compute == 'yes' - - slurm_conf is changed or slurm_cggroup_conf is changed or SLURM_cgroup_allowed_devices_conf is changed - -# copy -- name: Copy GridEngine compatible layer - copy: - src: slurm/qsub - dest: /usr/local/bin/qsub - mode: 0755 - owner: root - group: root - -- copy: - src: slurm/gridengine-rc - dest: /home/ubuntu/.gridengine-rc - mode: 0600 - owner: ubuntu - group: ubuntu \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/main.yml b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/main.yml index 9d22e9ce5..6d6658849 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/master/tasks/main.yml +++ b/bibigrid-core/src/main/resources/playbook/roles/master/tasks/main.yml @@ -1,12 +1,3 @@ -- block: - - debug: - msg: "[BIBIGRID] Setup Ganglia monitor" - - include: 001-ganglia.yml - tags: ["master-ganglia","ganglia"] - when: - - enable_ganglia == 'yes' - - ansible_distribution_release == 'xenial' - - block: - debug: msg: "[BIBIGRID] Setup Zabbix Server" @@ -20,6 +11,11 @@ - include: 005-disk.yml tags: ["master-disk","disk"] +- debug: + msg: "[BIBIGRID] Configure database" +- include: 006-database.yml + tags: ["database","slurm","master-slurm"] + - debug: msg: "[BIBIGRID] Setup NFS" when: @@ -29,24 +25,6 @@ - enable_nfs == 'yes' tags: ["master-nfs","nfs"] -- debug: - msg: "[BIBIGRID] Setup GridEngine" - when: - - enable_gridengine == 'yes' -- include: 020-gridengine.yml - when: - - enable_gridengine == 'yes' - tags: ["master-gridengine","gridengine"] - -- block: - - debug: - msg: "[BIBIGRID] Setup Slurm master" - - - include: 021-slurm.yml - when: - - enable_slurm == 'yes' - tags: ["master-slurm","slurm","scale-up","scale-down"] - - debug: msg: "[BIBIGRID] Setup Theia" when: diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/files/slurm/cgroup.conf b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/cgroup.conf similarity index 100% rename from bibigrid-core/src/main/resources/playbook/roles/common/files/slurm/cgroup.conf rename to bibigrid-core/src/main/resources/playbook/roles/slurm/files/cgroup.conf diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/files/slurm/cgroup_allowed_devices_file.conf b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/cgroup_allowed_devices_file.conf similarity index 75% rename from bibigrid-core/src/main/resources/playbook/roles/common/files/slurm/cgroup_allowed_devices_file.conf rename to bibigrid-core/src/main/resources/playbook/roles/slurm/files/cgroup_allowed_devices_file.conf index 756fc52f3..471ad8cfd 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/common/files/slurm/cgroup_allowed_devices_file.conf +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/cgroup_allowed_devices_file.conf @@ -1,6 +1,7 @@ /dev/null /dev/urandom /dev/zero -/dev/sda* +/dev/sd* +/dev/vd* /dev/cpu/*/* /dev/pts/* \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_default b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_default new file mode 100644 index 000000000..b6d2fd860 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_default @@ -0,0 +1,9 @@ +# /etc/default/slurmrestd +# Additional options that are passed to the slurmrestd daemon +#SLURMRESTD_OPTIONS="" +SLURM_CONF="/etc/slurm/slurm.conf" +#SLURMRESTD_DEBUG="8" +SLURM_JWT="" +SLURMRESTD_LISTEN=":6820" +SLURMRESTD_AUTH_TYPES="rest_auth/jwt" +SLURMRESTD_OPENAPI_PLUGINS="openapi/v0.0.36" \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_override.conf b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_override.conf new file mode 100644 index 000000000..eebbe66f7 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/files/slurmrestd_override.conf @@ -0,0 +1,6 @@ +# Override systemd service ExecStart command to disable unixSocket of slurmrestd +[Unit] +After=slurmdbd.service +[Service] +ExecStart= +ExecStart=/usr/sbin/slurmrestd $SLURMRESTD_OPTIONS \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/slurm/handlers/main.yml b/bibigrid-core/src/main/resources/playbook/roles/slurm/handlers/main.yml new file mode 100644 index 000000000..a6b8f991f --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/handlers/main.yml @@ -0,0 +1,20 @@ +- name: slurmdbd + systemd: + name: slurmdbd + state: restarted + +- name: slurmrestd + systemd: + name: slurmrestd + state: restarted + daemon_reload: yes + +- name: slurmctld + systemd: + name: slurmctld + state: restarted + +- name: slurmd + systemd: + name: slurmd + state: restarted \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/slurm/tasks/main.yml b/bibigrid-core/src/main/resources/playbook/roles/slurm/tasks/main.yml new file mode 100644 index 000000000..fc343d4b2 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/tasks/main.yml @@ -0,0 +1,138 @@ +- debug: + msg: "[BIBIGRID] Setup Slurm" + +- name: Install Slurm base packages and dependencies + apt: + name: + - slurm-wlm + - munge + state: latest + +- name: Create new secret + copy: + content: '{{ munge_key }}' + dest: /etc/munge/munge.key + owner: munge + group: munge + mode: 0600 + register: munge_key + +- name: Restart Munge (on key change) + systemd: + name: munge + state: restarted + when: munge_key is changed + +- name: SLURM configuration + template: + src: slurm.conf + dest: /etc/slurm/slurm.conf + owner: slurm + group: root + mode: 0444 + notify: + - slurmctld + - slurmd + +- name: SLURM cgroup configuration + copy: + src: cgroup.conf + dest: /etc/slurm/cgroup.conf + owner: slurm + group: root + mode: 0444 + notify: + - slurmctld + - slurmd + +- name: SLURM cgroup allowed devices conf + copy: + src: cgroup_allowed_devices_file.conf + dest: /etc/slurm/cgroup_allowed_devices_file.conf + owner: root + group: root + mode: 0444 + notify: + - slurmctld + - slurmd + +- block: + - name: Create slurm db + mysql_db: + name: "{{slurmConf.db}}" + state: present + login_unix_socket: /var/run/mysqld/mysqld.sock + + - name: Create slurm db user + mysql_user: + name: "{{slurmConf.db_user}}" + password: "{{slurmConf.db_password}}" + priv: '*.*:ALL' + state: present + login_unix_socket: /var/run/mysqld/mysqld.sock + + - name: Install Slurm database and RestAPI packages + apt: + name: + - slurmdbd + - slurmrestd + state: latest + + - name: Create slurmdb configuration file + template: + src: slurmdbd.conf + dest: /etc/slurm/slurmdbd.conf + owner: slurm + group: root + mode: 0600 + notify: + - slurmdbd + - slurmctld + + - name: Generate random JWT Secret + command: + cmd: "dd if=/dev/random of=/etc/slurm/jwt-secret.key bs=32 count=1" + creates: "/etc/slurm/jwt-secret.key" # only run the command when file is not present + + - name: Change file Properties of JWT Secret file + file: + path: /etc/slurm/jwt-secret.key + owner: slurm + group: slurm + mode: 0600 + + - name: Copy env file for configuration of slurmrestd + copy: + src: slurmrestd_default + dest: /etc/default/slurmrestd + owner: root + group: root + mode: 0644 + notify: + - slurmdbd + - slurmrestd + + - name: Create Service Directory + file: + path: /etc/systemd/system/slurmrestd.service.d + group: root + owner: root + mode: 0755 + state: directory + + - name: Copy systemd Service override file + copy: + src: slurmrestd_override.conf + dest: /etc/systemd/system/slurmrestd.service.d/override.conf + mode: 0644 + owner: root + group: root + notify: + - slurmrestd + + - name: start slurm explicity after all dependencies are configured + systemd: + name: slurmctld + state: started + + when: "'master' in group_names" \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/common/templates/slurm/slurm.conf b/bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurm.conf similarity index 62% rename from bibigrid-core/src/main/resources/playbook/roles/common/templates/slurm/slurm.conf rename to bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurm.conf index 01df70923..5a2812c8a 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/common/templates/slurm/slurm.conf +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurm.conf @@ -3,6 +3,8 @@ ControlMachine={{ master.hostname }} AuthType=auth/munge CryptoType=crypto/munge SlurmUser=slurm +AuthAltTypes=auth/jwt +AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key # NODE CONFIGURATIONS {% set mem = master.memory // 1024 * 1000 %} @@ -18,13 +20,20 @@ NodeName={{ worker.hostname }} SocketsPerBoard={{ worker.cores }} CoresPerSocket PartitionName=debug Nodes={% if use_master_as_compute == 'yes' %}{{master.hostname}},{%endif%}{{sl|join(",")}} default=YES # ACCOUNTING -#AccountingStorageType=accounting_storage/slurmdbd -#AccountingStorageHost=lxcc01 +AccountingStorageType=accounting_storage/slurmdbd +AccountingStoreJobComment=YES +AccountingStorageHost={{ master.hostname }} +AccountingStorageUser={{ slurmConf.db_user }} + +# PRIORITY +PriorityType=priority/multifactor +PriorityFavorSmall=NO +PriorityWeightJobSize=100000 +AccountingStorageTRES=cpu,mem,gres/gpu +PriorityWeightTRES=cpu=1000,mem=2000,gres/gpu=3000 + #JobAcctGatherType=jobacct_gather/linux -{% if (ansible_distribution == 'Ubuntu' and ansible_distribution_release == 'focal') or (ansible_distribution == 'Debian' and ansible_distribution_release == 'buster') %} -# Debian 10 "buster" slurm package needs clustername to be set ClusterName=bibigrid -{% endif %} # CONNECTION SlurmctldPort=6817 @@ -35,28 +44,23 @@ SelectType=select/cons_res SelectTypeParameters=CR_Core # DIRECTORIES -JobCheckpointDir=/var/lib/slurm-llnl/job_checkpoint -SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd -StateSaveLocation=/var/lib/slurm-llnl/state_checkpoint +JobCheckpointDir=/var/lib/slurm/job_checkpoint +SlurmdSpoolDir=/var/lib/slurm/slurmd +StateSaveLocation=/var/lib/slurm/state_checkpoint # LOGGING SlurmctldDebug=debug -SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log +SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=debug -SlurmdLogFile=/var/log/slurm-llnl/slurmd.log +SlurmdLogFile=/var/log/slurm/slurmd.log # ansible_distribution {{ ansible_distribution }} # ansible_distribution_release {{ ansible_distribution_release }} # ansible_distribution_version {{ ansible_distribution_version }} # STATE INFO -{% if ( ansible_distribution == 'Ubuntu' and ansible_distribution_release == 'focal' ) or ( ansible_distribution == 'Debian' and ansible_distribution_release == 'buster' ) %} SlurmctldPidFile=/run/slurmctld.pid SlurmdPidFile=/run/slurmd.pid -{% else %} -SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid -SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid -{% endif %} # SCHEDULING # FastSchedule=2 diff --git a/bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurmdbd.conf b/bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurmdbd.conf new file mode 100644 index 000000000..68d998f46 --- /dev/null +++ b/bibigrid-core/src/main/resources/playbook/roles/slurm/templates/slurmdbd.conf @@ -0,0 +1,29 @@ +ArchiveEvents=yes +ArchiveJobs=yes +ArchiveResvs=yes +ArchiveSteps=no +ArchiveSuspend=no +ArchiveTXN=no +ArchiveUsage=no +#ArchiveScript=/usr/sbin/slurm.dbd.archive +AuthInfo=/var/run/munge/munge.socket.2 +AuthType=auth/munge +DbdHost={{ master.hostname }} +DbdAddr=127.0.0.1 +DebugLevel=debug +PurgeEventAfter=1month +PurgeJobAfter=1month +PurgeResvAfter=1month +PurgeStepAfter=1month +PurgeSuspendAfter=1month +PurgeTXNAfter=1month +PurgeUsageAfter=1month +LogFile=/var/log/slurmdbd.log +PidFile=/var/run/slurmdbd.pid +SlurmUser=slurm +StorageLoc={{ slurmConf.db }} +StoragePass={{ slurmConf.db_password }} +StorageType=accounting_storage/mysql +StorageUser={{ slurmConf.db_user }} +StoragePort=3306 +StorageHost=127.0.0.1 \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/021-slurm.yml b/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/021-slurm.yml index 177bd2acb..047debbe5 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/021-slurm.yml +++ b/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/021-slurm.yml @@ -1,13 +1,12 @@ -- name: Install SLURM worker packages - apt: - name: [slurmd] - state: latest - update_cache: 'yes' - -# (Re-)start slurmd worker daemon -- name: (Re-)start slurmd worker daemon - systemd: - name: slurmd - enabled: True - state: restarted - when: slurm_conf is changed or slurm_cggroup_conf is changed or SLURM_cgroup_allowed_devices_conf is changed \ No newline at end of file +#- name: Install SLURM worker packages +# apt: +# name: [slurmd] +# state: latest +# update_cache: 'yes' +# +## (Re-)start slurmd worker daemon +#- name: Enable and slurmd worker daemon +# systemd: +# name: slurmd +# enabled: True +# state: started \ No newline at end of file diff --git a/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/main.yml b/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/main.yml index f5f202fc1..9eb7e2926 100644 --- a/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/main.yml +++ b/bibigrid-core/src/main/resources/playbook/roles/worker/tasks/main.yml @@ -11,20 +11,20 @@ when: enable_nfs == 'yes' tags: ['nfs','worker-nfs'] -- debug: - msg: "[BIBIGRID] Setup GridEngine" - when: - - enable_gridengine == 'yes' -- include: 020-gridengine.yml - when: - - enable_gridengine == 'yes' - tags: ['gridengine','worker-gridengine'] +#- debug: +# msg: "[BIBIGRID] Setup GridEngine" +# when: +# - enable_gridengine == 'yes' +#- include: 020-gridengine.yml +# when: +# - enable_gridengine == 'yes' +# tags: ['gridengine','worker-gridengine'] -- debug: - msg: "[BIBIGRID] Setup Slurm Worker" - when: enable_slurm == 'yes' -- include: 021-slurm.yml - tags: ['slurm','worker-slurm'] - when: enable_slurm == 'yes' +#- debug: +# msg: "[BIBIGRID] Setup Slurm Worker" +# when: enable_slurm == 'yes' +#- include: 021-slurm.yml +# tags: ['slurm','worker-slurm'] +# when: enable_slurm == 'yes' diff --git a/bibigrid-core/src/main/resources/playbook/site.yml b/bibigrid-core/src/main/resources/playbook/site.yml index 0ae8a9231..b5276e375 100644 --- a/bibigrid-core/src/main/resources/playbook/site.yml +++ b/bibigrid-core/src/main/resources/playbook/site.yml @@ -1,3 +1,7 @@ +# Attention. This file will be overwritten by BiBiGrid during Cluster creation. +# -> package de.unibi.cebitec.bibigrid.core.model +# -> public final class AnsibleConfig + - hosts: master become: 'yes' vars_files: @@ -5,8 +9,9 @@ - vars/instances.yml - vars/common_configuration.yml roles: - - common - - master + - { role: common, tags: ["common"] } + - { role: master, tags: [ "master" ] } + - { role: slurm, tags: ['slurm',"scale-up","scale-down"] } - hosts: workers become: 'yes' vars_files: @@ -15,5 +20,6 @@ - vars/common_configuration.yml - vars/{{ ansible_default_ipv4.address }}.yml roles: - - common - - worker + - { role: common, tags: ["common"] } + - { role: worker, tags: ["worker"] } + - { role: slurm, tags: ['slurm',"scale-up","scale-down"] } diff --git a/bibigrid-light-rest-4j/pom.xml b/bibigrid-light-rest-4j/pom.xml index 35d78746d..179c1bb6e 100644 --- a/bibigrid-light-rest-4j/pom.xml +++ b/bibigrid-light-rest-4j/pom.xml @@ -13,19 +13,14 @@ com.networknt.server.Server 2.0.4 2.9.9 - 2.9.10.7 + 2.9.10.8 1.7.25 0.6.3 1.2.3 4.13.1 - 2.0.23.Final + 2.1.6.Final 1.0.19 2.18.1 - - - - - 2.4 1.0.0 3.1.0 diff --git a/bibigrid-openstack/src/main/java/de/unibi/cebitec/bibigrid/openstack/OpenStackCredentials.java b/bibigrid-openstack/src/main/java/de/unibi/cebitec/bibigrid/openstack/OpenStackCredentials.java index 0725f3ed2..37ed640d6 100644 --- a/bibigrid-openstack/src/main/java/de/unibi/cebitec/bibigrid/openstack/OpenStackCredentials.java +++ b/bibigrid-openstack/src/main/java/de/unibi/cebitec/bibigrid/openstack/OpenStackCredentials.java @@ -107,7 +107,7 @@ public String toString(){ sb.append("\tproject : "+project+"\n"); sb.append("\tprojectId : "+projectId+"\n"); sb.append("\tusername : "+username+"\n"); - sb.append("\tpassword : XXXXXXXX\n"); + sb.append("\tpassword : *********\n"); sb.append("\tendpoint : "+endpoint+"\n"); sb.append("\tuserDomain : "+userDomain+"\n"); sb.append("\tuserDomainId : "+userDomainId+"\n"); diff --git a/docs/CONFIGURATION_SCHEMA.md b/docs/CONFIGURATION_SCHEMA.md index 1d75b713a..70d4fe296 100644 --- a/docs/CONFIGURATION_SCHEMA.md +++ b/docs/CONFIGURATION_SCHEMA.md @@ -59,7 +59,6 @@ serviceCIDR: string # Overwrites CIDR mask setti # HPC Cluster Software slurm: boolean [yes, no] # Enable / Disable SLURM Workload Manager. Default is no -oge: boolean [yes, no] # deprecated - supported for Ubuntu 16.04 only. Default is no # Monitoring zabbix: boolean [yes, "no"] # Use zabbix monitoring tool. Default is no @@ -70,7 +69,6 @@ zabbixConf: timezone: string # Default is "Europe/Berlin" server_name: string # Name of Server. Default is "bibigrid" admin_password: string # Admin password. Default is "bibigrid". Change hardly recommended! -ganglia: boolean [yes, "no"] # deprecated - supported for Ubuntu 16.04 only. Default is no # Network FileSystem nfs: boolean ["yes", no] # Enable / Disable Network File System, Default is yes diff --git a/docs/README.md b/docs/README.md index dedfac3d0..edadbbacb 100644 --- a/docs/README.md +++ b/docs/README.md @@ -54,11 +54,12 @@ workerInstances: ports: - type: TCP number: 80 + ipRange: current - type: TCP number: 443 + ipRange: current nfs: yes -theia: yes slurm: yes ``` @@ -124,20 +125,6 @@ http://ip.of.your.master/zabbix ``` The 'Username' to enter is `admin`, the following 'Password' is the previously specified admin password. -### GridEngine Configuration -If you decide to enable GridEngine (deprecated, supported for Ubuntu 16.04 only - you may use SLURM instead) -you have to use the `oge` configuration parameter. -See the [sge_conf(5) man page](http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html) to get -an overview as well as a description of the possible parameters. - -As an example you can set the max number of dynamic event clients (jobs submitted via qsub sync): -``` -ogeConf: - qmaster_params: MAX_DYN_EC=1000 -``` -The given value(s) will be overwritten in or added to the default configuration. -Check `qconf -sconf global` on master to proof the configuration. - ### Including Ansible (Galaxy) Roles You can include ansible roles from your local machine (compressed as .tar.gz files) automatically into your cluster setup by defining following configuration settings: @@ -351,4 +338,4 @@ Additionally, you have the possibility to terminate all clusters of a specific u ``` Here you have to insert your username instead of '[user]'. This may save time, -if you are absolutely certain you don't need any of your clusters anymore. \ No newline at end of file +if you are absolutely certain you don't need any of your clusters anymore.