Skip to content

Commit

Permalink
added bibilog to show output log of most recent worker start. Tried f…
Browse files Browse the repository at this point in the history
…ixing the slurm23.11 bug.
  • Loading branch information
XaverStiensmeier committed Feb 9, 2024
1 parent 880b6d0 commit 71f3214
Show file tree
Hide file tree
Showing 9 changed files with 74 additions and 2 deletions.
7 changes: 5 additions & 2 deletions bibigrid/core/actions/terminate.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,10 +147,12 @@ def delete_security_groups(provider, cluster_id, security_groups, log, timeout=5
tmp_success = False
while not tmp_success:
try:
# TODO: Check if security group exists at all
not_found = not provider.get_security_group(security_group_name)
tmp_success = provider.delete_security_group(security_group_name)
except ConflictException:
tmp_success = False
if tmp_success:
if tmp_success or not_found:
break
if attempts < timeout:
attempts += 1
Expand All @@ -161,7 +163,8 @@ def delete_security_groups(provider, cluster_id, security_groups, log, timeout=5
log.error(f"Attempt to delete security group {security_group_name} on "
f"{provider.cloud_specification['identifier']} failed.")
break
log.info(f"Delete security_group {security_group_name} -> {tmp_success}")
log.info(f"Delete security_group {security_group_name} -> {tmp_success or not_found} on "
f"{provider.cloud_specification['identifier']}.")
success = success and tmp_success
return success

Expand Down
8 changes: 8 additions & 0 deletions bibigrid/core/provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,14 @@ def append_rules_to_security_group(self, name_or_id, rules):
:return:
"""

@abstractmethod
def get_security_group(self, name_or_id):
"""
Returns security group if found else None.
@param name_or_id:
@return:
"""

def get_mount_info_from_server(self, server):
volumes = []
for server_volume in server["volumes"]:
Expand Down
8 changes: 8 additions & 0 deletions bibigrid/openstack/openstack_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,3 +320,11 @@ def append_rules_to_security_group(self, name_or_id, rules):
port_range_max=rule["port_range_max"],
remote_ip_prefix=rule["remote_ip_prefix"],
remote_group_id=rule["remote_group_id"])

def get_security_group(self, name_or_id):
"""
Returns security group if found else None.
@param name_or_id:
@return:
"""
return self.conn.get_security_group(name_or_id)
1 change: 1 addition & 0 deletions documentation/markdown/bibigrid_feature_list.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@
| [Configuration](features/configuration.md) | Contains all data regarding cluster setup for all providers. |
| [Command Line Interface](features/CLI.md) | What command line arguments can be passed into BiBiGrid. |
| [Multi Cloud](features/multi_cloud.md) | Explanation how BiBiGrid's multi-cloud approach works |
| [BiBiGrid Cluster Commands](features/cluster_commands.md) | Short useful commands to get information on the cluster |

![](../images/actions.jpg)
31 changes: 31 additions & 0 deletions documentation/markdown/features/cluster_commands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# BiBiGrid Cluster Commands

## [bibiname](../../../resources/playbook/roles/bibigrid/templates/bin/bibiname.j2)[m|v|default: w] [number]

This command creates node names for the user without them needing to copy the cluster-id.
Takes two arguments. The first defines whether a master, vpnwkr or worker is meant. Worker is the default.
The second parameter - if vpnwkr or worker is selected - defines which vpnwkr or worker is meant.

### Examples
Assume the cluster-id `20ozebsutekrjj4`.

```sh
bibiname m
# bibigrid-master-20ozebsutekrjj4
```

```sh
bibiname v 0
# bibigrid-vpnwkr-20ozebsutekrjj4-0
```

```sh
bibiname 0 # or bibiname w 0
# bibigrid-worker-20ozebsutekrjj4-0
```

A more advanced use would be to use the generated name to login into a worker:
```sh
ssh $(bibiname 0) # or bibiname w 0
# ssh bibigrid-worker-20ozebsutekrjj4-0
```
Binary file not shown.
16 changes: 16 additions & 0 deletions resources/bin/bibilog
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
if [ "$1" == "err" ]; then
err_out="err"
else
err_out="out"
fi

if [ "$2" == "fail" ]; then
fail_create="fail"
else
fail_create="create"
fi

LOG="/var/log/slurm/worker_logs/$fail_create/$err_out"
RECENT=$(ls -1rt $LOG | tail -n1)
tail -f "$LOG/$RECENT"
4 changes: 4 additions & 0 deletions resources/playbook/roles/bibigrid/handlers/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,26 @@
systemd:
name: slurmdbd
state: restarted
when: "'vpnwkr' not in group_names"

- name: slurmrestd
systemd:
name: slurmrestd
state: restarted
daemon_reload: true
when: "'vpnwkr' not in group_names"

- name: slurmctld
systemd:
name: slurmctld
state: restarted
when: "'master' in group_names"

- name: slurmd
systemd:
name: slurmd
state: restarted
when: "'vpnwkr' not in group_names"

- name: zabbix-agent
systemd:
Expand Down
1 change: 1 addition & 0 deletions resources/playbook/roles/bibigrid/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@
- debug:
msg: "[BIBIGRID] Setup Slurm"
- import_tasks: 042-slurm.yml
when: "'vpnwkr' not in group_names"
- import_tasks: 042-slurm-server.yml
when: "'master' in group_names"

Expand Down

0 comments on commit 71f3214

Please sign in to comment.