You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
job = ncluster.make_job(name=args.name,
run_name=f"{args.name}",
num_tasks=config.machines,
image_name=config.image_name,
instance_type=config.instance_type,
spot=not args.nospot,
skip_setup=args.skip_setup)
killer_task = ncluster.make_task()
killer_task.run(f'export AWS_ACCESS_KEY_ID={os.environ["AWS_ACCESS_KEY_ID"]}')
killer_task.run(f'export AWS_SECRET_ACCESS_KEY={os.environ["AWS_SECRET_ACCESS_KEY"]')
killer_task.run(f'export AWS_DEFAULT_REGION={os.environ["AWS_DEFAULT_REGION"]')
killer_task.run(f'python hung_job_killer.py --watchdir={job.logdir} --instances=",".join(t.name for t in job.tasks)')
The hung_job_killer would check watch_dir on a regular basis and kill all instances in --instances if watch_dir had no modifications for an hour.
Killing can be done with subset of the logic from ncluster command-line tool. Currently lookup_instances has special logic for exact_match which kicks in when fragment is wrapped in '', this should probably be a keyword argument instead
def kill(fragment=''):
instances = u.lookup_instances(fragment, valid_states=['running', 'stopped'])
instances_to_kill = []
for i in instances:
state = i.state['Name']
if LIMIT_TO_CURRENT_USER and i.key_name != u.get_keypair_name():
print(f"Skipping instance launched with key {i.key_name}, use reallykill to kill")
continue
print(u.get_name(i), i.instance_type, i.key_name,
state if state == 'stopped' else '')
instances_to_kill.append(i)
action = 'terminating'
if not _check_instance_found(instances, fragment):
return
ec2_client = u.get_ec2_client()
if answer.lower() == "y":
instance_ids = [i.id for i in instances_to_kill]
response = ec2_client.terminate_instances(InstanceIds=instance_ids)
assert u.is_good_response(response), response
print(f"{action}: success")
else:
print("Didn't get y, doing nothing")
The text was updated successfully, but these errors were encountered:
idea for babysitter job, cc @8enmann
The
hung_job_killer
would checkwatch_dir
on a regular basis and kill all instances in--instances
ifwatch_dir
had no modifications for an hour.Killing can be done with subset of the logic from
ncluster
command-line tool. Currentlylookup_instances
has special logic forexact_match
which kicks in when fragment is wrapped in '', this should probably be a keyword argument insteadThe text was updated successfully, but these errors were encountered: