Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

babysitter job to automatically kill hung jobs #8

Open
yaroslavvb opened this issue May 7, 2019 · 0 comments
Open

babysitter job to automatically kill hung jobs #8

yaroslavvb opened this issue May 7, 2019 · 0 comments

Comments

@yaroslavvb
Copy link

yaroslavvb commented May 7, 2019

idea for babysitter job, cc @8enmann

job = ncluster.make_job(name=args.name,
                            run_name=f"{args.name}",
                            num_tasks=config.machines,
                            image_name=config.image_name,
                            instance_type=config.instance_type,
                            spot=not args.nospot,
                            skip_setup=args.skip_setup)

killer_task = ncluster.make_task()
killer_task.run(f'export AWS_ACCESS_KEY_ID={os.environ["AWS_ACCESS_KEY_ID"]}')
killer_task.run(f'export AWS_SECRET_ACCESS_KEY={os.environ["AWS_SECRET_ACCESS_KEY"]')
killer_task.run(f'export AWS_DEFAULT_REGION={os.environ["AWS_DEFAULT_REGION"]')
killer_task.run(f'python hung_job_killer.py --watchdir={job.logdir} --instances=",".join(t.name for t in job.tasks)')

The hung_job_killer would check watch_dir on a regular basis and kill all instances in --instances if watch_dir had no modifications for an hour.

Killing can be done with subset of the logic from ncluster command-line tool. Currently lookup_instances has special logic for exact_match which kicks in when fragment is wrapped in '', this should probably be a keyword argument instead

def kill(fragment=''):
  instances = u.lookup_instances(fragment, valid_states=['running', 'stopped'])
  instances_to_kill = []
  for i in instances:
    state = i.state['Name']
    if LIMIT_TO_CURRENT_USER and i.key_name != u.get_keypair_name():
      print(f"Skipping instance launched with key {i.key_name}, use reallykill to kill")
      continue
    print(u.get_name(i), i.instance_type, i.key_name,
          state if state == 'stopped' else '')
    instances_to_kill.append(i)

  action = 'terminating'
  if not _check_instance_found(instances, fragment):
    return

  ec2_client = u.get_ec2_client()
  if answer.lower() == "y":
    instance_ids = [i.id for i in instances_to_kill]
    response = ec2_client.terminate_instances(InstanceIds=instance_ids)

    assert u.is_good_response(response), response
    print(f"{action}: success")
  else:
    print("Didn't get y, doing nothing")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant