[UX] Add infeasibility reasons to the exception message #3986

Conless · 2024-09-25T12:09:16Z

This PR fixes #3911 by summarize the infeasibility reasons for each resource into a table, and append it to the end of the final exception message.

Here is a minimal example.

$ sky launch -c k8s --cloud kubernetes -i 10 -y
I 09-25 20:04:46 optimizer.py:719] == Optimizer ==
I 09-25 20:04:46 optimizer.py:742] Estimated cost: $0.0 / hour
I 09-25 20:04:46 optimizer.py:742] 
I 09-25 20:04:46 optimizer.py:867] Considered resources (1 node):
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937]  CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937]  Kubernetes   2CPU--2GB   2       2         -              kubernetes    0.00          ✔     
I 09-25 20:04:46 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-25 20:04:46 optimizer.py:937] 
Running task on cluster k8s...
I 09-25 20:04:46 cloud_vm_ray_backend.py:4421] Creating a new cluster: 'k8s' [1x Kubernetes(2CPU--2GB)].
I 09-25 20:04:46 cloud_vm_ray_backend.py:4421] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000] sky.exceptions.NotSupportedError: The following features are not supported by Kubernetes:
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000]  Feature  Reason                                     
W 09-25 20:04:48 cloud_vm_ray_backend.py:2000]  stop     Kubernetes does not support stopping VMs.  
W 09-25 20:04:48 cloud_vm_ray_backend.py:2026] 
W 09-25 20:04:48 cloud_vm_ray_backend.py:2026] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Resource               Reason                                                   
Kubernetes(2CPU--2GB)  The following features are not supported by Kubernetes:  
                        Feature  Reason                                                                                               
                        stop     Kubernetes does not support stopping VMs.

The size of output table can fit the width of the terminal. This is an example when the terminal is narrow. (output is truncated)

$ sky launch --gpus H100:8 -y
The reasons for the infeasibility of each resource are 
summarized below. For detailed explanations, please 
refer to the log above.
Resource      Reason                                          
GCP(a3-       Failed to acquire resources in us-central1-a.   
highgpu-8g,   Try changing resource requirements or use       
{'H100': 8})  another zone.                                   
GCP(a3-       Failed to acquire resources in us-west1-a. Try  
highgpu-8g,   changing resource requirements or use another   
{'H100': 8})  zone.                                           
GCP(a3-       Failed to acquire resources in us-east4-a. Try  
highgpu-8g,   changing resource requirements or use another   
{'H100': 8})  zone.                                           
GCP(a3-       Failed to acquire resources in europe-west1-b.  
highgpu-8g,   Try changing resource requirements or use       
{'H100': 8})  another zone.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Michaelvll · 2024-10-01T16:54:20Z

sky/backends/cloud_vm_ray_backend.py

+                table = log_utils.create_table(['Resource', 'Reason'])
+                for (resource, exception) in resource_exceptions.items():
+                    table.add_row([
+                        resource,
+                        _EXCEPTION_SUMMARY_MESSAGE[exception.__class__]
+                    ])
+                raise exceptions.ResourcesUnavailableError(
+                    _RESOURCES_UNAVAILABLE_LOG + '\n' + table.get_string(),
+                    failover_history=failover_history)


Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?

It might be good to test with, e.g. sky launch --gpus H100:8 to see how the output for failover through many regions look like

Instead of parsing the exceptions here, should we directly rely on the failover_history to generate reason table at the caller? Or, is there a reason we have to do it here?

Yes this was a design that I've tried, but I don't think the failover_history gives enough information for users to identify the problem. For example, when I run sky launch --gpus H100:8, the (partial) failover history would be

[ResourcesUnavailableError('Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.'), ResourcesUnavailableError('Failed to acquire resources in us-west1-a. Try changing resource requirements or use another zone.'),

As you see it only contains the region of each failed provision, not even includes the cloud provider or resource information. So I think constructing the mapping from each resource to the exception here is more user-friendly.

It might be good to test with, e.g. sky launch --gpus H100:8 to see how the output for failover through many regions look like.

Sure here is the final output:

$ sky launch --gpus H100:8 sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x <Cloud>({'H100': 8}) To keep retrying until the cluster is up, use the `--retry-until-up` flag. The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above. Resource Reason GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. AWS(p5.48xlarge, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud. GCP(a3-highgpu-8g, {'H100': 8}) Requested resources cannot be satisfied on this cloud.

Seems that it works as expected.

A special case occurs when a resource have too many requirements, causing the 'Resource' column to become very long, which affects the display in the terminal.

Conless · 2024-10-28T06:17:03Z

Hi @Michaelvll ! I've just pushed a revised version of the PR, which change the format of the output table to fit the width of the terminal and provide more details for users. The new output is updated in the PR description.

yika-luo

Thanks @Conless!

Nit: this line is too verbose The reasons for the infeasibility of each resource are summarized below. For detailed explanations, please refer to the log above.
Suggestion: Reasons for provision failures (for details, please check the log above):

Conless · 2024-11-29T16:56:21Z

Thanks for your suggestion @yika-luo ! Just updated the message as you suggested.

Conless added 2 commits September 21, 2024 23:19

Append infeasibility reasons to the end of final exception message.

533e548

Fix format issues.

6932181

Michaelvll self-requested a review September 27, 2024 07:32

Michaelvll reviewed Oct 1, 2024

View reviewed changes

Beautify the output table of infeasibility reasons.

575ba56

Michaelvll requested a review from yika-luo November 15, 2024 17:07

yika-luo approved these changes Nov 21, 2024

View reviewed changes

Simplify the unavalible log.

4e129ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UX] Add infeasibility reasons to the exception message #3986

[UX] Add infeasibility reasons to the exception message #3986

Conless commented Sep 25, 2024 •

edited

Loading

Michaelvll Oct 1, 2024

Conless Oct 2, 2024

Conless commented Oct 28, 2024

yika-luo left a comment

Conless commented Nov 29, 2024

[UX] Add infeasibility reasons to the exception message #3986

Are you sure you want to change the base?

[UX] Add infeasibility reasons to the exception message #3986

Conversation

Conless commented Sep 25, 2024 • edited Loading

Michaelvll Oct 1, 2024

Choose a reason for hiding this comment

Conless Oct 2, 2024

Choose a reason for hiding this comment

Conless commented Oct 28, 2024

yika-luo left a comment

Choose a reason for hiding this comment

Conless commented Nov 29, 2024

Conless commented Sep 25, 2024 •

edited

Loading