Document use of retry_on_error for dedicated inference endpoints (#554

) Shortly after #549, the inference endpoint backend was updated to block by default on model loading. This PR adds documentation explaining how to circumvent that blocking so that the user, if desired, can handle the 500 errors themselves.
huggingface · Mar 14, 2024 · b757f81 · b757f81
1 parent 3226ad4
commit b757f81
Showing 1 changed file with 15 additions and 0 deletions.
diff --git a/packages/inference/README.md b/packages/inference/README.md
@@ -506,6 +506,21 @@ const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/
 const { generated_text } = await gpt2.textGeneration({inputs: 'The answer to the universe is'});
 ```
 
+By default, all calls to the inference endpoint will wait until the model is
+loaded. When [scaling to
+0](https://huggingface.co/docs/inference-endpoints/en/autoscaling#scaling-to-0)
+is enabled on the endpoint, this can result in non-trivial waiting time. If
+you'd rather disable this behavior and handle the endpoint's returned 500 HTTP
+errors yourself, you can do so like so:
+
+```typescript
+const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/gpt2');
+const { generated_text } = await gpt2.textGeneration(
+ {inputs: 'The answer to the universe is'},
+ {retry_on_error: false},
+);
+```
+
 ## Running tests
 
 ```console