Skip to content

Commit

Permalink
Document use of retry_on_error for dedicated inference endpoints (#554
Browse files Browse the repository at this point in the history
)

Shortly after #549, the inference endpoint backend
was updated to block by default on model loading.
This PR adds documentation explaining how to
circumvent that blocking so that the user, if
desired, can handle the 500 errors themselves.
  • Loading branch information
jinnovation authored Mar 14, 2024
1 parent 3226ad4 commit b757f81
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions packages/inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,21 @@ const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/
const { generated_text } = await gpt2.textGeneration({inputs: 'The answer to the universe is'});
```

By default, all calls to the inference endpoint will wait until the model is
loaded. When [scaling to
0](https://huggingface.co/docs/inference-endpoints/en/autoscaling#scaling-to-0)
is enabled on the endpoint, this can result in non-trivial waiting time. If
you'd rather disable this behavior and handle the endpoint's returned 500 HTTP
errors yourself, you can do so like so:

```typescript
const gpt2 = hf.endpoint('https://xyz.eu-west-1.aws.endpoints.huggingface.cloud/gpt2');
const { generated_text } = await gpt2.textGeneration(
{inputs: 'The answer to the universe is'},
{retry_on_error: false},
);
```

## Running tests

```console
Expand Down

0 comments on commit b757f81

Please sign in to comment.