Skip to content

Commit

Permalink
chore: Update README and cut 0.26.0 for publishing (#188)
Browse files Browse the repository at this point in the history
Bring back some of the autogenerated README content and make sure our
manual sections are using the right syntax. Once we merge and
regenerate, 0.26.0 will be published.
  • Loading branch information
awalker4 authored Oct 5, 2024
1 parent 6e1fa29 commit f6c6247
Show file tree
Hide file tree
Showing 2 changed files with 184 additions and 104 deletions.
286 changes: 183 additions & 103 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,12 @@

<div align="center">

<a
href="https://www.phorm.ai/query?projectId=34efc517-2201-4376-af43-40c4b9da3dc5">
<img src="https://img.shields.io/badge/Phorm-Ask_AI-%23F2777A.svg?&logo=" />
</a>

</div>


<h2 align="center">
<p>Python SDK for the Unstructured API</p>
</h2>

NOTE: This README is for the `0.26.0-beta` version. The current published SDK, `0.25.5` can be found [here](https://github.com/Unstructured-IO/unstructured-python-client/blob/v0.25.5/README.md).

This is a Python client for the [Unstructured API](https://docs.unstructured.io/api-reference/api-services/saas-api-development-guide) and you can sign up for your API key on https://app.unstructured.io.

Please refer to the [Unstructured docs](https://docs.unstructured.io/api-reference/api-services/sdk-python) for a full guide to using the client.
Expand Down Expand Up @@ -73,94 +65,6 @@ poetry add unstructured-client
```
<!-- End SDK Installation [installation] -->

## SDK Example Usage

### Example

```python
import os

import unstructured_client
from unstructured_client.models import operations, shared

client = unstructured_client.UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

filename = "PATH_TO_FILE"
with open(filename, "rb") as f:
data = f.read()

req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=data,
file_name=filename,
),
# --- Other partition parameters ---
strategy=shared.Strategy.AUTO,
languages=['eng'],
),
)

try:
res = client.general.partition(request=req)
print(res.elements[0])
except Exception as e:
print(e)
```
Refer to the [API parameters page](https://docs.unstructured.io/api-reference/api-services/api-parameters) for all available parameters.

### Configuration

#### Splitting PDF by pages

See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.

In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this.

The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively.
The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context).

Example:
```python
req = shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_concurrency_level=8
)
```

#### Sending specific page ranges

When `split_pdf_page=True` (the default), you can optionally specify a page range to send only a portion of your PDF to be extracted. The parameter takes a list of two integers to specify the range, inclusive. A ValueError is thrown if the page range is invalid.

Example:
```python
req = shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_page_range=[10,15],
)
```

#### Splitting PDF by pages - strict mode

When `split_pdf_allow_failed=False` (the default), any errors encountered during sending parallel request will break the process and raise an exception.
When `split_pdf_allow_failed=True`, the process will continue even if some requests fail, and the results will be combined at the end (the output from the errored pages will not be included).

Example:
```python
req = shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_allow_failed=True,
)
```

<!-- Start Retries [retries] -->
## Retries
Expand Down Expand Up @@ -229,6 +133,59 @@ if res.elements is not None:
```
<!-- End Retries [retries] -->


<!-- Start Error Handling [errors] -->
## Error Handling

Handling errors in this SDK should largely match your expectations. All operations return a response object or raise an error. If Error objects are specified in your OpenAPI Spec, the SDK will raise the appropriate Error type.

| Error Object | Status Code | Content Type |
| -------------------------- | -------------------------- | -------------------------- |
| errors.HTTPValidationError | 422 | application/json |
| errors.ServerError | 5XX | application/json |
| errors.SDKError | 4xx-5xx | */* |

### Example

```python
from unstructured_client import UnstructuredClient
from unstructured_client.models import errors, shared

s = UnstructuredClient()

res = None
try:
res = s.general.partition(request={
"partition_parameters": {
"files": {
"content": open("example.file", "rb"),
"file_name": "example.file",
},
"chunking_strategy": shared.ChunkingStrategy.BY_TITLE,
"split_pdf_page_range": [
1,
10,
],
"strategy": shared.Strategy.HI_RES,
},
})

if res.elements is not None:
# handle response
pass

except errors.HTTPValidationError as e:
# handle e.data: errors.HTTPValidationErrorData
raise(e)
except errors.ServerError as e:
# handle e.data: errors.ServerErrorData
raise(e)
except errors.SDKError as e:
# handle exception
raise(e)
```
<!-- End Error Handling [errors] -->

<!-- Start Custom HTTP Client [http-client] -->
## Custom HTTP Client

Expand Down Expand Up @@ -310,13 +267,6 @@ s = UnstructuredClient(async_client=CustomClient(httpx.AsyncClient()))
```
<!-- End Custom HTTP Client [http-client] -->

<!-- No SDK Example Usage [usage] -->
<!-- No SDK Available Operations -->
<!-- No Pagination -->
<!-- No Error Handling -->
<!-- No Server Selection -->
<!-- No Authentication -->

<!-- Start IDE Support [idesupport] -->
## IDE Support

Expand All @@ -327,6 +277,131 @@ Generally, the SDK will work well with most IDEs out of the box. However, when u
- [PyCharm Pydantic Plugin](https://docs.pydantic.dev/latest/integrations/pycharm/)
<!-- End IDE Support [idesupport] -->


<!-- Start SDK Example Usage [usage] -->
## SDK Example Usage

### Example

```python
# Synchronous Example
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared

s = UnstructuredClient()

res = s.general.partition(request={
"partition_parameters": {
"files": {
"content": open("example.file", "rb"),
"file_name": "example.file",
},
"chunking_strategy": shared.ChunkingStrategy.BY_TITLE,
"split_pdf_page_range": [
1,
10,
],
"strategy": shared.Strategy.HI_RES,
},
})

if res.elements is not None:
# handle response
pass
```

</br>

The same SDK client can also be used to make asychronous requests by importing asyncio.
```python
# Asynchronous Example
import asyncio
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared

async def main():
s = UnstructuredClient()
res = await s.general.partition_async(request={
"partition_parameters": {
"files": {
"content": open("example.file", "rb"),
"file_name": "example.file",
},
"chunking_strategy": shared.ChunkingStrategy.BY_TITLE,
"split_pdf_page_range": [
1,
10,
],
"strategy": shared.Strategy.HI_RES,
},
})
if res.elements is not None:
# handle response
pass

asyncio.run(main())
```
<!-- End SDK Example Usage [usage] -->

Refer to the [API parameters page](https://docs.unstructured.io/api-reference/api-services/api-parameters) for all available parameters.


## Configuration

### Splitting PDF by pages

See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.

In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this.

The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively.
The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context).

Example:
```python
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_concurrency_level=8
)
)
```

### Sending specific page ranges

When `split_pdf_page=True` (the default), you can optionally specify a page range to send only a portion of your PDF to be extracted. The parameter takes a list of two integers to specify the range, inclusive. A ValueError is thrown if the page range is invalid.

Example:
```python
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_page_range=[10,15],
)
)
```

### Splitting PDF by pages - strict mode

When `split_pdf_allow_failed=False` (the default), any errors encountered during sending parallel request will break the process and raise an exception.
When `split_pdf_allow_failed=True`, the process will continue even if some requests fail, and the results will be combined at the end (the output from the errored pages will not be included).

Example:
```python
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_allow_failed=True,
)
)
```

<!-- Start File uploads [file-upload] -->
## File uploads

Expand Down Expand Up @@ -380,6 +455,11 @@ s = UnstructuredClient(debug_logger=logging.getLogger("unstructured_client"))
```
<!-- End Debugging [debug] -->

<!-- No SDK Available Operations -->
<!-- No Pagination -->
<!-- No Server Selection -->
<!-- No Authentication -->

<!-- Placeholder for Future Speakeasy SDK Sections -->

### Maturity
Expand Down
2 changes: 1 addition & 1 deletion gen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ generation:
auth:
oAuth2ClientCredentialsEnabled: false
python:
version: 0.26.0-beta.4
version: 0.26.0
additionalDependencies:
dev:
deepdiff: '>=6.0'
Expand Down

0 comments on commit f6c6247

Please sign in to comment.