diff --git a/README.md b/README.md index 3d867090..f798f76e 100755 --- a/README.md +++ b/README.md @@ -11,20 +11,12 @@
Python SDK for the Unstructured API
-NOTE: This README is for the `0.26.0-beta` version. The current published SDK, `0.25.5` can be found [here](https://github.com/Unstructured-IO/unstructured-python-client/blob/v0.25.5/README.md). - This is a Python client for the [Unstructured API](https://docs.unstructured.io/api-reference/api-services/saas-api-development-guide) and you can sign up for your API key on https://app.unstructured.io. Please refer to the [Unstructured docs](https://docs.unstructured.io/api-reference/api-services/sdk-python) for a full guide to using the client. @@ -73,94 +65,6 @@ poetry add unstructured-client ``` -## SDK Example Usage - -### Example - -```python -import os - -import unstructured_client -from unstructured_client.models import operations, shared - -client = unstructured_client.UnstructuredClient( - api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"), - server_url=os.getenv("UNSTRUCTURED_API_URL"), -) - -filename = "PATH_TO_FILE" -with open(filename, "rb") as f: - data = f.read() - -req = operations.PartitionRequest( - partition_parameters=shared.PartitionParameters( - files=shared.Files( - content=data, - file_name=filename, - ), - # --- Other partition parameters --- - strategy=shared.Strategy.AUTO, - languages=['eng'], - ), -) - -try: - res = client.general.partition(request=req) - print(res.elements[0]) -except Exception as e: - print(e) -``` -Refer to the [API parameters page](https://docs.unstructured.io/api-reference/api-services/api-parameters) for all available parameters. - -### Configuration - -#### Splitting PDF by pages - -See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details. - -In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this. - -The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively. -The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context). - -Example: -```python -req = shared.PartitionParameters( - files=files, - strategy="fast", - languages=["eng"], - split_pdf_concurrency_level=8 -) -``` - -#### Sending specific page ranges - -When `split_pdf_page=True` (the default), you can optionally specify a page range to send only a portion of your PDF to be extracted. The parameter takes a list of two integers to specify the range, inclusive. A ValueError is thrown if the page range is invalid. - -Example: -```python -req = shared.PartitionParameters( - files=files, - strategy="fast", - languages=["eng"], - split_pdf_page_range=[10,15], -) -``` - -#### Splitting PDF by pages - strict mode - -When `split_pdf_allow_failed=False` (the default), any errors encountered during sending parallel request will break the process and raise an exception. -When `split_pdf_allow_failed=True`, the process will continue even if some requests fail, and the results will be combined at the end (the output from the errored pages will not be included). - -Example: -```python -req = shared.PartitionParameters( - files=files, - strategy="fast", - languages=["eng"], - split_pdf_allow_failed=True, -) -``` ## Retries @@ -229,6 +133,59 @@ if res.elements is not None: ``` + + +## Error Handling + +Handling errors in this SDK should largely match your expectations. All operations return a response object or raise an error. If Error objects are specified in your OpenAPI Spec, the SDK will raise the appropriate Error type. + +| Error Object | Status Code | Content Type | +| -------------------------- | -------------------------- | -------------------------- | +| errors.HTTPValidationError | 422 | application/json | +| errors.ServerError | 5XX | application/json | +| errors.SDKError | 4xx-5xx | */* | + +### Example + +```python +from unstructured_client import UnstructuredClient +from unstructured_client.models import errors, shared + +s = UnstructuredClient() + +res = None +try: + res = s.general.partition(request={ + "partition_parameters": { + "files": { + "content": open("example.file", "rb"), + "file_name": "example.file", + }, + "chunking_strategy": shared.ChunkingStrategy.BY_TITLE, + "split_pdf_page_range": [ + 1, + 10, + ], + "strategy": shared.Strategy.HI_RES, + }, + }) + + if res.elements is not None: + # handle response + pass + +except errors.HTTPValidationError as e: + # handle e.data: errors.HTTPValidationErrorData + raise(e) +except errors.ServerError as e: + # handle e.data: errors.ServerErrorData + raise(e) +except errors.SDKError as e: + # handle exception + raise(e) +``` + + ## Custom HTTP Client @@ -310,13 +267,6 @@ s = UnstructuredClient(async_client=CustomClient(httpx.AsyncClient())) ``` - - - - - - - ## IDE Support @@ -327,6 +277,131 @@ Generally, the SDK will work well with most IDEs out of the box. However, when u - [PyCharm Pydantic Plugin](https://docs.pydantic.dev/latest/integrations/pycharm/) + + +## SDK Example Usage + +### Example + +```python +# Synchronous Example +from unstructured_client import UnstructuredClient +from unstructured_client.models import shared + +s = UnstructuredClient() + +res = s.general.partition(request={ + "partition_parameters": { + "files": { + "content": open("example.file", "rb"), + "file_name": "example.file", + }, + "chunking_strategy": shared.ChunkingStrategy.BY_TITLE, + "split_pdf_page_range": [ + 1, + 10, + ], + "strategy": shared.Strategy.HI_RES, + }, +}) + +if res.elements is not None: + # handle response + pass +``` + + + +The same SDK client can also be used to make asychronous requests by importing asyncio. +```python +# Asynchronous Example +import asyncio +from unstructured_client import UnstructuredClient +from unstructured_client.models import shared + +async def main(): + s = UnstructuredClient() + res = await s.general.partition_async(request={ + "partition_parameters": { + "files": { + "content": open("example.file", "rb"), + "file_name": "example.file", + }, + "chunking_strategy": shared.ChunkingStrategy.BY_TITLE, + "split_pdf_page_range": [ + 1, + 10, + ], + "strategy": shared.Strategy.HI_RES, + }, + }) + if res.elements is not None: + # handle response + pass + +asyncio.run(main()) +``` + + +Refer to the [API parameters page](https://docs.unstructured.io/api-reference/api-services/api-parameters) for all available parameters. + + +## Configuration + +### Splitting PDF by pages + +See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details. + +In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this. + +The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively. +The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context). + +Example: +```python +req = operations.PartitionRequest( + partition_parameters=shared.PartitionParameters( + files=files, + strategy="fast", + languages=["eng"], + split_pdf_concurrency_level=8 + ) +) +``` + +### Sending specific page ranges + +When `split_pdf_page=True` (the default), you can optionally specify a page range to send only a portion of your PDF to be extracted. The parameter takes a list of two integers to specify the range, inclusive. A ValueError is thrown if the page range is invalid. + +Example: +```python +req = operations.PartitionRequest( + partition_parameters=shared.PartitionParameters( + files=files, + strategy="fast", + languages=["eng"], + split_pdf_page_range=[10,15], + ) +) +``` + +### Splitting PDF by pages - strict mode + +When `split_pdf_allow_failed=False` (the default), any errors encountered during sending parallel request will break the process and raise an exception. +When `split_pdf_allow_failed=True`, the process will continue even if some requests fail, and the results will be combined at the end (the output from the errored pages will not be included). + +Example: +```python +req = operations.PartitionRequest( + partition_parameters=shared.PartitionParameters( + files=files, + strategy="fast", + languages=["eng"], + split_pdf_allow_failed=True, + ) +) +``` + ## File uploads @@ -380,6 +455,11 @@ s = UnstructuredClient(debug_logger=logging.getLogger("unstructured_client")) ``` + + + + + ### Maturity diff --git a/gen.yaml b/gen.yaml index 43dc0c06..3a369a6e 100644 --- a/gen.yaml +++ b/gen.yaml @@ -10,7 +10,7 @@ generation: auth: oAuth2ClientCredentialsEnabled: false python: - version: 0.26.0-beta.4 + version: 0.26.0 additionalDependencies: dev: deepdiff: '>=6.0'