Skip to content

Commit

Permalink
Object storage clarifications
Browse files Browse the repository at this point in the history
  • Loading branch information
hcourdent committed Sep 24, 2024
1 parent 52b5bcd commit b222517
Show file tree
Hide file tree
Showing 33 changed files with 257 additions and 356 deletions.
6 changes: 3 additions & 3 deletions blog/2023-11-24-data-pipeline-orchestrator/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ An ETL is nothing else than a [DAG](https://en.wikipedia.org/wiki/Directed_acycl

Windmill enables building fast, powerful, reliable, and easy-to-build data pipelines:

- The DX in Windmill allows you to quickly [assemble flows](/docs/flows/flow_editor) that can process data step by step in a visual and easy-to-manage way ;
- You can control [parallelism](/docs/flows/flow_branches#branch-all) between individual steps, and set [concurrency limits](/docs/flows/concurrency_limit) in case external resources need are fragile or rate limited ;
- [Windmill flows can be restarted from any step](../2023-11-24-restartable-flows/index.mdx), making the iteration process of building a pipeline (or debugging one) smooth and efficient ;
- The DX in Windmill allows you to quickly [assemble flows](/docs/flows/flow_editor) that can process data step by step in a visual and easy-to-manage way;
- You can control [parallelism](/docs/flows/flow_branches#branch-all) between individual steps, and set [concurrency limits](/docs/flows/concurrency_limit) in case external resources need are fragile or rate limited;
- [Windmill flows can be restarted from any step](../2023-11-24-restartable-flows/index.mdx), making the iteration process of building a pipeline (or debugging one) smooth and efficient;
- Monitoring is made easy with [error and recovery handlers](/docs/core_concepts/error_handling).

<iframe
Expand Down
2 changes: 1 addition & 1 deletion changelog/2024-05-31-secondary-storage/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ features:
'Add additional storages from S3, Azure Blob, AWS OIDC or Azure Workload Identity.',
'From script, specify the secondary storage with an object with properties `s3` (path to the file) and `storage` (name of the secondary storage).'
]
docs: /docs/core_concepts/object_storage_in_windmill#secondary-s3-storage
docs: /docs/core_concepts/object_storage_in_windmill#secondary-storage
---
2 changes: 1 addition & 1 deletion changelog/2024-06-04-customer-portal/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ version: v1.342.0
title: Windmill Customer Portal
tags: ['Enterprise Edition']
image: ./portal.png
description: We have released our new Windmill Billing Portal https://portal.windmill.dev/. <br><br> You can access your Portal from your Instance Settings, in the "Core" tab. Or by visiting https://portal.windmill.dev/, entering your email and then accessing the link sent via email. Update contact information, billing details and subscription (seats & vCPUs) from the portal. From there, you can also enable/disable any time automatic renewal and automatic debit (therefore payment by invoice).<br><br>In the Usage section, you can find the Seats of vCPUs usage of your Prod instance, and check whether your use of Windmill corresponds to your subscription. There is a ‘Report an error’ button, please use it if reported usage is incorrect.<br><br>It's also an opportunity for us to explain our new way of managing license keys for self-hosted instances.<br><br>As you know, when you subscribe to Windmill, you receive a license key to enter in the instance settings. Now, this key automatically updates every day as long as the subscription is valid. A key is valid for 35 days and expires as soon as an updated key replaces it. This system relieves you from having to worry about your key expiring. Now everything is automatic as long as your subscription is valid. You can still contact us for exceptions.
description: We have released our new Windmill Billing Portal https://portal.windmill.dev/. <br><br> You can access your Portal from your Instance settings, in the "Core" tab. Or by visiting https://portal.windmill.dev/, entering your email and then accessing the link sent via email. Update contact information, billing details and subscription (seats & vCPUs) from the portal. From there, you can also enable/disable any time automatic renewal and automatic debit (therefore payment by invoice).<br><br>In the Usage section, you can find the Seats of vCPUs usage of your Prod instance, and check whether your use of Windmill corresponds to your subscription. There is a ‘Report an error’ button, please use it if reported usage is incorrect.<br><br>It's also an opportunity for us to explain our new way of managing license keys for self-hosted instances.<br><br>As you know, when you subscribe to Windmill, you receive a license key to enter in the instance settings. Now, this key automatically updates every day as long as the subscription is valid. A key is valid for 35 days and expires as soon as an updated key replaces it. This system relieves you from having to worry about your key expiring. Now everything is automatic as long as your subscription is valid. You can still contact us for exceptions.
features:
[
'Windmill Billing Portal available at https://portal.windmill.dev/',
Expand Down
2 changes: 1 addition & 1 deletion docs/advanced/15_dependencies_in_python/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,6 @@ windmill_worker:
- PIP_INDEX_CERT=/custom-certs/root-ca.crt
```

"Pip Index Url" and "Pip Extra Index Url" are filled through Windmill UI, in [Instance Settings](../18_instance_settings/index.mdx#registries) under [Enterprise Edition](/pricing).
"Pip Index Url" and "Pip Extra Index Url" are filled through Windmill UI, in [Instance settings](../18_instance_settings/index.mdx#registries) under [Enterprise Edition](/pricing).

![Private PyPi Repository](./private_pip.png 'Private PyPi Repository')
10 changes: 5 additions & 5 deletions docs/advanced/18_instance_settings/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@ import DocCard from '@site/src/components/DocCard';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Instance Settings
# Instance settings

Instance settings are accessible to all [superadmins](../../core_concepts/16_roles_and_permissions/index.mdx#superadmin) of your Windmill instance. This is where you manage settings and features across all workspaces.

This is from the Instance Settings that you can see on which Windmill [version](https://github.com/windmill-labs/windmill/releases) your instance is running.
This is from the Instance settings that you can see on which Windmill [version](https://github.com/windmill-labs/windmill/releases) your instance is running.

![Instance version](./instance_version.png "Instance version")

## Admins workspace

The Admins workspace is for [superadmins](../../core_concepts/16_roles_and_permissions/index.mdx#superadmin) only and contains scripts whose purpose is to manage your Windmill instance, such as [keeping resource types up to date](../../core_concepts/3_resources_and_types/index.mdx#sync-resource-types-with-windmillhub) or the New User Setup App.

You can access it from the list of workspaces or from Instance Settings.
You can access it from the list of workspaces or from Instance settings.

![Admins workspace](./admins_workspace.png "Admins workspace")

Expand Down Expand Up @@ -100,7 +100,7 @@ This setting is only available on [Enterprise Edition](/pricing).

### Instance object storage

[Connect your instance](../../core_concepts/38_object_storage_in_windmill/index.mdx#instance-object-storage) to a S3 bucket to [store large logs](../../core_concepts/20_jobs/index.mdx#large-logs-management-with-s3) and [global cache for Python and Go](../../misc/13_s3_cache/index.mdx).
[Connect your instance](../../core_concepts/38_object_storage_in_windmill/index.mdx#instance-object-storage) to a S3 bucket to [store large logs](../../core_concepts/20_jobs/index.mdx#large-job-logs-management) and [global cache for Python and Go](../../misc/13_s3_cache/index.mdx).

This feature has no overlap with the [Workspace object storage](../../core_concepts/38_object_storage_in_windmill/index.mdx#workspace-object-storage).

Expand Down Expand Up @@ -132,7 +132,7 @@ This setting is only available on [Enterprise Edition](/pricing).

## SSO/OAuth

Windmill supports [SSO/OAuth](../../misc/2_setup_oauth/index.mdx) for user authentication. You can enable it from the Instance Settings.
Windmill supports [SSO/OAuth](../../misc/2_setup_oauth/index.mdx) for user authentication. You can enable it from the Instance settings.

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
Expand Down
6 changes: 3 additions & 3 deletions docs/advanced/1_self_host/index.mdx
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: Self Host
title: Self-host
---

import DocCard from '@site/src/components/DocCard';

# Self Host Windmill
# Self-host Windmill

Self-host Windmill on your own infrastructure.

Expand Down Expand Up @@ -193,7 +193,7 @@ More details at:

### Configuring Domain and Reverse Proxy

To deploy Windmill to the `windmill.example.com` domain, make sure to set "Base Url" correctly in the [Instance Settings](../18_instance_settings/index.mdx#base-url).
To deploy Windmill to the `windmill.example.com` domain, make sure to set "Base Url" correctly in the [Instance settings](../18_instance_settings/index.mdx#base-url).

You can use any reverse proxy as long as they behave mostly like the default provided following caddy configuration:

Expand Down
2 changes: 1 addition & 1 deletion docs/compared_to/prefect.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ where users share useful and proven scripts, flows, and applications.
href="https://github.com/windmill-labs/windmill"
/>
<DocCard
title="Self Host Windmill"
title="Self-host Windmill"
description="Self host Windmill in 2 minutes."
href="/docs/advanced/self_host/"
/>
Expand Down
2 changes: 1 addition & 1 deletion docs/compared_to/retool.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Unlike Retool, where you are limited to pre-written templates by the Retool team
href="https://github.com/windmill-labs/windmill"
/>
<DocCard
title="Self Host Windmill"
title="Self-host Windmill"
description="Self host Windmill in 2 minutes."
href="/docs/advanced/self_host/"
/>
Expand Down
2 changes: 1 addition & 1 deletion docs/core_concepts/15_authentification/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ By default, users are not invited to any workspace, unless auto-invite has been

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Self Host Windmill"
title="Self-host Windmill"
description="Self host Windmill in 2 minutes."
href="/docs/advanced/self_host#authentication-and-user-management"
/>
Expand Down
2 changes: 1 addition & 1 deletion docs/core_concepts/18_files_binary_data/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ When a script outputs a S3 file, it can be downloaded or previewed directly in W

Windmill provides helpers in its SDKs to consume and produce S3 file seamlessly.

All details on Workspace object storage, and how to [read](../38_object_storage_in_windmill/index.mdx#read-a-file-from-s3-within-a-script) and [write](../38_object_storage_in_windmill/index.mdx#create-a-file-in-s3-within-a-script) files to S3 as well as [Windmill embedded integration with Polars and DuckDB](../27_data_pipelines/index.mdx#windmill-embedded-integration-with-polars-and-duckdb-for-data-pipelines) for data pipelines, can be found in the [Object storage in Windmill](../38_object_storage_in_windmill/index.mdx) page.
All details on Workspace object storage, and how to [read](../38_object_storage_in_windmill/index.mdx#read-a-file-from-s3-or-object-storage-within-a-script) and [write](../38_object_storage_in_windmill/index.mdx#create-a-file-from-s3-or-object-storage-within-a-script) files to S3 as well as [Windmill embedded integration with Polars and DuckDB](../27_data_pipelines/index.mdx#windmill-embedded-integration-with-polars-and-duckdb-for-data-pipelines) for data pipelines, can be found in the [Object storage in Windmill](../38_object_storage_in_windmill/index.mdx) page.

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
Expand Down
2 changes: 1 addition & 1 deletion docs/core_concepts/20_jobs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ You can set a custom retention period for the jobs runs details. The retention p
/>
</div>

## Large logs management with S3
## Large job logs management

To optimize log storage and performance, Windmill leverages S3 for log management. This approach minimizes database load by treating the database as a temporary buffer for up to 5000 characters of logs per job.

Expand Down
202 changes: 8 additions & 194 deletions docs/core_concepts/27_data_pipelines/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,6 @@ because those results are serialized to Windmill database and kept as long as th

In most cases, S3 is a well-suited storage and Windmill now provides a basic yet very useful [integration with external S3 storage](../38_object_storage_in_windmill/index.mdx#workspace-object-storage) at the workspace level.

<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Workspace object storage"
description="Connect your Windmill workspace to your S3 bucket or your Azure Blob storage to enable users to read and write from S3 without having to have access to the credentials."
href="/docs/core_concepts/object_storage_in_windmill#workspace-object-storage"
/>
</div>

The first step is to define an [S3 resource](/docs/integrations/s3) in Windmill and assign it to be the Workspace S3 bucket in the workspace settings.

![S3 workspace settings](../../../blog/2023-11-24-data-pipeline-orchestrator/workspace_s3_settings.png 'S3 workspace settings')
Expand All @@ -85,193 +77,15 @@ Clicking on one of those buttons, a drawer will open displaying the content of t
From there you always have the possibility to use the S3 client library of your choice to read and write to S3.
That being said, Polars and DuckDB can read/write directly from/to files stored in S3 Windmill now ships with helpers to make the entire data processing mechanics very cohesive.

### Read a file from S3 within a script

<Tabs className="unique-tabs">

<TabItem value="bun" label="TypeScript (Bun)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```ts
import * as wmill from 'windmill-client';
import { S3Object } from 'windmill-client';

export async function main(input_file: S3Object) {
// Load the entire file_content as a Uint8Array
const file_content = await wmill.loadS3File(input_file);

const decoder = new TextDecoder();
const file_content_str = decoder.decode(file_content);
console.log(file_content_str);

// Or load the file lazily as a Blob
let fileContentBlob = await wmill.loadS3FileStream(input_file);
console.log(await fileContentBlob.text());
}
```

</TabItem>

<TabItem value="deno" label="TypeScript (Deno)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```ts
import * as wmill from 'npm:[email protected]';
import S3Object from 'npm:[email protected]';

export async function main(input_file: S3Object) {
// Load the entire file_content as a Uint8Array
const file_content = await wmill.loadS3File(input_file);

const decoder = new TextDecoder();
const file_content_str = decoder.decode(file_content);
console.log(file_content_str);

// Or load the file lazily as a Blob
let fileContentBlob = await wmill.loadS3FileStream(input_file);
console.log(await fileContentBlob.text());
}
```

</TabItem>

<TabItem value="python" label="Python" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```python
import wmill
from wmill import S3Object

def main(input_file: S3Object):
# Load the entire file_content as a bytes array
file_content = wmill.load_s3_file(input_file)
print(file_content.decode('utf-8'))

# Or load the file lazily as a Buffered reader:
with wmill.load_s3_file_reader(input_file) as file_reader:
print(file_reader.read())
```

</TabItem>
</Tabs>

![Read S3 file](../18_files_binary_data/s3_file_input.png)

:::info
Certain file types, typically parquet files, can be [directly rendered by Windmill](../19_rich_display_rendering/index.mdx)
:::

### Create a file in S3 within a script

<Tabs className="unique-tabs">

<TabItem value="bun" label="TypeScript (Bun)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```ts
import * as wmill from 'windmill-client';
import { S3Object } from 'windmill-client';

export async function main(s3_file_path: string) {
const s3_file_output: S3Object = {
s3: s3_file_path
};

const file_content = 'Hello Windmill!';
// file_content can be either a string or ReadableStream<Uint8Array>
await wmill.writeS3File(s3_file_output, file_content);
return s3_file_output;
}
```

</TabItem>
Find all details at:

<TabItem value="deno" label="TypeScript (Deno)" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```ts
import * as wmill from 'npm:[email protected]';
import S3Object from 'npm:[email protected]';

export async function main(s3_file_path: string) {
const s3_file_output: S3Object = {
s3: s3_file_path
};

const file_content = 'Hello Windmill!';
// file_content can be either a string or ReadableStream<Uint8Array>
await wmill.writeS3File(s3_file_output, file_content);
return s3_file_output;
}
```

</TabItem>

<TabItem value="python" label="Python" attributes={{className: "text-xs p-4 !mt-0 !ml-0"}}>

```python
import wmill
from wmill import S3Object

def main(s3_file_path: str):
s3_file_output = S3Object(s3=s3_file_path)

file_content = b"Hello Windmill!"
# file_content can be either bytes or a BufferedReader
file_content = wmill.write_s3_file(s3_file_output, file_content)
return s3_file_output
```

</TabItem>
</Tabs>

![Write to S3 file](../18_files_binary_data/s3_file_output.png)

Even though the whole file is downloadable, the backend only sends the rows that the frontend needs for the preview. This means that you can manipulate objects of infinite size, and the backend will only return what is necessary.

You can even display several S3 files through an array of S3 objects:

```ts
export async function main() {
return [{s3: "path/to/file_1"}, {s3: "path/to/file_2", {s3: "path/to/file_3"}}];
}
```

![S3 list of files download](../19_rich_display_rendering/s3_array.png "S3 list of files download")

### Secondary S3 Storage

Read and write from a storage that is not your main storage by specifying it in the S3 object as "secondary_storage" with the name of it.

From the workspace settings, in tab "S3 Storage", just click on "Add secondary storage", give it a name, and pick a resource from type "S3", "Azure Blob", "AWS OIDC" or "Azure Workload Identity". You can save as many additional storages as you want as long as you give them a different name.

Then from script, you can specify the secondary storage with an object with properties `s3` (path to the file) and `storage` (name of the secondary storage).

```ts
const file = {s3: 'folder/hello.txt', storage: 'storage_1'}
```

Here is an example of the [Create](#create-a-file-in-s3-within-a-script) then [Read](#read-a-file-from-s3-within-a-script) a file from S3 within a script with secondary storage named "storage_1":

```ts
import * as wmill from 'windmill-client';

export async function main() {
await wmill.writeS3File({ s3: "data.csv", storage: "storage_1" }, "fooo\n1")

const res = await wmill.loadS3File({ s3: "data.csv", storage: "storage_1" })

const text = new TextDecoder().decode(res)

console.log(text)
return { s3: "data.csv", storage: "storage_1" }
}
```

<iframe
style={{ aspectRatio: '16/9' }}
src="https://www.youtube.com/embed/-nJs6E_1E8Y"
title="Perpetual Scripts"
frameBorder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
allowFullScreen
className="border-2 rounded-lg object-cover w-full dark:border-gray-800"
></iframe>
<div className="grid grid-cols-2 gap-6 mb-4">
<DocCard
title="Workspace object storage"
description="Connect your Windmill workspace to your S3 bucket or your Azure Blob storage to enable users to read and write from S3 without having to have access to the credentials."
href="/docs/core_concepts/object_storage_in_windmill#workspace-object-storage"
/>
</div>

## Windmill integration with Polars and DuckDB for data pipelines

Expand Down
Loading

0 comments on commit b222517

Please sign in to comment.