Skip to content

Commit

Permalink
Merge pull request #5797 from EnterpriseDB/release-2024-06-24a
Browse files Browse the repository at this point in the history
Release 2024-06-24a
  • Loading branch information
djw-m authored Jun 24, 2024
2 parents 4ca8413 + 3fe81d7 commit ae8db43
Show file tree
Hide file tree
Showing 151 changed files with 1,673 additions and 585 deletions.
5 changes: 3 additions & 2 deletions advocacy_docs/edb-postgres-ai/ai-ml/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@ title: EDB Postgres AI - AI/ML
navTitle: AI/ML
indexCards: simple
iconName: BrainCircuit
description: How to make use of EDB Postgres AI for AI/ML workloads and using the pgai extension.
navigation:
- overview
- install-tech-preview
- using-tech-preview
---

EDB Postgres® AI Database is designed to solve all AI data management needs, including storing, searching, and retrieving of AI data. This uplevels Postgres to a database that manages and serves all types of data modalities directly and combines it with its battle-proof strengths as an established Enterprise system of record that manages high-value business data.
EDB Postgres® AI Database is designed to solve all AI data management needs, including storing, searching, and retrieving of AI data. This up-levels Postgres to a database that manages and serves all types of data modalities directly and combines it with its battle-proof strengths as an established Enterprise system of record that manages high-value business data.

In this tech preview, you will be able to use the pgai extension to build a simple retrieval augmented generation (RAG) application in Postgres.
In this tech preview, you can use the pgai extension to build a simple retrieval augmented generation (RAG) application in Postgres.

An [overview](overview) of the pgai extension gives you a high-level understanding of the major functionality available to date.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,35 @@ navTitle: Working with AI data in S3
description: How to work with AI data stored in S3-compatible object storage using the pgai extension.
---

We recommend you to prepare your own S3 compatible object storage bucket with some test data and try the steps in this section with that. But it is possible to simply use the example S3 bucket data as is in the examples here even with your custom access key and secret key credentials because these have been configured for public access.
The following examples demonstrate how to use the pgai functions with S3-compatible object storage. You can use the following examples as is, because they use a publicly accessible example S3 bucket. Or you can prepare your own S3 compatible object storage bucket with some test data and try the steps in this section with that data.

In addition we use image data and an according image encoder LLM in this example instead of text data. But you could also use plain text data on object storage similar to the examples in the previous section.
These examples also use image data and an appropriate image encoder LLM instead of text data. You could, though, use plain text data on object storage similar to the examples in [Working with AI data in Postgres](working-with-ai-data-in-postgres).

First let's create a retriever for images stored on s3-compatible object storage as the source. We specify torsten as the bucket name and an endpoint URL where the bucket is created. We specify an empty string as prefix because we want all the objects in that bucket. We use the [`clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) open encoder model for image data from HuggingFace. We provide a name for the retriever so that we can identify and reference it subsequent operations:
### Creating a retriever

Start by creating a retriever for images stored on s3-compatible object storage as the source using the `pgai.create_s3_retriever` function.

```
pgai.create_s3_retriever(
retriever_name text,
schema_name text,
model_name text,
data_type text,
bucket_name text,
prefix text,
endpoint_url text
)
```

* The retriever_name is used to identify and reference the retriever; set it to `image_embeddings` for this example.
* The schema_name is the schema where the source table is located.
* The model_name is the name of the embeddings encoder model for similarity data; set it to [`clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) to use the open encoder model for image data from HuggingFace.
* The data_type is the type of data in the source table, which could be either `img` or `text`; set it to `img`.
* The bucket_name is the name of the S3 bucket where the data is stored; set this to `torsten`.
* The prefix is the prefix of the objects in the bucket; set this to an empty string because you want all the objects in that bucket.
* The endpoint_url is the URL of the S3 endpoint; set that to `https://s3.us-south.cloud-object-storage.appdomain.cloud` to access the public example bucket.

This gives the following SQL command:

```sql
SELECT pgai.create_s3_retriever(
Expand All @@ -27,8 +51,9 @@ __OUTPUT__
(1 row)
```

### Refreshing the retriever

Next, run the refresh_retriever function.
Next, run the `pgai.refresh_retriever` function.

```sql
SELECT pgai.refresh_retriever('image_embeddings');
Expand All @@ -38,8 +63,13 @@ __OUTPUT__

(1 row)
```

Finally, run the retrieve_via_s3 function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.

### Retrieving data

Finally, run the `pgai.retrieve_via_s3` function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.

```sql
Finally, run the `pgai.retrieve_via_s3` function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.

```sql
SELECT data from pgai.retrieve_via_s3(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ navTitle: Working with AI data in Postgres
description: How to work with AI data stored in Postgres tables using the pgai extension.
---

We will first look at working with AI data stored in columns in the Postgres table.
The examples on this page are about working with AI data stored in columns in the Postgres table.

To see how to use AI data stored in S3-compatible object storage, skip to the next section.
To see how to use AI data stored in S3-compatible object storage, skip to [working with AI data in S3](working-with-ai-data-in-S3).

First let's create a Postgres table for some test AI data:
Begin by creating a Postgres table for some test AI data:

```sql
CREATE TABLE products (
Expand All @@ -21,8 +21,33 @@ __OUTPUT__
CREATE TABLE
```

## Working with auto embedding

Now let's create a retriever with the just created products table as the source. We specify product_id as the unique key column to and we define the product_name and description columns to use for the similarity search by the retriever. We use the `all-MiniLM-L6-v2` open encoder model from HuggingFace. We set `auto_embedding` to True so that any future insert, update or delete to the source table will automatically generate, update or delete also the corresponding embedding. We provide a name for the retriever so that we can identify and reference it subsequent operations:
Next, you are going to create a retriever with the just created products table as the source using the `pgai.create_pg_retriever` function which has this syntax:

```sql
pgai.create_pg_retriever(
retriever_name text,
schema_name text,
primary_key text,
model_name text,
data_type text,
source_table text,
columns text[],
auto_embedding boolean
)
```

* The `retriever_name` is used to identify and reference the retriever; set it to `product_embeddings_auto` for this example.
* The `schema_name` is the schema where the source table is located; set this to `public`.
* The `primary_key` is the primary key column of the source table.
* The `model_name` is the name of the embeddings encoder model for similarity data; set it to `all-MiniLM-L6-v2` to use the open encoder model for text data from HuggingFace.
* The `data_type` is the type of data in the source table, which could be either `img` or `text`. Set it to `text`.
* The `source_table` is the name of the source table. The source table created previously, is `products` so set it to that.
* The `columns` is an array of columns to use for the similarity search by the retriever. Set this to `ARRAY['product_name', 'description']` to use the product_name and description columns.
* The `auto_embedding` is a boolean value to set a trigger for auto embeddings. Set it to TRUE so that any future insert, update or delete to the source table shall automatically generate, update or delete also the corresponding embedding.

This gives the following SQL command:

```sql
SELECT pgai.create_pg_retriever(
Expand All @@ -42,9 +67,8 @@ __OUTPUT__
(1 row)
```



Now let's insert some AI data records into the products table. Since we have set auto_embedding to True, the retriever will automatically generate all embeddings in real-time for each inserted record:
You have now created a retriever for the products table. The next step is to insert some AI data records into the products table.
Since you set `auto_embedding` to true, the retriever shall automatically generate all embeddings in real-time for each inserted record:

```sql
INSERT INTO products (product_name, description) VALUES
Expand All @@ -61,7 +85,21 @@ __OUTPUT__
INSERT 0 9
```

Now we can directly use the retriever (specifying the retriever name) for a similarity retrieval of the top K most relevant (most similar) AI data items:
Now you can use the retriever, by specifying the retriever name, to perform a similarity retrieval of the top K most relevant, in this case most similar, AI data items. You can do this by running the `pgai.retrieve` function with the required parameters:

```sql
pgai.retrieve(
query text,
top_k integer,
retriever_name text
)
```

* The `query` is the text to use to retrieve the top similar data. Set it to `I like it`.
* The `top_k` is the number of top similar data items to retrieve. Set this to 5
* The `retriever_name` is the name of the retriever. The retriever's name is `product_embeddings_auto`.

This gives the following SQL command:

```sql
SELECT data FROM pgai.retrieve(
Expand All @@ -80,7 +118,9 @@ __OUTPUT__
(5 rows)
```

Now let's try a retriever without auto embedding. This means that the application has control over when the embeddings are computed in a bulk fashion. For demonstration we can simply create a second retriever for the same products table that we just created above:
### Working without auto embedding

You can now create a retriever without auto embedding. This means that the application has control over when the embeddings computation occurs. It also means that the computation is a bulk operation. For demonstration you can simply create a second retriever for the same products table that you just previously created the first retriever for, but setting `auto_embedding` to false.

```sql
SELECT pgai.create_pg_retriever(
Expand All @@ -100,8 +140,7 @@ __OUTPUT__
(1 row)
```


We created this second retriever on the products table after we have inserted the AI records there. If we run a retrieve operation now we would not get back any results:
The AI records are already in the table though. As this second retriever is newly created, it won't have created any embeddings. Running `pgai.retrieve` using the retriever now doesn't return any results:

```sql
SELECT data FROM pgai.retrieve(
Expand All @@ -115,7 +154,15 @@ __OUTPUT__
(0 rows)
```

That's why we first need to run a bulk generation of embeddings. This is achieved via the `refresh_retriever()` function:
You need to run a bulk generation of embeddings before performing any retrieval. You can do this using the `pgai.refresh_retriever` function:

```
pgai.refresh_retriever(
retriever_name text
)
```

The `retriever_name` is the name of the retriever. Our retriever's name is `product_embeddings_bulk`. So the SQL command is:

```sql
SELECT pgai.refresh_retriever(
Expand All @@ -129,7 +176,7 @@ INFO: inserted table name public._pgai_embeddings_product_embeddings_bulk
(1 row)
```

Now we can run the same retrieve operation with the second retriever as above:
You can now run that retrieve operation using the second retriever and get the same results as with the first retriever:

```sql
SELECT data FROM pgai.retrieve(
Expand All @@ -148,7 +195,7 @@ __OUTPUT__
(5 rows)
```

Now let's see what happens if we add additional AI data records:
The next step is to see what happens if when you add more AI data records:

```sql
INSERT INTO products (product_name, description) VALUES
Expand Down Expand Up @@ -177,7 +224,7 @@ __OUTPUT__
(5 rows)
```

At the same time the second retriever without auto embedding does not reflect the new data until there is another explicit refresh_retriever() run:
The second retriever without auto embedding doesn't reflect the new data. It can only do so when once there has been another explicit call to `pgai.refresh_retriever`. Until then, the results don't change:

```sql
SELECT data FROM pgai.retrieve(
Expand All @@ -196,7 +243,7 @@ __OUTPUT__
(5 rows)
```

If we now call `refresh_retriever()` again, the new data is picked up:
If you now call `pgai.refresh_retriever()` again, the embeddings computation uses the new data to refresh the embeddings:

```sql
SELECT pgai.refresh_retriever(
Expand All @@ -208,7 +255,7 @@ INFO: inserted table name public._pgai_embeddings_product_embeddings_bulk
-------------------
```

And will be returned when we run the retrieve operation again:
And the new data shows up in the results of the query when you call the `pgai.retrieve` function again:

```sql
SELECT data FROM pgai.retrieve(
Expand All @@ -227,6 +274,8 @@ __OUTPUT__
(5 rows)
```

We used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit `refresh_retriever()`. In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that is ingested, updated or deleted.
You used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit `refresh_retriever()`.

In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that's ingested, updated, or deleted.

You should consider relying on `refresh_retriever()` only, without auto embedding, if you typically ingest a lot of AI data at once in a batched manner.
You should consider relying on `pgai.refresh_retriever`, and not using auto embedding, if you typically ingest a lot of AI data at once as a batch.
3 changes: 2 additions & 1 deletion advocacy_docs/edb-postgres-ai/analytics/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: Concepts - EDB Postgres Lakehouse
navTitle: Concepts
description: Learn about the ideas and terminology behind EDB Postgres Lakehouse for Analytics workloads.
deepToC: true
---

EDB Postgres Lakehouse is the solution for running Rapid Analytics against
Expand Down Expand Up @@ -121,4 +122,4 @@ Here's a slightly more comprehensive diagram of how these services fit together:

Here's the more detailed, zoomed-in view of "what's in the box":

[![Level 200 Architecture](images/level-300.svg)](images/level-300.svg)
[![Level 300 Architecture](images/level-300.svg)](images/level-300.svg)
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ The Lakehouse sync process organizes the transactional database data into Lakeho

### Navigate to Lakehouse Sync

1. Go to the [EDB Postgres AI Console]().
1. Go to the [EDB Postgres AI Console](https://portal.biganimal.com/beacon).

2. From the landing page, select the project with the database instance you want to sync. If it is not shown on the landing page, select the **View Projects** link in the **Projects** section and select your project from there.
2. From the landing page, select the project with the database instance you want to sync. If it's not shown on the landing page, select the **View Projects** link in the **Projects** section and select your project from there.

3. Select the **Migrate** dropdown in the left navigation bar and then select **Migrations**.

Expand Down Expand Up @@ -49,8 +49,9 @@ The Lakehouse sync process organizes the transactional database data into Lakeho

11. Select the **Start Lakehouse Sync** button.

12. If successful, you will see your Lakehouse sync with the 'Creating' status under 'MOST RECENT' migrations on the Migrations page. The time taken to perform a sync can depend upon how much data is being synchronized and may take several hours.
12. If successful, you'll see your Lakehouse sync with the 'Creating' status under 'MOST RECENT' migrations on the Migrations page. The time taken to perform a sync can depend upon how much data is being synchronized and may take several hours.

!!! Warning
!!! Note
The first sync in a project will take a couple of hours due to the provisioning of the required infrastructure.
!!!
!!!

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions advocacy_docs/edb-postgres-ai/analytics/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ title: Lakehouse analytics
navTitle: Lakehouse analytics
indexCards: simple
iconName: Improve
description: How EDB Postgres Lakehouse extends the power of Postgres by adding a vectorized query engine and separating storage from compute, to handle analytical workloads.
navigation:
- concepts
- quick_start
Expand Down
15 changes: 9 additions & 6 deletions advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: Quick Start - EDB Postgres Lakehouse
navTitle: Quick Start
description: Launch a Lakehouse node and query sample data.
deepToC: true
---

In this guide, you will:
Expand Down Expand Up @@ -43,7 +44,7 @@ in object storage.

Here's what's in the box of a Lakehouse node:

![Level 300 Architecture of Postgres Lakehouse node](./images/level-300-architecture.png)
[![Level 300 Architecture](images/level-300.svg)](images/level-300.svg)

## Getting started

Expand All @@ -55,9 +56,11 @@ a project, you can create a cluster.
You will see a “Lakehouse Analytics” option under the “Create New” dropdown
on your project page:

![Create Lakehouse Node Dropdown](./images/create-cluster-dropdown.png)
<center>
<a href="./images/create-cluster-dropdown.svg"><img width="50%" src="./images/create-cluster-dropdown.svg" alt="Create Lakehouse Node Dropdown" /></a>
</center>

Clicking this button will start a configuration wizard that looks like this:
Selecting the "Lakehouse Analytics" button starts a configuration wizard that looks like this:

![Create Lakehouse Node Wizard Step 1](./images/create-cluster-wizard.png)

Expand Down Expand Up @@ -97,7 +100,7 @@ cluster). Then you can copy the connection string and use it as an argument to

In general, you should be able to connect to the database with any Postgres
client. We expect all introspection queries to work, and if you find one that
does not, then that is a bug.
doesn't, then that's a bug.

### Understand the constraints

Expand Down Expand Up @@ -125,11 +128,11 @@ see [Reference - Bring your own data](./reference/#advanced-bring-your-own-data)

## Inspect the benchmark datasets

Inspect the Benchmark Datasets. Every cluster has some benchmarking data
Inspect the Benchmark Datasets. Every cluster has some benchmarking data
available out of the box. If you are using pgcli, you can run `\dn` to see
the available tables.

The available benchmarking datsets are:
The available benchmarking datasets are:

* TPC-H, at scale factors 1, 10, 100 and 1000
* TPC-DS, at scale factors 1, 10, 100 and 1000
Expand Down
1 change: 1 addition & 0 deletions advocacy_docs/edb-postgres-ai/analytics/reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: Reference - EDB Postgres Lakehouse
navTitle: Reference
description: Things to know about EDB Postgres Lakehouse
deepToC: true
---

Postgres Lakehouse is an early product. Eventually, it will support deployment
Expand Down
Loading

2 comments on commit ae8db43

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.