Merge pull request #5750 from EnterpriseDB/e10ai-over-overview

DOCS-725 - Prepping for Big Animal -> Cloud Service
EnterpriseDB · Jun 24, 2024 · 95e3933 · 95e3933
2 parents d9c45e7 + e7dfa45
commit 95e3933
Show file tree

Hide file tree

Showing 54 changed files with 1,172 additions and 485 deletions.
diff --git a/advocacy_docs/edb-postgres-ai/ai-ml/index.mdx b/advocacy_docs/edb-postgres-ai/ai-ml/index.mdx
@@ -3,15 +3,16 @@ title: EDB Postgres AI - AI/ML
 navTitle: AI/ML
 indexCards: simple
 iconName: BrainCircuit
+description: How to make use of EDB Postgres AI for AI/ML workloads and using the pgai extension.
 navigation:
 - overview
 - install-tech-preview
 - using-tech-preview
 ---
 
-EDB Postgres® AI Database is designed to solve all AI data management needs, including storing, searching, and retrieving of AI data. This uplevels Postgres to a database that manages and serves all types of data modalities directly and combines it with its battle-proof strengths as an established Enterprise system of record that manages high-value business data. 
+EDB Postgres® AI Database is designed to solve all AI data management needs, including storing, searching, and retrieving of AI data. This up-levels Postgres to a database that manages and serves all types of data modalities directly and combines it with its battle-proof strengths as an established Enterprise system of record that manages high-value business data. 
 
-In this tech preview, you will be able to use the pgai extension to build a simple retrieval augmented generation (RAG) application in Postgres. 
+In this tech preview, you can use the pgai extension to build a simple retrieval augmented generation (RAG) application in Postgres. 
 
 An [overview](overview) of the pgai extension gives you a high-level understanding of the major functionality available to date.
 

diff --git a/...cy_docs/edb-postgres-ai/ai-ml/using-tech-preview/working-with-ai-data-in-S3.mdx b/...cy_docs/edb-postgres-ai/ai-ml/using-tech-preview/working-with-ai-data-in-S3.mdx
@@ -4,11 +4,35 @@ navTitle: Working with AI data in S3
 description: How to work with AI data stored in S3-compatible object storage using the pgai extension.
 ---
 
-We recommend you to prepare your own S3 compatible object storage bucket with some test data and try the steps in this section with that. But it is possible to simply use the example S3 bucket data as is in the examples here even with your custom access key and secret key credentials because these have been configured for public access.
+The following examples demonstrate how to use the pgai functions with S3-compatible object storage. You can use the following examples as is, because they use a publicly accessible example S3 bucket. Or you can prepare your own S3 compatible object storage bucket with some test data and try the steps in this section with that data.
 
-In addition we use image data and an according image encoder LLM in this example instead of text data. But you could also use plain text data on object storage similar to the examples in the previous section.
+These examples also use image data and an appropriate image encoder LLM instead of text data. You could, though, use plain text data on object storage similar to the examples in [Working with AI data in Postgres](working-with-ai-data-in-postgres).
 
-First let's create a retriever for images stored on s3-compatible object storage as the source.  We specify torsten as the bucket name and an endpoint URL where the bucket is created. We specify an empty string as prefix because we want all the objects in that bucket. We use the [`clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) open encoder model for image data from HuggingFace. We provide a name for the retriever so that we can identify and reference it subsequent operations:
+### Creating a retriever
+
+Start by creating a retriever for images stored on s3-compatible object storage as the source using the `pgai.create_s3_retriever` function.
+
+```
+pgai.create_s3_retriever(
+    retriever_name text,
+    schema_name text,
+    model_name text,
+    data_type text,
+    bucket_name text,
+    prefix text,
+    endpoint_url text
+)
+```
+
+* The retriever_name is used to identify and reference the retriever; set it to `image_embeddings` for this example. 
+* The schema_name is the schema where the source table is located. 
+* The model_name is the name of the embeddings encoder model for similarity data; set it to [`clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) to use the open encoder model for image data from HuggingFace. 
+* The data_type is the type of data in the source table, which could be either `img` or `text`; set it to `img`.
+* The bucket_name is the name of the S3 bucket where the data is stored; set this to `torsten`. 
+* The prefix is the prefix of the objects in the bucket; set this to an empty string because you want all the objects in that bucket. 
+* The endpoint_url is the URL of the S3 endpoint; set that to `https://s3.us-south.cloud-object-storage.appdomain.cloud` to access the public example bucket.
+
+This gives the following SQL command:
 
 ```sql
 SELECT pgai.create_s3_retriever(
@@ -27,8 +51,9 @@ __OUTPUT__
 (1 row)
 ```
 
+### Refreshing the retriever
 
-Next, run the refresh_retriever function.
+Next, run the `pgai.refresh_retriever` function.
 
 ```sql
 SELECT pgai.refresh_retriever('image_embeddings');
@@ -38,8 +63,13 @@ __OUTPUT__
 
 (1 row)
 ```
-
-Finally, run the retrieve_via_s3 function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.
+
+### Retrieving data
+
+Finally, run the `pgai.retrieve_via_s3` function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.
+
+```sql  
+Finally, run the `pgai.retrieve_via_s3` function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.
 
 ```sql 
 SELECT data from pgai.retrieve_via_s3(

diff --git a/...s/edb-postgres-ai/ai-ml/using-tech-preview/working-with-ai-data-in-postgres.mdx b/...s/edb-postgres-ai/ai-ml/using-tech-preview/working-with-ai-data-in-postgres.mdx
@@ -4,11 +4,11 @@ navTitle: Working with AI data in Postgres
 description: How to work with AI data stored in Postgres tables using the pgai extension.
 ---
 
-We will first look at working with AI data stored in columns in the Postgres table. 
+The examples on this page are about working with AI data stored in columns in the Postgres table. 
 
-To see how to use AI data stored in S3-compatible object storage, skip to the next section.
+To see how to use AI data stored in S3-compatible object storage, skip to [working with AI data in S3](working-with-ai-data-in-S3).
 
-First let's create a Postgres table for some test AI data: 
+Begin by creating a Postgres table for some test AI data: 
 
 ```sql
 CREATE TABLE products (
@@ -21,8 +21,33 @@ __OUTPUT__
 CREATE TABLE
 ```
 
+## Working with auto embedding
 
-Now let's create a retriever with the just created products table as the source. We specify product_id as the unique key column to and we define the product_name and description columns to use for the similarity search by the retriever. We use the `all-MiniLM-L6-v2` open encoder model from HuggingFace. We set `auto_embedding` to True so that any future insert, update or delete to the source table will automatically generate, update or delete also the corresponding embedding. We provide a name for the retriever so that we can identify and reference it subsequent operations:
+Next, you are going to create a retriever with the just created products table as the source using the `pgai.create_pg_retriever` function which has this syntax:
+
+```sql
+pgai.create_pg_retriever(
+    retriever_name text,
+    schema_name text,
+    primary_key text,
+    model_name text,
+    data_type text,
+    source_table text,
+    columns text[],
+    auto_embedding boolean
+)
+```
+
+* The `retriever_name` is used to identify and reference the retriever; set it to `product_embeddings_auto` for this example. 
+* The `schema_name` is the schema where the source table is located; set this to `public`.
+* The `primary_key` is the primary key column of the source table.
+* The `model_name` is the name of the embeddings encoder model for similarity data; set it to `all-MiniLM-L6-v2` to use the open encoder model for text data from HuggingFace.
+* The `data_type` is the type of data in the source table, which could be either `img` or `text`. Set it to `text`. 
+* The `source_table` is the name of the source table. The source table created previously, is `products` so set it to that.
+* The `columns` is an array of columns to use for the similarity search by the retriever. Set this to `ARRAY['product_name', 'description']` to use the product_name and description columns.
+* The `auto_embedding` is a boolean value to set a trigger for auto embeddings. Set it to TRUE so that any future insert, update or delete to the source table shall automatically generate, update or delete also the corresponding embedding.
+
+This gives the following SQL command:
 
 ```sql
 SELECT pgai.create_pg_retriever(
@@ -42,9 +67,8 @@ __OUTPUT__
 (1 row)
 ```
 
-
-
-Now let's insert some AI data records into the products table. Since we have set auto_embedding to True, the retriever will automatically generate all embeddings in real-time for each inserted record: 
+You have now created a retriever for the products table. The next step is to insert some AI data records into the products table. 
+Since you set `auto_embedding` to true, the retriever shall automatically generate all embeddings in real-time for each inserted record: 
 
 ```sql
 INSERT INTO products (product_name, description) VALUES
@@ -61,7 +85,21 @@ __OUTPUT__
 INSERT 0 9
 ```
 
-Now we can directly use the retriever (specifying the retriever name) for a similarity retrieval of the top K most relevant (most similar) AI data items:
+Now you can use the retriever, by specifying the retriever name, to perform a similarity retrieval of the top K most relevant, in this case most similar, AI data items. You can do this by running the `pgai.retrieve` function with the required parameters:
+
+```sql
+pgai.retrieve(
+    query text,
+    top_k integer,
+    retriever_name text
+)
+```
+
+* The `query` is the text to use to retrieve the top similar data. Set it to `I like it`.
+* The `top_k` is the number of top similar data items to retrieve. Set this to 5
+* The `retriever_name` is the name of the retriever. The retriever's name is `product_embeddings_auto`.
+
+This gives the following SQL command:
 
 ```sql
 SELECT data FROM pgai.retrieve(
@@ -80,7 +118,9 @@ __OUTPUT__
 (5 rows)
 ```
 
-Now let's try a retriever without auto embedding. This means that the application has control over when the embeddings are computed in a bulk fashion. For demonstration we can simply create a second retriever for the same products table that we just created above:
+### Working without auto embedding
+
+You can now create a retriever without auto embedding. This means that the application has control over when the embeddings computation occurs. It also means that the computation is a bulk operation. For demonstration you can simply create a second retriever for the same products table that you just previously created the first retriever for, but setting `auto_embedding` to false.
 
 ```sql
 SELECT pgai.create_pg_retriever(
@@ -100,8 +140,7 @@ __OUTPUT__
 (1 row)
 ```
 
-
-We created this second retriever on the products table after we have inserted the AI records there. If we run a retrieve operation now we would not get back any results:
+The AI records are already in the table though. As this second retriever is newly created, it won't have created any embeddings. Running `pgai.retrieve` using the retriever now doesn't return any results:
 
 ```sql
 SELECT data FROM pgai.retrieve(
@@ -115,7 +154,15 @@ __OUTPUT__
 (0 rows)
 ```
 
-That's why we first need to run a bulk generation of embeddings. This is achieved via the `refresh_retriever()` function:
+You need to run a bulk generation of embeddings before performing any retrieval. You can do this using the `pgai.refresh_retriever` function:
+
+```
+pgai.refresh_retriever(
+    retriever_name text
+)
+```
+
+The `retriever_name` is the name of the retriever. Our retriever's name is `product_embeddings_bulk`. So the SQL command is:
 
 ```sql
 SELECT pgai.refresh_retriever(
@@ -129,7 +176,7 @@ INFO:  inserted table name public._pgai_embeddings_product_embeddings_bulk
 (1 row)
 ```
 
-Now we can run the same retrieve operation with the second retriever as above:
+You can now run that retrieve operation using the second retriever and get the same results as with the first retriever:
 
 ```sql
 SELECT data FROM pgai.retrieve(
@@ -148,7 +195,7 @@ __OUTPUT__
 (5 rows)
 ```
 
-Now let's see what happens if we add additional AI data records:
+The next step is to see what happens if when you add more AI data records:
 
 ```sql
 INSERT INTO products (product_name, description) VALUES
@@ -177,7 +224,7 @@ __OUTPUT__
 (5 rows)
 ```
 
-At the same time the second retriever without auto embedding does not reflect the new data until there is another explicit refresh_retriever() run:
+The second retriever without auto embedding doesn't reflect the new data. It can only do so when once there has been another explicit call to `pgai.refresh_retriever`. Until then, the results don't change:
 
 ```sql
 SELECT data FROM pgai.retrieve(
@@ -196,7 +243,7 @@ __OUTPUT__
 (5 rows)
 ```
 
-If we now call `refresh_retriever()` again, the new data is picked up:
+If you now call `pgai.refresh_retriever()` again, the embeddings computation uses the new data to refresh the embeddings:
 
 ```sql
 SELECT pgai.refresh_retriever(
@@ -208,7 +255,7 @@ INFO:  inserted table name public._pgai_embeddings_product_embeddings_bulk
 -------------------
 ```
 
-And will be returned when we run the retrieve operation again:
+And the new data shows up in the results of the query when you call the `pgai.retrieve` function again:
 
 ```sql
 SELECT data FROM pgai.retrieve(
@@ -227,6 +274,8 @@ __OUTPUT__
 (5 rows)
 ```
 
-We used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit `refresh_retriever()`. In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that is ingested, updated or deleted.
+You used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit `refresh_retriever()`. 
+
+In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that's ingested, updated, or deleted.
 
-You should consider relying on `refresh_retriever()` only, without auto embedding, if you typically ingest a lot of AI data at once in a batched manner.
+You should consider relying on `pgai.refresh_retriever`, and not using auto embedding, if you typically ingest a lot of AI data at once as a batch.
diff --git a/advocacy_docs/edb-postgres-ai/analytics/concepts.mdx b/advocacy_docs/edb-postgres-ai/analytics/concepts.mdx
@@ -2,6 +2,7 @@
 title: Concepts - EDB Postgres Lakehouse
 navTitle: Concepts
 description: Learn about the ideas and terminology behind EDB Postgres Lakehouse for Analytics workloads.
+deepToC: true
 ---
 
 EDB Postgres Lakehouse is the solution for running Rapid Analytics against
@@ -121,4 +122,4 @@ Here's a slightly more comprehensive diagram of how these services fit together:
 
 Here's the more detailed, zoomed-in view of "what's in the box":
 
-[![Level 200 Architecture](images/level-300.svg)](images/level-300.svg)
+[![Level 300 Architecture](images/level-300.svg)](images/level-300.svg)
diff --git a/advocacy_docs/edb-postgres-ai/analytics/how_to_lakehouse_sync.mdx b/advocacy_docs/edb-postgres-ai/analytics/how_to_lakehouse_sync.mdx
@@ -19,9 +19,9 @@ The Lakehouse sync process organizes the transactional database data into Lakeho
 
 ### Navigate to Lakehouse Sync
 
-1. Go to the [EDB Postgres AI Console]().
+1. Go to the [EDB Postgres AI Console](https://portal.biganimal.com/beacon).
 
-2. From the landing page, select the project with the database instance you want to sync. If it is not shown on the landing page, select the **View Projects** link in the **Projects** section and select your project from there.
+2. From the landing page, select the project with the database instance you want to sync. If it's not shown on the landing page, select the **View Projects** link in the **Projects** section and select your project from there.
 
 3. Select the **Migrate** dropdown in the left navigation bar and then select **Migrations**.
 
@@ -49,8 +49,9 @@ The Lakehouse sync process organizes the transactional database data into Lakeho
 
 11. Select the **Start Lakehouse Sync** button.
 
-12. If successful, you will see your Lakehouse sync with the 'Creating' status under 'MOST RECENT' migrations on the Migrations page. The time taken to perform a sync can depend upon how much data is being synchronized and may take several hours. 
+12. If successful, you'll see your Lakehouse sync with the 'Creating' status under 'MOST RECENT' migrations on the Migrations page. The time taken to perform a sync can depend upon how much data is being synchronized and may take several hours. 
 
-!!! Warning
+!!! Note
 The first sync in a project will take a couple of hours due to the provisioning of the required infrastructure.
-!!!
+!!!
+
diff --git a/advocacy_docs/edb-postgres-ai/analytics/images/create-cluster-dropdown.svg b/advocacy_docs/edb-postgres-ai/analytics/images/create-cluster-dropdown.svg
diff --git a/advocacy_docs/edb-postgres-ai/analytics/index.mdx b/advocacy_docs/edb-postgres-ai/analytics/index.mdx
@@ -3,6 +3,7 @@ title: Lakehouse analytics
 navTitle: Lakehouse analytics
 indexCards: simple
 iconName: Improve
+description: How EDB Postgres Lakehouse extends the power of Postgres by adding a vectorized query engine and separating storage from compute, to handle analytical workloads.  
 navigation:
 - concepts
 - quick_start

diff --git a/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx b/advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx
@@ -2,6 +2,7 @@
 title: Quick Start - EDB Postgres Lakehouse
 navTitle: Quick Start
 description: Launch a Lakehouse node and query sample data.
+deepToC: true
 ---
 
 In this guide, you will:
@@ -43,7 +44,7 @@ in object storage.
 
 Here's what's in the box of a Lakehouse node:
 
-![Level 300 Architecture of Postgres Lakehouse node](./images/level-300-architecture.png)
+[![Level 300 Architecture](images/level-300.svg)](images/level-300.svg)
 
 ## Getting started
 
@@ -55,9 +56,11 @@ a project, you can create a cluster.
 You will see a “Lakehouse Analytics” option under the “Create New” dropdown
 on your project page:
 
-![Create Lakehouse Node Dropdown](./images/create-cluster-dropdown.png)
+<center>
+<a href="./images/create-cluster-dropdown.svg"><img width="50%" src="./images/create-cluster-dropdown.svg" alt="Create Lakehouse Node Dropdown" /></a>
+</center>
 
-Clicking this button will start a configuration wizard that looks like this:
+Selecting the "Lakehouse Analytics" button starts a configuration wizard that looks like this:
 
 ![Create Lakehouse Node Wizard Step 1](./images/create-cluster-wizard.png)
 
@@ -97,7 +100,7 @@ cluster). Then you can copy the connection string and use it as an argument to
 
 In general, you should be able to connect to the database with any Postgres
 client. We expect all introspection queries to work, and if you find one that
-does not, then that is a bug.
+doesn't, then that's a bug.
 
 ### Understand the constraints
 
@@ -125,11 +128,11 @@ see [Reference - Bring your own data](./reference/#advanced-bring-your-own-data)
 
 ## Inspect the benchmark datasets
 
-Inspect the Benchmark Datasets.  Every cluster has some benchmarking data
+Inspect the Benchmark Datasets. Every cluster has some benchmarking data
 available out of the box. If you are using pgcli, you can run `\dn` to see
 the available tables.
 
-The available benchmarking datsets are:
+The available benchmarking datasets are:
 
 * TPC-H, at scale factors 1, 10, 100 and 1000
 * TPC-DS, at scale factors 1, 10, 100 and 1000

diff --git a/advocacy_docs/edb-postgres-ai/analytics/reference.mdx b/advocacy_docs/edb-postgres-ai/analytics/reference.mdx
@@ -2,6 +2,7 @@
 title: Reference - EDB Postgres Lakehouse
 navTitle: Reference
 description: Things to know about EDB Postgres Lakehouse
+deepToC: true
 ---
 
 Postgres Lakehouse is an early product. Eventually, it will support deployment