Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release: 2024-12-18a #6359

Merged
merged 30 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
9b88070
First pass at changes.
jpe442 Nov 20, 2024
18b867c
Small changes
jpe442 Nov 20, 2024
6df8840
Changed title to gerund
jpe442 Nov 20, 2024
febc4a6
Small changes.
jpe442 Nov 20, 2024
17922af
Small change in wording.
jpe442 Dec 2, 2024
0f56699
Another small word change.
jpe442 Dec 2, 2024
b79b527
docs: review
NiccoloFei Dec 4, 2024
226e7a3
First draft added sub-json.
jpe442 Dec 12, 2024
8a036bd
Improved language.
jpe442 Dec 16, 2024
aef5202
Nested sub-field topic under full document retrieval
jpe442 Dec 16, 2024
e98e2b5
Fixed indentations of code blocks for using easier numerics
jpe442 Dec 16, 2024
a41ba4a
Update product_docs/docs/mongo_data_adapter/5/06_features_of_mongo_fd…
jpe442 Dec 16, 2024
303d5ff
Updated front page
djw-m Dec 17, 2024
a286607
Small change to formatting.
jpe442 Dec 17, 2024
a6d0877
Typo.
jpe442 Dec 17, 2024
cee516b
Merge pull request #6353 from EnterpriseDB/docs/pg4k/frontpageeditfor…
djw-m Dec 17, 2024
4e6caa5
Fixed typo.
jpe442 Dec 17, 2024
febb21e
Merge pull request #6270 from EnterpriseDB/DOCS-776
jpe442 Dec 17, 2024
0e0569a
Update installation_upgrade.mdx fix typo
codepope Dec 18, 2024
f98d2cf
Merge pull request #6358 from codepope/patch-3
djw-m Dec 18, 2024
6c7495f
Merge pull request #6350 from EnterpriseDB/DOCS-1142
jpe442 Dec 18, 2024
db35426
First pass update
djw-m Nov 1, 2024
dd21307
Revisions as per review comments.
djw-m Nov 4, 2024
5cce06b
Fixes for linkses
djw-m Nov 4, 2024
84cb51a
Case changes
djw-m Nov 4, 2024
c86608f
Update advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx
djw-m Dec 5, 2024
6c45b5f
Update advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx
djw-m Dec 5, 2024
50bd15b
Update advocacy_docs/edb-postgres-ai/analytics/reference/loadingdata.mdx
djw-m Dec 5, 2024
68b8ee9
Update advocacy_docs/edb-postgres-ai/analytics/reference/functions.mdx
djw-m Dec 5, 2024
5a2fcb0
Merge pull request #6199 from EnterpriseDB/DOCS-1100-update-lakehouse…
djw-m Dec 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions advocacy_docs/edb-postgres-ai/analytics/external_tables.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Querying Delta Lake Tables in S3-compatible object storage
navTitle: External Tables
description: Access and Query data stored as Delta Lake Tablles in S3-compatible object storage using External Tables
deepToC: true
---

## Overview

External tables allow you to access and query data stored in S3-compatible object storage using SQL. You can create an external table that references data in S3-compatible object storage and query the data using standard SQL commands.

## Prerequisites

* An EDB Postgres AI account and a Lakehouse node.
* An S3-compatible object storage location with data stored as Delta Lake Tables.
* See [Bringing your own data](reference/loadingdata) for more information on how to prepare your data.
* Credentials to access the S3-compatible object storage location, unless it is a public bucket.
* These credentials will be stored within the database. We recommend creating a separate user with limited permissions for this purpose.

!!! Note Regions, latency and cost
Using an S3 bucket that isn't in the same region as your node will

* be slow because of cross-region latencies
* will incur AWS costs (between $0.01 and $0.02 / GB) for data transfer. Currently these egress costs are not passed through to you but we do track them and reserve the right to terminate an instance.
!!!

## Creating an External Storage Location

The first step is to create an external storage location which references S3-compatible object storage where your data resides. A storage location is an object within the database which you refer to to access the data; each storage location has a name for this purpose.

Creating a named storage location is performed with SQL by executing the `pgaa.create_storage_location` function.
`pgaa` is the name of the extension and namespace that provides the functionality to query external storage locations.
The `create_storage_location` function takes a name for the new storage location, and the URI of the S3-compatible object storage location as parameters.
The function optionally can take a third parameter, `options`, which is a JSON object for specifying optional settings, detailed in the [functions reference](reference/functions#pgaacreate_storage_location).
For example, in the options, you can specify the access key ID and secret access key for the storage location to enable access to a private bucket.

The following example creates an external table that references a public S3-compatible object storage location:

```sql
SELECT pgaa.create_storage_location('sample-data', 's3://pgaa-sample-data-eu-west-1');
```

The next example creates an external storage location that references a private S3-compatible object storage location:

```sql
SELECT pgaa.create_storage_location('private-data', 's3://my-private-bucket', '{"access_key_id": "my-access-key-id","secret_access_key": "my-secret-access-key"}');
```

## Creating an External Table

After creating the external storage location, you can create an external table that references the data in the storage location.
The following example creates an external table that references a Delta Lake Table in the S3-compatible object storage location:

```sql
CREATE TABLE public.customer () USING PGAA WITH (pgaa.storage_location = 'sample-data', pgaa.path = 'tpch_sf_1/customer');
```

Note that the schema is not defined in the `CREATE TABLE` statement. The pgaa extension expects the schema to be defined in the storage location, and the schema itself is derived from the schema stored at the path specified in the `pgaa.path` option. The pgaa extension will infer the best Postgres-equivalent data types for the columns in the Delta Table.

## Querying an External Table

After creating the external table, you can query the data in the external table using standard SQL commands. The following example queries the external table created in the previous step:

```sql
SELECT COUNT(*) FROM public.customer;
```
47 changes: 15 additions & 32 deletions advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -81,50 +81,33 @@ Persistent data in system tables (users, roles, etc) is stored in an attached
block storage device and will survive a restart or backup/restore cycle.
* Only Postgres 16 is supported.

For more notes about supported instance sizes,
see [Reference - Supported AWS instances](./reference/#supported-aws-instances).
For more notes about supported instance sizes,see [Reference - Supported AWS instances](./reference/instances).

## Operating a Lakehouse node

### Connect to the node

You can connect to the Lakehouse node with any Postgres client, in the same way
that you connect to any other cluster from EDB Postgres AI Cloud Service
(formerly known as BigAnimal): navigate to the cluster detail page and copy its
connection string.
You can connect to the Lakehouse node with any Postgres client, in the same way that you connect to any other cluster from EDB Postgres AI Cloud Service (formerly known as BigAnimal): navigate to the cluster detail page and copy its connection string.

For example, you might copy the `.pgpass` blob into `~/.pgpass` (making sure to
replace `$YOUR_PASSWORD` with the password you provided when launching the
cluster). Then you can copy the connection string and use it as an argument to
`psql` or `pgcli`.
For example, you might copy the `.pgpass` blob into `~/.pgpass` (making sure to replace `$YOUR_PASSWORD` with the password you provided when launching the cluster).
Then you can copy the connection string and use it as an argument to `psql` or `pgcli`.

In general, you should be able to connect to the database with any Postgres
client. We expect all introspection queries to work, and if you find one that
doesn't, then that's a bug.
In general, you should be able to connect to the database with any Postgres client.
We expect all introspection queries to work, and if you find one that doesn't, then that's a bug.

### Understand the constraints

* Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those
flavors in the installation when you connect.
* Queryable data (like the benchmarking datasets) is stored in object storage
as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket
with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at
scale factors 1 and 10.
* Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those flavors in the installation when you connect.
* Queryable data (like the benchmarking datasets) is stored in object storage as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at scale factors from 1 to 1000.
* Only AWS is supported at the moment. Bring Your Own Account (BYOA) is not supported.
* You can deploy a cluster in any region that is activated in
your EDB Postgres AI Account. Each region has a bucket with a copy of the
benchmarking data, and so when you launch a cluster, it will use the
benchmarking data in the location closest to it.
* The cluster is ephemeral. None of the data is stored on the hard drive,
except for data in system tables, e.g. roles and users and grants.
If you restart the cluster, or backup the cluster and then restore it,
it will restore these system tables. But the data in object storage will
* You can deploy a cluster in any region that is activated in your EDB Postgres AI Account. Each region has a bucket with a copy of the
benchmarking data, and so when you launch a cluster, it will use the benchmarking data in the location closest to it.
* The cluster is ephemeral. None of the data is stored on the hard drive, except for data in system tables, e.g. roles and users and grants.
If you restart the cluster, or backup the cluster and then restore it, it will restore these system tables. But the data in object storage will
remain untouched.
* The cluster supports READ ONLY queries of the data in object
storage (but it supports write queries to system tables for creating users,
* The cluster supports READ ONLY queries of the data in object storage (but it supports write queries to system tables for creating users,
etc.). You cannot write directly to object storage. You cannot create new tables.
* If you want to load your own data into object storage,
see [Reference - Bring your own data](./reference/#advanced-bring-your-own-data).
* If you want to load your own data into object storage, see [Reference - Bring your own data](reference/loadingdata).

## Inspect the benchmark datasets

Expand All @@ -140,7 +123,7 @@ The available benchmarking datasets are:
* 1 Billion Row Challenge

For more details on benchmark datasets,
see Reference - Available benchmarking datasets](./reference/#available-benchmarking-datasets).
see Reference - Available benchmarking datasets](./reference/datasets).

## Query the benchmark datasets

Expand Down
Loading
Loading