Trino Example

This is a simple example intended to be used to gather requirements. Trino is a distributed SQL Query engine. We're using it in this example to query data stored as parquet files in Cloud Storage buckets, and as BigQuery tables, and then visualing the query results with Apache Superset.

The idea is, if we have GCP Epi Workstations which are set up with RStudio and connected to data stored in Cloud Storage buckets, this could potentially be one solution to access federated data stored across various GCP projects and other locations.

An advantage of Trino is that it's quick - it pushes queries down in parallel, using query planning and optimization, to be processed at the source(s) - passing only the query results back over the network.

Trino acts as a translator, with the common query language being SQL. This allows us to be able to query data from variety of sources with different query notations.

Trino allows us to query data located in a multi-cloud set up.

To connect to data sources, we use connectors. Some of these connectors are native to Trino, like the BiqQuery connector, where as others use intermediate connectors eg Apache Hive to connect to sources like Google Cloud Storage.

Superset is a data query and vizualization platform that can be used with Trino to explore the data and create new datasets.

Note

This proof-of-concept is not yet in a fully working state. While we are able to connect to BigQuery tables across various projects, we are only able at this point, to connect to one project's Cloud Storage Buckets at a time.

To run

Add service account(s) and generate service account keys for each GCP project - grant BigQuery and Google Cloud Storage permissions. In this case we have lk-sa-key.json and te-sa-key.json. Add these json key files to the root of this project.
Docker composer up. This generates a large quanity of logs, so we're writting the logs to file, rather than the terminal, in order to be able to view and search the log history.

docker compose build && docker compose up > docker-compose.log 2>&1

Once running (view status in docker-compose.log), in another terminal launch Trino;

docker exec -it trino trino

Trino CLI commands

Here are the cli commands: https://trino.io/docs/current/client/cli.html

Example (Big Query)

show catalogs;

show schemas in tebq;

show tables in tebq.cansim_tables;

SELECT * from tebq.cansim_tables.life_expectancy limit 5;

Example - Cloud Storage

We need to first create the metadata table in hive metastore.

CREATE SCHEMA IF NOT EXISTS tebucket.iris WITH (location = 'gs://bucket-in-other-project/iris');

CREATE TABLE IF NOT EXISTS tebucket.iris.iris_data (
  sepal_length DOUBLE,
  sepal_width  DOUBLE,
  petal_length DOUBLE,
  petal_width  DOUBLE,
  class        VARCHAR
)
WITH (
  external_location = 'gs://bucket-in-the-other-project/iris/',
  format = 'PARQUET'
);

Then we can use. (and join with other datasets)

SELECT count(*) FROM tebucket.iris.iris_data LIMIT 10;

Trino UI

Used to view query performance and who performed the query. http://localhost:8080/

Superset

Launch the browser, and log into the Superset UI:

localhost:8088

username: admin
password: admin

Connect each data source (catalog/ database), and add a dataset for each table see Superset docs.

trino://trino@trino:8080/{catalog_name}

Things to note

Use case insenstive flags in properties or use quotes " " around the table names if the table names start with numbers or contain '-' or special characters..

To tear down

docker compose down -v

Configuration

Trino

Trino uses connectors to access external data sources. With this example, we're using the BigQuery connector using a json GCP sevice account credentials file (for now) for each catalog (eg each data source connection). We're also using the Hive connector in order to access files in google cloud storage as there's no native GCS connecter for Trino.

Configuration files for each catalog are stored in trino/etc/catalog (per the connector docs (links above)). When we mount this folder to the /etc/trino/catalog location in the container, the other configuration defaults remain intact.

Hive Metastore Service (HMS)

The HMS provides a central repository of metadata.

For Trino, we're using the Hive metastore service rather than the full Hive distribution. See this article for an explaination why we're using just the metastore (and database). Trino uses the hive connector for google cloud storage - which requires a little set up.

(Hive does have an internal database (Derby) that you could use, but here's we're using MariaDB (mySQL) based on the available examples.)

Here are the docs for hive metastore.

hive-metastore.xml

Contains database connection credentials, thrift uri, and modifications enable hive metastore standalone configuration.

core-site.xml

contains GCP/ cloud connection credentials, hadoop file system config and additional connection permissions for GCP in this case.

Dockerfile

We need to include the HDFS (hadoop distributed file system) and the GCS connector in the container image.
https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector

Superset

Use these resources to set up:

Note - Superset comes with MYSQL database driver, but for any other database connection, you will need to pip install the database driver in the Superset Dockerfile:

database drivers https://superset.apache.org/docs/databases/installing-database-drivers/

UI

localhost:8088

Dataset

Presto

Connection sql uri:

trino://hive-metastore@trino:8080/<catalog name - eg. tebq>

Examples

This is the one referred to in the previous article, and the one we're using as a base here.

https://github.com/njanakiev/trino-minio-docker/tree/master

Hive metastore only

https://github.com/naushadh/hive-metastore (when running this, the hive metastore container unfortuneatley only has run.sh container - no other hive-metastore files (maybe error with script))
https://medium.com/@adamrempter/running-spark-3-with-standalone-hive-metastore-3-0-b7dfa733de91 https://github.com/arempter/hive-metastore-docker

Superset

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
hive-metastore		hive-metastore
img		img
lk-hive-metastore		lk-hive-metastore
superset		superset
te-hive-metastore		te-hive-metastore
trino/etc/catalog		trino/etc/catalog
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trino Example

Note

To run

Trino CLI commands

Example (Big Query)

Example - Cloud Storage

Trino UI

Superset

Things to note

To tear down

Configuration

Trino

Hive Metastore Service (HMS)

hive-metastore.xml

core-site.xml

Dockerfile

Superset

UI

Other Links and Resources

Trino

Hive Metastore

Examples

About

Releases

Packages

Languages

PHACDataHub/trino-example

Folders and files

Latest commit

History

Repository files navigation

Trino Example

Note

To run

Trino CLI commands

Example (Big Query)

Example - Cloud Storage

Trino UI

Superset

Things to note

To tear down

Configuration

Trino

Hive Metastore Service (HMS)

hive-metastore.xml

core-site.xml

Dockerfile

Superset

UI

Other Links and Resources

Trino

Hive Metastore

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages