Skip to content

An example used for requirements gathering - using Trino and Superset for distributed queries of Parquet files.

Notifications You must be signed in to change notification settings

PHACDataHub/trino-example

Repository files navigation

Trino Example

This is a simple example intended to be used to gather requirements. Trino is a distributed SQL Query engine. We're using it in this example to query data stored as parquet files in Cloud Storage buckets, and as BigQuery tables, and then visualing the query results with Apache Superset.

The idea is, if we have GCP Epi Workstations which are set up with RStudio and connected to data stored in Cloud Storage buckets, this could potentially be one solution to access federated data stored across various GCP projects and other locations.

An advantage of Trino is that it's quick - it pushes queries down in parallel, using query planning and optimization, to be processed at the source(s) - passing only the query results back over the network.

Trino acts as a translator, with the common query language being SQL. This allows us to be able to query data from variety of sources with different query notations.

Trino allows us to query data located in a multi-cloud set up.

To connect to data sources, we use connectors. Some of these connectors are native to Trino, like the BiqQuery connector, where as others use intermediate connectors eg Apache Hive to connect to sources like Google Cloud Storage.

Superset is a data query and vizualization platform that can be used with Trino to explore the data and create new datasets.

Note

This proof-of-concept is not yet in a fully working state. While we are able to connect to BigQuery tables across various projects, we are only able at this point, to connect to one project's Cloud Storage Buckets at a time.

To run

  1. Add service account(s) and generate service account keys for each GCP project - grant BigQuery and Google Cloud Storage permissions. In this case we have lk-sa-key.json and te-sa-key.json. Add these json key files to the root of this project.

  2. Docker composer up. This generates a large quanity of logs, so we're writting the logs to file, rather than the terminal, in order to be able to view and search the log history.

docker compose build && docker compose up > docker-compose.log 2>&1
  1. Once running (view status in docker-compose.log), in another terminal launch Trino;
docker exec -it trino trino

Trino CLI commands

Here are the cli commands: https://trino.io/docs/current/client/cli.html

Example (Big Query)

show catalogs;
show schemas in tebq;
show tables in tebq.cansim_tables;
SELECT * from tebq.cansim_tables.life_expectancy limit 5;

Example - Cloud Storage

We need to first create the metadata table in hive metastore.

CREATE SCHEMA IF NOT EXISTS tebucket.iris WITH (location = 'gs://bucket-in-other-project/iris');
CREATE TABLE IF NOT EXISTS tebucket.iris.iris_data (
  sepal_length DOUBLE,
  sepal_width  DOUBLE,
  petal_length DOUBLE,
  petal_width  DOUBLE,
  class        VARCHAR
)
WITH (
  external_location = 'gs://bucket-in-the-other-project/iris/',
  format = 'PARQUET'
);   

Then we can use. (and join with other datasets)

SELECT count(*) FROM tebucket.iris.iris_data LIMIT 10;

Trino UI

Used to view query performance and who performed the query. http://localhost:8080/

Superset

Launch the browser, and log into the Superset UI:

localhost:8088
username: admin
password: admin

Connect each data source (catalog/ database), and add a dataset for each table see Superset docs.

trino://trino@trino:8080/{catalog_name}

Things to note

  • Use case insenstive flags in properties or use quotes " " around the table names if the table names start with numbers or contain '-' or special characters..

To tear down

docker compose down -v

Configuration

Trino

Trino uses connectors to access external data sources. With this example, we're using the BigQuery connector using a json GCP sevice account credentials file (for now) for each catalog (eg each data source connection). We're also using the Hive connector in order to access files in google cloud storage as there's no native GCS connecter for Trino.

Configuration files for each catalog are stored in trino/etc/catalog (per the connector docs (links above)). When we mount this folder to the /etc/trino/catalog location in the container, the other configuration defaults remain intact.

Hive Metastore Service (HMS)

The HMS provides a central repository of metadata.

For Trino, we're using the Hive metastore service rather than the full Hive distribution. See this article for an explaination why we're using just the metastore (and database). Trino uses the hive connector for google cloud storage - which requires a little set up.

(Hive does have an internal database (Derby) that you could use, but here's we're using MariaDB (mySQL) based on the available examples.)

Trino Architecture

Here are the docs for hive metastore.

hive-metastore.xml

  • Contains database connection credentials, thrift uri, and modifications enable hive metastore standalone configuration.

core-site.xml

  • contains GCP/ cloud connection credentials, hadoop file system config and additional connection permissions for GCP in this case.

Dockerfile

Use these resources to set up:

Note - Superset comes with MYSQL database driver, but for any other database connection, you will need to pip install the database driver in the Superset Dockerfile:

UI

localhost:8088

Dataset

Presto

Connection sql uri:

trino://hive-metastore@trino:8080/<catalog name - eg. tebq>

Other Links and Resources

Trino

Trino - connecting to GCS

Trino Authorization

Hive Metastore

Good explaination of using hive - need hive metadata service and database (and Hadoop distributed filesystem) (we're using this example as a starting point this one)

Hive quickstart docker image:

Running metastore without Hive

Configuring hive metastore service

Connect to Google Cloud Storage (with Hive Connector)

Configuration changes for gcs

HortonWorks: Working with Google Cloud Storage

Cloudera: Configuring Google Cloud Storage Connectivity

Setting up gcs as warehouse directory

Examples

This is the one referred to in the previous article, and the one we're using as a base here.

Hive metastore only

Superset

About

An example used for requirements gathering - using Trino and Superset for distributed queries of Parquet files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published