This deploys a Google Cloud Function that periodically fetches records from subgraphs and exposes them in BigQuery.
- Post-processing of records (e.g. to generate daily balances)
This project has a few components:
- Google Cloud Storage (GCS) bucket to store records (in JSONL files)
- Files are stored in the following location:
<bucket>/<subgraph name>/<deployment id>/<object>/dt=<YYYY-MM-DD>/records.jsonl
- Files are stored in the
JSONL
(newline-delimited) format in order to make it easy to ingest into BigQuery - The parent directory of the
records.jsonl
file containsdt=
followed by the date, so that Hive partitioning is supported without further transformation.
- Files are stored in the following location:
- Google Cloud Function (GCF) to fetch records from the subgraph
- It will fetch records from the latest date present in GCS, or the finish date specified in a PubSub message (which is useful when there's a timeout).
- If the function detects that the timeout is approaching, it will exit fetching the records.
- Before exit, the function sends a message on the specified PubSub Topic. This includes both the start date and the finish date for the fetching.
- Google Cloud Scheduler Job to trigger the function in GCF.
- BigQuery table exposing the data stored in the GCS bucket.
- Alert policies to inform about failures and extraneous executions in Discord/email.
- Monitoring dashboard for key metrics: function executions and status, GCS network activity
The Pulumi tool is used to manage the orchestration of resources in Google Cloud Platform, and has dev
and prod
stacks (environments).
This project is designed to cache any Graph Protocol subgraph that is specified, effectively creating a BigQuery pipeline. See the Subgraph Configuration section for more information.
GCP monitoring does not have a direct integration with Discord, and it seemed like overkill to write a GCP -> Discord webhook integration.
- Data is by default stored as a string, to avoid any issues with data type. When performing queries in BigQuery, you will need to cast the data type appropriately.
- If specifying a BigQuery type override for a date in ISO string format (e.g.
2022-10-11T10:05.001Z
), the type to specify isTIMESTAMP
(notDATE
). - After changing any of the contents of the
subgraphs/*.json
files, runyarn codegen
. - BigQuery does not support replacing the schema of an existing BigQuery table. If the type overrides are changed, you will need to delete the BigQuery table manually, run
pulumi refresh
, and then runpulumi up
.
For each subgraph, there is a corresponding JSON file in the subgraphs/
directory. Note: the name of the file is irrelevant. This is used to:
- generate a query to fetch all fields in the subgraph
- generate a schema that is used in conjunction with Pulumi to create a BigQuery table
See the Cooler_Loans_Loan.json file as an example.
To cache a new subgraph, perform the following:
- Add a new definition file in
subgraphs/
, adhering to theSubgraphConfig
type - Run
yarn codegen
to generate the GraphQL & BigQuery schemas, and the Typescript typings - Run
pulumi up
on the appropriate stack to update the resources - Once this is done, the function can be manually triggered by running
yarn run execute <subgraphFile>