-
Notifications
You must be signed in to change notification settings - Fork 0
Federated (External) Tables
BQ recently added support for querying/joining tables stored in GCS. These are similar to regular tables but in the table resource have an additional section:
"externalDataConfiguration": {
"sourceUris": [
string
],
"schema": {
"fields": [
{
"name": string,
"type": string,
"mode": string,
"fields": [
(TableFieldSchema)
],
"description": string
}
]
},
"sourceFormat": string,
"maxBadRecords": integer,
"ignoreUnknownValues": boolean,
"compression": string,
"csvOptions": {
"fieldDelimiter": string,
"skipLeadingRows": integer,
"quote": string,
"allowQuotedNewlines": boolean,
"allowJaggedRows": boolean,
"encoding": string
}
},
Issue #644 is to track supporting these.
The schema part should be straightforward as it looks compatible with existing Schema class.
Apart from sourceUris and compression, all the remaining fields correspond to arguments in the Table.load method. IIRC at one point we had support for compression too; I'm not sure why it is gone (or perhaps it only existed in Table.extract and was overlooked in Table.load). But essentially, these arguments represent the characteristics of a table in GCS and how it should be imported/accessed from BQ.
We could factor these out into an ExternalTable class (or FederatedTable, or GCSTable), and change Table.load to ExternalTable.import(tablename) (although we could keep Table.load for backcompat). We could also have ExternalTable.create(tablename) (or Externaltable.manifest?) which does not do the import but creates the necessary representation in BQ to use this as a federated data source.
It may be a good idea then to require a schema when creating an ExternalTable. We could then require that the Table and ExternalTable schema are identical (for import()) if mode is something other than create/overwrite. For create we would simply get the new table schema from the external table.
Once done it would likely make sense to deprecate Table.load.
We could use this too for extract although many of the fields aren't used and its unlikely that an extract would be followed by a load or use as a federated table in a query, so this is less useful; it would make more sense if extract jobs could append to existing tables but they can't.