For data collection and data exploration, we like to work with BigQuery. But we have not found a python library, to easily handle recurring tasks like adding new data (of potentially inconsistent schema) and schema migrations. So we took a couple of our solutions for those tasks and put them into this library.
bqtools
provides a light-weight solution to explicit schema management with python-native types (unlike pandas dtype) and
some convenient type checking, inference and conversions. Table-objects created by bqtools
can be read from BigQuery, stored locally, read from a local file and written to BigQuery. Table schemas can be changed and data can be added or modified.
pip install --upgrade bqtools
from fourtytwo import bqtools
schema = [
{'name': 'number', 'field_type': 'INTEGER'},
{'name': 'text', 'field_type': 'STRING'},
{'name': 'struct', 'field_type':'RECORD', 'mode':'REPEATED',
'fields': [
{'name':'integer', 'field_type':'INTEGER'},
{'name':'text', 'field_type':'STRING'}
]
}
]
# valid BigQuery types see:
# https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
# geo and array are currently not/not fully supported
# data = columns of lists
table = bqtools.BQTable(
schema=schema,
data=[[1, 2, 3, 4], ['a', 'b', 'c', 'd']]
)
# data = rows of dicts
table = bqtools.BQTable(
schema=schema,
data=[
{'number': 1, 'text': 'a'},
{'number': 2, 'text': 'b'},
...
]
)
print(table.data) # list of all columns
print(table.rows(n=10)) # list of first n rows
# convert to pandas.DataFrame
df = table.to_df()
# warning: pandas dtypes may be inconsistent
# with BigQuery Schema field_types
rows = [{'number': 5, 'text': 'e'}]
table.append(rows)
row = [[6, 'f']]
table.append(rows)
# requires environment variable GOOGLE_APPLICATION_CREDENTIALS
# or parameter credentials='path-to-credentials.json'
table = bqtools.read_bq(
table_ref='project_id.dataset_id.new_table_id',
limit=10, # limit query rows
schema_only=False # set True to only add data
)
# change column order and field_type
new_schema = [
{'name': 'text', 'field_type': 'STRING'},
{'name': 'number', 'field_type': 'FLOAT'},
]
table.schema(new_schema)
# change column names
table.rename(columns={'number': 'decimal'})
# requires environment variable GOOGLE_APPLICATION_CREDENTIALS
# or parameter credentials='path-to-credentials.json'
table.to_bq(table_ref, mode='append')
# write to local file (compressed binary format)
table.save('local_table.bqt')
# load from local file
table = bqtools.load('local_table.bqt')