Google BigQuery¶
Google BigQuery data sources and sinks are in the database category of I/O connectors.
Tool documentation¶
Connector type values¶
The "type": value to use in the source or sink JSON files.
| Connector type | Value |
|---|---|
| Data source | DKDataSource_BigQuery |
| Data sink | DKDataSink_BigQuery |
Connection properties¶
The properties to use when connecting to a Google BigQuery instance from Automation.
Connections to BigQuery databases are most often established via service accounts that generate an associated JSON key file.
You can choose from three connection methods:
- Credentials mode: specify
private-keyandservice-account. - Service Account Key mode: specify
service-account-fileonly. - IAM Role mode: no fields required.
Tip
Using roles for connection permissions. Role-based permissions configured on the agent-level may be used as an alternative to the key-based configuration described below. In these cases, the BigQuery connection object need not be defined as a kitchen-level override or referenced in BigQuery sources or sinks.
| Field | Scope | Type | Required? | Description |
|---|---|---|---|---|
private-key |
source/sink | string | depends on method | A JSON key file provided by Google Cloud Platform (GCP), saved to the vault as a secret. |
project-id |
source/sink | string | depends on method | The Google Cloud Platform (GCP) Project ID. The default JSON key file provided by GCP contains the project-id. |
service-account |
source/sink | string | depends on method | The service account email generated by GCP. The default JSON key file provided by GCP contains the service-account. |
service-account-file |
source/sink | string/JSON | depends on method | A service account key file created within a GCP project. |
Example GCP JSON service account key file
{
"type": "service_account",
"project_id": "",
"private_key_id": "",
"private_key": "-----BEGIN PRIVATE KEY-----*****\n-----END PRIVATE KEY-----\n",
"client_email": "",
"client_id": "",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": ""
}
Connections¶
See Connection Properties for more details on connection configurations.
Defined in kitchen-level variables¶
bigqueryConfig in kitchen overrides
{
"bigqueryConfig": {
"service-account": "#{vault://bigquery/service_account}",
"private-key": "#{vault://bigqueryj/json_private_key}",
"project-id": "#{vault://bigquery/project_id}"
}
}
Connection syntax¶
For a data source¶
bigquery_datasource.json
{
"type": "DKDataSource_BigQuery",
"name": "bigquery_datasource",
"config": {
"service-account-file": "{{bigqueryConfig.private_key}}"
},
"keys": {},
"tests": {}
}
For a data sink¶
bigquery_datasink.json
{
"type": "DKDataSink_BigQuery",
"name": "bigquery_datasource",
"config": {
"service-account-file": "{{bigqueryConfig.private_key}}"
},
"keys": {},
"tests": {}
}
Local connections¶
GCP console¶
The Google Cloud Platform Console allows users to view database contents and query submissions.
IDE connections¶
You can configure connections from local IDEs like Pycharm.
Tip
Connecting PyCharm and DataGrip to BigQuery: See Using BigQuery from IntelliJ-based IDE for detailed information.
Other configuration properties¶
See the following topics for common properties and runtime variables:
Additional BigQuery step (key) properties¶
| Field | Type | Required? | Description |
|---|---|---|---|
write-disposition |
string | no | Available for BigQuery sinks only. Specifies the action that occurs if the destination table already exists. The following values are supported. WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data. WRITE_APPEND: If the table already exists, BigQuery appends the data to the table. WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result. Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion. |
File encoding requirements¶
Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing problems with sinking data to database tables and errors when running related tests
For CSV and other delimited files, use Save as in the program and select the proper encoding, or consider using a text editor with encoding options.
Data source examples¶
Example 1¶
Submit an inline query and declare the result as a runtime variable
bigquery_datasource.json
{
"type": "DKDataSource_BigQuery",
"name": "bigquery_datasource",
"config": {
"service-account-file": "{{bigqueryConfig.private_key}}"
},
"keys": {
"get_new_table_column_count": {
"query-type": "execute_scalar",
"sql": "SELECT COUNT (DISTINCT column_name) FROM `{{bigquery.project_id}}.{{kitchen_schema}}.INFORMATION_SCHEMA.COLUMNS` WHERE table_name='output_table'",
"set-runtime-vars": {
"result": "output_table_column_count"
}
}
},
"tests": {}
}
Example 2¶
Submit a query via a resource file
The resource file is saved to the Recipe's /resources directory. Its contents, a SQL query, are loaded into the get_new_table_row_count key via the load_text() built-in function.
bigquery_datasource.json
{
"type": "DKDataSource_BigQuery",
"name": "bigquery_datasource",
"config": {
"service-account-file": "{{bigqueryConfig.private_key}}"
},
"keys": {
"get_new_table_row_count": {
"query-type": "execute_scalar",
"sql": "{{load_text('bigquery_row_count.sql')}}",
"set-runtime-vars": {
"result": "output_table_row_count"
}
}
},
"tests": {}
}
bigquery_row_count.sql
Data sink examples¶
Example 1¶
Note
BigQuery dataset creation. Because BigQuery does not presently support the creation and deletion of datasets (schemas) directly via SQL, these specific operational steps should be accomplished by use of container nodes and the appropriate container image upstream of nodes leveraging BigQuery data sinks.
Bulk insert data from a CSV file into a BigQuery table
The following example bulk inserts data from a CSV source file into a new BiqQuery table. A DataMapper node is used along with file-based data source.
bigquery_datasink.json
{
"type": "DKDataSink_BigQuery",
"name": "bigquery_datasink",
"config": {
"service-account-file": "{{bigqueryConfig.private_key}}"
},
"keys": {
"load_source_data_into_database": {
"table-name": "{{kitchen_schema}}.output_table",
"format": "csv",
"first-row-column-names": true,
"query-type": "bulk_insert",
"write-disposition": "WRITE_TRUNCATE"
}
}
"tests": {}
}
gcs_datasource.json
{
"name": "gcs_datasource",
"type": "DKDataSource_GCS",
"config": {
"bucket": "{{gcsConfig.bucket_name}}",
"service-account-file": "{{gcsConfig.json_private_key}}"
},
"keys": {
"get_source_file": {
"file-key": "{{bucket_subdirectory}}/file.csv",
"use-only-file-key": true,
"set-runtime-vars": {
"size": "gcs_source_file_size",
"row_count": "gcs_source_file_row_count",
"md5": "gcs_source_file_md5_hash"
}
}
},
"set-runtime-vars": {
"key_count": "count_gcs_files_pulled"
},
"tests":{}
}
notebook.json
{
"mappings": {
"map_gcs_to_bigquery": {
"source-name": "gcs_datasource",
"source-key": "get_source_file",
"sink-name": "bigquery_datasink",
"sink-key": "load_source_data_into_database"
}
}
}
Example 2¶
Note
UI forms do not support advanced Jinja templating. Use the File Editor to work with these examples.
Bulk insert data from many CSV files into a BigQuery table using Jinja templating.
The following example bulk inserts data from a number of .csv source files into a new BiqQuery table. A file-based data source is used to get the data to be written to BigQuery via the data sink. Jinja templating is used to generate explicit keys for each source file provided by the source_files runtime variable generated by the node's data source.
bigquery_datasink.json
{
"type": "DKDataSource_BigQuery",
"name": "bigquery_datasource",
"config": {
"service-account-file": "{{bigqueryConfig.private_key}}"
},
"keys" : {
{% for i,filename in enumerate(source_files) %}
{{ ',' if i > 0 else '' }}
"{{filename}}" : {
"format": "csv",
"query-type": "bulk_insert",
"table-name": "{{datasetname}}.{{filename.replace('.csv','')}}",
"first-row-column-names": true,
"write-disposition": "WRITE_TRUNCATE"
}
{% endfor %}
}
"tests": {}
}
gcs_datasource.json
{
"name" : "gcs_datasource",
"type" : "DKDataSource_GCS",
"config": {
"bucket": "{{gcsConfig.bucket_name}}",
"service-account-file": "{{gcsConfig.json_private_key}}"
},
"wildcard-key-prefix" : "stardata_{{build_date}}",
"wildcard" : "*.csv",
"cache" : true,
"set-runtime-vars" : {
"key_names" : "source_files"
},
"keys": {},
"tests": {}
}