Skip to content

Google BigQuery

Google BigQuery data sources and sinks are in the database category of I/O connectors.

Tool documentation

Connector type values

The "type": value to use in the source or sink JSON files.

Connector type Value
Data source DKDataSource_BigQuery
Data sink DKDataSink_BigQuery

Connection properties

The properties to use when connecting to a Google BigQuery instance from Automation.

Connections to BigQuery databases are most often established via service accounts that generate an associated JSON key file.

You can choose from three connection methods:

  • Credentials mode: specify private-key and service-account.
  • Service Account Key mode: specify service-account-file only.
  • IAM Role mode: no fields required.

Tip

Using roles for connection permissions. Role-based permissions configured on the agent-level may be used as an alternative to the key-based configuration described below. In these cases, the BigQuery connection object need not be defined as a kitchen-level override or referenced in BigQuery sources or sinks.

Field Scope Type Required? Description
private-key source/sink string depends on method A JSON key file provided by Google Cloud Platform (GCP), saved to the vault as a secret.
project-id source/sink string depends on method The Google Cloud Platform (GCP) Project ID. The default JSON key file provided by GCP contains the project-id.
service-account source/sink string depends on method The service account email generated by GCP. The default JSON key file provided by GCP contains the service-account.
service-account-file source/sink string/JSON depends on method A service account key file created within a GCP project.

Example GCP JSON service account key file

{
    "type": "service_account",
    "project_id": "",
    "private_key_id": "",
    "private_key": "-----BEGIN PRIVATE KEY-----*****\n-----END PRIVATE KEY-----\n",
    "client_email": "",
    "client_id": "",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": ""
}

Connections

See Connection Properties for more details on connection configurations.

Defined in kitchen-level variables

bigqueryConfig in kitchen overrides

{
    "bigqueryConfig": {
        "service-account": "#{vault://bigquery/service_account}",
        "private-key": "#{vault://bigqueryj/json_private_key}",
        "project-id": "#{vault://bigquery/project_id}"
    }
}

Connection syntax

For a data source

bigquery_datasource.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {},
    "tests": {}
}

For a data sink

bigquery_datasink.json

{
    "type": "DKDataSink_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {},
    "tests": {}
}

Local connections

GCP console

The Google Cloud Platform Console allows users to view database contents and query submissions.

IDE connections

You can configure connections from local IDEs like Pycharm.

Tip

Connecting PyCharm and DataGrip to BigQuery: See Using BigQuery from IntelliJ-based IDE for detailed information.

Other configuration properties

See the following topics for common properties and runtime variables:

Additional BigQuery step (key) properties

Field Type Required? Description
write-disposition string no Available for BigQuery sinks only.

Specifies the action that occurs if the destination table already exists. The following values are supported.
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.

Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.

File encoding requirements

Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing problems with sinking data to database tables and errors when running related tests

For CSV and other delimited files, use Save as in the program and select the proper encoding, or consider using a text editor with encoding options.

Data source examples

Example 1

Submit an inline query and declare the result as a runtime variable

bigquery_datasource.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {
        "get_new_table_column_count": {
            "query-type": "execute_scalar",
            "sql": "SELECT COUNT (DISTINCT column_name) FROM `{{bigquery.project_id}}.{{kitchen_schema}}.INFORMATION_SCHEMA.COLUMNS` WHERE table_name='output_table'",
            "set-runtime-vars": {
                "result": "output_table_column_count"
            }

        }
    },
    "tests": {}
}

Example 2

Submit a query via a resource file

The resource file is saved to the Recipe's /resources directory. Its contents, a SQL query, are loaded into the get_new_table_row_count key via the load_text() built-in function.

bigquery_datasource.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {
        "get_new_table_row_count": {
            "query-type": "execute_scalar",
            "sql": "{{load_text('bigquery_row_count.sql')}}",
            "set-runtime-vars": {
                "result": "output_table_row_count"
            }
        }
    },
    "tests": {}
}

bigquery_row_count.sql

SELECT COUNT(*)
FROM `{{bigquery.project_id}}.{{kitchen_schema}}.output_table` LIMIT 1;

Data sink examples

Example 1

Note

BigQuery dataset creation. Because BigQuery does not presently support the creation and deletion of datasets (schemas) directly via SQL, these specific operational steps should be accomplished by use of container nodes and the appropriate container image upstream of nodes leveraging BigQuery data sinks.

Bulk insert data from a CSV file into a BigQuery table

The following example bulk inserts data from a CSV source file into a new BiqQuery table. A DataMapper node is used along with file-based data source.

bigquery_datasink.json

{
    "type": "DKDataSink_BigQuery",
    "name": "bigquery_datasink",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {
        "load_source_data_into_database": {
            "table-name": "{{kitchen_schema}}.output_table",
            "format": "csv",
            "first-row-column-names": true,
            "query-type": "bulk_insert",
            "write-disposition": "WRITE_TRUNCATE"
        }

    }
    "tests": {}
}

gcs_datasource.json

{
    "name": "gcs_datasource",
    "type": "DKDataSource_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.json_private_key}}"
    },
    "keys": {
        "get_source_file": {
            "file-key": "{{bucket_subdirectory}}/file.csv",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "size": "gcs_source_file_size",
                "row_count": "gcs_source_file_row_count",
                "md5": "gcs_source_file_md5_hash"
            }
        }
    },
    "set-runtime-vars": {
        "key_count": "count_gcs_files_pulled"
    },
    "tests":{}
}

notebook.json

{
    "mappings": {
        "map_gcs_to_bigquery": {
            "source-name": "gcs_datasource",
            "source-key": "get_source_file",
            "sink-name": "bigquery_datasink",
            "sink-key": "load_source_data_into_database"
        }
    }
}

Example 2

Note

UI forms do not support advanced Jinja templating. Use the File Editor to work with these examples.

Bulk insert data from many CSV files into a BigQuery table using Jinja templating.

The following example bulk inserts data from a number of .csv source files into a new BiqQuery table. A file-based data source is used to get the data to be written to BigQuery via the data sink. Jinja templating is used to generate explicit keys for each source file provided by the source_files runtime variable generated by the node's data source.

bigquery_datasink.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys"  : {
{% for i,filename in enumerate(source_files) %}
        {{ ',' if i > 0 else '' }}
        "{{filename}}" : {
            "format": "csv",
            "query-type": "bulk_insert",
            "table-name": "{{datasetname}}.{{filename.replace('.csv','')}}",
            "first-row-column-names": true,
            "write-disposition": "WRITE_TRUNCATE"
        }
{% endfor %}
    }
    "tests": {}
}

gcs_datasource.json

{
    "name" : "gcs_datasource",
    "type" : "DKDataSource_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.json_private_key}}"
    },
    "wildcard-key-prefix" : "stardata_{{build_date}}",
    "wildcard" : "*.csv",
    "cache" : true,
    "set-runtime-vars" : {
        "key_names" : "source_files"
    },
    "keys": {},
    "tests": {}
}