Google BigQuery¶

Google BigQuery data sources and sinks are in the database category of I/O connectors.

Tool documentation¶

Google BigQuery documentation

Connector type values¶

The "type": value to use in the source or sink JSON files.

Connector type	Value
Data source	DKDataSource_BigQuery
Data sink	DKDataSink_BigQuery

Connection properties¶

The properties to use when connecting to a Google BigQuery instance from Automation.

Connections to BigQuery databases are most often established via service accounts that generate an associated JSON key file.

You can choose from three connection methods:

Credentials mode: specify private-key and service-account.
Service Account Key mode: specify service-account-file only.
IAM Role mode: no fields required.

Tip

Using roles for connection permissions. Role-based permissions configured on the agent-level may be used as an alternative to the key-based configuration described below. In these cases, the BigQuery connection object need not be defined as a kitchen-level override or referenced in BigQuery sources or sinks.

Field	Scope	Type	Required?	Description
`private-key`	source/sink	string	depends on method	A JSON key file provided by Google Cloud Platform (GCP), saved to the vault as a secret.
`project-id`	source/sink	string	depends on method	The Google Cloud Platform (GCP) Project ID. The default JSON key file provided by GCP contains the `project-id`.
`service-account`	source/sink	string	depends on method	The service account email generated by GCP. The default JSON key file provided by GCP contains the `service-account`.
`service-account-file`	source/sink	string/JSON	depends on method	A service account key file created within a GCP project.

Example GCP JSON service account key file

{
    "type": "service_account",
    "project_id": "",
    "private_key_id": "",
    "private_key": "-----BEGIN PRIVATE KEY-----*****\n-----END PRIVATE KEY-----\n",
    "client_email": "",
    "client_id": "",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": ""
}

Connections¶

See Connection Properties for more details on connection configurations.

Defined in kitchen-level variables¶

bigqueryConfig in kitchen overrides

{
    "bigqueryConfig": {
        "service-account": "#{vault://bigquery/service_account}",
        "private-key": "#{vault://bigqueryj/json_private_key}",
        "project-id": "#{vault://bigquery/project_id}"
    }
}

Connection syntax¶

For a data source¶

bigquery_datasource.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {},
    "tests": {}
}

For a data sink¶

bigquery_datasink.json

{
    "type": "DKDataSink_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {},
    "tests": {}
}

Local connections¶

GCP console¶

The Google Cloud Platform Console allows users to view database contents and query submissions.

IDE connections¶

You can configure connections from local IDEs like Pycharm.

Tip

Connecting PyCharm and DataGrip to BigQuery: See Using BigQuery from IntelliJ-based IDE for detailed information.

Other configuration properties¶

See the following topics for common properties and runtime variables:

Additional BigQuery step (key) properties¶

Field	Type	Required?	Description
`write-disposition`	string	no	*Available for BigQuery sinks only.* Specifies the action that occurs if the destination table already exists. The following values are supported. `WRITE_TRUNCATE`: If the table already exists, BigQuery overwrites the table data. `WRITE_APPEND`: If the table already exists, BigQuery appends the data to the table. `WRITE_EMPTY`: If the table already exists and contains data, a 'duplicate' error is returned in the job result. Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.

File encoding requirements¶

Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing problems with sinking data to database tables and errors when running related tests

For CSV and other delimited files, use Save as in the program and select the proper encoding, or consider using a text editor with encoding options.

Data source examples¶

Example 1¶

Submit an inline query and declare the result as a runtime variable

bigquery_datasource.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {
        "get_new_table_column_count": {
            "query-type": "execute_scalar",
            "sql": "SELECT COUNT (DISTINCT column_name) FROM `{{bigquery.project_id}}.{{kitchen_schema}}.INFORMATION_SCHEMA.COLUMNS` WHERE table_name='output_table'",
            "set-runtime-vars": {
                "result": "output_table_column_count"
            }

        }
    },
    "tests": {}
}

Example 2¶

Submit a query via a resource file

The resource file is saved to the Recipe's /resources directory. Its contents, a SQL query, are loaded into the get_new_table_row_count key via the load_text() built-in function.

bigquery_datasource.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {
        "get_new_table_row_count": {
            "query-type": "execute_scalar",
            "sql": "{{load_text('bigquery_row_count.sql')}}",
            "set-runtime-vars": {
                "result": "output_table_row_count"
            }
        }
    },
    "tests": {}
}

bigquery_row_count.sql

SELECT COUNT(*)
FROM `{{bigquery.project_id}}.{{kitchen_schema}}.output_table` LIMIT 1;

Data sink examples¶

Example 1¶

Note

BigQuery dataset creation. Because BigQuery does not presently support the creation and deletion of datasets (schemas) directly via SQL, these specific operational steps should be accomplished by use of container nodes and the appropriate container image upstream of nodes leveraging BigQuery data sinks.

Bulk insert data from a CSV file into a BigQuery table

The following example bulk inserts data from a CSV source file into a new BiqQuery table. A DataMapper node is used along with file-based data source.

bigquery_datasink.json

{
    "type": "DKDataSink_BigQuery",
    "name": "bigquery_datasink",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys": {
        "load_source_data_into_database": {
            "table-name": "{{kitchen_schema}}.output_table",
            "format": "csv",
            "first-row-column-names": true,
            "query-type": "bulk_insert",
            "write-disposition": "WRITE_TRUNCATE"
        }

    }
    "tests": {}
}

gcs_datasource.json

href="#__codelineno-8-1">{ "name": "gcs_datasource", "type": "DKDataSource_GCS", "config": { "bucket": "{{gcsConfig.bucket_name}}", "service-account-file": "{{gcsConfig.json_private_key}}" }, "keys": { "get_source_file": { "file-key": "{{bucket_subdirectory}}/file.csv", "use-only-file-key": true, "set-runtime-vars": { "size": "gcs_source_file_size", "row_count": "gcs_source_file_row_count", "md5": "gcs_source_file_md5_hash" } } }, "set-runtime-vars": { "key_count": "count_gcs_files_pulled" }, "tests":{} }

notebook.json

{
    "mappings": {
        "map_gcs_to_bigquery": {
            "source-name": "gcs_datasource",
            "source-key": "get_source_file",
            "sink-name": "bigquery_datasink",
            "sink-key": "load_source_data_into_database"
        }
    }
}

Example 2¶

Note

UI forms do not support advanced Jinja templating. Use the File Editor to work with these examples.

Bulk insert data from many CSV files into a BigQuery table using Jinja templating.

The following example bulk inserts data from a number of .csv source files into a new BiqQuery table. A file-based data source is used to get the data to be written to BigQuery via the data sink. Jinja templating is used to generate explicit keys for each source file provided by the source_files runtime variable generated by the node's data source.

bigquery_datasink.json

{
    "type": "DKDataSource_BigQuery",
    "name": "bigquery_datasource",
    "config": {
        "service-account-file": "{{bigqueryConfig.private_key}}"
    },
    "keys"  : {
{% for i,filename in enumerate(source_files) %}
        {{ ',' if i > 0 else '' }}
        "{{filename}}" : {
            "format": "csv",
            "query-type": "bulk_insert",
            "table-name": "{{datasetname}}.{{filename.replace('.csv','')}}",
            "first-row-column-names": true,
            "write-disposition": "WRITE_TRUNCATE"
        }
{% endfor %}
    }
    "tests": {}
}

gcs_datasource.json

{
    "name" : "gcs_datasource",
    "type" : "DKDataSource_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.json_private_key}}"
    },
    "wildcard-key-prefix" : "stardata_{{build_date}}",
    "wildcard" : "*.csv",
    "cache" : true,
    "set-runtime-vars" : {
        "key_names" : "source_files"
    },
    "keys": {},
    "tests": {}
}