Skip to content

Google Cloud Storage

Google Cloud Storage data sources and sinks are in the file-based category of I/O connectors.

Tool documentation

Connector type values

The "type": value to use in the source or sink JSON files.

Connector type Value
Data source DKDataSource_GCS
Data sink DKDataSink_GCS

Connection properties

The properties to use when connecting to a Google Cloud Storage instance from Automation.

Connections to GCS buckets are most often established via GCP storage service accounts which generate an associated JSON key file.

You can choose from three connection options:

  • Credentials mode: specify private-key and service-account.
  • Service Account Key mode: specify service-account-file only.
  • IAM Role mode: no fields required.

Tip

Using roles for connection permissions. Role-based permissions configured on the agent-level may be used as an alternative to the key-based configuration described below. In these cases, the GCS connection object need not be defined as a kitchen-level override or referenced in GCS sources or sinks.

Field Scope Type Required? Description
bucket source/sink string yes The name of the GCS bucket. Not included in the service account JSON key file provided by GCP.
private-key source/sink string yes A JSON key file provided by GCP, saved to the vault as a secret.
service-account source/sink string no The service account email generated by GCP. The default JSON key file provided by GCP contains the service-account.
service-account-file source/sink string/JSON depends on method A service account key file created within a GCP project.

Example GCP JSON service account key file

{
  "type": "service_account",
  "project_id": "",
  "private_key_id": "",
  "private_key": "-----BEGIN PRIVATE KEY-----*****\n-----END PRIVATE KEY-----\n",
  "client_email": "",
  "client_id": "",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": ""
}

Connections

See Connection Properties for more details on connection configurations.

Defined in kitchen-level variables

gcsConfig in kitchen overrides

{
    "gcsConfig": {
        "service_account": "#{vault://gcs/service_account}",
        "private_key": "#{vault://gcs/json_private_key",
        "bucket": "#{vault://gcs/bucket_name}"
    }
}

Connection syntax

For a data source

gcs_datasource.json

{
    "type": "DKDataSource_GCS",
    "name": "gcs_datasource",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "keys": {},
    "tests": {}
}

For a data sink

gcs_datasink.json

{
    "type": "DKDataSink_GCS",
    "name": "gcs_datasink",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "keys": {},
    "tests": {}
}

Local connections

GCP Console

The Google Cloud Platform Console allows users to view database contents and query submissions.

IDE connections

You can configure connections from local IDEs like PyCharm.

Tip

Connect PyCharm or DataGrip to GCS: Install a plugin to view GCS bucket contents in PyCharm or DataGrip. Go to PyCharm --> Preferences --> Plugins --> Install JetBrains plugin... and install the Google Cloud Tools plugin. When initially connecting you will be prompted to enter the credentials you use for the Google Cloud Platform Console.

Other configuration properties

See the following topics for common properties, wildcards, and runtime variables:

File encoding requirements

Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing problems with sinking data to database tables and errors when running related tests

For CSV and other delimited files, use Save as in the program and select the proper encoding, or consider using a text editor with encoding options.

Data source examples

Example 1

Get a file from GCS with an explicit key.

gcs_datasource.json

{
    "name": "gcs_datasource",
    "type": "DKDataSource_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "keys" : {
        "input" : {
            "use-only-file-key": true,
            "file-key": "input-files/input.csv"
        }
    }
}

Example 2

Get all available CSV files from GCS using a wildcard and set the file name to runtime variables.

Here, the list of files pulled by the *.csv wildcard is declared as a runtime variable.

gcs_datasource.json

{
    "name": "gcs_datasource",
    "type": "DKDataSource_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "wildcard-key-prefix" : "stardata_{{build_date}}",
    "wildcard" : "*.csv",
    "set-runtime-vars" : {
        "key_names" : "dimension_files"
    }
}

Example 3

Use Jinja templating to programmatically generate explicit keys and runtime variables.

The Jinja templating used generates an explicit key and runtime variable based on the file list dimension_files generated in the example above. The file list may come from alternative sources like a /resources file, for example.

check_gcs.json

{
    "name": "check_gcs",
    "type": "DKDataSource_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "keys" : {
      {% for file in dimension_files %}

            {% set key = file.replace('.csv','') %}

            {{',' if loop.index0 > 0 else ''}}

            "{{file}}" : {
                "file-key" : "stardata-{{build_date}}/{{file}}",
                "use-only-file-key" : true,
                "set-runtime-vars" : {
                    "row_count" : "row_count_file_{{key}}"
        }
      }

        {% endfor %}
    }
}

Data sink examples

Example 1

Push a file to GCS with an explicit key.

gcs_datasink.json

{
    "name": "gcs_datasink",
    "type": "DKDataSink_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "keys" : {
        "push_file" : {
            "use-only-file-key": true,
            "file-key": "stardata_{{build_date}}/d_customer_profit.csv"
        }
    }
}

Example 2

Push an arbitrary list of files to GCS using a wildcard.

Here, the wildcard pushes all files from the DataMapper node's data source to a directory that matches the wildcard-key-prefix.

gcs_datasink.json

{
    "name": "gcs_datasink",
    "type": "DKDataSink_GCS",
    "config": {
        "bucket": "{{gcsConfig.bucket_name}}",
        "service-account-file": "{{gcsConfig.private_key}}"
    },
    "wildcard-key-prefix" : "stardata_{{build_date}}/",
    "keys" : {}
}