Skip to content

Azure Data Lake Storage Gen2

Azure Data Lake Storage data sources and sinks are in the file-based category of I/O connectors.

Tool documentation

Connector type values

The "type": value to use in the source or sink JSON files.

Connector type Value
Data source DKDataSource_ADLS2
Data sink DKDataSink_ADLS2

Connection properties

The properties to use when connecting to an Azure Data Lake Storage Gen2 instance from Automation.

In order to use this DataKitchen connector, you must have a pre-existing file system.

Field Scope Type Required? Description
connection_string source/sink string yes Secret access string used to identify Azure account and permissions.
filesystem source/sink string yes Name of file system on which to operate. File system must exist before writing to the path with an ADLS connector.
retry-count source/sink string no Number of attempts system makes to establish a connection.
reconnect-interval sink only numeric no The timeout interval in seconds set to allow for tool reconnections and for large files to be uploaded to an ADLS sink. Note that this setting must be added using the File Editor; the UI does not display the associated field.

Connections

See Connection Properties for details on connection configurations

Defined in kitchen-level variables

adls2Config in Kitchen Overrides

{
    "adls2Config":{
        "connection_string": "#{vault://adls2/connection_string}",
        "filesystem": "datakitchen-staging"
    }
}

The Connection tab in a Node Editor

(Screenshot of the Connector tab)

Expanded connection syntax

For a data source

azuredatalake_datasource.json

{
    "type": "DKDataSource_ADLS2",
    "name": "azuredatalake_datasource",
    "config": {
        "connection_string": "{{adls2Config.connection_string}}",
        "filesystem": "{{adls2Config.filesystem}}"
    },
    "keys": {
        "adls_source": {
            "file-key": "test_upload.json",
            "use-only-file-key": true,
            "set_runtime-vars": {
                "md5": "pre_upload_md5"
            }
        }
    }
}

For a data sink

azuredatalake_datasink.json

{
    "type": "DKDataSink_ADLS2",
    "name": "azuredatalake_datasink",
    "config": {
        "connection_string": "{{adls2Config.connection_string}}",
        "filesystem": "{{adls2Config.filesystem}}",
        "reconnect-interval": 1800
    },
    "keys": {
       "adls_sink": {
            "file-key": "test_upload.json",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "md5": "post_download_md5"
            }
       }
    }
}

Condensed connection syntax

For a data source

azuredatalake_datasource.json

{
    "type": "DKDataSource_ADLS2",
    "name": "azuredatalake_datasource",
    "config-ref": "adls2Config",
    "keys": {},
    "tests": {}
}

For a data sink

azuredatalake_datasink.json

{
    "type": "DKDataSink_ADLS2",
    "name": "azuredatalake_datasink",
    "config-ref": "adls2Config",
    "keys": {},
    "tests": {}
}

Local connections

You can access your Azure Storage account settings to find access keys and connection strings. Access keys are basically the credentials for your storage, and connection strings contain the information needed for the DataKitchen platform to connect and access data.

See Microsoft instructions to view and copy a connection string.

Other configuration properties

Known issue with multiple file uploads. Some node configurations for uploading more than four files into Microsoft ADSL2 have failed during execution. This issue is under investigation; the root cause has not yet been determined.

See the following topics for common properties, wildcards, and runtime variables:

File encoding requirements

Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing problems with sinking data to database tables and errors when running related tests

For CSV and other delimited files, use Save as in the program and select the proper encoding, or consider using a text editor with encoding options.

Data source example

The ADLS2 data source below loads all JSON files present in the wildcard/ directory with a wildcard key. It also loads the specific test_upload.json file with a file key. The azureblobConfig variable defines the adls2Config variable and file system for these files.

The source, when finished loading the file, stores the file’s md5 hash in the post_download_md5 runtime variable. As a file integrity test, the source then compares post_download_md5 to a predefined pre_upload_md5 variable.

source.json

{
    "name": "source",
    "type": "DKDataSource_ADLS2",
    "config-ref": "adls2Config",
    "wildcard": "*.json",
    "wildcard-key-prefix": "wildcard/",
    "keys": {
        "azure_source": {
            "file-key": "test_upload.json",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "md5": "post_download_md5"
            }
        }
    },
    "tests": {
        "verify_data": {
            "action": "stop-on-error",
            "test-variable": "pre_upload_md5",
            "type": "test-contents-as-string",
            "test-logic": "pre_upload_d5 == {{post_download_md5}}"
        }
    }
}

One could also run a test to compare the pre_upload and post_download runtime variables. See Tests for more information and examples.

Data sink example

The AzureBlob data sink below uploads a single file named test_upload.json to a file on the Azure account and file system defined by the azureblobConfig variable. If a file exists at the specified location it is overwritten. After uploading, the sink stores the md5 hash of the file in the pre_upload_md5 runtime variable for later use.

sink.json

{
    "name": "sink",
    "type": "DKDataSink_ADLS2",
    "config-ref": "adls2Config",
    "keys": {
        "azure_sink": {
            "file-key": "test_upload.json",
            "use-only-file-key": true,
            "set-runtime-vars": {
                "md5": "pre_upload_md5"
            }
        }
    }
}