Azure Data Lake Storage Gen2¶
Azure Data Lake Storage data sources and sinks are in the file-based category of I/O connectors.
Tool documentation¶
Connector type values¶
The "type": value to use in the source or sink JSON files.
| Connector type | Value |
|---|---|
| Data source | DKDataSource_ADLS2 |
| Data sink | DKDataSink_ADLS2 |
Connection properties¶
The properties to use when connecting to an Azure Data Lake Storage Gen2 instance from Automation.
In order to use this DataKitchen connector, you must have a pre-existing file system.
| Field | Scope | Type | Required? | Description |
|---|---|---|---|---|
connection_string |
source/sink | string | yes | Secret access string used to identify Azure account and permissions. |
filesystem |
source/sink | string | yes | Name of file system on which to operate. File system must exist before writing to the path with an ADLS connector. |
retry-count |
source/sink | string | no | Number of attempts system makes to establish a connection. |
reconnect-interval |
sink only | numeric | no | The timeout interval in seconds set to allow for tool reconnections and for large files to be uploaded to an ADLS sink. Note that this setting must be added using the File Editor; the UI does not display the associated field. |
Connections¶
See Connection Properties for details on connection configurations
Defined in kitchen-level variables¶
adls2Config in Kitchen Overrides
{
"adls2Config":{
"connection_string": "#{vault://adls2/connection_string}",
"filesystem": "datakitchen-staging"
}
}
The Connection tab in a Node Editor¶

Expanded connection syntax¶
For a data source¶
azuredatalake_datasource.json
{
"type": "DKDataSource_ADLS2",
"name": "azuredatalake_datasource",
"config": {
"connection_string": "{{adls2Config.connection_string}}",
"filesystem": "{{adls2Config.filesystem}}"
},
"keys": {
"adls_source": {
"file-key": "test_upload.json",
"use-only-file-key": true,
"set_runtime-vars": {
"md5": "pre_upload_md5"
}
}
}
}
For a data sink¶
azuredatalake_datasink.json
{
"type": "DKDataSink_ADLS2",
"name": "azuredatalake_datasink",
"config": {
"connection_string": "{{adls2Config.connection_string}}",
"filesystem": "{{adls2Config.filesystem}}",
"reconnect-interval": 1800
},
"keys": {
"adls_sink": {
"file-key": "test_upload.json",
"use-only-file-key": true,
"set-runtime-vars": {
"md5": "post_download_md5"
}
}
}
}
Condensed connection syntax¶
For a data source¶
azuredatalake_datasource.json
{
"type": "DKDataSource_ADLS2",
"name": "azuredatalake_datasource",
"config-ref": "adls2Config",
"keys": {},
"tests": {}
}
For a data sink¶
azuredatalake_datasink.json
{
"type": "DKDataSink_ADLS2",
"name": "azuredatalake_datasink",
"config-ref": "adls2Config",
"keys": {},
"tests": {}
}
Local connections¶
You can access your Azure Storage account settings to find access keys and connection strings. Access keys are basically the credentials for your storage, and connection strings contain the information needed for the DataKitchen platform to connect and access data.
See Microsoft instructions to view and copy a connection string.
Other configuration properties¶
Known issue with multiple file uploads. Some node configurations for uploading more than four files into Microsoft ADSL2 have failed during execution. This issue is under investigation; the root cause has not yet been determined.
See the following topics for common properties, wildcards, and runtime variables:
File encoding requirements¶
Files used with data sources and data sinks must be encoded in UTF-8 in order to avoid non-Unicode characters causing problems with sinking data to database tables and errors when running related tests
For CSV and other delimited files, use Save as in the program and select the proper encoding, or consider using a text editor with encoding options.
Data source example¶
The ADLS2 data source below loads all JSON files present in the wildcard/ directory with a wildcard key. It also loads the specific test_upload.json file with a file key. The azureblobConfig variable defines the adls2Config variable and file system for these files.
The source, when finished loading the file, stores the file’s md5 hash in the post_download_md5 runtime variable. As a file integrity test, the source then compares post_download_md5 to a predefined pre_upload_md5 variable.
source.json
{
"name": "source",
"type": "DKDataSource_ADLS2",
"config-ref": "adls2Config",
"wildcard": "*.json",
"wildcard-key-prefix": "wildcard/",
"keys": {
"azure_source": {
"file-key": "test_upload.json",
"use-only-file-key": true,
"set-runtime-vars": {
"md5": "post_download_md5"
}
}
},
"tests": {
"verify_data": {
"action": "stop-on-error",
"test-variable": "pre_upload_md5",
"type": "test-contents-as-string",
"test-logic": "pre_upload_d5 == {{post_download_md5}}"
}
}
}
One could also run a test to compare the pre_upload and post_download runtime variables. See Tests for more information and examples.
Data sink example¶
The AzureBlob data sink below uploads a single file named test_upload.json to a file on the Azure account and file system defined by the azureblobConfig variable. If a file exists at the specified location it is overwritten. After uploading, the sink stores the md5 hash of the file in the pre_upload_md5 runtime variable for later use.
sink.json