Databricks Delta on AWS (v1) Destination Reference

Databricks Delta Lake (AWS) is an open source storage layer that sits on top of your existing data lake file storage. Stitch’s Databricks Delta Lake (AWS) destination is compatible with Amazon S3 data lakes.

This guide serves as a reference for version 1 of Stitch’s Databricks Delta Lake (AWS) destination.

Stitch features

High-level details about Stitch’s implementation of Databricks Delta Lake (AWS), such as supported connection methods, availability on Stitch plans, etc.

Release status	Released
Stitch plan availability	All Stitch plans
Stitch supported regions	North America (AWS us-east-1) Europe (AWS eu-central-1) Operating regions determine the location of the resources Stitch uses to process your data. Learn more.
Supported versions	Databricks Runtime Version 6.3+
Connect API availability	Supported This version of the Databricks Delta Lake (AWS) destination can be created and managed using Stitch’s Connect API. Learn more.
SSH connections	Supported Stitch supports using SSH tunnels to connect to Databricks Delta Lake (AWS) destinations.
SSL connections	Supported Stitch will attempt to use SSL to connect by default. No additional configuration is needed.
VPN connections	Unsupported Virtual Private Network (VPN) connections may be implemented as part of a Premium plan. Contact Stitch Sales for more info.
Static IP addresses	Supported This version of the Databricks Delta Lake (AWS) destination has static IP addresses that can be whitelisted.
Default loading behavior	Upsert Note: Append-Only loading will be used if all conditions for Upsert are not met. Learn more.
Nested structure support	Supported Nested data structures (JSON arrays and objects) will be loaded intact into a `STRING` column with a comment specifying that the column contains JSON. Learn more.

Destination details

Details about the destination, including object names, table and column limits, reserved keywords, etc.

Note: Exceeding the limits noted below will result in loading errors or rejected data.

Maximum record size	20MB
Table name length	78 characters
Column name length	122 characters
Maximum columns per table	None
Maximum table size	None
Maximum tables per database	None
Case sensitivity	Insensitive
Reserved keywords	Refer to the Reserved keywords documentation.

Replication

Replication process overview
Loading behavior
Primary Keys
Incompatible sources

Replication process overview

A Stitch replication job consists of three stages:

Step 1: Data extraction
Step 2: Stitch’s internal pipeline
Step 3: Amazon S3 bucket
Step 4: Staging data
Step 5: Data merge

Step 1: Data extraction

Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.

Step 2: Stitch's internal pipeline

The data extracted from sources is processed by Stitch. Stitch’s internal pipeline includes the Prepare and Load phases of the replication process:

Prepare: During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading.
Load: During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. Refer to the Transformations section for more info about the transformations Stitch performs for Databricks Delta Lake (AWS) destinations.

Refer to the System overview guide for a more detailed explanation of these phases.

Step 3: Amazon S3 bucket

Data is loaded into S3 files in the Amazon S3 bucket you provide during destination setup.

Step 4: Staging data

Data is copied from the Amazon S3 bucket and placed into staging tables in Databricks Delta Lake (AWS).

Step 5: Data merge

Data is merged from the staging tables into real tables in Databricks Delta Lake (AWS).

Loading behavior

By default, Stitch will use Upsert loading when loading data into Databricks Delta Lake (AWS).

If the conditions for Upsert loading aren’t met, data will be loaded using Append-Only loading.

Refer to the Understanding loading behavior guide for more info and examples.

Primary Keys

Stitch requires Primary Keys to de-dupe incrementally replicated data. To ensure Primary Key data is available, Stitch creates a stitch.pks table property comment when the table is initially created in Databricks Delta Lake (AWS). The table property comment is an array of strings that contain the names of the Primary Key columns for the table.

For example: A table property comment for a table with a single Primary Key:

(stitch.pks="id")

And a table property comment for a table with a composite Primary Key:

(stitch.pks="id,created_at")

Note: Removing or incorrectly altering Primary Key table property comments can lead to replication issues.

Incompatible sources

No compatibility issues have been discovered between Databricks Delta Lake (AWS) and Stitch's integration offerings.

See all destination and integration incompatibilities.

Transformations

System tables and columns
Data typing
JSON structures
Column names
Timezones

System tables and columns

Stitch will create the following tables in each integration’s dataset:

_sdc_rejected

Additionally, Stitch will insert system columns (prepended with _sdc) into each table.

Data typing

Stitch converts data types only where needed to ensure the data is accepted by Databricks Delta Lake (AWS). In the table below are the data types Stitch supports for Databricks Delta Lake (AWS) destinations, and the Stitch types they map to.

Stitch type: The Stitch data type the source type was mapped to. During the Extraction and Preparing phases, Stitch identifies the data type in the source and then maps it to a common Stitch data type.
Destination type: The destination-compatible data type the Stitch type maps to. This is the data type Stitch will use to store data in Databricks Delta Lake (AWS).
Notes: Details about the data type and/or its allowed values in the destination, if available. If a range is available, values that exceed the noted range will be rejected by Databricks Delta Lake (AWS).

Stitch type	Destination type	Notes
BIGINT	UNSUPPORTED	Description: The `BIGINT` type is not supported for this destination. If a record contains a `BIGINT` value, it will be sent to the `_sdc_rejected` table.
BOOLEAN	BOOLEAN
DATE	TIMESTAMP	Description: Stored in UTC Range : Timestamps before `1900-01-01T00:00:00Z` are not supported.
DOUBLE	DECIMAL	Description: Stored as `decimal(38,6)`
FLOAT	FLOAT
INTEGER	INT
JSON ARRAY	STRING	Description: Used to store nested JSON structures (objects and arrays). JSON is loaded intact into the column, which will have a comment (`"json"`) specifying that the column contains JSON data.
JSON OBJECT	STRING	Description: Used to store nested JSON structures (objects and arrays). JSON is loaded intact into the column, which will have a comment (`"json"`) specifying that the column contains JSON data.
NUMBER	DECIMAL	Description: Stored as `decimal(38,6)`
STRING	STRING

JSON structures

Databricks Delta Lake (AWS) supports nested records within tables. When JSON objects and arrays are replicated, Stitch will load the JSON intact into a STRING column and add a comment ("json") specifying that the column contains JSON data.

Refer to Databricks’ documentation for examples and instructions on working with complex data structures.

Column names

Column names in Databricks Delta Lake (AWS):

Must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_)
Must begin with a letter or an underscore
Must be less than the maximum length of 122 characters. Columns that exceed this limit will be rejected by Databricks Delta Lake (AWS).
Must not be prefixed or suffixed with any of Stitch’s reserved keyword prefixes or suffixes

Stitch will perform the following transformations to ensure column names adhere to the rules imposed by Databricks Delta Lake (AWS):

Transformation	Source column	Destination column
Convert uppercase and mixed case to lowercase	`CUSTOMERID` or `cUsTomErId`	`customerid`
Convert spaces to underscores	`customer id`	`customer_id`
Convert special characters to underscores	`customer#id` or `!customerid`	`customer_id` and `_customerid`
Prepend an underscore to names with leading numeric characters	`4customerid`	`_4customerid`

Timezones

Databricks Delta Lake (AWS) will store the value as TIMESTAMP WITH TIMEZONE. In Databricks Delta Lake (AWS), this data is stored with timezone information and expressed as UTC.

Compare destinations

Not sure if Databricks Delta Lake (AWS) is the destination for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Choosing a Destination Destination & Integration Compatibility Loading Data into Your Destination Switching Destinations	Destination Connection Errors