Databricks Delta Lake (AWS) is an open source storage layer that sits on top of your existing data lake file storage. Stitch’s Databricks Delta Lake (AWS) destination is compatible with Amazon S3 data lakes.

This guide serves as a reference for version 1 of Stitch’s Databricks Delta Lake (AWS) destination.


Details and features

Stitch features

High-level details about Stitch’s implementation of Databricks Delta Lake (AWS), such as supported connection methods, availability on Stitch plans, etc.

Release status

Released

Stitch plan availability

All Stitch plans

Stitch supported regions
  • North America (AWS us-east-1)
  • Europe (AWS eu-central-1)

Operating regions determine the location of the resources Stitch uses to process your data. Learn more.

Supported versions

Databricks Runtime Version 6.3+

Connect API availability Supported

This version of the Databricks Delta Lake (AWS) destination can be created and managed using Stitch’s Connect API. Learn more.

SSH connections Supported

Stitch supports using SSH tunnels to connect to Databricks Delta Lake (AWS) destinations.

SSL connections Supported

Stitch will attempt to use SSL to connect by default. No additional configuration is needed.

VPN connections Unsupported

Virtual Private Network (VPN) connections may be implemented as part of a Premium plan. Contact Stitch Sales for more info.

Static IP addresses Supported

This version of the Databricks Delta Lake (AWS) destination has static IP addresses that can be whitelisted.

Default loading behavior

Upsert
Note: Append-Only loading will be used if all conditions for Upsert are not met. Learn more.

Nested structure support

Supported
Nested data structures (JSON arrays and objects) will be loaded intact into a STRING column with a comment specifying that the column contains JSON. Learn more.

Destination details

Details about the destination, including object names, table and column limits, reserved keywords, etc.

Note: Exceeding the limits noted below will result in loading errors or rejected data.

Maximum record size

20MB

Table name length

78 characters

Column name length

122 characters

Maximum columns per table

None

Maximum table size

None

Maximum tables per database

None

Case sensitivity

Insensitive

Reserved keywords

Refer to the Reserved keywords documentation.


Replication

Replication process overview

A Stitch replication job consists of three stages:

Step 1: Data extraction

Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.

Step 2: Stitch's internal pipeline

The data extracted from sources is processed by Stitch. Stitch’s internal pipeline includes the Prepare and Load phases of the replication process:

  • Prepare: During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading.
  • Load: During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. Refer to the Transformations section for more info about the transformations Stitch performs for Databricks Delta Lake (AWS) destinations.

Refer to the System overview guide for a more detailed explanation of these phases.

Step 3: Amazon S3 bucket

Data is loaded into S3 files in the Amazon S3 bucket you provide during destination setup.

Step 4: Staging data

Data is copied from the Amazon S3 bucket and placed into staging tables in Databricks Delta Lake (AWS).

Step 5: Data merge

Data is merged from the staging tables into real tables in Databricks Delta Lake (AWS).

Loading behavior

By default, Stitch will use Upsert loading when loading data into Databricks Delta Lake (AWS).

If the conditions for Upsert loading aren’t met, data will be loaded using Append-Only loading.

Refer to the Understanding loading behavior guide for more info and examples.

Primary Keys

Stitch requires Primary Keys to de-dupe incrementally replicated data. To ensure Primary Key data is available, Stitch creates a stitch.pks table property comment when the table is initially created in Databricks Delta Lake (AWS). The table property comment is an array of strings that contain the names of the Primary Key columns for the table.

For example: A table property comment for a table with a single Primary Key:

(stitch.pks="id")

And a table property comment for a table with a composite Primary Key:

(stitch.pks="id,created_at")

Note: Removing or incorrectly altering Primary Key table property comments can lead to replication issues.

Incompatible sources

No compatibility issues have been discovered between Databricks Delta Lake (AWS) and Stitch's integration offerings.

See all destination and integration incompatibilities.


Transformations

System tables and columns

Stitch will create the following tables in each integration’s dataset:

Additionally, Stitch will insert system columns (prepended with _sdc) into each table.

Data typing

Stitch converts data types only where needed to ensure the data is accepted by Databricks Delta Lake (AWS). In the table below are the data types Stitch supports for Databricks Delta Lake (AWS) destinations, and the Stitch types they map to.

  • Stitch type: The Stitch data type the source type was mapped to. During the Extraction and Preparing phases, Stitch identifies the data type in the source and then maps it to a common Stitch data type.
  • Destination type: The destination-compatible data type the Stitch type maps to. This is the data type Stitch will use to store data in Databricks Delta Lake (AWS).
  • Notes: Details about the data type and/or its allowed values in the destination, if available. If a range is available, values that exceed the noted range will be rejected by Databricks Delta Lake (AWS).
Stitch type Destination type Notes
BIGINT BIGINT
BOOLEAN BOOLEAN
DATE TIMESTAMP
  • Description: Stored in UTC

DOUBLE DECIMAL
  • Description: Stored as decimal(38,6)

FLOAT FLOAT
INTEGER BIGINT
JSON ARRAY STRING
  • Description: Used to store nested JSON structures (objects and arrays). JSON is loaded intact into the column, which will have a comment ("json") specifying that the column contains JSON data.

JSON OBJECT STRING
  • Description: Used to store nested JSON structures (objects and arrays). JSON is loaded intact into the column, which will have a comment ("json") specifying that the column contains JSON data.

NUMBER DECIMAL
  • Description: Stored as decimal(38,6)

STRING STRING

JSON structures

Databricks Delta Lake (AWS) supports nested records within tables. When JSON objects and arrays are replicated, Stitch will load the JSON intact into a STRING column and add a comment ("json") specifying that the column contains JSON data.

Refer to Databricks’ documentation for examples and instructions on working with complex data structures.

Column names

Column names in Databricks Delta Lake (AWS):

Stitch will perform the following transformations to ensure column names adhere to the rules imposed by Databricks Delta Lake (AWS):

Transformation Source column Destination column
Convert uppercase and mixed case to lowercase CUSTOMERID or cUsTomErId customerid
Convert spaces to underscores customer id customer_id
Convert special characters to underscores customer#id or !customerid customer_id and _customerid
Prepend an underscore to names with leading numeric characters 4customerid _4customerid

Timezones

Databricks Delta Lake (AWS) will store the value as TIMESTAMP WITH TIMEZONE. In Databricks Delta Lake (AWS), this data is stored with timezone information and expressed as UTC.


Compare destinations

Not sure if Databricks Delta Lake (AWS) is the destination for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.


Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.