Databricks Delta is an open source storage layer that sits on top of your existing data lake file storage. Stitch’s Databricks Delta destination is compatible with Amazon S3 data lakes.

This guide serves as a reference for version 1 of Stitch’s Databricks Delta destination.


Details and features

Stitch features

High-level details about Stitch’s implementation of Databricks Delta, such as supported connection methods, availability on Stitch plans, etc.

Release status

Beta

Stitch plan availability

All Stitch plans

Supported versions

Databricks Runtime Version 6.3+

Connect API availability Supported

This version of the Databricks Delta destination can be created and managed using Stitch’s Connect API. Learn more.

SSH connections Supported

Stitch supports using SSH tunnels to connect to Databricks Delta destinations.

SSL connections Supported

Stitch will attempt to use SSL to connect by default. No additional configuration is needed.

VPN connections Unsupported

Virtual Private Network (VPN) connections may be implemented as part of an Enterprise plan. Contact Stitch Sales for more info.

Default loading behavior

Upsert
Note: Append-Only loading will be used if all conditions for Upsert are not met. Learn more.

Nested structure support

Supported
Nested data structures (JSON arrays and objects) will be loaded intact into a STRING column with a comment specifying that the column contains JSON. Learn more.

Destination details

Details about the destination, including object names, table and column limits, reserved keywords, etc.

Note: Exceeding the limits noted below will result in loading errors or rejected data.

Maximum record size

20MB

Table name length

78 characters

Column name length

122 characters

Maximum table size

None

Maximum tables per database

None

Case sensitivity

Insensitive

Reserved keywords

Refer to the Reserved keywords documentation.


Replication

Replication process overview

A Stitch replication job consists of three stages:

Step 1: Data extraction

Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.

Step 2: Stitch's internal pipeline

The data extracted from sources is processed by Stitch. Stitch’s internal pipeline includes the Prepare and Load phases of the replication process:

  • Prepare: During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading.
  • Load: During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. Refer to the Transformations section for more info about the transformations Stitch performs for Databricks Delta destinations.

Refer to the System overview guide for a more detailed explanation of these phases.

Step 3: Amazon S3 bucket

Data is loaded into S3 files in the Amazon S3 bucket you provide during destination setup.

Step 4: Staging data

Data is copied from the Amazon S3 bucket and placed into staging tables in Databricks Delta.

Step 5: Data merge

Data is merged from the staging tables into real tables in Databricks Delta.

Loading behavior

By default, Stitch will use Upsert loading when loading data into Databricks Delta.

If the conditions for Upsert loading aren’t met, data will be loaded using Append-Only loading.

Refer to the Understanding loading behavior guide for more info and examples.

Primary Keys

Stitch requires Primary Keys to de-dupe incrementally replicated data. To ensure Primary Key data is available, Stitch creates a stitch.pks table property comment when the table is initially created in Databricks Delta. The table property comment is an array of strings that contain the names of the Primary Key columns for the table.

For example: A table property comment for a table with a single Primary Key:

(stitch.pks="id")

And a table property comment for a table with a composite Primary Key:

(stitch.pks="id,created_at")

Note: Removing or incorrectly altering Primary Key table property comments can lead to replication issues.

Incompatible sources

No compatibility issues have been discovered between Databricks Delta and Stitch's integration offerings.

See all destination and integration incompatibilities.


Transformations

System tables and columns

Stitch will create the following tables in each integration’s dataset:

Additionally, Stitch will insert system columns (prepended with _sdc) into each table.

Data typing

Stitch converts data types only where needed to ensure the data is accepted by Databricks Delta. In the table below are the data types Stitch supports for Databricks Delta destinations, and the Stitch types they map to.

  • Stitch type: The Stitch data type the source type was mapped to. During the Extraction and Preparing phases, Stitch identifies the data type in the source and then maps it to a common Stitch data type.
  • Destination type: The destination-compatible data type the Stitch type maps to. This is the data type Stitch will use to store data in Databricks Delta.
  • Notes: Details about the data type and/or its allowed values in the destination, if available. If a range is available, values that exceed the noted range will be rejected by Databricks Delta.
Stitch type Destination type Notes
BIGINT BIGINT
BOOLEAN BOOLEAN
DATE TIMESTAMP
  • Description: Stored in UTC

DOUBLE DECIMAL
  • Description: Stored as decimal(38,6)

FLOAT FLOAT
INTEGER BIGINT
JSON ARRAY STRING
  • Description: Used to store nested JSON structures (objects and arrays). JSON is loaded intact into the column, which will have a comment ("json") specifying that the column contains JSON data.

JSON OBJECT STRING
  • Description: Used to store nested JSON structures (objects and arrays). JSON is loaded intact into the column, which will have a comment ("json") specifying that the column contains JSON data.

NUMBER DECIMAL
  • Description: Stored as decimal(38,6)

STRING STRING

JSON structures

Databricks Delta supports nested records within tables. When JSON objects and arrays are replicated, Stitch will load the JSON intact into a STRING column and add a comment ("json") specifying that the column contains JSON data.

Refer to Databricks’ documentation for examples and instructions on working with complex data structures.

Column names

Column names in Databricks Delta:

Stitch will perform the following transformations to ensure column names adhere to the rules imposed by Databricks Delta:

Transformation Source column Destination column
Convert uppercase and mixed case to lowercase CUSTOMERID or cUsTomErId customerid
Convert spaces to underscores customer id customer_id
Convert special characters to underscores customer#id or !customerid customer_id and _customerid
Prepend an underscore to names with leading numeric characters 4customerid _4customerid

Timezones

Databricks Delta will store the value as TIMESTAMP WITH TIMEZONE. In Databricks Delta, this data is stored with timezone information and expressed as UTC.


Compare destinations

Not sure if Databricks Delta is the destination for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.


Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.