Databricks Delta is an open source storage layer that sits on top of your existing data lake file storage. Stitch’s Databricks Delta destination is compatible with Amazon S3 data lakes.
This guide serves as a reference for version 1 of Stitch’s Databricks Delta destination.
Details and features
High-level details about Stitch’s implementation of Databricks Delta, such as supported connection methods, availability on Stitch plans, etc.
|Stitch plan availability||
All Stitch plans
Databricks Runtime Version 6.3+
|Connect API availability||
This version of the Databricks Delta destination can be created and managed using Stitch’s Connect API. Learn more.
Stitch supports using SSH tunnels to connect to Databricks Delta destinations.
Stitch will attempt to use SSL to connect by default. No additional configuration is needed.
Virtual Private Network (VPN) connections may be implemented as part of an Enterprise plan. Contact Stitch Sales for more info.
|Default loading behavior||
|Nested structure support||
Details about the destination, including object names, table and column limits, reserved keywords, etc.
Note: Exceeding the limits noted below will result in loading errors or rejected data.
|Maximum record size||
|Table name length||
|Column name length||
|Maximum table size||
|Maximum tables per database||
Refer to the Reserved keywords documentation.
Replication process overview
A Stitch replication job consists of three stages:
Step 1: Data extraction
Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.
Step 2: Stitch's internal pipeline
The data extracted from sources is processed by Stitch. Stitch’s internal pipeline includes the Prepare and Load phases of the replication process:
- Prepare: During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading.
- Load: During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. Refer to the Transformations section for more info about the transformations Stitch performs for Databricks Delta destinations.
Refer to the System overview guide for a more detailed explanation of these phases.
Step 3: Amazon S3 bucket
Data is loaded into S3 files in the Amazon S3 bucket you provide during destination setup.
Step 4: Staging data
Data is copied from the Amazon S3 bucket and placed into staging tables in Databricks Delta.
Step 5: Data merge
Data is merged from the staging tables into real tables in Databricks Delta.
By default, Stitch will use Upsert loading when loading data into Databricks Delta.
If the conditions for Upsert loading aren’t met, data will be loaded using Append-Only loading.
Refer to the Understanding loading behavior guide for more info and examples.
Stitch requires Primary Keys to de-dupe incrementally replicated data. To ensure Primary Key data is available, Stitch creates a
stitch.pks table property comment when the table is initially created in Databricks Delta. The table property comment is an array of strings that contain the names of the Primary Key columns for the table.
For example: A table property comment for a table with a single Primary Key:
And a table property comment for a table with a composite Primary Key:
Note: Removing or incorrectly altering Primary Key table property comments can lead to replication issues.
No compatibility issues have been discovered between Databricks Delta and Stitch's integration offerings.
System tables and columns
Stitch will create the following tables in each integration’s dataset:
Additionally, Stitch will insert system columns (prepended with
_sdc) into each table.
Stitch converts data types only where needed to ensure the data is accepted by Databricks Delta. In the table below are the data types Stitch supports for Databricks Delta destinations, and the Stitch types they map to.
- Stitch type: The Stitch data type the source type was mapped to. During the Extraction and Preparing phases, Stitch identifies the data type in the source and then maps it to a common Stitch data type.
- Destination type: The destination-compatible data type the Stitch type maps to. This is the data type Stitch will use to store data in Databricks Delta.
- Notes: Details about the data type and/or its allowed values in the destination, if available. If a range is available, values that exceed the noted range will be rejected by Databricks Delta.
|Stitch type||Destination type||Notes|
Databricks Delta supports nested records within tables. When JSON objects and arrays are replicated, Stitch will load the JSON intact into a
STRING column and add a comment (
"json") specifying that the column contains JSON data.
Refer to Databricks’ documentation for examples and instructions on working with complex data structures.
Column names in Databricks Delta:
- Must contain only letters (a-z, A-Z), numbers (0-9), or underscores (
- Must begin with a letter or an underscore
Must be less than the maximum length of 122 characters. Columns that exceed this limit will be rejected by Databricks Delta.
- Must not be prefixed or suffixed with any of Stitch’s reserved keyword prefixes or suffixes
Stitch will perform the following transformations to ensure column names adhere to the rules imposed by Databricks Delta:
|Transformation||Source column||Destination column|
|Convert uppercase and mixed case to lowercase||
|Convert spaces to underscores||
|Convert special characters to underscores||
|Prepend an underscore to names with leading numeric characters||
Databricks Delta will store the value as
TIMESTAMP WITH TIMEZONE. In Databricks Delta, this data is stored with timezone information and expressed as UTC.
Not sure if Databricks Delta is the destination for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.