Prerequisites

  • An Amazon Web Services (AWS) account with a Databricks Delta Lake deployment. Instructions for configuring a Databricks Delta Lake deployment are outside the scope of this tutorial; our instructions assume that you have Databricks Delta Lake up and running. Refer to Databricks’ documentation for help configuring your AWS account with Databricks.

  • An existing Amazon S3 bucket. This bucket must be in the same AWS account as the Databricks deployment or have a cross-account bucket policy that allows access to the bucket from the AWS account with the Databricks deployment.

  • Permissions to manage S3 buckets in AWS. Your AWS user must be able to add and modify bucket policies in the AWS account or accounts where the S3 bucket and Databricks deployment reside.


Step 1: Configure S3 bucket access in AWS

Step 1.1: Grant Stitch access to your Amazon S3 bucket

To allow Stitch to access the bucket, you’ll need to add a bucket policy using the AWS console. Follow the instructions in the tab below to add the bucket policy.

  1. Sign into your Amazon Web Services (AWS) account as a user with privileges that allow you to manage S3 buckets.

  2. Click Services near the top-left corner of the page.

  3. Under the Storage option, click S3.

  4. A page listing all buckets currently in use will display. Click the name of the bucket that is used with Databricks.

  5. Click the Permissions tab.

  6. In the Permissions tab, click the Bucket Policy button.

  7. In the Bucket policy editor, paste the following bucket policy. Replace <YOUR-BUCKET-NAME> with the name of your S3 bucket:

    {
      "Version": "2012-10-17",
      "Id": "Policy-LoaderDelta-MWF",
      "Statement": [
        {
          "Sid": "Stmt123",
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::218546966473:role/LoaderDelta"
            ]
          },
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:DeleteObject"
          ],
          "Resource": [
            "arn:aws:s3:::<YOUR-BUCKET-NAME>",
            "arn:aws:s3:::<YOUR-BUCKET-NAME>/"
          ]
        }
      ]
    }
    
  8. When finished, click Save.

In the table below are the database user privileges Stitch requires to connect to and load data into Databricks Delta Lake.

Privilege name Reason for requirement
s3:DeleteObject

Required to remove obsolete staging tables during loading.

s3:GetObject

Required to read objects in an S3 bucket. Granting the s3:GetObject privilege in a bucket policy allows the user to perform the following operations:

s3:ListBucket

Required to determine if an S3 bucket exists, if access is allowed to the bucket is allowed, and to list the objects in the bucket. Granting the s3:ListBucket privilege in a bucket policy allows the user to perform the following operations:

s3:PutObject

Required to add objects, such as files, to an S3 bucket. Granting the s3:PutObject privilege in a bucket policy allows the user to perform the following operations:

Step 1.2: Grant Databricks access to your Amazon S3 bucket

Next, you’ll configure your AWS account to allow access from Databricks by creating an IAM role and policy. This is required to complete loading data into Databricks Delta Lake.

Follow steps 1-4 in Databricks’ documentation to create the IAM policy and role for Databricks.


Step 2: Configure access in Databricks

Step 2.1: Add the Databricks S3 IAM role to Databricks

Follow step 5 in this Databricks guide to add IAM role you created for Databricks in Step 1.2 to your Databricks account.

After the Databricks IAM role has been added using the Databricks Admin Console, proceed to the next step.

Step 2.2: Create a Databricks cluster

  1. Sign into your Databricks account.
  2. Click the Clusters option on the left side of the page.
  3. Click the + Create Cluster button.
  4. In the Cluster Name field, enter a name for the cluster.
  5. In the Databricks Runtime Version field, select a version that’s 6.3 or higher. This is required for Databricks Delta Lake to work with Stitch:

    Databricks Runtime Version field with version Runtime: 6.3 selected

  6. In the Advanced Options section, locate the IAM Role field.
  7. In the dropdown menu, select the Databricks IAM role you added to your account in the previous step.
  8. When finished, click the Create Cluster button to create the cluster.

Step 2.3: Retrieve the Databricks cluster's JDBC URL

Next, you’ll retrieve your Databricks’ cluster JDBC URL.

  1. On the Clusters page in Databricks, click the cluster you created in the previous step.
  2. Open the Advanced Options section.
  3. Click the JDBC/ODBC tab.
  4. Locate the JDBC URL field and copy the value:

    The Advanced Options section of the Cluster Details page in Databricks

Keep this handy - you’ll need it to complete the setup in Stitch.

Step 2.4: Generate a Databricks access token

  1. Click the user profile icon in the upper right corner of your Databricks workspace.
  2. Click User Settings.
  3. Click the Access Tokens tab:

    The Access Tokens tab in the User Settings page of Databricks

  4. In the tab, click the Generate New Token button.

The Generate New Token window in Databricks

  1. In the window that displays, enter the following:
    • Comment: Stitch destination
    • Lifetime (days): Leave this field blank. If you enter a value, your token will eventually expire and break the connection to Stitch.
  2. Click Generate.

A newly generated access token in Databricks

  1. Copy the token somewhere secure. Databricks will only display the token once.
  2. Click Done after you copy the token.

Step 3: Connect Stitch

  1. If you aren’t signed into your Stitch account, sign in now.
  2. Click the Destination tab.

  3. Locate and click the Databricks Delta Lake icon.
  4. Fill in the fields as follows:

    • Access Token: Paste the access token you generated in Step 2.4.

    • JDBC URL: Paste the JDBC URL you retrieved in Step 2.3.

    • Bucket Name: Enter the name of Amazon S3 bucket you configured in Step 1. Enter only the bucket name: No URLs, https, or S3 parts. For example: stitch-databricks-delta-bucket

When finished, click Check and Save.

Stitch will perform a connection test to the Databricks Delta Lake database; if successful, a Success! message will display at the top of the screen. Note: This test may take a few minutes to complete.


Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.