Reference documentation for Stitch’s Amazon S3 destination, including info about Stitch features, replication, and transformations.

This guide serves as a reference for version 1 of Stitch’s Amazon S3 destination.


Details and features

Stitch features

High-level details about Stitch’s implementation of Amazon S3, such as supported connection methods, availability on Stitch plans, etc.

Release status

Released

Stitch plan availability

All Stitch plans

Stitch supported regions
  • North America (AWS us-east-1)
  • Europe (AWS eu-central-1)

Operating regions determine the location of the resources Stitch uses to process your data. Learn more.

Supported versions

Not applicable

Connect API availability Supported

This version of the Amazon S3 destination can be created and managed using Stitch’s Connect API. Learn more.

SSH connections

Not applicable

SSL connections

Not applicable

VPN connections

Not applicable

Static IP addresses Supported

This version of the Amazon S3 destination has static IP addresses that can be whitelisted.

Default loading behavior

Append-Only
Note: Existing rows will not be updated in the destination. Learn more.

Nested structure support

Depends on data storage format (CSV or JSON)

Destination details

Details about the destination, including object names, table and column limits, reserved keywords, etc.

Note: Exceeding the limits noted below will result in loading errors or rejected data.

Maximum record size

20MB

Table name length

1,024 bytes

Column name length

None

Maximum columns per table

None

Maximum table size

None

Maximum tables per database

None

Case sensitivity

Not applicable

Reserved keywords

None

Amazon S3 pricing

Amazon S3 pricing is based on two factors: The amount of data stored in and location (region) of your Amazon S3 bucket.

To learn more about pricing, refer to Amazon’s S3 pricing page. Note: Remember to select the correct region to view accurate pricing.


Replication

Replication process overview

A Stitch replication job consists of three stages:

Step 1: Data extraction

Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.

Step 2: Preparation

During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading. Refer to the System overview guide for a more detailed explanation of the Preparation phase.

Step 3: Loading

During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. For Amazon S3 destinations, data is loaded in an Append-Only fashion into the file format (.csv or .jsonl) you select during destination setup. Refer to the Loading behavior section for more info and examples.

Note: The transformations Stitch performs when loading data into Amazon S3 depends on the data storage format selected during destination setup. Refer to the Transformations section for more info.

Loading behavior

When data is loaded into an Amazon S3 destination, it will be loaded in an Append-Only fashion. This means that:

  • A new CSV or JSON file for every table replicated is created during each load. A single table in the source will correspond to multiple files in the destination.
  • Existing records - that is, records in files already in the destination - are never updated
  • Data will not be de-duped, meaning that multiple versions of the same record may exist across multiple files in the data warehouse

Because of this loading strategy, querying may require a different approach than usual. Using some of the system columns Stitch inserts into tables will enable you to locate the latest version of a record at query time. Refer to the Querying Append-Only Tables documentation for more info.

Example: Key-based Incremental Replication

Below is an example of how tables using Key-based Incremental Replication will be loaded into Amazon S3:

Example Amazon S3 data loading diagram

File structure

The file structure of your integrations’ data in your Amazon S3 bucket depends on two destination parameters:

  1. The definition of the Object Key, and
  2. The selected data storage format (CSV or JSON)

Object Keys and file structure

Amazon S3 uses what is called an Object Key to uniquely identify objects in a bucket. During the Stitch setup process, you have the option of using our default Object Key or defining your own using a handful of Stitch-approved elements. Refer to the Amazon S3 Setup instructions for more info on the available elements.

The S3 Key setting determines the convention Stitch uses to create Object Keys when it writes to your bucket. It also defines the folder structure of Stitch-replicated data.

Below is the default Key and two examples of an Object Key that an integration named salesforce-prod might produce:

/* Default Key */
[integration_name]/[table_name]/[table_version]_[timestamp_loaded].[csv|jsonl]


/* Example Object Keys */
  - salesforce-prod/account/1_1519235654474.[csv|jsonl]
  - salesforce-prod/opportunity/1_1519327555000.[csv|jsonl]

As previously mentioned, the S3 Key also determines the folder structure of replicated data. In the AWS console, the folder structure for the salesforce-prod integration would look like the following:

.
└── salesforce-prod
    └── account
    |   └── 1_1519235654474.[csv|jsonl]
    └── opportunity
    |   └── 1_1519327555000.[csv|jsonl]
    └── _sdc_rejected
        └── 1_[timestamp].jsonl
        └── 1_[timestamp].jsonl

Data storage formats

Stitch will store replicated data in the format you select during the initial setup of Amazon S3. Currently Stitch supports storing data in CSV or JSON format for Amazon S3 destinations.

The tabs below contain an example of raw source data and how it would be stored in Amazon S3 for each data storage format type.

{
   "contacts":[
      {
         "id":2078178,
         "name":"Bubblegum",
         "phone_numbers":[
            {
               "mobile":"0987654321",
               "work":"7896541230"
            }
         ],
         "personas":[
            {
               "id":"persona_1",
               "type":"Princess"
            },
            {
               "id":"persona_2",
               "type":"Heroine"
            }
         ],
         "updated_at":"2018-01-01T00:59:16Z"
      }
   ]
}

Top-level Table

In Amazon S3, this data would create a file named contacts/1_[timestamp].csv, which would look like this:

id name phone_numbers__mobile phone_numbers__work updated_at
2078178 Bubblegum 0987654321 7896541230 2018-01-01T00:59:16Z

While objects (like phone_numbers) will be flattened into the table, arrays are handled differently.

Subfiles

Arrays will be de-nested and flattened into subfiles. In this example, the name of the file would be contacts/personas/1_[timestamp].csv:

_sdc_source_key_ id _sdc_level_0_id id type
2078178 0 persona_1 Princess
2078178 1 persona_2 Heroine

For more info and examples on how Stitch flattens nested data structures, refer to the Nested Data Structures guide.

With the exception of the _sdc columns, Stitch will store replicated data intact as .jsonl files. In this example, the name of the file would be contacts/1_[timestamp].jsonl:

{
   "id":2078178,
   "name":"Bubblegum",
   "phone_numbers":[
      {
         "mobile":"0987654321",
         "work":"7896541230"
      }
   ],
   "personas":[
      {
         "id":"persona_1",
         "type":"Princess"
      },
      {
         "id":"persona_2",
         "type":"Heroine"
      }
   ],
   "updated_at":"2018-01-01T00:59:16Z",
   "_sdc_extracted_at":"2018-01-01T01:10:53Z",
   "_sdc_received_at":"2018-01-01T01:10:53Z",
   "_sdc_batched_at":"2018-01-01T01:11:04Z",
   "_sdc_table_version":0,
   "_sdc_sequence":1514769053000
}

Incompatible sources

No compatibility issues have been discovered between Amazon S3 and Stitch's integration offerings.

See all destination and integration incompatibilities.


Transformations

Note: Aside from these transformations, the data loaded into Amazon S3 is in its raw form.

System tables and columns

Note: This applies to all data storage formats.

For every integration you connect, an _sdc_rejected folder will be created in the integration’s directory in Amazon S3. For every load where a rejection occurs, a .jsonl file containing data about the rejection will be placed in the _sdc_rejected folder. Note: These files will be .jsonl, even if .csv is selected as the file format during destination setup.

Additionally, Stitch will insert system columns (prepended with _sdc) into each file (.csv or .jsonl) created for each table.

JSON structures

How JSON structures are stored in Amazon S3 destinations depends on the data storage format selected during destination setup:

  • CSV: Nested JSON structures will be flattened into relational objects. This means that nested maps (JSON objects) will be flattened into the CSV file, and nested records (JSON arrays) will be de-nested into subfiles.

  • JSON: Nested JSON structures are stored as-is, as all data is stored in JSON files.

Refer to the Data storage formats section for examples of how nested data will be stored for each data storage format.


Compare destinations

Not sure if Amazon S3 is the data warehouse for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.


Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.