Amazon S3 Destination Reference | Stitch Documentation

Reference documentation for Stitch’s Amazon S3 destination, including info about Stitch features, replication, and transformations.

This guide serves as a reference for version 1 of Stitch’s Amazon S3 destination.

Details and features

Stitch features
Destination details
Amazon S3 pricing

Stitch features

High-level details about Stitch’s implementation of Amazon S3, such as supported connection methods, availability on Stitch plans, etc.

Release status	Released
Stitch plan availability	All Stitch plans
Stitch supported regions	North America (AWS us-east-1) Europe (AWS eu-central-1) Operating regions determine the location of the resources Stitch uses to process your data. Learn more.
Supported versions	Not applicable
Connect API availability	Supported This version of the Amazon S3 destination can be created and managed using Stitch’s Connect API. Learn more.
SSH connections	Not applicable
SSL connections	Not applicable
VPN connections	Not applicable
Static IP addresses	Supported This version of the Amazon S3 destination has static IP addresses that can be whitelisted.
Default loading behavior	Append-Only Note: Existing rows will not be updated in the destination. Learn more.
Nested structure support	Depends on data storage format (CSV or JSON)

Destination details

Details about the destination, including object names, table and column limits, reserved keywords, etc.

Note: Exceeding the limits noted below will result in loading errors or rejected data.

Maximum record size	20MB
Table name length	1,024 bytes
Column name length	None
Maximum columns per table	None
Maximum table size	None
Maximum tables per database	None
Case sensitivity	Not applicable
Reserved keywords	None

Amazon S3 pricing

Amazon S3 pricing is based on two factors: The amount of data stored in and location (region) of your Amazon S3 bucket.

To learn more about pricing, refer to Amazon’s S3 pricing page. Note: Remember to select the correct region to view accurate pricing.

Replication process overview

A Stitch replication job consists of three stages:

Step 1: Data extraction
Step 2: Preparation
Step 3: Loading

Step 1: Data extraction

Stitch requests and extracts data from a data source. Refer to the System overview guide for a more detailed explanation of the Extraction phase.

Step 2: Preparation

During this phase, the extracted data is buffered in Stitch’s durable, highly available internal data pipeline and readied for loading. Refer to the System overview guide for a more detailed explanation of the Preparation phase.

Step 3: Loading

During this phase, the prepared data is transformed to be compatible with the destination, and then loaded. For Amazon S3 destinations, data is loaded in an Append-Only fashion into the file format (.csv or .jsonl) you select during destination setup. Refer to the Loading behavior section for more info and examples.

Note: The transformations Stitch performs when loading data into Amazon S3 depends on the data storage format selected during destination setup. Refer to the Transformations section for more info.

Loading behavior

When data is loaded into an Amazon S3 destination, it will be loaded in an Append-Only fashion. This means that:

A new CSV or JSON file for every table replicated is created during each load. A single table in the source will correspond to multiple files in the destination.
Existing records - that is, records in files already in the destination - are never updated
Data will not be de-duped, meaning that multiple versions of the same record may exist across multiple files in the data warehouse

Because of this loading strategy, querying may require a different approach than usual. Using some of the system columns Stitch inserts into tables will enable you to locate the latest version of a record at query time. Refer to the Querying Append-Only Tables documentation for more info.

Example: Key-based Incremental Replication

Below is an example of how tables using Key-based Incremental Replication will be loaded into Amazon S3:

Example Amazon S3 data loading diagram

File structure

The file structure of your integrations’ data in your Amazon S3 bucket depends on two destination parameters:

The definition of the Object Key, and
The selected data storage format (CSV or JSON)

Object Keys and file structure

Amazon S3 uses what is called an Object Key to uniquely identify objects in a bucket. During the Stitch setup process, you have the option of using our default Object Key or defining your own using a handful of Stitch-approved elements. Refer to the Amazon S3 Setup instructions for more info on the available elements.

The S3 Key setting determines the convention Stitch uses to create Object Keys when it writes to your bucket. It also defines the folder structure of Stitch-replicated data.

Below is the default Key and two examples of an Object Key that an integration named salesforce-prod might produce:

/* Default Key */
[integration_name]/[table_name]/[table_version]_[timestamp_loaded].[csv|jsonl]


/* Example Object Keys */
  - salesforce-prod/account/1_1519235654474.[csv|jsonl]
  - salesforce-prod/opportunity/1_1519327555000.[csv|jsonl]

As previously mentioned, the S3 Key also determines the folder structure of replicated data. In the AWS console, the folder structure for the salesforce-prod integration would look like the following:

.
└── salesforce-prod
    └── account
    |   └── 1_1519235654474.[csv|jsonl]
    └── opportunity
    |   └── 1_1519327555000.[csv|jsonl]
    └── _sdc_rejected
        └── 1_[timestamp].jsonl
        └── 1_[timestamp].jsonl

Data storage formats

Stitch will store replicated data in the format you select during the initial setup of Amazon S3. Currently Stitch supports storing data in CSV or JSON format for Amazon S3 destinations.

The tabs below contain an example of raw source data and how it would be stored in Amazon S3 for each data storage format type.

{
   "contacts":[
      {
         "id":2078178,
         "name":"Bubblegum",
         "phone_numbers":[
            {
               "mobile":"0987654321",
               "work":"7896541230"
            }
         ],
         "personas":[
            {
               "id":"persona_1",
               "type":"Princess"
            },
            {
               "id":"persona_2",
               "type":"Heroine"
            }
         ],
         "updated_at":"2018-01-01T00:59:16Z"
      }
   ]
}

Top-level Table

In Amazon S3, this data would create a file named contacts/1_[timestamp].csv, which would look like this:

id	name	phone_numbers__mobile	phone_numbers__work	updated_at
2078178	Bubblegum	0987654321	7896541230	2018-01-01T00:59:16Z

While objects (like phone_numbers) will be flattened into the table, arrays are handled differently.

Subfiles

Arrays will be de-nested and flattened into subfiles. In this example, the name of the file would be contacts/personas/1_[timestamp].csv:

_sdc_source_key_ id	_sdc_level_0_id	id	type
2078178	0	persona_1	Princess
2078178	1	persona_2	Heroine

For more info and examples on how Stitch flattens nested data structures, refer to the Nested Data Structures guide.

With the exception of the _sdc columns, Stitch will store replicated data intact as .jsonl files. In this example, the name of the file would be contacts/1_[timestamp].jsonl:

{
   "id":2078178,
   "name":"Bubblegum",
   "phone_numbers":[
      {
         "mobile":"0987654321",
         "work":"7896541230"
      }
   ],
   "personas":[
      {
         "id":"persona_1",
         "type":"Princess"
      },
      {
         "id":"persona_2",
         "type":"Heroine"
      }
   ],
   "updated_at":"2018-01-01T00:59:16Z",
   "_sdc_extracted_at":"2018-01-01T01:10:53Z",
   "_sdc_received_at":"2018-01-01T01:10:53Z",
   "_sdc_batched_at":"2018-01-01T01:11:04Z",
   "_sdc_table_version":0,
   "_sdc_sequence":1514769053000
}

Incompatible sources

No compatibility issues have been discovered between Amazon S3 and Stitch's integration offerings.

See all destination and integration incompatibilities.

Transformations

Note: Aside from these transformations, the data loaded into Amazon S3 is in its raw form.

System tables and columns
JSON structures

System tables and columns

Note: This applies to all data storage formats.

For every integration you connect, an _sdc_rejected folder will be created in the integration’s directory in Amazon S3. For every load where a rejection occurs, a .jsonl file containing data about the rejection will be placed in the _sdc_rejected folder. Note: These files will be .jsonl, even if .csv is selected as the file format during destination setup.

Additionally, Stitch will insert system columns (prepended with _sdc) into each file (.csv or .jsonl) created for each table.

JSON structures

How JSON structures are stored in Amazon S3 destinations depends on the data storage format selected during destination setup:

CSV: Nested JSON structures will be flattened into relational objects. This means that nested maps (JSON objects) will be flattened into the CSV file, and nested records (JSON arrays) will be de-nested into subfiles.
JSON: Nested JSON structures are stored as-is, as all data is stored in JSON files.

Refer to the Data storage formats section for examples of how nested data will be stored for each data storage format.

Compare destinations

Not sure if Amazon S3 is the data warehouse for you? Check out the Choosing a Stitch Destination guide to compare each of Stitch’s destination offerings.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Choosing a Destination Destination & Integration Compatibility Loading Data into Your Destination Switching Destinations	Destination Connection Errors