Amazon S3 is a simple, reliable, and cost-effective object store that provides nearly endless capacity to safely store data in the cloud. Its flexibility allows users the ability to not only persist data ranging from bytes to petabytes, but also consume it via a myriad of tools like Amazon Athena and Qubole.


Pricing

Amazon S3 pricing is based on two factors: the amount of data stored in and location (region) of your Amazon S3 bucket.

To learn more about pricing, refer to Amazon’s S3 pricing page. Note: Remember to select the correct region to view accurate pricing.


Setup

To use Amazon S3 as your Stitch destination, you’ll need to:

  • Create an S3 bucket,
  • Allow Stitch to access the bucket,
  • Define the format data will be stored in, and
  • Define the bucket’s Key, which determines how files are organized in the bucket

Spin up a Amazon S3 data warehouse


Replication

A Stitch replication job consists of three stages: Extraction, Preparation, and Loading.

Data extraction

During the Extraction phase, Stitch will check for structural changes to your data, query for data according to the integration’s replication settings, and extract the appropriate data.

Replication settings include the integration’s Replication Schedule, the data set to replicate, and the selected tables’ Replication Methods.

Note: Because Stitch’s Incremental Replication Method is inclusive, a single row will be replicated for every Incremental table even if there’s no new or updated data. Refer to the Replication Methods documentation for an explanation and examples.

Data preparation/transformations for Amazon S3

During the Preparation phase, Stitch applies some light transformations to the extracted data to ensure compatibility with the destination.

The transformations Stitch performs depends on the selected data storage format (CSV or JSON). Aside from these transformations, the data loaded into Amazon S3 is in its raw form. Refer to the Schema section for more info and examples].

CSV JSON
  • Stitch (_sdc) system columns are inserted into every table

Loading data into Amazon S3

During Loading, Stitch loads the extracted data into the destination. For Amazon S3 destinations, data is loaded in an Append-Only fashion.

This means that:

  • A new CSV or JSON file for every table replicated is created during each load
  • Existing records - that is, records already in the bucket - are never updated
  • Data will not be de-duped, meaning that multiple versions of the same record may exist in the data warehouse

Because of this loading strategy, querying may require a different strategy than usual. Using some of the system columns Stitch inserts into tables will enable you to locate the latest version of a record at query time. Refer to the Querying Append-Only Tables documentation for more info.

Example: Incremental Replication

Below is an example of how incrementally replicated tables will be loaded into Amazon S3:

Example Amazon S3 data loading diagram


Schema

The file structure of your integrations’ data in your Amazon S3 bucket depends on two destination parameters:

  1. The definition of the Object Key, and
  2. The selected data storage format (CSV or JSON)

Object Keys and file structure

Amazon S3 uses what is called an Object Key to uniquely identify objects in a bucket. During the Stitch setup process, you have the option of using our default Object Key or defining your own using a handful of Stitch-approved elements. Refer to the Amazon S3 Setup instructions for more info on the available elements.

The S3 Key setting determines the convention Stitch uses to create Object Keys when it writes to your bucket. It also defines the folder structure of Stitch-replicated data.

Below is the default Key and two examples of an Object Key that an integration named salesforce-prod might produce:

/* Default Key */
[integration_name]/[table_name]/[table_version]_[timestamp_loaded].[csv|jsonl]


/* Example Object Keys */
  - salesforce-prod/account/1_1519235654474.[csv|jsonl]
  - salesforce-prod/opportunity/1_1519327555000.[csv|jsonl]

As previously mentioned, the S3 Key also determines the folder structure of replicated data. In the AWS console, the folder structure for the salesforce-prod integration would look like the following:

.
└── salesforce-prod
    └── account
    |   └── 1_1519235654474.[csv|jsonl]
    └── opportunity
    |   └── 1_1519327555000.[csv|jsonl]
    └── _sdc_rejected
        └── 1_[timestamp].jsonl
        └── 1_[timestamp].jsonl

_sdc_rejected

For every integration you connect, an _sdc_rejected folder will be created in the integration’s directory in Amazon S3. _sdc_rejected acts as a log for records rejected during the loading process. For every load where a rejection occurs, a .jsonl file containing data about the rejection will be placed in the _sdc_rejected folder.

Data storage formats

Stitch will store replicated data in the format you select during the initial setup of Amazon S3. Currently Stitch supports storing data in CSV or JSON format for Amazon S3 destinations.

The tabs below contain an example of raw source data and how it would be stored in Amazon S3 for each data storage format type.

{
   "contacts":[
      {
         "id":2078178,
         "name":"Bubblegum",
         "phone_numbers":[
            {
               "mobile":"0987654321",
               "work":"7896541230"
            }
         ],
         "personas":[
            {
               "id":"persona_1",
               "type":"Princess"
            },
            {
               "id":"persona_2",
               "type":"Heroine"
            }
         ],
         "updated_at":"2018-01-01T00:59:16Z"
      }
   ]
}

Top-level Table

In Amazon S3, this data would create a file named contacts/1_[timestamp].csv, which would look like this:

id name phone_numbers__mobile phone_numbers__work updated_at
2078178 Bubblegum 0987654321 7896541230 2018-01-01T00:59:16Z

While objects (like phone_numbers) will be flattened into the table, arrays are handled differently.

Subtables

Arrays will be de-nested and flattened into subtables. In this example, the name of the file would be contacts/personas/1_[timestamp].csv:

_sdc_source_key_ id _sdc_level_0_id id type
2078178 0 persona_1 Princess
2078178 1 persona_2 Heroine

For more info and examples on how Stitch flattens nested data structures, refer to the Nested Data Structures guide.

With the exception of the _sdc columns, Stitch will store replicated data intact as .jsonl files. In this example, the name of the file would be contacts/1_[timestamp].jsonl:

{
   "id":2078178,
   "name":"Bubblegum",
   "phone_numbers":[
      {
         "mobile":"0987654321",
         "work":"7896541230"
      }
   ],
   "personas":[
      {
         "id":"persona_1",
         "type":"Princess"
      },
      {
         "id":"persona_2",
         "type":"Heroine"
      }
   ],
   "updated_at":"2018-01-01T00:59:16Z",
   "_sdc_extracted_at":"2018-01-01T01:10:53Z",
   "_sdc_received_at":"2018-01-01T01:10:53Z",
   "_sdc_batched_at":"2018-01-01T01:11:04Z",
   "_sdc_table_version":0,
   "_sdc_sequence":1514769053000
}

Webhook loading notifications

Webhooks allow external services to be notified when an event happens. If you choose, you can configure a webhook for Stitch to notify you when data is successfully loaded into your bucket.

Webhook notifications are sent on a per-integration basis. This means that every time Stitch successfully loads data for an integration, a summary webhook will be sent to the URL you define.

Sample use cases

Enabling loading notifications ensures that you and your team are alerted when fresh data is available in Amazon S3. For example, you could:

  • Set up a webhook Zap in Zapier that sends a Slack notification whenever data is loaded for a specific integration
  • Use loading notifications to kick off an internal process, such as DBT or a script

Webhook request body

A webhook loading request body will look like the following:

{
  "integration_name": "salesforce-prod",
  "switch_view": false,
  "table_version": 1,
  "sequence_field": "_sdc_sequence",
  "s3_bucket": "com-stitch-prod-s3-bucket",
  "tables": [
    {
      "account": [
        {
          "name": "salesforce-prod/account/1_1519235654474.[csv|jsonl]",
          "pk": [
            "id"
          ]
        }
      ]
    },
    {
      "opportunity": [
        {
          "name": "salesforce-prod/opportunity/1_1519235654474.[csv|jsonl]",
          "pk": [
            "id"
          ]
        }
      ]
    }
  ]
}

Webhook request body fields

Attribute Description
integration_name
STRING

The name of the integration.

switch_view
BOOLEAN

Indicates if the load resulted in a switch-view operation.

table_version
INTEGER

The version of the table.

sequence_field
STRING

The name of the sequence field in each of the integration’s tables. This value will be _sdc_sequence.

s3_bucket
STRING

The name of the Amazon S3 bucket the data was loaded into.

tables
ARRAY

An array of tables that were successfully loaded for the integration.

Each array contains the following attributes:

[table_name]
ARRAY

The name of the array will be the name of the table. For example: account

This array contains arrays which contain the following attributes:

name
STRING

The name of the table. For example: account

pk
ARRAY

An array of the Primary Key column(s) of the table.

For example: id, or _sdc_source_key_id, _sdc_level_0_id



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.