Release Status Released Supported By Stitch
Availability Free Status Page Amazon S3 CSV Status Page
Default Historical Sync 1 year Default Replication Frequency 1 hour
Whitelisting Tables and columns Destination Incompatibilities None

Connecting Amazon S3 CSV

Amazon S3 CSV Setup requirements

To set up Amazon S3 CSV in Stitch, you need:

  • An Amazon Web Services (AWS) account. Signing up is free - click here or go to https://aws.amazon.com to create an account if you don’t have one already.

  • Permissions in AWS Identity Access Management (IAM) that allow you to create policies, create roles, and attach policies to roles. This is required to grant Stitch authorization to your S3 bucket.

  • Verify that column names in CSV files adhere to your destination’s length limit for column names. If a column name exceeds the destination’s limit, the destination will reject the column. Compliant columns will persist to the destination.

    Column name limits vary by destination:

    • Amazon S3 - Not applicable to this destination

    • BigQuery - Limited to 128 characters

    • data.world - Not applicable to this destination

    • Panoply - Limited to 115 characters

    • PostgreSQL - Limited to 59 characters

    • Redshift - Limited to 115 characters

    • Snowflake - Limited to 251 characters

Step 1: Retrieve your Amazon Web Services account ID

  1. Sign into your Amazon Web Services (AWS) account.
  2. Click the user menu, located between the bell and Global menus in the top-right corner of the page.
  3. Click My Account.
  4. In the Account Settings section of the page, locate the Account Id field:

    An AWS account ID, highlighted in the AWS Account Settings page

Keep this handy - you’ll need it to complete the next step.

Step 2: Add Amazon S3 CSV as a Stitch data source

  1. Sign into your Stitch account.
  2. On the Stitch Dashboard page, click the Add Integration button.

  3. Click the Amazon S3 icon.

  4. Enter a name for the integration. This is the name that will display on the Stitch Dashboard for the integration; it’ll also be used to create the schema in your destination.

    For example, the name “Stitch Amazon S3 CSV” would create a schema called stitch_amazon_s3_csv in the destination. Note: Schema names cannot be changed after you save the integration.

  5. In the S3 Bucket field, enter the name of bucket. Enter only the bucket name: No URLs, https, or S3 parts. For example: com-test-stitch-bucket
  6. In the AWS Account ID field, paste the account ID you retrieve in Step 1.

Step 3: Configure tables

Next, you’ll indicate which CSV file(s) you want to include for replication. You can include a single CSV file, or map several CSV files to a table.

In the following sections, we’ll walk you through how to configure a table in Stitch.

Step 3.1: Define the table's search pattern

The Search Pattern field accepts regular expressions, which can be used to include a single file or multiple CSV files. What you enter into this field depends on how data for a particular entity is updated.

If a single file is replaced in your S3 bucket at some interval, it would make sense to enter location of the file. For example: Customer data is added to and updated in a single file named customers.csv, located in the analytics folder of the bucket:

analytics/customers.csv           /* Single file, no escaped characters */

analytics\/customers\.csv         /* Same file with escaped special characters */

If you include special characters (/ or .) in the file location and want the expression to match exactly, you’ll need to escape them in the expression as we did in the example above.

In other cases, there may be multiple files that contain data for an entity. For example: Every day a new CSV file is generated with new/updated customer data, and it follows the naming convention of customers-YYYY-MM-DD.csv.

To ensure data is correctly captured, you’d want to enter a search pattern that would match all files beginning with customers, regardless of the date in the file name. This would map all files in the analytics folder that begin with customers to a single table:

analytics\/customers.*\.csv

This search pattern would match customers-2018-07-01.csv, customers-2018-07-02.csv, customers-2018-07-03.csv, etc., and ensure files are replicated as they’re created or updated.

Step 3.2: Define the table's name

When creating table names, keep in mind that each destination has its own rules for how tables can be named.

Step 3.3: Define the table's Primary Key

In the Primary Key field, enter one or more header fields (separated by commas) Stitch can use to identify unique rows. For example:

account_id,date

If undefined, Stitch will load data into the table in an append-only fashion. This means that existing rows in the destination won’t be updated, but will be appended to the end of the table. Refer to the Primary Keys and Append-Only Replication section below for more info and examples.

Step 3.4: Specify datetime fields

In the Specify datetime fields field, enter one or more header fields (separated by commas) that should appear in the destination table as datetime fields instead of strings. For example:

created_at,updated_at

If columns are not specified and values cannot be parsed as dates, Stitch will load them as nullable strings. Refer to the Determining data types section for more info on how Stitch identifies data types.

Step 3.5: Configure additional tables

If you want to add another table, click the Configure another table? link below the Specify datetime fields field. Otherwise, move onto the Sync historical data section.

Stitch doesn’t enforce a limit on the number of tables that you can configure for a single integration.

Step 4: Define the Historical Sync

The Sync Historical Data setting will define the starting date for your Amazon S3 CSV integration. For Amazon S3 CSV integrations, Stitch will replicate all files that match the search pattern that have been updated since the Start Date selected here.

Change this setting if you want to replicate data beyond Amazon S3 CSV’s default setting of 1 year.

For example: Let’s say we’ve added a customers.*\csv search pattern and set the integration’s historical Start Date to 1 year. During the initial replication job, Stitch will fully replicate the contents of all files that match the search pattern that have been modified in the past year.

During subsequent replication jobs, Stitch will only replicate the files that have been modified since the last job ran.

As files included in a replication job are replicated in full during each job, how data is added to updated files can impact your row count. Refer to the Incremental Replication for Amazon S3 CSV section for more info on initial and subsequent replication jobs for Amazon S3 CSV.

Step 5: Create a replication schedule

In the Replication Frequency section, you’ll create the integration’s replication schedule. An integration’s replication schedule determines how often Stitch runs a replication job, and the time that job begins.

Stitch offers two methods of creating a replication schedule:

  • Replication Frequency: This method requires selecting the interval you want replication to run for the integration. Start times of replication jobs are based on the start time and duration of the previous job. Refer to the Replication Frequency documentation for more information and examples.
  • Anchor scheduling: Based on the Replication Frequency, or interval, you select, this method “anchors” the start times of this integration’s replication jobs to a time you select to create a predictable schedule. Anchor scheduling is a combination of the Anchor Time and Replication Frequency settings, which must both be defined to use this method. Additionally, note that:

    • A Replication Frequency of at least one hour is required to use anchor scheduling.
    • An initial replication job may not begin immediately after saving the integration, depending on the selected Replication Frequency and Anchor Time. Refer to the Anchor Scheduling documentation for more information.

    • You’ll need to contact support to request using an Anchor Time with this integration.

To help prevent overages, consider setting the integration to replicate less frequently. See the Understanding and Reducing Your Row Usage guide for tips on reducing your usage.

Step 6: Grant access to your bucket using AWS IAM

Next, Stitch will display a Grant Access to Your Bucket page. This page contains the info you need to configure bucket access for Stitch, which is accomplished via an IAM policy and role.

Note: Saving the integration before you’ve completed the steps below will result in connection errors.

Step 6.1: Create an IAM policy

An IAM policy is JSON-based access policy language to manage permissions to bucket resources. The policy Stitch provides is an auto-generated policy unique to the specific bucket you entered in the setup page.

For more info about the top-level permissions the Stitch IAM policy grants, click the link below.

Permission Name Operation Operation Description
s3:GetObject GET Object

Allows for the retrieval of objects from Amazon S3.

HEAD Object

Allows for the retrieval of metadata from an object without returning the object itself.

s3:ListBucket GET Bucket (List Objects)

Allows for the return of some or all (up to 1,000) of the objects in a bucket.

HEAD Bucket

Used to determine if a bucket exists and access is allowed.

To create the IAM policy:

  1. In AWS, navigate to the IAM service by clicking the Services menu and typing IAM.
  2. Click IAM once it displays in the results.
  3. On the IAM home page, click Policies in the menu on the left side of the page.
  4. Click Create Policy.
  5. In the Create Policy page, click the JSON tab.
  6. Select everything currently in the text field and delete it.
  7. In the text field, paste the Stitch IAM policy.
  8. Click Review policy.
  9. On the Review Policy page, give the policy a name. For example: stitch_s3
  10. Click Create policy.

Step 6.2: Create an IAM role for Stitch

In this step, you’ll create an IAM role for Stitch and apply the IAM policy from the previous step. This will ensure that Stitch is visible in any logs and audits.

To create the role, you’ll need the Account ID and External ID values provided on the Stitch Grant Access to Your Bucket page.

Note: If you’re creating multiple Amazon S3 CSV integrations, you need to only complete this process once. After you create the Stitch role, you can just create an additional IAM policy and attach it to the role.

  1. In AWS, navigate to the IAM Roles page.
  2. Click Create Role.
  3. On the Create Role page:
    1. In the Select type of trusted entity section, click the Another AWS account option.
    2. In the Account ID field, paste the Account ID from Stitch. Note: This isn’t your AWS account ID from Step 1 - this is the Account ID that displays in Stitch on the Grant Access to Your Bucket page:

      Account ID and External ID fields mapped from Stitch to AWS

    3. In the Options section, check the Require external ID box.
    4. In the External ID field that displays, paste the External ID from Stitch.
    5. Click Next: Permissions.
  4. On the Attach permissions page:
    1. Search for the policy you created in Step 6.1.
    2. Once located, check the box next to it in the table.
    3. Click Next: Review.
  5. In the Role name field, type Stitch.
  6. Click Create role.

Step 6.3: Check and save the connection in Stitch

After you’ve created the IAM policy and role, you can save the integration in Stitch. When finished, click Check and Save.

Step 7: Set tables to replicate

To complete the setup, you’ll need to select tables you want to replicate to your data warehouse.

Check out the Schema section to learn more about how Amazon S3 CSV data will be structured once it’s loaded into your destination.

  1. In the list of tables that displays - or in the Tables to Replicate tab, if you skipped this step during setup - locate a table you want to replicate.
  2. To track a table, click the checkbox next to the table’s name. A green checkmark means the table is set to replicate.

  3. Repeat this process for all the tables you want to replicate.
  4. When finished, click the Finalize Your Selections button at the bottom of the screen to save your selections.

Note: If you change these settings while a replication job is still in progress, they will not be used until the next job starts.

Initial and historical replication jobs

After you finish setting up Amazon S3 CSV, its Sync Status may show as Pending on either the Stitch Dashboard or in the Integration Details page.

For a new integration, a Pending status indicates that Stitch is in the process of scheduling the initial replication job for the integration. This may take some time to complete.

Free historical data loads

The first seven days of replication, beginning when data is first replicated, are free. Rows replicated from the new integration during this time won’t count towards your quota. Stitch offers this as a way of testing new integrations, measuring usage, and ensuring historical data volumes don’t quickly consume your quota.


Amazon S3 CSV Replication

In this section:

Incremental Replication for Amazon S3 CSV

While data from Amazon S3 CSV integrations is replicated using Key-based Incremental Replication, the behavior for this integration subtly different from other integrations.

The table below compares Key-based Incremental Replication and Replication Key behavior for Amazon S3 CSV to that of other integrations.

Amazon S3 CSV Other integrations
What's used as a Replication Key?

The time a file is modified.

A column or columns in a table.

Are Replication Keys inclusive?

No. Only files with a modification timestamp value greater than the last saved bookmark are replicated.

Yes. Rows with a Replication Key value greater than or equal to the last saved bookmark are replicated.

What's replicated during a replication job?

The entire contents of a modified file.

Only new or updated rows in a table.

Frequently updated files and impact on row usage

Because modified files are replicated in full during each replication job, large, frequently updated files can quickly use up a large number of rows.

To reduce row usage, you could create new files that only include updated records that match the table’s search pattern. This will ensure that only updated records are replicated and counted towards your usage.

Primary Keys and Append-Only Replication

For destinations that support upserts (that is, updating existing rows), Stitch uses Primary Keys to de-dupe data during loading. Primary Keys are used to identify unique rows within a table and ensure that only the most recently updated version of that record appears in your data warehouse.

If Primary Keys aren’t specified during setup, Stitch will load data using Append-Only Replication. This means that existing rows in the destination won’t be updated, but instead appended to the end of the table.

Additionally, Stitch will append a column (_sdc_primary_key) to the table to function as a Primary Key. Note: Appending this column will not enable Stitch to de-dupe data, as a unique value will be inserted every time a row is loaded, regardless of whether it’s ever been replicated before. This means that a record can have multiple _sdc_primary_key values, each of them unique.

Example: Append-Only with _sdc_primary_key

For example: The following rows are replicated during the initial replication job:

_sdc_primary_key name type magic
b6c0fd8c-7dec-4e34-be93-2b774fde32cc Finn human false
4b5c413c-1adf-4720-8ccc-48579d6b4e58 Jake dog true

Before the next job, the CSV file containing these rows is modified. This means that Stitch will replicate the contents of the entire file, including the rows for Finn and Jake even if they haven’t been updated.

In the destination, the table might now look like the table below. Notice that records for Finn and Jake have been appended to the end of the table with new _sdc_primary_key values:

_sdc_primary_key name type magic
b6c0fd8c-7dec-4e34-be93-2b774fde32cc Finn human false
0acd439b-cefe-436c-b8ba-285bd956057b Finn human false
4b5c413c-1adf-4720-8ccc-48579d6b4e58 Jake dog true
7e9fa5cf-1739-45a2-9a89-caa6f393efc9 Jake dog true
634d6945-1762-4049-b997-cd9240d4592b Beamo robot true
c5fb32b8-a16d-455d-96c9-b62fff22fe4b Bubblegum princess true

Querying Append-Only tables

Querying Append-Only tables requires a different strategy than you might normally use. For instructions and a sample query, check out the Querying Append-Only tables guide.

Determining data types

To determine a column’s data type, Stitch will analyze the first 1,000 lines of (up to) the first five files included for a given table.

Stitch’s Amazon S3 CSV integration will load data from CSV files and type it as one of the following data types:

  • DATETIME - If a value can’t be parsed as a date, Stitch will load the column as a nullable string. To ensure dates are typed properly, specify them during setup.

  • INTEGER

  • NUMBER

  • STRING


Amazon S3 CSV table schemas

In this section:

Column name transformations

When loading data, Stitch may perform some light transformation on column names to ensure the names follow the destination’s rules for column names. This might include removing or replacing spaces or illegal characters such as !#*.

Note: Stitch will not truncate column names to make them adhere to a destination’s length limit. If a column’s name is too long, the destination will reject the column. Compliant columns will persist to the table.

In the table below are some examples of column names and how they’ll be transformed to fit each destination. Hover over the icon for info about the example.

customer ID $customer! id 123customerid _customerid
Amazon S3 customer ID $customer! id 123customerid _customerid
BigQuery customer_id _customer_id _customerid _customerid
data.world customer ID $customer! id 123customerid _customerid
Panoply customer id $customer id customerid _customerid
PostgreSQL customer ID customer id customerid _customerid
Redshift customer id $customer id customerid _customerid
Snowflake CUSTOMERID CUSTOMERID CUSTOMERID CUSTOMERID


Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.