|Release Status||Released||Supported By||Stitch|
|Availability||Free||Status Page||Amazon S3 CSV Status Page|
|Default Historical Sync||1 year||Default Replication Frequency||1 hour|
Connecting Amazon S3 CSV
Amazon S3 CSV Setup requirements
To set up Amazon S3 CSV in Stitch, you need:
An Amazon Web Services (AWS) account. Signing up is free - click here or go to
https://aws.amazon.comto create an account if you don’t have one already.
Permissions in AWS Identity Access Management (IAM) that allow you to create policies, create roles, and attach policies to roles. This is required to grant Stitch authorization to your S3 bucket.
Verify that column names in CSV files adhere to your destination’s length limit for column names. If a column name exceeds the destination’s limit, the destination will reject the column. Compliant columns will persist to the destination.
Column name limits vary by destination:
Amazon S3 - Not applicable to this destination
BigQuery - Limited to 128 characters
data.world - Not applicable to this destination
Panoply - Limited to 115 characters
PostgreSQL - Limited to 59 characters
Redshift - Limited to 115 characters
Snowflake - Limited to 251 characters
Step 1: Retrieve your Amazon Web Services account ID
- Sign into your Amazon Web Services (AWS) account.
- Click the user menu, located between the bell and Global menus in the top-right corner of the page.
- Click My Account.
In the Account Settings section of the page, locate the Account Id field:
Keep this handy - you’ll need it to complete the next step.
Step 2: Add Amazon S3 CSV as a Stitch data source
- Sign into your Stitch account.
On the Stitch Dashboard page, click the Add Integration button.
Click the Amazon S3 icon.
Enter a name for the integration. This is the name that will display on the Stitch Dashboard for the integration; it’ll also be used to create the schema in your destination.
For example, the name “Stitch Amazon S3 CSV” would create a schema called
stitch_amazon_s3_csvin the destination. Note: Schema names cannot be changed after you save the integration.
- In the S3 Bucket field, enter the name of bucket. Enter only the bucket name: No URLs,
https, or S3 parts. For example:
- In the AWS Account ID field, paste the account ID you retrieve in Step 1.
Step 3: Configure tables
Next, you’ll indicate which CSV file(s) you want to include for replication. You can include a single CSV file, or map several CSV files to a table.
In the following sections, we’ll walk you through how to configure a table in Stitch.
Tips for setting up CSV files
For the best results:
Each file should have a header row with names that adhere to your destination’s limits for column names. If a column name exceeds the destination’s limit, the destination will reject the column.
If including multiple files in a table, each file should have the same header row. Including multiple files in a single table depends on the search pattern you define in the next step.
Note: This is not the same as configuring multiple tables. See the search pattern section below for examples.
Step 3.1: Define the table's search pattern
The Search Pattern field accepts regular expressions, which can be used to include a single file or multiple CSV files. What you enter into this field depends on how data for a particular entity is updated.
If a single file is replaced in your S3 bucket at some interval, it would make sense to enter location of the file. For example: Customer data is added to and updated in a single file named
customers.csv, located in the
analytics folder of the bucket:
analytics/customers.csv /* Single file, no escaped characters */ analytics\/customers\.csv /* Same file with escaped special characters */
If you include special characters (
.) in the file location and want the expression to match exactly, you’ll need to escape them in the expression as we did in the example above.
In other cases, there may be multiple files that contain data for an entity. For example: Every day a new CSV file is generated with new/updated customer data, and it follows the naming convention of
To ensure data is correctly captured, you’d want to enter a search pattern that would match all files beginning with
customers, regardless of the date in the file name. This would map all files in the
analytics folder that begin with
customers to a single table:
This search pattern would match
customers-2018-07-03.csv, etc., and ensure files are replicated as they’re created or updated.
Step 3.2: Define the table's name
When creating table names, keep in mind that each destination has its own rules for how tables can be named.
Important: Destination table name limits
The table name you enter should adhere to your destination’s length limit for table names. If the table name exceeds the destination’s limit, the destination will reject the table entirely.
Table name limits vary by destination type:
Amazon S3 - Not applicable to this destination
BigQuery - Limited to 1,024 characters
data.world - Not applicable to this destination
Panoply - Limited to 127 characters
PostgreSQL - Limited to 63 characters
Redshift - Limited to 127 characters
Snowflake - Limited to 255 characters
Step 3.3: Define the table's Primary Key
In the Primary Key field, enter one or more header fields (separated by commas) Stitch can use to identify unique rows. For example:
If undefined, Stitch will load data into the table in an append-only fashion. This means that existing rows in the destination won’t be updated, but will be appended to the end of the table. Refer to the Primary Keys and Append-Only Replication section below for more info and examples.
Step 3.4: Specify datetime fields
In the Specify datetime fields field, enter one or more header fields (separated by commas) that should appear in the destination table as
datetime fields instead of strings. For example:
If columns are not specified and values cannot be parsed as dates, Stitch will load them as nullable strings. Refer to the Determining data types section for more info on how Stitch identifies data types.
Step 3.5: Configure additional tables
If you want to add another table, click the Configure another table? link below the Specify datetime fields field. Otherwise, move onto the Sync historical data section.
Stitch doesn’t enforce a limit on the number of tables that you can configure for a single integration.
Step 4: Define the Historical Sync
The Sync Historical Data setting will define the starting date for your Amazon S3 CSV integration. For Amazon S3 CSV integrations, Stitch will replicate all files that match the search pattern that have been updated since the Start Date selected here.
Change this setting if you want to replicate data beyond Amazon S3 CSV’s default setting of 1 year.
For example: Let’s say we’ve added a
customers.*\csv search pattern and set the integration’s historical Start Date to 1 year. During the initial replication job, Stitch will fully replicate the contents of all files that match the search pattern that have been modified in the past year.
During subsequent replication jobs, Stitch will only replicate the files that have been modified since the last job ran.
As files included in a replication job are replicated in full during each job, how data is added to updated files can impact your row count. Refer to the Incremental Replication for Amazon S3 CSV section for more info on initial and subsequent replication jobs for Amazon S3 CSV.
Step 5: Create a replication schedule
In the Replication Frequency section, you’ll create the integration’s replication schedule. An integration’s replication schedule determines how often Stitch runs a replication job, and the time that job begins.
Stitch offers two methods of creating a replication schedule:
- Replication Frequency: This method requires selecting the interval you want replication to run for the integration. Start times of replication jobs are based on the start time and duration of the previous job. Refer to the Replication Frequency documentation for more information and examples.
Anchor scheduling: Based on the Replication Frequency, or interval, you select, this method “anchors” the start times of this integration’s replication jobs to a time you select to create a predictable schedule. Anchor scheduling is a combination of the Anchor Time and Replication Frequency settings, which must both be defined to use this method. Additionally, note that:
- A Replication Frequency of at least one hour is required to use anchor scheduling.
An initial replication job may not begin immediately after saving the integration, depending on the selected Replication Frequency and Anchor Time. Refer to the Anchor Scheduling documentation for more information.
- You’ll need to contact support to request using an Anchor Time with this integration.
To help prevent overages, consider setting the integration to replicate less frequently. See the Understanding and Reducing Your Row Usage guide for tips on reducing your usage.
Step 6: Grant access to your bucket using AWS IAM
Next, Stitch will display a Grant Access to Your Bucket page. This page contains the info you need to configure bucket access for Stitch, which is accomplished via an IAM policy and role.
Note: Saving the integration before you’ve completed the steps below will result in connection errors.
Step 6.1: Create an IAM policy
CreatePolicy. Refer to Amazon’s documentation for more info.
An IAM policy is JSON-based access policy language to manage permissions to bucket resources. The policy Stitch provides is an auto-generated policy unique to the specific bucket you entered in the setup page.
For more info about the top-level permissions the Stitch IAM policy grants, click the link below.
|Permission Name||Operation||Operation Description|
Allows for the retrieval of objects from Amazon S3.
Allows for the retrieval of metadata from an object without returning the object itself.
|s3:ListBucket||GET Bucket (List Objects)||
Allows for the return of some or all (up to 1,000) of the objects in a bucket.
Used to determine if a bucket exists and access is allowed.
To create the IAM policy:
- In AWS, navigate to the IAM service by clicking the Services menu and typing IAM.
- Click IAM once it displays in the results.
- On the IAM home page, click Policies in the menu on the left side of the page.
- Click Create Policy.
- In the Create Policy page, click the JSON tab.
- Select everything currently in the text field and delete it.
- In the text field, paste the Stitch IAM policy.
- Click Review policy.
- On the Review Policy page, give the policy a name. For example:
- Click Create policy.
Step 6.2: Create an IAM role for Stitch
AttachRolePolicy. Refer to Amazon’s documentation for more info.
In this step, you’ll create an IAM role for Stitch and apply the IAM policy from the previous step. This will ensure that Stitch is visible in any logs and audits.
To create the role, you’ll need the Account ID and External ID values provided on the Stitch Grant Access to Your Bucket page.
Note: If you’re creating multiple Amazon S3 CSV integrations, you need to only complete this process once. After you create the Stitch role, you can just create an additional IAM policy and attach it to the role.
- In AWS, navigate to the IAM Roles page.
- Click Create Role.
- On the Create Role page:
- In the Select type of trusted entity section, click the Another AWS account option.
In the Account ID field, paste the Account ID from Stitch. Note: This isn’t your AWS account ID from Step 1 - this is the Account ID that displays in Stitch on the Grant Access to Your Bucket page:
- In the Options section, check the Require external ID box.
- In the External ID field that displays, paste the External ID from Stitch.
- Click Next: Permissions.
- On the Attach permissions page:
- Search for the policy you created in Step 6.1.
- Once located, check the box next to it in the table.
- Click Next: Review.
- In the Role name field, type
- Click Create role.
Step 6.3: Check and save the connection in Stitch
After you’ve created the IAM policy and role, you can save the integration in Stitch. When finished, click Check and Save.
Step 7: Set tables to replicate
To complete the setup, you’ll need to select tables you want to replicate to your data warehouse.
Check out the Schema section to learn more about how Amazon S3 CSV data will be structured once it’s loaded into your destination.
- In the list of tables that displays - or in the Tables to Replicate tab, if you skipped this step during setup - locate a table you want to replicate.
To track a table, click the checkbox next to the table’s name. A green checkmark means the table is set to replicate.
- Repeat this process for all the tables you want to replicate.
- When finished, click the Finalize Your Selections button at the bottom of the screen to save your selections.
Note: If you change these settings while a replication job is still in progress, they will not be used until the next job starts.
Initial and historical replication jobs
After you finish setting up Amazon S3 CSV, its Sync Status may show as Pending on either the Stitch Dashboard or in the Integration Details page.
For a new integration, a Pending status indicates that Stitch is in the process of scheduling the initial replication job for the integration. This may take some time to complete.
Initial replication jobs with Anchor Scheduling
If using Anchor Scheduling, an initial replication job may not kick off immediately. This depends on the selected Replication Frequency and Anchor Time. Refer to the Anchor Scheduling documentation for more information.
Free historical data loads
The first seven days of replication, beginning when data is first replicated, are free. Rows replicated from the new integration during this time won’t count towards your quota. Stitch offers this as a way of testing new integrations, measuring usage, and ensuring historical data volumes don’t quickly consume your quota.
Amazon S3 CSV Replication
In this section:
- How new and updated data is identified and replicated
- How Primary Keys affect loading data
- How data types are determined
Incremental Replication for Amazon S3 CSV
While data from Amazon S3 CSV integrations is replicated using Key-based Incremental Replication, the behavior for this integration subtly different from other integrations.
The table below compares Key-based Incremental Replication and Replication Key behavior for Amazon S3 CSV to that of other integrations.
|Amazon S3 CSV||Other integrations|
|What's used as a Replication Key?||
The time a file is modified.
A column or columns in a table.
|Are Replication Keys inclusive?||
No. Only files with a modification timestamp value greater than the last saved bookmark are replicated.
Yes. Rows with a Replication Key value greater than or equal to the last saved bookmark are replicated.
|What's replicated during a replication job?||
The entire contents of a modified file.
Only new or updated rows in a table.
Frequently updated files and impact on row usage
Because modified files are replicated in full during each replication job, large, frequently updated files can quickly use up a large number of rows.
To reduce row usage, you could create new files that only include updated records that match the table’s search pattern. This will ensure that only updated records are replicated and counted towards your usage.
Primary Keys and Append-Only Replication
For destinations that support upserts (that is, updating existing rows), Stitch uses Primary Keys to de-dupe data during loading. Primary Keys are used to identify unique rows within a table and ensure that only the most recently updated version of that record appears in your data warehouse.
If Primary Keys aren’t specified during setup, Stitch will load data using Append-Only Replication. This means that existing rows in the destination won’t be updated, but instead appended to the end of the table.
Additionally, Stitch will append a column (
_sdc_primary_key) to the table to function as a Primary Key. Note: Appending this column will not enable Stitch to de-dupe data, as a unique value will be inserted every time a row is loaded, regardless of whether it’s ever been replicated before. This means that a record can have multiple
_sdc_primary_key values, each of them unique.
Example: Append-Only with _sdc_primary_key
For example: The following rows are replicated during the initial replication job:
Before the next job, the CSV file containing these rows is modified. This means that Stitch will replicate the contents of the entire file, including the rows for
Jake even if they haven’t been updated.
In the destination, the table might now look like the table below. Notice that records for
Jake have been appended to the end of the table with new
Querying Append-Only tables
Querying Append-Only tables requires a different strategy than you might normally use. For instructions and a sample query, check out the Querying Append-Only tables guide.
What column do I use as a Primary Key when querying?
The answer depends on your destination:
BigQuery and Amazon S3 CSV - These destinations are Append-Only regardless of whether Primary Keys are specified.
If you specified Primary Keys during setup, use the column(s) specified (ex:
id). Otherwise, use
All other destinations - Use
_sdc_primary_keywhen querying Append-Only tables.
Determining data types
To determine a column’s data type, Stitch will analyze the first 1,000 lines of (up to) the first five files included for a given table.
Stitch’s Amazon S3 CSV integration will load data from CSV files and type it as one of the following data types:
DATETIME- If a value can’t be parsed as a date, Stitch will load the column as a nullable
string. To ensure dates are typed properly, specify them during setup.
Amazon S3 CSV table schemas
In this section:
Column name transformations
When loading data, Stitch may perform some light transformation on column names to ensure the names follow the destination’s rules for column names. This might include removing or replacing spaces or illegal characters such as
Note: Stitch will not truncate column names to make them adhere to a destination’s length limit. If a column’s name is too long, the destination will reject the column. Compliant columns will persist to the table.
In the table below are some examples of column names and how they’ll be transformed to fit each destination. Hover over the icon for info about the example.
|customer ID||$customer! id||123customerid||_customerid|
|Amazon S3||customer ID||$customer! id||123customerid||_customerid|
|data.world||customer ID||$customer! id||123customerid||_customerid|
|Panoply||customer id||$customer id||customerid||_customerid|
|PostgreSQL||customer ID||customer id||customerid||_customerid|
|Redshift||customer id||$customer id||customerid||_customerid|
Table and column names in your destination
Depending on your destination, table and column names may not appear as they are outlined below.
For example: Object names are lowercased in Redshift (
customers), while case is maintained in PostgreSQL destinations (
CusTomERs). Refer to the Loading Guide for your destination for more info.