Google Sheets + Stitch: An easy way to sync your small data to Redshift

A few months back, I found myself in a situation I’m sure someone at your company has been in. We had some goal data in Google Sheets that we wanted to bring into Redshift to join up to our sales data and trend against. This is small data – maybe a few dozen rows. Small as it is, it was important and we needed to join it with the rest of our data.

It would have been easy to populate a table with those goals directly in Redshift, but then our CFO would have had to manually update it in SQL, which would have been a harder task. So I spent a few hours coding up a Google Sheets integration for Stitch that makes loading to Redshift trivial. I’ll explain how it works.

Feel free to skip ahead to the repo. Please note: I am not a programmer by trade and pull requests are welcome!

I knew that there were a few key things that this script needed to do:

  • Allow for nontechnical users to set up and use

  • Grab all the data from a sheet and format as JSON

  • Conform to the Stitch API for JSON formatting, datatyping, and records per request

Step 1: Prepare your sheet to load data to your data warehouse

The easiest way for a general user to set up and use the script is to add a dropdown to the menu. Google Sheets makes this easy with the addMenu function.

I added a menu that has a two options that execute functions:

  • Sync to Stitch

  • Set up spreadsheet for sync

The onOpen function makes sure the menu dropdown loads every time the spreadsheet loads, and it contains the options for the other functions.

function onOpen() {

  var spreadsheet = SpreadsheetApp.getActive();

  var menuItems = [

   {name: ‘Sync with Stitch’, functionName: ‘push’},

   {name: ‘Setup Spreadsheet For Push’, functionName: ‘onInstall’}

  ];

  spreadsheet.addMenu(‘Stitch Import’, menuItems);

  function auth() {}

  auth()

}

Dropdown menu added to spreadsheetDropdown menu added to spreadsheet

The onInstall function has prompts to set up the spreadsheet and current sheet for syncing. The Stitch API requires an API token and the client ID of the Stitch account. You will also need to define the primary key(s) for the sheet, so Stitch can determine what makes each row unique. Users can just click cancel on the API token and CID fields if they’re already added for additional sheets.

In the code below I use the normalizeHeaders function, which is defined later on, but basically makes sure that the sheet keys are formatted for the API and Redshift.

function onInstall() {

  var ss = SpreadsheetApp.getActiveSpreadsheet();

  var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();

  var key = Browser.inputBox("Input Stitch API Key here. Press cancel if no change.", Browser.Buttons.OK_CANCEL);

var cid = Browser.inputBox("Input Stitch Client Id here. Press cancel if no change.", Browser.Buttons.OK_CANCEL);

  var primaryKey = Browser.inputBox("Enter a comma separated list of the primary key(s) for this Sheet (tab). Usually this will be one column, but if multiple columns make a row unique, add more. Press cancel if no change.", Browser.Buttons.OK_CANCEL);

  if(key && key!="cancel") ScriptProperties.setProperty("STITCHKEY", key);

  if(cid && cid!="cancel") ScriptProperties.setProperty("STITCHCID", cid);

  if(primaryKey && primaryKey!="cancel") ScriptProperties.setProperty(normalizeHeaders([sheet.getSheetName()]), primaryKey);

  auth()

}

Step 2: Grab data from your sheet

In Google Scripts, you can define specific ranges of cells based on coordinates of the sheet. Below, you can see how we can get all of the data for a sheet into a JavaScript object. I took this code from a section of Google Scripts documentation about sheets-to-JSON that seems to no longer be available. I edited it a bit to account for spaces in headers and to make sure the headers are alphabetical.

// Normalizes a string, by removing all alphanumeric characters and using mixed case to separate words. The output will always start with a lower case letter.

// This function is designed to produce JavaScript object property names.

// Arguments:

// — header: string to normalize

// Examples:

// “First Name” -> “firstName”

// “Market Cap (millions) -> “marketCapMillions

// “1 number at the beginning is ignored” -> “numberAtTheBeginningIsIgnored”

function normalizeHeader(header) {

  var key = “”;

  var upperCase = false;

  for (var i = 0; i < header.length; ++i) {

    var letter = header[i];

    if (letter == “ “ && key.length > 0) {

      upperCase = true;

     continue;

    }

    if (!isAlnum(letter)) {

      continue;

    }

    if (key.length == 0 && isDigit(letter)) {

      continue; // first character must be a letter

    }

    if (upperCase) {

      upperCase = false;

      key += letter.toUpperCase();

    } else {

       key += letter.toLowerCase();

    }

  }

  return key;

}

// Returns true if the cell where cellData was read from is empty.

// Arguments:

// — cellData: string

function isCellEmpty(cellData) {

    return typeof(cellData) == “string” && cellData == “”;

}

// Returns true if the character char is alphabetical, false otherwise.

function isAlnum(char) {

  return char >= ‘A’ && char <= ‘Z’ ||

  char >= ‘a’ && char <= ‘z’ ||

  isDigit(char);

}

// Returns true if the character char is a digit, false otherwise.

function isDigit(char) {

  return char >= ‘0’ && char <= ‘9’;

}

Step 3: Format your data and send to Stitch

Next, I need to have the script talk to the Stitch API. This involves three main parts:

  1. Each package can contain up to 10,000 top-level records of data. (Additional records can be nested within top-level records. Learn more about de-nesting in our docs.)

  2. We need to authenticate with the API.

  3. The request needs to contain the table name, primary key(s), and sequence ID. It can be formatted as JSON, or preferably as Transit.

We have points 2 and 3 covered by capturing the details in our first section. Below you’ll see what the ideal JSON package looks like to Stitch, while the headers include the authentication.

[

  {“client_id”:0001,

  “action”:”upsert”,

  “sequence”:1469150213914,

  “table_name”:”sheet1",

  “key_names”:[“id2”],

  “data”:{

    “id”:1,

    “customer_id”:2,

    “email”:”sad@mdrcou.org”,

    “createdDaysAgo”:1,

    “utmSource”:”somegreatsource”

    }

  }

]

Before we can push all of the data, we need to account for the 10K record limit per request that the Stitch API enforces. While most spreadsheets I’ve worked with are not larger than that limit, you never know. The push function includes a simple conditional statement that takes care of the initial breaking apart of large docs vs. small docs, while getting the relevant information staged for the other functions.

function push(){

  var ss = SpreadsheetApp.getActiveSpreadsheet();

  var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();

  var tablename = normalizeHeaders([sheet.getSheetName()])[0];

  if (ScriptProperties.getProperty(‘STITCHKEY’) == null || ScriptProperties.getProperty(‘STITCHCID’) == null || ScriptProperties.getProperty(tablename) == null){

    Browser.msgBox(“You are missing some of the required information to send the data. Please click add the required information in the next prompt”);

    onInstall();
  } else {

    var range = sheet.getDataRange();

    var lastcolumn = range.getLastColumn();

    Logger.log(tablename);

    var newkey = normalizeHeaders(ScriptProperties.getProperty(tablename).split(‘,’));

    var firstrow = 2

    // last row minus 1 assuming the first row is headers

    var lastrow = range.getLastRow() + 1;

    trackdoc(lastrow, tablename);

    var i = 10001;

    if (ScriptProperties.getProperty(‘STITCHKEY’) == null || ScriptProperties.getProperty(‘STITCHCID’) == null || tablename == null){

      msgBox(“You are missing some of the required information to send the data. Please click the ‘Setup Spreadsheet For Push’ in the dropdown”);

    } else if (lastrow > i){

      largedoc(lastrow, lastcolumn, i, tablename, sheet, newkey)

    } else {
      smalldoc(lastrow, lastcolumn, i, firstrow, tablename, sheet, newkey)};

  }

}

I’m only going to focus on the smalldoc section, because most of the code is the same, and it doesn’t need to be repeated.

There’s one function that handles the push, and one that handles the formatting of the record for the push, both shown below. The code makes one call out to transform the JavaScript object to Transit format, which is like JSON enriched with data type encoding. The Stitch API can accept Transit, and will use the encoding to have data show up in your data destination with encoding intact.

At the top of the script, there’s a large minified library, which is the JavaScript Transit library. We use a simple function to transform the data to Transit.

smalldoc (less than 10,000 rows) function:

//Stages and sends the package to Stitch API
function smalldoc(lastrow, lastcolumn, i, firstrow, tablename, sheet, newkey){ 
  Logger.log(‘starting last rows’); 
  var length_left = lastrow — firstrow; 
  var datarange = sheet.getRange(firstrow, 1, length_left, lastcolumn); 
  var spreadsheetdata = getRowsData(sheet, datarange, 1); 
  var api = ScriptProperties.getProperty(‘STITCHTOKEN’); 
  var cid = ScriptProperties.getProperty(‘STITCHID’); 
  var spreadsheetdata = getRowsData(sheet, datarange, 1); 
  var payload_pre = insertKeys(spreadsheetdata, newkey, tablename, sheet, cid, api); 
  var payload = toTransit(payload_pre); 
  var url = ‘https://pipeline-gateway.rjmetrics.com/push'; 
  var options = { ‘method’: ‘post’, “contentType” : “application/transit+json”, “payload”: payload, “headers”: {‘Authorization’: ‘Bearer ‘ + api}, “muteHttpExceptions” : true }; 
  //Logger.log(options) 
  var response = UrlFetchApp.fetch(url, options); 
  Logger.log(response); 
  return response
}

insertKeys function:

// Takes the defined keys and creates an array for the package
function insertKeys(spreadsheetdata, keys, tablename, sheet, cid){
 var new_array = []
 var arrayLength = spreadsheetdata.length;
  for (var i = 0; i < arrayLength; i++) {
    var record = {};
    record.client_id = parseInt(cid);
    record.action = "upsert"
    record.sequence = new Date().getTime();
    record.table_name = tablename;
    record.key_names = keys;
    record.data = spreadsheetdata[i];

    new_array.push(record);
  }
  return new_array
}

toTransit function:

// Takes javascript object and transforms to Transit format
function toTransit(data){ 
  var w = t.writer(“json”); 
  return w.write(data);
}

Additional notes

I haven’t explained the sequence_id and why it’s so important. You could theoretically push two versions of the same record within milliseconds of each other. Stitch, which is a distributed system in AWS, has to make sure that your latest update is the one of record. The chart below illustrates why this matters.

Good vs. bad data pipeline

The easiest way to handle sequencing is to assign a Unix timestamp as the sequence ID, which is shown in the insertKeys function above. Later syncs will have a higher number, which Stitch will account for.

The full script, which accounts for sheets larger than 10K records, can be found here. Please let me know if you’d like any clarification on anything above, and please open an issue on Github if you have any problems using this script.