Skyeng, based in Russia, provides an interactive online school to help people learn English. Its web-based platform lets students attend interactive lessons with teachers, do homework, and take tests. The company also has mobile applications for iOS and Android and browser extensions that help train English skills.
All of its products have independent development teams that each use their favorite technology stacks, including separate sets of databases and tools for product analytics. This approach works well as long as each of team measures only its own KPIs. It works less well if the company wants to answer a question like “How much faster and better do students learn English only via lessons with a teacher, versus those who independently learn words in the application?” The need for cross-product analytics meant the company had to build a new platform to collect data from various sources.
Skyeng Chief Analytics Officer Gleb Sologub got the assignment. “I needed to build an infrastructure for collecting arbitrary events from all the products, combining them with the necessary data from different databases in a single repository, and analyzing and visualizing these data in the form of reports and dashboards.”
The first question was what platform to use to store the data. The main criteria were:
- query execution speed
- complexity of integration with analytical systems and data collection services.
Cloud data warehouses like Amazon Redshift and Google BigQuery met all the criteria. Both were fast enough, and both offer integration with a variety of services. Cost became the determining factor. After looking at both platforms’ pricing strategies and the company’s likely usage, Sologub chose Redshift.
The next step was to settle on a way to collect events, which are the basic elements of product analytics. All the services Skyeng was using had their own APIs for sending arbitrary events. Sologub wanted to choose one service that could collect events from all platforms (web, iOS, Android) and via HTTP (for browser extensions and the server) in some unified way. He chose Segment, which had an API to collect events for all the platforms they needed and could automatically upload all the data to Redshift. Segment was particularly cost-effective – it charges not for the number of collected events, but for the number of unique users per month.
Selecting a service for loading data from databases
In addition to collecting events directly from products, Skyeng also needed to download data from its databases – different instances of different versions of MySQL and PostgreSQL, both cloud-based and on their own servers. The company did not want to write its own ETL solution, since that would require at least one more programmer and a corresponding increase in the payroll.
Sologub looked at several services that could automate the data pipeline, and chose the least expensive one – Stitch. “Stitch gives you a free trial for a month, during which you can download all your historical data and configure incremental replication of new data. After that you can replicate up to 100 million lines per month for $500,” he says.
Unlike services that support data transformation, Stitch has a simple interface that allows users to simply mark which tables and which fields they want to replicate and how often.
Stitch supports full overwriting and incremental replication by key. Replication can use a primary key if you want only to add rows in the source table, and never change them, or it can use a field like updated_at that stores the date of the last line change. If you need to transform the data before moving it to the repository, you can create a view with the right data directly in the source database and replicate this view, or you can do the transformation after the raw data gets replicated to the data warehouse.
Sologub says Stitch itself performs some datatype conversions depending on the storage – for example, json and enum translate to varchar in the case of Redshift. It also monitors the structure of the source database and changes in the tables, and automatically hooks up new columns. “Stitch also has a capable support service, which helped me solve many different questions right in the chat window,” Sologub says.
I advise you to stop being afraid and start doing it now.
Chief Analytics Officer
Adding in analytics
Once it had a data warehouse populated with data from multiple sources, Skyeng needed a platform to analyze and visualize the data. Here, cost considerations again played a decisive role in making the decision. The choice was between Mode, Redash, and Plotly, all of which allow businesses to connect to different data stores and make SQL queries to them. They all let users visualize data in various ways, and build reports and dashboards and give access to them to other users. Ultimately Skyeng chose Redash.
Sologub says it took about four months to build the data warehouse and reporting infrastructure, connect data from all the products and databases, and create about 90 reports and dashboards on various KPIs and metrics. The project cost only about $2,000 over that time.
Skyeng’s infrastructure scales horizontally with an increasing amount of processed data, and in the future its service stack could be supplemented by additional CRM or marketing platforms as separate data sources.
Many companies want to build a cloud analytics infrastructure, but are afraid of the cost, Sologlub says. “I advise you to stop being afraid and start doing it now,” he says. “It’s become easier and cheaper than it was a couple of years ago.”