New research — we’re in the middle of a data engineering talent shortage

We’ve all become accustomed to hearing about the rising demand for data scientists, but according to our latest research, the real talent crisis lies in data engineering. There are 6,500 people who call themselves data engineers on LinkedIn. In the San Francisco Bay area alone there are 6,600 job listings for this same title!

Yes, there are developers doing data engineering work who are not using the data engineering title. But even with that caveat, it’s clear that there’s a growing demand for a specific skill set, and companies are hiring (and paying) aggressively for it.

Here are a few highlights from our research:

  • From 2013–2015 the number of data engineers grew 122%. The number of data scientists grew 47% during that same time period.

  • 42% of data engineers come from a software engineering background.

  • The top five data engineering skills are SQL, Java, Python, Hadoop, and Linux.

You can read the full report here, and you should – there’s some great stuff in there. In the meantime, let’s dig a bit deeper into the highlights.

Data engineer growth is far outpacing data science growth

From 2013–2015 the number of data engineers grew 122%. During that same time period, the number of data scientists grew by 47%.

Cumulative number of data engineers and data scientists

In one respect, does it really matter that data engineering is growing so much faster than data science roles and getting less attention? Not really. But it reflects a fundamental flaw with the way many people think about working with data. It’s incredibly common for company leaders to say, “We need insights!” The answer is then to hire analysts and data scientists who spend hours of their time cleaning, munging, and moving data.

Data engineers build and maintain the pipelines that keep your data clean and flowing. Insights are great, and you need them. But to deliver insights at scale, you need data infrastructure. That’s delivered by data engineering. It’s not as fun to talk about as D3 visualizations and business intelligence dashboards, but it’s every bit as important. That importance often goes unrecognized until there’s a problem.

42% of data engineers come from a software engineering background

Data engineers by prior role

Data engineering is specialized. Many of the data engineers and experts we spoke with for this report mentioned how data engineering and devops share similar characteristics — a focus on uptime, scalability, and deliverability. These are very different skills set than a software developer working on building a product.

Galvanize, a training program for data science and data engineering, just announced a $45 million Series B round. This is another manifestation of just how big the data engineering shortage is right now. Training software developers to work with big data is a massive market opportunity right now.

Of course Stitch is also working to solve this problem. Stitch is an ETL service built for developers. We handle the data consolidation portion of a company’s business intelligence stack so instead of wasting hours on API maintenance or JSON wrangling, data engineers can spend their time on projects that make their products better.

The data engineering skill set is evolving

We looked at the top 20 skills of data engineers, and we found no big surprises there. SQL, Java, Python, and Hadoop top the list.

Top 20 skills of a data engineer

But there’s a noticeable difference in skill set when you look at skills by company size:

Skill differences across company size

Data engineers at larger companies are more likely to have skills in data warehousing, business intelligence, and ETL. Data engineers at smaller companies are more likely to have skills in Python, Java, and machine learning. This shift represents an evolution from data engineers being essentially managers of legacy BI systems, to data engineers working on bigger architecture problems.

This shift has been enabled by the rise of composable data infrastructure tools. In the past, a company would use Oracle or SAP to build its entire data stack. These tools were difficult to work with and mind-numbingly expensive. Today, we’re seeing the business intelligence stack organize in three distinct layers:

  1. Data consolidation: This is the ETL portion of business intelligence, handled by tools like Stitch.

  2. Data warehousing: Options like Amazon Redshift and Google BigQuery are optimized for analytics and cheaper than ever.

  3. Analytics: Tools like Looker, Mode, and Chartio offer powerful interfaces for analysts, data scientists, and business users to explore data.

These composable stacks free up data engineers to spend their time on projects that use data to build better products (see machine learning way at the top of the skills list), rather than spending all their time just keeping the latest ETL script from breaking.

Now read the whole report

You can grab the whole report here. It’s packed with good stuff: 4,000 words, 11 charts, and insights from six experts in the data engineering. If you have questions about the data or ideas on what we should explore next, leave us a comment.

Click here to read the full report