5 things you should know for a career in data engineering

The demand for big data professionals has never been higher. "Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn," Forbes proclaims. Many people are building high-salary careers working with big data. We've already talked about things you should know before getting a job in data science — now let's talk about data engineering.

First, you should know that a data science degree isn't training for a data engineering career. Data science is heavily math-oriented. By contrast, data engineers work primarily on the tech side, building data pipelines. What the two roles have in common is that both work with big data.

Working with big data often takes a big team. Data engineers work with people in roles like data warehouse engineer, data platform engineer, data infrastructure engineer, analytics engineer, data architect, and devops engineer.

To help students and mid-career professionals decide whether data engineering is for them, we spoke with people who've worked as data engineers themselves and hired data engineering teams:

1) You must be a strong developer

Everyone agrees that you need strong developer skills for a data engineering job.

"You'll have to write scripts and maybe some glue code," Ng says. "Everything is code now: infrastructure as code, pipeline as code, etc. Courses are OK but nothing beats real-world experience. A textbook doesn't teach you how to handle a data pipeline outage – at least none of mine did!"

In a blog post about what he looks for in a data engineer, Anderson said, "I can't stress enough how important it is for a data engineer to have a strong programming background. They also need a love of or at least an interest in data, in finding patterns in data, otherwise they may find the work boring. Also, they have to like and have the ability to create systems that are difficult and complex. Big data projects are 10 times more complex than small data. So it's a love of data combined with a love of programming to create data pipelines."

In addition to being comfortable coding, Lappas says, "You have to have the operations mindset that uptime is critically important. You have to be careful how you build your infrastructure for reliability, so that any changes won't break any of the pieces. Devops experience is very valuable. And you need DBA skills." In fact, he says, "I generally see the title data engineer in midmarket and smaller companies. When you get to bigger companies the title is still DBA or senior DBA."

2) You need to know about a lot of technologies

Lappas says, "A data engineer has three main duties:

  • To ensure that the data pipeline – the acquisition and processing of data – is working
  • To serve the needs of internal customers – the data scientists and data analysts
  • To control the cost of moving and storing data

"The critical skills are SQL, Python, and R, and ETL methodologies and practices."

Tam confirms the value of knowing SQL and having competency in a language. "Just understanding the foundation of a language will allow you to work at any company."

Anderson says most data engineering is done in Java, but "you have to be aware that most universities are teaching programming from an academic point of view, and there's a disparity between what industry wants and what academia is providing. A university may have classes on programming, but people who want to become data engineers may have to learn the technical and systems side on their own."

But there's more to being a data engineer than knowing SQL and programming, he says. "A qualified data engineer's value is to know the right tool for the job. People who are new to big data think I'm exaggerating when I say that data engineers need to know 10 to 30 different technologies to choose the right tool for the job in technologies, such as:

  • Apache Hadoop
  • Apache Spark
  • Apache Hive
  • Apache HBase
  • Apache Impala
  • Apache Kafka
  • Apache Crunch
  • Hue
  • Apache Oozie
  • Apache NiFi
  • Apache Flink
  • Apache Apex
  • Apache Storm
  • Heron
  • Apache Beam
  • Apache Cassandra

To make the right decision in choosing, for example, a NoSQL cluster, you'll need to have learned the pros and cons of five to 10 different NoSQL technologies," Anderson says. From that list, you can narrow it down to two to three for a more in-depth look.

Learning all those tools takes time. You can't expect to spend a weekend watching YouTube videos and studying MOOCs and expect to do well in a job interview. Also, Anderson cautions that many low-cost classes are useless. "They're too general, taught by people with not enough knowledge, and they won't help you get a job." Similarly, most certifications just don't make enough of a difference to be worth it. "You're be better off putting your time and money into a better personal project that shows true mastery.

"The best options are to get professional training, read books, and work on big data projects. You have to both internalize the knowledge and practice it. If you've learned passively but never practiced, you won't be able to code a project, and that will come out in an interview. Practice practice practice!"

Most companies standardize on a single vendor's suite of cloud computing services, so Ng recommends, "Go deep on one of the big three: Google Cloud Platform (GCP), Amazon Web Services (AWS), or Microsoft Azure. Understand how their service offerings can be used as building blocks for highly scalable and available data pipelines."

"So if your company uses AWS, for instance," Lappas adds, "you'd need to know Amazon Redshift, Amazon EMR, Amazon Athena, and Amazon S3."

Finally, Ng says, "A neglected skill that I feel is core to being a good engineer is technical diagramming. Data engineers will inevitably need to map out their pipeline architectures in a clear and presentable way. No amount of words can beat a thoughtful and clean diagram. My advice would be to pick a diagramming tool and get really good with it. My favorite is Lucidchart and I use it for everything from AWS architecture diagrams to block diagrams and flow charts."

3) Experience beats education

How do you pick up all those skills? Typically, on the job. Everyone we spoke with told us it wasn't necessary to have an advanced degree to get a job as a data engineer.

"I think education has its place, but a lot of things you don't learn until you operate in the real world, meaning you have to deal with real customers," Ng says. "I think anyone with a software background that has had some experience in operations or systems can make a smooth transition to data engineering. A lot of the skills that devops and site reliability engineers have overlap significantly with data engineering responsibilities."

"A lot of the skills you need you can pick up yourself or on the job," Lappas agrees. "If you're just starting out, I recommend starting as an analyst to get a feel for the business value that the data brings. Eventually you can move down the stack into data engineering."

Nevertheless, he says, training in both software development and data science skills such as statistics and math is important. "Data engineers are responsible for acquiring data for data scientists and data analysts, who need all the company's data available in a format that lets them query it with the tool of their choice. The data engineer has to migrate it from where it lives and transform it so that it makes sense to the data scientists and data analysts. That may require aggregating it and running statistical methods to derive higher insights. For example, if a mobile app generates 10,000 events per second, chances are you're going to have to do some transformation on that raw data to make it useful for the rest of the data team."

Tam says, "Education has its place but experience makes the best engineer. I've seen people in customer-facing areas like support and customer success, if they have interest in programming, move into a data engineering role. Most support people are so in tune with what customers are asking for in terms of custom data integrations that it's an easy sidestep for them, in the sense that they understand the use case and the thought process behind it. I've also seen data scientists move from the analysis side into data engineering. Generally, it's people who have a hand and the experience in the day-to-day use of the data."

She says, "I've hired people of many different educational backgrounds – from people who've just graduated with a computer science degree to people who've done bootcamp courses in Python. You shouldn't be pigeonholed by your background. It depends on the person's overall goal. If they have the vision and drive, anyone could make a good data engineer with time."

Anderson, however, says, "Data engineering teams generally skew toward senior people. More broadly based software engineering teams will have people with a wider range of experience.

"If you have only a bachelor's degree and want to get on a data engineering team, I recommend you make a personal project that shows what you can do, not just what you can talk about."

4) Social and communication skills are important

Ng says, "Aside from hard technical skills, a good data engineer should also have certain soft skills and qualities":

  1. Attention to detail: Data quality is extremely important when building pipelines. All downstream work is only as good as the quality and integrity of the data you're moving through the pipeline. You have to really care about and appreciate the "garbage in, garbage out" principle.

  2. Appreciation for clean design: There's never one way to design and build a pipeline for moving data from point A to point B. A good data engineer should appreciate the elegance of clean and simple designs that are not over-architected.

  3. Good communication skills: A lot of times there's a discovery period when you start to design a pipeline because your data is sitting in different silos that may be located in different areas of your infrastructure. You'll have to talk to people to understand the playing field before you design anything. This discovery step isn't easy, but it's a requirement for making sure you're building the right thing. A good data engineer should find satisfaction in helping their customers solve painful problems.

  4. Excitement about working on back-end systems: Data engineers don't build a lot of UIs and front-end apps. They work deep in the systems stack, and in many cases they won't be able to point to something shiny and say "I built that!" You have to be OK with that and take pride in being the hero behind the scenes.

  5. A love of learning: This isn't really data engineering-specific, it's just how the software engineering world operates. You have to keep up with new libraries, frameworks, and tools out there in the community. Things change fast and you need to be able to quickly understand, evaluate, and learn new tools if necessary.

"Having good people skills is critical," Lappas agrees. "A data engineer serves internal teams, so he or she has to understand the business goal that the data analyst wants to achieve to best support them. If a data scientist has a specific tool they want to use, the data engineer has to set up the environment in a way that lets them use it. So you have to be really good at interacting with the rest of the data team."

Tam says, "a lot of teams are really collaborative, so you have to communicate what you're doing technically with people who aren't as technical in the process of getting the requirements vetted out." But she says another kind of skill is also important. "I look for people who like stories and puzzles – the process of piecing together stories that might not seem like they make sense into a more complete picture. That mindset is useful as you move up the data engineering ladder and you have more input into designing systems that make sense."

5) The job is changing

While all of the above is important, data engineering is an evolving discipline.

Lappas says, "We're seeing a shift to data services, which means a change in the job of the data engineer to delivering data services. When it was expensive to store and process, data was siloed. Few people had access to it, and it was hard to make changes to it. With the cloud, it's now cheap and easy to store and process data, so everyone is putting data into cloud data warehouses and allowing anyone in the company to connect to it. Data engineers are still responsible for the performance of that infrastructure."

In other words, Tam says, "As data becomes even more ginormous than it is today, it becomes more about infrastructure and sustainable processes than it does about single processes. The job is growing toward more being able to maintain things."

Ng is more cautious. "I think this one is very company-specific. For large companies a data engineer may be able to have a narrower focus, e.g. just building pipelines. If you're at a small startup and you're the only data engineer, you'll inevitably have to wear multiple hats. Both of these scenarios exist today; you just have to decide what is best for you."

Conclusion

What would our experts tell someone who was considering a career as a data engineer?

Lappas says, "The job is very difficult. It's an unsexy job, but it's super-critical. Data engineers are kind of like the unsung heroes of the data world. Their job is incredibly complex, involving new skills and new tech. It's really hard to build new ETL pipelines."

Anderson agrees. "It's more difficult than a regular software engineering job. It may not be for everybody. The technical bar for data engineers is pretty high. Some people's efforts may be better put elsewhere."

Ng's advice: "Work for a startup and find a great mentor. Whether this is at an internship or your first job, find a place where you can work directly for someone who's a great teacher. More than anything else, a great mentor is the most efficient way to learn the right things and learn those things quickly. By working at a startup you'll be forced to wear multiple hats and will learn an incredible amount while doing that. Each hat is an opportunity to learn something new. Be a hat collector."

"Data engineering is a job that takes a lifetime to master," Tam says. "Every year there's something new to learn. You're never doing the same thing year after year."

Anderson agrees. "There is no such thing as future-proofing your career by choosing the right technology. The right technology will eventually become the wrong technology. You'll need to spend time and effort to keep up with what's happening."

Or as Tam puts it, "You can continue to grow forever."

Image credit: Image Editor