Questioning and answering. At times you may feel that you have found the correct answer. I assure you that this is a total delusion on your part. You will never find the correct, absolute, and final answer. — Professor Kingsfield, The Paper Chase
All kinds of organizations today are hiring data analysts to make sense of the growing amount and range of data they generate and collect. Wringing actionable answers out of data has become a key business skill. All kinds of organizations collect big data and want to use it to make or improve decisions. Firms in fields as varied at B2B and B2C commerce, health care, manufacturing, and marketing all use data analytics to improve processes and enhance profits.
For example, "Medicine uses data analytics in clinical studies to predict the efficacy of medicines and survival rates," says Carl Howe, director of education at RStudio, a company that provides open source and enterprise tools for use with the R programming language. "Factories are always looking to improve production yields — if you can improve yield by one to two percent, that can mean millions of dollars to a chip or drug manufacturer."
And while companies are working on automating data analytics, "around 80% of the job hasn't been automated, and the 20% that is being automated still isn't automated really well," says Matthew May, lead data scientist at URSA. "More importantly, any problem that auto-machine learning can solve is a 'softball problem.' Hard problems take one or more people to work on. So jobs doing data analytics aren't going away."
If you've read our piece on what you should know before getting a degree in data science, you may wonder how data science and data analytics are related. Data science, says Howe, "is about 'can we model the world — and use these models to make predictions,' while data analytics is more about extracting insights from big datasets. For example, 'we forecast the demand for oil will peak by 2030,' or 'the proportion of the world living in poverty has almost halved in the last 20 years' are both results from data analysis."
Salarywise, a Northeastern University report that cites Robert Half Technology's 2019 Salary Guide says, "Data analysts have an earning potential of between $81,750 and $138,000... Since these professionals work mainly in databases, however, they are able to increase their salaries by learning additional programming skills, such as R and Python."
What skills and experience do you need to succeed as a data analyst? To find out, we got advice from four data analytics professionals.
- Carl Howe is a data scientist and director of education for RStudio, Inc., which provides open source and enterprise tools for use with the R programming language.
- Dr. Kristen Sosulski is a professor of tech operations and statistics at NYU Stern and author of the book Data Visualization Made Simple: Insights Into Becoming Visual.
- Matthew May is lead data scientist at URSA Inc. (for Unmanned Robotics Systems Analysis), an unmanned systems data analytics and visualization company that helps users unlock value from their telemetry data.
- Dr. Rosaria Silipo is the principal data scientist at KNIME, which provides the KNIME Analytics Platform tool for data science. She's also the author of six books, including Practicing Data Science: A Collection of Case Studies, the three-hour video course KNIME: A Data Science Approach to Analytics, plus dozens of other technical publications.
1. You have to like working with numbers
"Doing data analytics makes use of two skills," Howe says: "One, statistics, and two, telling a story with those statistics in ordinary words.
"If you're going to be a data analyst, you must know how to use statistical techniques accurately. You have to like and be good at working with numbers. You have to be able to see data like a mystery or puzzle, and think, 'There's something in here that I want to discover.' Then you apply your math skills to find clues and eventually solve that mystery."
But that's only half the story. Jobs in data analytics focus not only on the numbers but also on how we communicate insight, Howe says. "You're turning data and statistics into a story that can influence others. That story probably has to be told in pictures, because that's the way we internalize information quickly."
Conveying the meaning of results in a way that can be quickly and easily grasped is essential, Sosulski agrees. "You have to be very descriptive in the work you do, be able to visually communicate what you've learned from your analysis — for example, creating charts and graphs, and solidly interpreting them, using the data as evidence."
In general, says May, "You have to be curious and inquisitive, and enjoy not knowing how to solve a problem, not knowing the answer, and working through that to get to a usable solution or usable actionable answer."
If you have some programming experience, and already have an MBA or other business degree, or your major was statistics or another math-related field, May suggests you ask yourself a couple of questions, both to help you assess whether you and data analytics are a good match, and because the questions will probably come up in job interviews:
- What kind of modeling have you done already?
- How comfortable are you working with a dataset — specifically, gleaning insights using statistical models and techniques, and creating insights that are interpretable by people who aren't quantitative? To do this, you need to have a solid foundation in coding, modeling, analysis, and data presentation, including data visualization.
Your answers can help you decide what new topics to learn about. Beyond this, May says, "There are varying levels of technical ability. Statistics helps. So does having a behavioral analysis background."
Behavioral analysis "is concerned with describing, understanding, predicting, and changing behavior." That, May says, "is a big part of data science/analytics. Often we are looking to be able to influence user behavior, i.e., ‘get them to click on this, or that’ or answer the question ‘why did the user click on this, or that?’ So having a background in a science that is all about behavior can be very useful."
Also valuable, May says: "Knowing a lot of the high-level math to do analysis using probability and statistics. To do the modeling that's at the core of machine learning models, you need linear algebra along with calculus, statistics, and probability. The better you understand this math, the better you can understand the underlying behavior of the algorithm you are using, the positive and negatives of a certain algorithm. Most people have the ability to learn calculus, linear algebra, statistics, and probability... But the desire to actually do that can't be learned — although it can be fostered."
"Math is definitely necessary," agrees Silipo. "Everything else — algorithms, more programming languages — you can learn. And since programming languages dedicated to AI change every few years, you'll need to learn new ones throughout your career. However, the math skills you must have from the beginning, and they are the basis to understand and learn everything else.
"Study the basic algorithms and the basic math behind the instructions in Python or another one of your favorite tools," she advises. "While the tools can make the programming easy, the complexity behind-the-scenes of AI algorithms is still there and must be handled with care."
All of that said, "If you have a non-tech or non-traditional background, it's probably going to be very difficult if not impossible to get into data analytics or data science without a master's degree," May says.
2. You have to know how to code, but you don't have to know computer science
Along with a love for numbers, data analysts need an affinity for working with them programmatically.
"You should learn to code, for reproducibility so others can build on what you've done," Howe says. If you can't write down a program that does what you are doing, "you're left with two choices: teach others how to do it or keep doing it yourself forever."
What computer languages and other software tools are most likely to be useful for a data analyst? SQL is essential — it's the standard language for data manipulation. Other useful options:
- Python. "Pick a graphing library in Python and get to where you're pretty good at it," May says, "and learn Pandas, the Python Data Analysis library."
- R, a free software environment for statistical computing and graphics. "R is written specifically for data analytics and science," RStudio's Howe says. However, URSA's May says if you don't already know R, it may not be worth your time to learn it. "If you mostly deliver one-off answers that don't get put into production, R is OK, but If you're shipping code to production, R gets really tricky."
- Hadoop, a collection of tools for processing large datasets, and Spark, a fast and general cluster computing system for big data. "I think anyone who gets hired as a data analyst and can't wrangle and clean data will struggle in a real-world environment because at least 90% of an analyst's work is cleaning and transforming data. If your data set is large enough that you can't process it on your laptop, you need big data (and usually cloud-based big data) skills such as Hadoop and Spark," says Howe.
- Cloud-native and desktop analytics platforms, such as Looker, Tableau, and Microsoft Power BI. However, notes Carl Howe, "Many of those tools are simply cloud-based versions of point-and-click visualization tools, which rely on manual and irreproducible processes for analyzing data. If you're an analyst who knows how to use a programming language, you'll have no trouble picking up those tools if you need them. On the other hand, if your skills are primarily in the point-and-click world, you'll find it difficult to make the transition to a code-based analysis environment, which is where hardcore data analysts work."
- Excel. Howe says, "Data scientists and analysts love to disparage Excel, but the reality is that many businesses run on Excel data. A good data analyst can build a dialog with end users and find ways to work with those users, and in many businesses, that may mean working with Excel data. The problem with Excel, though, is that it guarantees that you are doing the analysis manually, using your mouse and keyboard. That's not a recipe for creating reproducible results. Most analytics isn't doing one-offs, but doing things over and over again, and you always want to do them in the same way. The trick usually is to get that data out of Excel as quickly as possible and put it into a form more amenable to reproducible analysis."
"The cloud-based vendors like Amazon have done a great job of selling systems like Redshift, BigQuery, and Snowflake," Howe says. "However, if the company an analyst works for won't put their data in the cloud because of regulatory or just business concerns (and there are lots of companies in financial, pharma, health care, and other industries who are in this situation), then the analyst will need to know how to process big data without those cloud solutions, and that probably means using a Hadoop cluster or equivalent."
That said, while knowing how to code and knowing a programming language or three is essential to being a data analyst, coding for data analytics doesn't require the same depth of knowledge required for a degree in computer science.
"Data analytics and computer science are different disciplines," Howe says. "Data analytics is more about understanding large datasets. In a computer science course, you'd be introduced to the concept of loops and loop statements, but in data analytics, you might not encounter this concept until the end, because data analytics operations process a whole set at once; looping is only used rarely. So while a data analyst needs to be able to write code, they don't necessarily need a computer science background."
3. Communication skills are (almost) as important as math
You may have the technical chops to handle data analytics, but that might not be enough to get hired. What else do you need to ace an interview?
"One, make sure you can talk your way through a number of machine learning algorithms," says URSA's May. "Two, be able to speak to the bias-variance tradeoff in prediction models — and what you can do to/about it using SQL. And three, be able to talk through an end-to-end data science or data analytics problem that you've solved — what the problem was, your solution, and how you dealt with the roadblocks you encountered along the way."
Silipo says, "I look for many different things when I run interviews for these people and positions. First of all I look for technical skills. I give them an exercise and see how they approach it, how their way of thinking is, and whether they have the right math background. This applies to both data analytics and data science.
"Then I check their communication skills. It's true that a data analyst’s role, like that of a data scientist, is mainly technical, but for both roles, a minimum level of communication will be required to explain the results of a project or even to promote the project itself."
Communication skills covers a range of factors. Do you have a design sense, to create visualizations? Can you communicate with non-technical colleagues?
"And then finally, and most importantly, I check their attitude," says Silipo. Data science is constantly evolving and there will be new concepts and new algorithms to learn every year. A curious attitude is what I need. I need somebody who isn't afraid of saying, 'I don't know. I'll research it.' It's impossible to know everything in the data science space, so a healthy humble attitude mixed with a self-starting curious attitude is the right combination."
But attitude only goes so far. "People want to see demonstrable evidence of your skills," NYU's Sosulski says. "How do you do this? Build portfolios of data projects from start to finish."
May agrees. "It's good to have one — and it shouldn't just be code that you've written. You should have writeups to accompany your code, using words to explain — in a concise way — what you did. Code can get pretty long, and can take a while to digest if you don't comment it correctly. And even then, nobody has hours to dig through your code. So you have to be able to explain what you did.
"Your portfolio should have at least two classification problems where you use different algorithms, and two regression problems where you use different algorithms," May advises. "And all of these problems need to have the proper data science workflow."
In terms of the datasets you use in your portfolio, "Use some nice clean datasets — but if you can, get your hands on at least one that's very dirty and raw, so you can show what you did with missing values — do you fill them in or remove them." And how can you find data and projects to work on? Many open source datasets are available online.
4. Much of what you'll do won't be at the top of the job description
You may have an intuitive idea about what a data analyst does, but what you imagine might not line up with how you actually spend your time. URSA's May says, "Mostly, you'll be thinking about a problem or question, and how you can use data to potentially solve or answer that. And you'll be doing EDA — exploratory data analysis — which means seeing if you can find a signal that can help answer that question or problem."
Some of this may get done on a blackboard (traditional or digital), and some with coding, May says. "You do EDA by writing code. There's also writing the code for the model that you build, and the code to create the graph to show your answer, and the code for the statistic you're going to derive."
Once you think you know how to answer a question, you still have a ways to go before you create a report or a visualization. "One irony of both data science and analytics is that while you need to know a great deal about models and machine learning, you will usually spend as much as 90% of your time cleaning real-world data before you analyze it. It's the old story of 'garbage in, garbage out,'" says RStudio's Howe. "You need clean data to work with before you can model it."
Cleaning the data, Howe says, includes:
- finding things that are clearly errors, are coded badly, or exhibit transcription problems
- transforming data into a consistent and meaningful format. Times and dates, for example, may be represented in many different ways with many different reference points, "so if you have a dataset of measurements that were recorded at local times around the world, all of those have to be converted to a standard such as Greenwich Mean Time before you can analyze them."
- Deciding what to do with values that are left out, like temperature measurements with a faulty sensor. "Do you leave them out of your analysis?" says Howe. "Extrapolate from other nearby values? Replace them with the average value? There is no right answer — all will introduce bias into the results."
Conclusion: So you want a job in analytics
Learning the skills you need to get a job in data analytics won't happen in a month — depending on your background and goals, expect to spend a year or more.
But if you have a mathematical mindset and no fear of coding, becoming a data analyst should be within your reach. To help you on your way, we've compiled a list of resources for learning and doing data analytics.
Start your journey to a data analytics job by plugging some keywords into an Internet search engine and skimming through results. "Search on YouTube too," suggests URSA's May. (My quick search there using "data analytics" yielded, among other things, "Data Analytics for Beginners" and "What does a data analyst do on a daily basis?")
Ready to go deeper? Here are some printed, online, and academic resources you may find useful:
- Storytelling with Data: A Data Visualization Guide for Business Professionals
- Analytics: How to Win with Intelligence
- Practical Statistics for Data Scientists: 50 Essential Concepts
- Data Visualization Made Simple: Insights into Becoming Visual by Dr. Sosulski, who says, "This book is a good entry point for understanding how to work with data and to present."
Learn by doing
- Kaggle not only lets you access "free GPUs and a huge repository of community published data and code," it also hosts data science competitions. "Data" means more than 20,000 multi-megabyte datasets you can do analytics on, from PGA Tour Golf Data and 1.88 Million US Wildfires to Zomato Bangalore Restaurants and Denver Police Pedestrian Stops and Vehicle Stops. KNIME's Silipo says you don't have to compete with the goal of winning — "just by participating, you'll see what your competition is, and the types of paths that are common."
But, advises URSA's Matthew May, "Don't go to Kaggle unless you can already code and have some understanding of what's going on. You can get lots of datasets at Kaggle, and lots of people there who are learning, who are extremely good."
May also offers one caveat to using Kaggle: "The data that you see leaves out the entire data cleaning portion of the job."
Degree programs and courses
"You can get a data analytics degree online," notes RStudio's Howe. "Johns Hopkins offers a variety of data science and analytics online programs, and Boston University runs a good online master's program in data analytics. And online programs like EdX host programs from institutions around the world."
But do due diligence before selecting a degree program, May cautions. "Some of the master's degree programs in data science and data analytics were originally statistics or business analytics and have been rebranded. That doesn't mean the programs are necessarily bad, but if the content hasn't been updated, they may miss some of what we consider data analytics and data science are today."
Sosulski has been authoring a 12-week executive certification program through the New York University Stern School of Business, available starting January 2020, in which, she says, "Students will spend four weeks learning R, and then the other eight weeks learning to visualize."
There is, of course, no shortage of companies and educational institutions offering online courses relevant to data analytics, such as:
- IBM's Cognitive Class and Microsoft Learn (formerly Microsoft Virtual Academy)
- MIT Open Courseware, Harvard Extension, and Stanford Online
- LinkedIn Learning, Udemy, Khan Academy, Codecademy, and Code.com
- edX, Coursera, Open Culture, Class Central, and iTunesUFree Courses