This report is based on self-reported information from Linkedin, including all publicly visible personal and company profiles, skills, professional experiences, and education.
We identified data scientists based on their professional headline and current title. We only included data scientists associated with companies we could identify in our sample. We considered the possibility that those listing “data scientist” in their profile without an association with an actual company may only have aspirations about a career in data science, so we did not include those profiles in our analysis.
A summary of the dataset is provided in the chart below.
For the data scientists identified, we analyzed in detail:
The analysis was carried out in Python, SQL, and an open-source computing platform Jupyter. Python packages charts and python-highcharts were used to create interactive visualizations in HighCharts and HighMaps. Data was stored and processed using Amazon Redshift.
Data Scientists In the World
How many data scientists are there?
Rather than getting lost in the “What is a data scientist?” debate, as so many have done before, we chose to let data scientists speak for themselves. We certainly could have identified data scientists using a complex machine learning algorithm, employing skills, education, keywords, or other identifying characteristics, but the best solution is often the simplest. Similar to how LinkedIn simply asks its users, “does Joe know about Python?” we asked “Does Joe say he is a data scientist?” Specifically, we looked at people who actually state “data scientist” either in their title or in their professional headline.
A direct consequence of our approach is that it does not necessarily capture everyone doing data science. Today, many companies employ data analysts, business intelligence analysts, quantitative analysts, or simply scientists who may very well be doing the same work as someone with a data scientist title at another company. However, many companies also employ analysts who do very little with data beyond working with it in Excel. We intentionally avoid including this great diversity of titles so as not to contaminate our sample, and instead consider a very small list of permutations around the phrase “data scientist.”
In addition to searching for data scientists in English, we translated data science titles into eight other languages on LinkedIn: French, Spanish, Italian, Portuguese, German, Swedish, Dutch, and Turkish. If you are curious to see precisely how we identified someone as a data scientist, feel free to take a look at the final query that we ran on our Redshift cluster.
All in all, we found only 11,400 data scientists worldwide. While this number seems low at first glance, it is in line with the analysis by LinkedIn’s own Senior Data Scientist Peter Skomoroch, who shared his insights in this Quora answer in March of 2014. Taking Skomoroch’s estimate of 6,900 self-identified data scientists and factoring in both (a) the 15 months of growth in LinkedIn’s user base experienced during that time and (b) the 22% change in the number of new data scientists added between 2014 and 2015 (see the year-over-year chart below), we get 8,900 data scientists. This number is smaller than our estimate, most likely due to the fact that Skomoroch's search excluded other variations of the phrase “data scientist,” and was conducted only in English.
How has the number of data scientists changed over time?
Our analysis revealed impressive growth in the number of data scientists over time. In fact, at least 52% of all data scientists have earned that title within the past 4 years.
In the chart above, the cumulative number of data scientists in any given year corresponds to the number of present-day data scientists who started their first job that year. Since the first job of someone who is a data scientist today may not have been data-science related, the curve underestimates the growth in the total number of data scientists. One advantage of this approach is that we can see how the number of data scientists grew before LinkedIn was founded, because people list both their professional experience and education prior to 2003.
Note that while LinkedIn has certainly exhibited impressive growth ever since it was founded, our analysis in this chart does not depend on this growth. Specifically, what is important for this type of analysis is how many data scientists have profiles on LinkedIn today, and not how the number of LinkedIn profiles has changed over time.
While this growth in the total number of data scientists is impressive, how does it compare to other technical fields? To answer this question, we looked at the change in the number of people having data science skills starting their first job, and compared that number to two other disciplines: software engineering and data analysis.
Over time, all three disciplines exhibited similar behavior, including contractions in the number of new people added during the dot-com bubble and the most recent recession. However, since 2012 the number of data scientists starting their first job has increased at a rate that is consistently 50% higher than that for software engineers and data analysts.
Note that the year-over-year change shown in this chart is different from the year-over-year growth of the field in general, as we are looking only at the number of people added to the field. Unfortunately, there is no way for us to determine when and how many people have left the field, as they would not be identified as present-day data scientists. However, given how young the field is, we speculate that very few people have left. If one were to take the outflow of people into account, the difference between data scientists and software engineers/data analysts would be even more pronounced.
Where are data scientists located?
55% of all the data scientists on LinkedIn are located in the United States. This makes sense, given that data science originated in the US, and that the US has arguably the highest concentration of high tech companies in the world. However, we were surprised to see hotbeds of data scientists in countries like India, the Netherlands, and Israel. All three made it into the top 10 countries, ranked by the absolute number of data scientists.
There are of course many factors contributing to our estimates of the total number of data scientists in each country.
For example, LinkedIn adoption rates vary greatly on a country-by-country basis, and people in some countries do not use LinkedIn at all. However, given the education level of data scientists, and their propensity towards all things technological, we speculate that the vast majority of them are in fact on LinkedIn.
To show how both of these considerations affect the country ranking, we normalized the number of data scientists for each country by both the total LinkedIn membership and by the country’s most recent census data. Both normalizations paint a very different picture.
Specifically, LinkedIn adoption appears to be very poor in India, Israel, and Germany, yet India is high up on the list of the total number of data scientists.
At the same time, the “density” of data scientists, or the number of data scientists per unit of country’s population, is the highest in Israel, followed by the United States and the Netherlands. While it is not surprising to see Israel, long known as the startup nation with Silicon Wadi as its own Silicon Valley, it is interesting to see such a high concentration of data scientists in the Netherlands.
Data Scientists at Work
What industries employ the largest number of data scientists?
Today, the Information Technology and Services industry employs the largest number of data scientists, followed by Internet and Computer Software.
Both are noteworthy when examined in the context of Marc Andreessen’s “software is eating the world” hypothesis. For example, both Airbnb and Uber are listed as Internet companies, yet Airbnb is disrupting the Hotel Industry, and Uber is taking on incumbents in both the Transportation and Shipping industries with approaches that, in large part, are fueled by data science. Traditional businesses will need to quickly adopt best data science practices or risk having data-driven competitors simply out-innovate them.
What companies employ the most data scientists?
There is a very healthy mix of new companies and more established businesses employing data scientists. Facebook, LinkedIn, and Twitter are high on the list, along with Apple, Microsoft, IBM, GlaxoSmithKline, and GE. Facebook is the only young tech company to crack the top five, second to Microsoft, which employs almost twice as many data scientists in total.
It is important to note that these numbers account for all divisions within each company world-wide, so the figures should be considered with this context in mind.
While the above chart is interesting, it is just a snapshot of the state of data science at each company. Given how rapidly the field is evolving, we also wanted to see how businesses grow their data science teams. To do this, we used an approach similar to that described in our analysis of the number of data scientists over time. Specifically, we looked at the work experience of current data scientists to see when was the last time they joined and left any of the top 10 employers of data scientists. Since a data scientist’s last employment with any of these companies may not have been in a data science role, the trends shown in this chart are underestimating the actual efforts of these companies to grow their data science teams.
Two companies stand out in particular: Microsoft and Facebook. Both Microsoft and Facebook appear to be on a hiring spree, accelerating their data scientist recruiting during the 2014 calendar year by at least 151% and 39%, respectively, when compared to 2013 (Microsoft went from at least 49 to 123 people hired, and Facebook from 43 to 60).
While Microsoft appears to be bringing the largest number of new people with data science skills on board, it seems to be losing the largest number of data scientists as well. Note that our estimate of the number of people leaving each company excludes anybody who no longer identifies themselves as a data scientist on LinkedIn. Thus, the actual number of people with data science skills leaving each company is in reality larger.
Attrition and divisions aside, Microsoft still has an impressive lead both in the number of data scientists it has on staff and the pace of hiring.
The DNA of a Data Scientist
What are the primary skills of a data scientist?
In today’s world, a data scientist is expected to be a jack of all trades; a self-learner who has a solid quantitative foundation, an aptitude for programming, infinite intellectual curiosity, and great communication skills.
Instead of relying on personal and professional biases, we wanted to let the data speak for itself. We analyzed 254,000 skill records of self-identified data scientists and ranked each skill by the number of people listing it on their profile.
While “big data” and “hadoop” might still be buzzwords in some circles, they are not even in the top 10 actual skills employed by a garden-variety data scientist. Instead, generic “data analysis,” R, Python, and machine learning lead the way, followed by statistics, SQL, analytics, MATLAB, and Java.
Note that data analytics differs from data analysis in that it is a broader term, generally implying an understanding of techniques and methods as opposed to just familiarity with tools for exploring and analyzing data (see this article).
While this ranking does show the most prominent skills in the data science community as a whole, it averages out important hierarchical differences. We took a closer look at LinkedIn data and compared top skills across three different seniority levels: chief, senior, and junior.
The chief data scientist group included people in the C-suite, as well as founders, co-founders, owners and vice-presidents. The senior group consisted of directors of data science, managers, heads of data science, data science leads, principal and senior data scientists. Finally, the junior group included everybody not already captured by the chief and senior groups.
To highlight differences across seniority levels and make these differences easier to digest, we compared each level to the same common denominator: the average data scientist.
Interestingly, chief data scientists were significantly more likely to list business intelligence, analytics, leadership, strategy and management among their skills than both junior and senior data scientists. At the same time, today’s chief data scientists appear to be less technical on average: only 27% and 26% listed Python and R, respectively. Compare this to the corresponding 52% and 53% of junior data scientists, along with 38% and 43% of senior practitioners.
While it is certainly true that chief data scientists may be simply emphasizing skills that are more relevant to their position within the company, we also speculate that many chief data scientists assumed these roles by virtue of being in the field longer or having additional qualifications, such as a business degree. Therefore, it is also possible that some chief data scientists never actually learned many of the skills listed by more junior people.
Similarly to chief data scientists, senior data scientists de-emphasized data analysis, and instead were more likely to emphasize data analytics when compared to junior data scientists: more than 45% of senior data scientists listed that skill vs. only 30% who did so at the junior level.
What is a data scientist's level of education?
We analyzed 27,000 education records to evaluate what percentage of data scientists hold advanced degrees and what fields of academic speciality they come from. This is shown as a percentage of all distinct bachelor’s, master’s, and doctorate degrees listed by data scientists (there are typically multiple degrees per person). 12% of all self-identified data scientists did not list any degrees.
Over 79% of data scientists listing their education have earned a graduate degree, with 38% of all data scientists who had an education record earning a PhD, and close to 42% listing a Master’s degree as the highest degree attained. This shockingly large percentage of data scientists with graduate degrees is indicative of the increasing demand for specialists and a desire for advanced training in general. This trend is echoed by many of today’s data science initiatives that build on research backgrounds of PhDs by helping them learn the tools and the technology stack most commonly used in the industry. This allows them to quickly get up to speed and become productive members of any data science team.
As with our analysis of skills, we saw significant differences in education across seniority levels.
The ratio of data scientists with a PhD to data scientists with only a Master’s degree is the highest at the senior level. In fact, it is almost 31% higher for senior data scientists when compared to junior data scientists. This indicates that in today’s market, having a PhD helps data scientists climb the corporate ladder. We also noticed that fewer data scientists had a PhD at the chief data scientist level than at the senior level (35% vs 43%). Again, we speculate that this is largely due to the fact that people in more senior positions have been in the field longer and/or have other credentials that may be more relevant to their position.
What are the top academic backgrounds of data scientists?
Overall, Computer Science is the dominant field of study among data scientists. This supports what we found in our analysis of the skills listed, and what DJ Patil and Hilary Mason expressed in their book Data Driven: Creating a Data Culture. According to these two data science pioneers, “a data scientist who lacks the tools to get data from a database into an analysis package and back out again will become a second-class citizen in the technical organization.”
That being said, we speculate that another reason Computer Science is so prominent is that many more people graduate with backgrounds in Computer Science than with backgrounds in Biology, Neuroscience, Bioinformatics or Psychology. Furthermore, today’s Computer Science majors are arguably much more likely to work in the technology sector compared to any other graduate.
It is also interesting to note the differences between master’s and doctoral degrees:
Top 10 Backgrounds for People with Master’s and Doctoral Degrees
||% of people
||Business Administration/ Management
||Machine Learning/ Data Science
||Economics & Finance
||All other fields
||% of people
||Machine Learning/Data Science
||Economics & Finance
||All other fields
First, some programs, such as programs in Business Administration are virtually not offered at the PhD level. This explains why Business Administration/Management does not appear in the list of top 10 PhD fields, but is ranked second on the list of Master’s fields, with over 12% of data scientists listing an MBA as their highest and most recent level of education.
Second, graduates across fields face different job prospects upon graduation. For example, Physics majors arguably have fewer career options in both industry and academia than people with a computer science, electrical engineering, computer engineering or a statistics background. For that very reason, Physics majors historically have applied their expertise to other fields post-graduation. We speculate that this is the same reason they are likely overrepresented in data science.
That being said, there does appear to be a strong connection between data science and the mindset that Physics encourages. Kevin Novak, Head of Data Science Platform at Uber, notes that Physics helps “you become very good at understanding how to approximate, as well as when and why it’s appropriate.” Physics, and other disciplines like Biology, Neuroscience, and Electrical Engineering all involve experimentation, problem solving, and working with empirical data. Of course, empiricism inevitably gets messy and practical, encouraging exactly the mindset one needs to be a great data scientist.
Closing Thoughts from Lillian Pierson
In 2012, Thomas H. Davenport and DJ Patil reported on the growing need for data scientists in a world where the amount of data is growing exponentially. In an article titled, Data Scientist: The Sexiest Job of the 21st Century, they likened the new data scientist to the “Wall Street ‘quants’ of the 1980’s and 1990’s”:
In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.
Three years later, this report reveals the future that Davenport and Patil envisioned. Universities, bootcamps, and training programs have sprung up to bridge the skills gap. Simultaneously, organizations are clarifying and shaping the data scientist’s role. They are recognizing that the distributed tasks formerly carried out by a variety of roles can be most effectively executed when condensed under one title.
This report shows consolidation among the skills of data scientists, coupled with a growth in people with this title. I expect to see this upward trend continue as pioneers realize that the blend of machine learning, Python, and deep domain expertise that they have already mastered is actually data science, and as they inspire others to acquire these same skills.
This is exactly how my own career has played out. I spent years in technical roles in engineering and analytics, but there was limited opportunity for me to use the breadth of my skills in a single job. The growth in demand for data scientists led me to round out my own skillset and pursue a role within the field.
My personal belief is that the next four years of growth in the profession will come from those who work in adjacent fields and now just need to sharpen their skills. Now that I am a data scientist, I am excited to have the chance to help others who are just beginning this same journey.
Your data is scattered, and we can help.
Stitch is a simple, powerful ETL service built for developers. Stitch connects to all your data sources – from databases like MongoDB and MySQL, to SaaS tools like Salesforce and Zendesk – and replicates that data to your warehouse. With Stitch, developers can provision data to analysts and other team members in minutes, not weeks.
Set up in minutes
Unlimited data volume during trial