What is a Data Scientist

Extracting meaning from data is nothing new, but the world has really woken up to the value of predictive analytics and machine learning… preference and recommendation engines, effective marketing, spam filters that actually work, better medicine, even self-driving cars. This new focus has created a scramble as companies have tried to find people with the skills needed to get them into the predictive game. This scramble has led to two problems: 1) what, exactly am I looking for (not just programmers and not statisticians), and 2) where are these people?

Emergence of the Data Scientist

The world has settled on the terms Data Science and Data Scientist. HBR famously referred to the Data Scientist as the sexiest job of the 21st century.

I like the term because its practitioners are applying the scientific method while working in the medium of data–creating and validating hypotheses, making discoveries, and improving life in myriad ways.

A data scientist is more than a statistician:

  • The data is not sitting in nice, neat SAS datasets. It’s in unstructured social media networks, streaming off of sensors, or in various other messy forms.
  • The machine learning algorithms bringing the breakthrough innovations are more computational than mathematical.
  • Implementation of the insights coming from the data require significant programming.

A data scientist is more than a programmer:

  • Programmers don’t normally think in terms of designing and executing experiments.
  • They must understand what data these experiments require and what can be inferred from the data.
  • The big data aspect requires specialized skills in distributed computation.

So, What is a Data Scientist?

This rare combination of skills–and the hype surrounding the field–has led to some fun definitions of the data scientist:

DataScientistDefinition

These snarky definitions have been pretty popular as well:

  • “Data Scientist is a Data Analyst who lives in California”
  • “A data scientist is a business analyst who lives in New York.”
  • “A data scientist is a statistician who lives in San Francisco.”
  • “Data Science is statistics on a Mac.”

Hype and cynicism aside, the world needs more technologists that can program, handle data, and have a mastery of inferential statistics. There is an incredible need and the work is intellectually stimulating. This has motivated many developers to learn to be data scientists, myself included.

Next up… approaching the data science field as a developer.

An Unexpected Journey

It was just over a year ago when I started talking to small company in Columbia, SC about heading up their Engineering team. They were a .NET shop–right in my wheelhouse. All I had to do was pick up the insurance domain and figure out what predictive analytics and machine learning are all about.

Technically, I didn’t have to understand machine learning because the company has a core research team that develops and maintains algorithms. I would lead the team that turns those algorithms into great software solutions and user experiences for the insurance industry. Of course, no engineer worth his salt is going to be content to treat the heart of his system as a mysterious black box. So, for me, taking the job meant diving into machine learning, which I knew nothing about. As I spent the previous five years building mobile and web field service automation solutions, I knew “big data” was a hot topic, but I had missed the rise to prominence of predictive analytics and the whole Data Scientist craze—the sexiest job of the 21st century

Drew Conway created a helpful and widely referenced venn diagram of skills that define the Data Scientist:

DataScienceVennDiagram

I’ve spent two and half decades filling in the red circle. As a Domain-Driven Design adherent, I’ve always committed myself to learning the domain my software is designed for—in this case insurance. However, I hadn’t given serious thought to higher math and statistics beyond batting averages and occasionally having to remind myself how obscure three sigma outliers are. I enjoyed these subjects in college, but left them behind as i built systems where the most complicated math could be done by a middle schooler. Sure, cryptography has some interesting math, but we rely on libraries for that.

I’m going to use this space to chronicle my journey from a transactional business system developer to a data scientist—or at least a machine learning/predictive analytics specialist. I’m early in the journey, but I’ve made enough missteps as well as positive steps that I can help others looking get into the predictive analytics space.