What is a Data Scientist

Extracting meaning from data is nothing new, but the world has really woken up to the value of predictive analytics and machine learning… preference and recommendation engines, effective marketing, spam filters that actually work, better medicine, even self-driving cars. This new focus has created a scramble as companies have tried to find people with the skills needed to get them into the predictive game. This scramble has led to two problems: 1) what, exactly am I looking for (not just programmers and not statisticians), and 2) where are these people?

Emergence of the Data Scientist

The world has settled on the terms Data Science and Data Scientist. HBR famously referred to the Data Scientist as the sexiest job of the 21st century.

I like the term because its practitioners are applying the scientific method while working in the medium of data–creating and validating hypotheses, making discoveries, and improving life in myriad ways.

A data scientist is more than a statistician:

  • The data is not sitting in nice, neat SAS datasets. It’s in unstructured social media networks, streaming off of sensors, or in various other messy forms.
  • The machine learning algorithms bringing the breakthrough innovations are more computational than mathematical.
  • Implementation of the insights coming from the data require significant programming.

A data scientist is more than a programmer:

  • Programmers don’t normally think in terms of designing and executing experiments.
  • They must understand what data these experiments require and what can be inferred from the data.
  • The big data aspect requires specialized skills in distributed computation.

So, What is a Data Scientist?

This rare combination of skills–and the hype surrounding the field–has led to some fun definitions of the data scientist:

DataScientistDefinition

These snarky definitions have been pretty popular as well:

  • “Data Scientist is a Data Analyst who lives in California”
  • “A data scientist is a business analyst who lives in New York.”
  • “A data scientist is a statistician who lives in San Francisco.”
  • “Data Science is statistics on a Mac.”

Hype and cynicism aside, the world needs more technologists that can program, handle data, and have a mastery of inferential statistics. There is an incredible need and the work is intellectually stimulating. This has motivated many developers to learn to be data scientists, myself included.

Next up… approaching the data science field as a developer.

An Unexpected Journey

It was just over a year ago when I started talking to small company in Columbia, SC about heading up their Engineering team. They were a .NET shop–right in my wheelhouse. All I had to do was pick up the insurance domain and figure out what predictive analytics and machine learning are all about.

Technically, I didn’t have to understand machine learning because the company has a core research team that develops and maintains algorithms. I would lead the team that turns those algorithms into great software solutions and user experiences for the insurance industry. Of course, no engineer worth his salt is going to be content to treat the heart of his system as a mysterious black box. So, for me, taking the job meant diving into machine learning, which I knew nothing about. As I spent the previous five years building mobile and web field service automation solutions, I knew “big data” was a hot topic, but I had missed the rise to prominence of predictive analytics and the whole Data Scientist craze—the sexiest job of the 21st century

Drew Conway created a helpful and widely referenced venn diagram of skills that define the Data Scientist:

DataScienceVennDiagram

I’ve spent two and half decades filling in the red circle. As a Domain-Driven Design adherent, I’ve always committed myself to learning the domain my software is designed for—in this case insurance. However, I hadn’t given serious thought to higher math and statistics beyond batting averages and occasionally having to remind myself how obscure three sigma outliers are. I enjoyed these subjects in college, but left them behind as i built systems where the most complicated math could be done by a middle schooler. Sure, cryptography has some interesting math, but we rely on libraries for that.

I’m going to use this space to chronicle my journey from a transactional business system developer to a data scientist—or at least a machine learning/predictive analytics specialist. I’m early in the journey, but I’ve made enough missteps as well as positive steps that I can help others looking get into the predictive analytics space.

Turtles All The Way Down

I enjoyed a moment today that would warm the heart of any geeky dad. Sitting in church this morning, the pastor told the story of his college days when he read Chariots of the Gods? and came to believe that life on Earth was seeded by aliens.  As an aside, he added that he hadn’t thought to ask the question of where the aliens came from, and if they came from aliens, then where did those aliens come from. My 11-year-old leaned over and whispered in my ear… “It’s turtles all the way down.”

Drive: The Peopleware of this Generation?

imageThe 70’s and 80’s had The Mythical Man-Month. The 90’s had Peopleware. These works helped software managers understand and communicate to non-software people the dynamics involved in effectively managing software teams. The management models that got cars and TVs built at ever cheaper costs didn’t work on software projects. Brooks, Lister, and DeMarco helped thoughtful software managers figure out how to best manage professionals who must bring a challenging combination of creativity and technical rigor to their work.   

In Drive: The Surprising Truth About What Motivates Us, Dan Pink provides the same kind or resource for a more general audience. He argues that the models for understanding human motivation that worked in the past are outdated and don’t apply to today’s knowledge workers.

Pink contrasts internal and external motivation. Our internal motivation is driven by three needs: autonomy, mastery, and purpose.

Autonomy: We need more than “buy in” or even independence. We crave true self-direction.  To provide meaningful autonomy to our teams, we need to give our people choice over:

  • Task – What they do
  • Time – When they do it
  • Team – Who they do it with
  • Technique – How they do it

Mastery: We are driven to grow, improve, and be increasingly capable of solving more and more complex problems.

  • Mastery is a mindset: It requires the capacity to see your abilities not as finite, but as infinitely improvable. Internally motivated people tend to have an incremental theory of intelligence, prize learning goals over performance goals, and welcome effort as a way to improve at something that matters.
  • Mastery is pain: It demands effort, grit and deliberate practice. The path to mastery – becoming ever better at something you care about – is a difficult process over a long period of time.
  • Mastery is asymptotic: It’s impossible to fully realize, which makes it simultaneously frustrating and alluring.

Purpose: The old models for understanding motivation assumed that we are primarily motivated by money. Today we see money as a necessary but not sufficient reward of our work.  We want to know that what we do makes a difference in the world.

Much of Drive is derivative of the research done in cognitive psychology and behavioral economics, but Pink brings it all together in an engaging and practical way that will allow managers to put these ideas into action. I highly recommend it to anyone who oversees the work of anyone else.

For a small taste of the ideas in the book, check out this presentation.

Comeback-a-thon

hackathonIt’s only two days before the Marathon Data Systems sponsored Jersey Shore Comeback-a-thon, a 24-hour hackathon with a theme of bringing business back to the Jersey Shore in the first season after hurricane Sandy.

Many businesses along the Shore have worked very hard and made great investments to rebuild, clean up, and claw their way back into business in time for the summer season. Unfortunately, the images of devastation of the storm have left many would-be visitors thinking there is no beach to vacation to this year.

The $1000 grand prize will be awarded to the individual or team that comes up with the best application, system, or other creative use of technology to help bring business back to the Shore.  There will also be two $300 runner up prizes.

We’ve had good coverage from patch.com, njbiz.com, and triCity News. Get more details here.

Hashtag: #combackathon

Attributes of a Good Team Room

As we look at new office space, I had to think about what we would want in new team rooms. Here is what I came up with, with the help of my team…

  • Four walls (ideally with at least one being glass or lots of windows)
    • high enough to provide a sound barrier
    • lots of white-board space
    • wall of offices or conference rooms is OK as long as the doors can be closed
  • At least 5 feet of desk space per person
  • Ability to run HDMI to a large, shared monitor (probably on a rolling stand)
  • 8 – 10 people per room (ideally with removable wall to combine two rooms)
  • Private space nearby
  • Manager’s office nearby

Confusion Over Structs

I was recently perusing an article called C# developer interview questions and answers. I do a lot of interviews of developers with C# experience; so, I like to see what others think are good questions. The article was generally good, but there was this…

image

Good question. Good first sentence. Then things start to go down-hill.

First let’s look at the claim that “Structs are passed by value and not by reference.” This is technically true, but betrays a superficial understanding of the language. In C#, all parameters are are passed by value. It just so happens that the value of a reference type is in fact a reference. If that doesn’t make sense, read this article by Jon Skeet. A better way to state the point would be: “Structs are value types, while classes are reference types.” This would also cover the part about not being able to inherit from structs because all value types are sealed.

Next we have “Structs are stored on the stack not heap.” This is false. Look at this code:

image

Where will the Point inside the Shape be stored? Clearly on the heap. It is true that value types declared as local variables, even though they may be newed up, are still allocated on the executing thread’s stack space…

image

Now we get to the best of all… “The result is better performance with Structs.” If it were that simple we would always use structs and wouldn’t even need classes. The reality is that some objects are better modeled as classes and some are better modeled as structs. Discussing what kinds of objects are best modeled as structs would be a great question.

Looking for Agile Software Developer in Central NJ

Are you a software craftsman? Do you value growing your skills as you simultaneously learn from and teach the other members of your team? Want to help us build the world’s best cross-platform mobile solutions for field service workers?

The folks that fix your A/C when it’s 100 degrees and keep your toilet from backing up into your living room deserve the best software technology can provide, and we need your help. We are looking for a key addition to our Agile/Scrum/XP team that is building a highly scalable, event-driven, mobile-enabled platform. This person will be test-driven and team-focused. He or she will be a mentor to team members new to TDD and some of the other XP practices.

Must haves:

  • At least two years of test-driven development experience with the .NET platform (C#) on a collaborative team (Scrum, CI, TDD/BDD, Paired-Programming, Collective Code Ownership, Refactoring, Iterative & Incremental Development)
  • Strong foundation in Object-Oriented Design and Programming (Design Patterns, SOLID principles, etc.)
  • Strong desire to learn and teach others

Highly valued:

  • Domain-Driven Design experience
    SOA and/or SaaS experience (especially with NServiceBus or any ESB)
  • Entity Framework Code-first or other ORM experience (especially with RavenDB, MongoDB, or other document databases)
  • Significant JavaScript experience (especially with Sencha Touch, CoffeeScript, or any JavaScript TDD/BDD framework)
  • Mobility experience (iOS, Android, or Web)

Voted as one of the Top 30 Places to Work in NJ, Marathon develops supports and sells SaaS software and marketing solutions to the SMBs in pest control, landscaping, HVAC and other service verticals. Comprehensive benefits package including vacation, insurance, and company sponsored profit sharing plan. Contact me at efarr@marathondata.com or @efarr.