Learning Data Science: Getting Started

Get Started with Data Science on RYou’ve chosen R as your tool for getting started learning data science and machine learning. If you are coming from a background on the Microsoft technology stack, your decision to choose R was affirmed by the recent announcement that Microsoft acquired Revolution Analytics, a leader in the R world.

Download and install R from CRAN, The Comprehensive R Archive Network. You’ll find installers for Mac, Windows, and Linux. I’ve installed on both Mac and Windows. They are both simple and straightforward.

Next, download and install R Studio. Even if you are a command-line person who thinks that IDEs rot the mind and inhibit true learning of a new language, trust me–you will still be writing R code in a Notepad-like experience and the integrated help, plots, and data views make R Studio a must-have. Just like R, R Studio is a a straightforward instal on Mac or Windows.

While you’re at it, download a copy of Introduction to Statistical Learning with Applications in R and The Elements of Statistical Learning. Two of the best data science books on the R platform are made freely available by the authors in electronic format!

Learning Data Science: Choose a Platform


DS-DecisionTree

One of the first questions I confronted when setting out to learn Data Science was what platform to use. As you begin to look at books and courses you realize that you’ll need a basic platform for working with data. Think of it as an IDE for data manipulation, statistics, and algorithms. For example, if you take Andrew Ng‘s popular Machine Learning course, you’ll be doing the exercises in Octave. If you take the machine learning course on Pluralsight, you be using ENCOG.

Data Scientists Love Them Some Python

Python is the most popular general purpose programming language in the machine learning world. I’m not a Python guy (yet), but you can start at SciPy and go from there.

Why I Chose R

I initially started working through Andrew Ng’s course, but I wasn’t sold on spending a lot of time learning Octave. I had a Data Mining book with all the exercises in Weka, but I wasn’t loving that idea either. I kept hearing about this statistics language called R. After some investigation, I found that the R language is nothing to write home about, but R Studio and the vast collection of available packages make R a great choice.

R Studio has been great to work in. The popular Coursera Data Science specialization is essentially an extended course in R. Azure ML Studio now supports the R language. The list goes on and is growing. The folks at Kaggle show the popularity of tools used by their competitors, with R as the clear winner…

Kaggle Tools

Bottom line… if you have an tool that makes sense for you, then use it. Otherwise, start with R.