You’ve got R and R Studio installed–now what? If you are more of a book-learner, I’ll give you some places to start in the next post, but we’ll start with the MOOC space. There is a lot of new content coming all the time; so all I can give you is a snapshot as of February of 2015, but I’ll provide updates occasionally.
Coursera Data Science Track
Put together by three professors from the Johns Hopkins University Bioinformatics program, the Data Science Specialization from Coursera is nine 4-week courses on R and Data Science. As of this writing, I’ve taken R Programming, Getting and Cleaning Data, Statistical Inference, and Practical Machine Learning.
- Consistent use of R. You will become pretty proficient with R just by taking a few of these courses.
- Nice combination of video lectures, quizzes, and practical projects.
- Popular courses with active discussion groups during class offerings.
- Coverage of topics like statistical inference and machine learning not in-depth enough to be called anything other than surveys.
- Some of the content and assignments appear to have been rushed and not well edited.
- The profs don’t interact on the discussion boards.
Machine Learning by Andrew Ng
Machine Learning by Stanford’s Andrew Ng was one of the earliest and most popular courses on machine learning. This course goes into enough depth for you to not just use machine learning as a black box, but to understand how and why it works. That level of understanding comes with a caveat: you’ll need to remember a bit of your college calculus and linear algebra (although Ng provides an optional section on the linear algebra you’ll need).
Statistical Learning with Trevor Hastie and Robert Tibshirani is new and in it’s first session. I started the course, but haven’t been able to make the time to keep up with it. However, I think this may be the best course to start with for several reasons:
Udacity offers some machine courses that look pretty good, but to access the course materials and exercises (which is necessary to really learn), you must use the paid version which is pretty expensive. Also check out Pedro Domingo’s Machine Learning course.
Data Science in the Cloud with Azure ML and R is a short eBook that steps you through building a model and deploying it to Azure ML as a Web service. The book assumes you already know how to use R; so, it’s not the best starting point if you are new to R. However, I’d go ahead and pick up the book. It covers the critical area of how to deploy a model once it’s built.
You’ve chosen R as your tool for getting started learning data science and machine learning. If you are coming from a background on the Microsoft technology stack, your decision to choose R was affirmed by the recent announcement that Microsoft acquired Revolution Analytics, a leader in the R world.
Download and install R from CRAN, The Comprehensive R Archive Network. You’ll find installers for Mac, Windows, and Linux. I’ve installed on both Mac and Windows. They are both simple and straightforward.
Next, download and install R Studio. Even if you are a command-line person who thinks that IDEs rot the mind and inhibit true learning of a new language, trust me–you will still be writing R code in a Notepad-like experience and the integrated help, plots, and data views make R Studio a must-have. Just like R, R Studio is a a straightforward instal on Mac or Windows.
While you’re at it, download a copy of Introduction to Statistical Learning with Applications in R and The Elements of Statistical Learning. Two of the best data science books on the R platform are made freely available by the authors in electronic format!
One of the first questions I confronted when setting out to learn Data Science was what platform to use. As you begin to look at books and courses you realize that you’ll need a basic platform for working with data. Think of it as an IDE for data manipulation, statistics, and algorithms. For example, if you take Andrew Ng‘s popular Machine Learning course, you’ll be doing the exercises in Octave. If you take the machine learning course on Pluralsight, you be using ENCOG.
Data Scientists Love Them Some Python
Python is the most popular general purpose programming language in the machine learning world. I’m not a Python guy (yet), but you can start at SciPy and go from there.
Why I Chose R
I initially started working through Andrew Ng’s course, but I wasn’t sold on spending a lot of time learning Octave. I had a Data Mining book with all the exercises in Weka, but I wasn’t loving that idea either. I kept hearing about this statistics language called R. After some investigation, I found that the R language is nothing to write home about, but R Studio and the vast collection of available packages make R a great choice.
R Studio has been great to work in. The popular Coursera Data Science specialization is essentially an extended course in R. Azure ML Studio now supports the R language. The list goes on and is growing. The folks at Kaggle show the popularity of tools used by their competitors, with R as the clear winner…
Bottom line… if you have an tool that makes sense for you, then use it. Otherwise, start with R.