Learning Data Science: Brushing Up on Math

CatchingUpOnMathIt doesn’t take a lot of math to understand machine learning, but you will need some calculus and linear algebra. If you haven’t covered these or it’s been long enough that you need to brush up, there are great free options out there.

The excellent Khan Academy covers Linear Algebra, Differential Calculus, and more. Khan is an incredible resource, but if you are intimidated by calculus or just prefer a methodical pace that allows you to develop intuition about the concepts as you go, I highly recommend Jim Fowler’s Calculus 1 Coursera course.

Learning Data Science: Video Courses

You’ve got R and R Studio installed–now what? If you are more of a book-learner, I’ll give you some places to start in the next post, but we’ll start with the MOOC space. There is a lot of new content coming all the time; so all I can give you is a snapshot as of February of 2015, but I’ll provide updates occasionally.

Coursera Data Science Track

Coursera Data Science Specialization

Put together by three professors from the Johns Hopkins University Bioinformatics program, the Data Science Specialization from Coursera is nine 4-week courses on R and Data Science. As of this writing, I’ve taken R Programming, Getting and Cleaning Data, Statistical Inference, and Practical Machine Learning.

Pros:

  • Consistent use of R. You will become pretty proficient with R just by taking a few of these courses.
  • Nice combination of video lectures, quizzes, and practical projects.
  • Popular courses with active discussion groups during class offerings.

Cons:

  • Coverage of topics like statistical inference and machine learning not in-depth enough to be called anything other than surveys.
  • Some of the content and assignments appear to have been rushed and not well edited.
  • The profs don’t interact on the discussion boards.

Machine Learning by Andrew Ng

Machine Learning by Stanford’s Andrew Ng was one of the earliest and most popular courses on machine learning. This course goes into enough depth for you to not just use machine learning as a black box, but to understand how and why it works. That level of understanding comes with a caveat: you’ll need to remember a bit of your college calculus and linear algebra (although Ng provides an optional section on the linear algebra you’ll need).

Statistical Learning

Statistical Learning with Trevor Hastie and Robert Tibshirani is new and in it’s first session. I started the course, but haven’t been able to make the time to keep up with it. However, I think this may be the best course to start with for several reasons:

  • It follows the professors’ excellent text Introduction to Statistical Learning with Applications in R.
  • The lectures are well-produced, featuring a nice combination of seeing the professors and the supporting graphics.
  • Much more in-depth on machine learning than the Data Science Specialization.

Udacity offers some machine courses that look pretty good, but to access the course materials and exercises (which is necessary to really learn), you must use the paid version which is pretty expensive. Also check out Pedro Domingo’s Machine Learning course.

Learning Data Science: Choose a Platform


DS-DecisionTree

One of the first questions I confronted when setting out to learn Data Science was what platform to use. As you begin to look at books and courses you realize that you’ll need a basic platform for working with data. Think of it as an IDE for data manipulation, statistics, and algorithms. For example, if you take Andrew Ng‘s popular Machine Learning course, you’ll be doing the exercises in Octave. If you take the machine learning course on Pluralsight, you be using ENCOG.

Data Scientists Love Them Some Python

Python is the most popular general purpose programming language in the machine learning world. I’m not a Python guy (yet), but you can start at SciPy and go from there.

Why I Chose R

I initially started working through Andrew Ng’s course, but I wasn’t sold on spending a lot of time learning Octave. I had a Data Mining book with all the exercises in Weka, but I wasn’t loving that idea either. I kept hearing about this statistics language called R. After some investigation, I found that the R language is nothing to write home about, but R Studio and the vast collection of available packages make R a great choice.

R Studio has been great to work in. The popular Coursera Data Science specialization is essentially an extended course in R. Azure ML Studio now supports the R language. The list goes on and is growing. The folks at Kaggle show the popularity of tools used by their competitors, with R as the clear winner…

Kaggle Tools

Bottom line… if you have an tool that makes sense for you, then use it. Otherwise, start with R.