Introduction to Data Science using R

Programming with R

Brief history of R

  • Until the mid-70s, much of statistical computing work was being done using Fortran.
  • In 1975-76, the S programming language was developed at Bell Labs to an alternative and more interactive approach to statistical computing.
  • R is developed as an open source reimplementation of S with a first beta-release in 2000. R is currently at its third major version.

Why R

  • R is written by statisticians, for statisticians.
  • R has been widely adopted by the statistic community.
  • R contains a wide range of statistical techniques, mainly due to the enthusiastic contributions from the user community. These inclure linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. It also has excellent tools to help with data acquisition, extration, manipulation, and visualization.

Data Analytic Task

We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets of their daily inflammation. The data sets are stored in comma-separated values (CSV) format: each row holds information for a single patient, and the columns represent successive days. The first few rows of our first file look like this:

0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1

{: .source}

We want to:

  • load that data into memory,
  • calculate the average inflammation per day across all patients, and
  • plot the result.