# Introduction to Data Science using R

## Faster Looping

## Learning objectives

- Compare loops and vectorized operations
- Use the apply family of functions

In R you have multiple options when repeating calculations: vectorized operations, `for`

loops, and `apply`

functions.

This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, `analyze`

, over multiple data files:

```
setwd("/home/lngo/intro-data-science/")
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
filenames <- list.files(path = "data", pattern = "inflammation.*.csv", full.names = TRUE)
```

### Vectorized Operations

A key difference between R and many other languages is a topic known as vectorization. When you wrote the `total`

function, we mentioned that R already has `sum`

to do this; `sum`

is *much* faster than the interpreted `for`

loop because `sum`

is coded in C to work with a vector of numbers. Many of R’s functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.

For example, to add pairs of numbers contained in two vectors

```
a <- 1:10
b <- 1:10
```

You could loop over the pairs adding each in turn, but that would be very inefficient in R.

```
# Instead of using `i in a` to make our loop variable, we use
# the function `seq_along` to generate indices for each element `a` contains.
res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
```

Instead, `+`

is a *vectorized* function which can operate on entire vectors at once

```
res2 <- a + b
all.equal(res, res2)
```

### Vector Recycling

When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:

```
a <- 1:10
b <- 1:5
a + b
```

The elements of `a`

and `b`

are added together starting from the first element of both vectors. When R reaches the end of the shorter vector `b`

, it starts again at the first element of `b`

and continues until it reaches the last element of the longest vector `a`

. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector `a`

by 5:

```
a <- 1:10
b <- 5
a * b
```

Remember there are no scalars in R, so `b`

is actually a vector of length 1; in order to add its value to every element of `a`

, it is *recycled* to match the length of `a`

.

When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:

```
a <- 1:10
b <- 1:7
a + b
```

`for`

or `apply`

?

A `for`

loop is used to apply the same function calls to a collection of objects. R has a family of functions, the `apply`

family, which can be used in much the same way. You’ve already used one of the family, `apply`

in the first lesson. The `apply`

family members include

`apply`

- apply over the margins of an array (e.g. the rows or columns of a matrix)`lapply`

- apply over an object and return list`sapply`

- apply over an object and return a simplified object (an array) if possible`vapply`

- similar to`sapply`

but you specify the type of object returned by the iterations

Each of these has an argument `FUN`

which takes a function to apply to each element of the object. Instead of looping over `filenames`

and calling `analyze`

, as you did earlier, you could `sapply`

over `filenames`

with `FUN = analyze`

:

`sapply(filenames, FUN = analyze)`

Deciding whether to use `for`

or one of the `apply`

family is really personal preference. Using an `apply`

family function forces to you encapsulate your operations as a function rather than separate calls with `for`

. `for`

loops are often more natural in some circumstances; for several related operations, a `for`

loop will avoid you having to pass in a lot of extra arguments to your function.

### Loops in R Are Slow

No, they are not! *If* you follow some golden rules:

- Don’t use a loop when a vectorized alternative exists
- Don’t grow objects (via
`c`

,`cbind`

, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column - Allocate an object to hold the results and fill it in during the loop

As an example, we’ll create a new version of `analyze`

that will return the mean inflammation per day (column) of each file.

```
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
```

Note how we add a new column to `out`

at each iteration? This is a cardinal sin of writing a `for`

loop in R.

Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the `f`

th column of our results matrix `out`

. This time there is no copying/growing for R to deal with.

```
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
```

In this simple example there is little difference in the compute time of `analyze2`

and `analyze3`

. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.

Note that `apply`

handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to `apply`

. At its heart, `apply`

is just a `for`

loop with extra convenience.