Training Machine Learning model using Regression Method#

More examples of prediction#

CARET supports a huge number of prediction methods; see the list here. Let’s do a few examples.

For Continuous output#

Linear regression with one predictor#

Pre-process the data and create a partition

library(caret)
data(airquality)

set.seed(123)
#Impute missing value using Bagging approach
PreImputeBag <- preProcess(airquality,method="bagImpute")
airquality_imp <- predict(PreImputeBag,airquality)

indT <- createDataPartition(y=airquality_imp$Ozone,p=0.6,list=FALSE)
training <- airquality_imp[indT,]
testing  <- airquality_imp[-indT,]

Now, let’s build a model using a single predictor. Let’s predict Ozone from one variable, temperature (Temp), using a linear model (method=lm):

ModFit <- train(Ozone~Temp,data=training,
                preProcess=c("center","scale"),
                method="lm")
summary(ModFit$finalModel)

Apply trained model to testing data set and evaluate output:

prediction <- predict(ModFit,testing)
cor.test(prediction,testing$Ozone)

Linear regression with multiple predictors (Multi-Linear Regression)#

Now, let’s predict Ozone from three predictors: solar radiation, wind, and temperature.

modFit2 <- train(Ozone~Solar.R+Wind+Temp,data=training,
                 preProcess=c("center","scale"),
                 method="lm")
summary(modFit2$finalModel)

prediction2 <- predict(modFit2,testing)
cor.test(prediction2,testing$Ozone)

We see that our correlation has improved when we used more predictors.

Train model using Stepwise Linear Regression#

It’s a step by step Regression to determine which covariates set best match with the dependent variable. Using AIC as criteria:

modFit_SLR <- train(Ozone~Solar.R+Wind+Temp,data=training,method="lmStepAIC")
summary(modFit_SLR$finalModel)
prediction_SLR <- predict(modFit_SLR,testing)
cor.test(prediction_SLR,testing$Ozone)
postResample(prediction_SLR,testing$Ozone)

–>

Train model using Polynomial Regression#

image

In this study, let’s use polynomial regression with degrees of freedom=3:

modFit_poly <- train(Ozone~poly(Solar.R,3)+poly(Wind,3)+poly(Temp,3),data=training,
                     preProcess=c("center","scale"),
                     method="lm")
summary(modFit_poly$finalModel)
prediction_poly <- predict(modFit_poly,testing)
cor.test(prediction_poly,testing$Ozone)

Train model using Principal Component Regression#

Principal Component Regression is a combination of linear regression and principal component analysis; it is particularly useful when the predictors are highly correlated.

install.packages("pls")
library(pls)
modFit_PCR <- train(Ozone~Solar.R+Wind+Temp,data=training,method="pcr")
summary(modFit_PCR$finalModel)
prediction_PCR <- predict(modFit_PCR,testing)
cor.test(prediction_PCR,testing$Ozone)

For categorical output#

Train model using Logistic Regression#

  • Logistic regression is a common method for binary classification problems (when outcomes fall into two categories).

  • Typical binary classification: True/False, Yes/No, Pass/Fail

  • Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.

  • The standard logistic function has formulation:

image

logreg

In this example, we use spam data set from package kernlab. This is a data set collected at Hewlett-Packard Labs, that classifies 4601 e-mails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail. More information on this data set can be found here

Train the model:

install.packages("kernlab")
library(kernlab)
data(spam)
names(spam)

indTrain <- createDataPartition(y=spam$type,p=0.6,list = FALSE)
training <- spam[indTrain,]
testing  <- spam[-indTrain,]

ModFit_glm <- train(type~.,data=training,method="glm")
summary(ModFit_glm$finalModel)

Predict based on testing data and evaluate model output:

predictions <- predict(ModFit_glm,testing)
confusionMatrix(predictions, testing$type)

Plotting ROC and computing AUC:

#Need to install package ROCR
install.packages("ROCR")
library(ROCR)
pred_prob <- predict(ModFit_glm,testing, type = "prob")
head(pred_prob)
data_roc <- data.frame(pred_prob = pred_prob[,'spam'],
                           actual_label = ifelse(testing$type == 'spam', 1, 0))

roc <- prediction(predictions = data_roc$pred_prob,
                      labels = data_roc$actual_label)

plot(performance(roc, "tpr", "fpr"))
abline(0, 1, lty = 2)
auc <- performance(roc, measure = "auc")

image