Identify the Successful Completion of Weight-Lifting Exercises

Human Activity Recognition (HAR) data has become ubiquitous with the advent of devices like the Jawbone Up, Nike FuelBand, Fitbit and even smart-phones. Although users of these devices tend to quantify how much they participate in an activity, they rarely consider “how well” they perform the activity.

Results

Using RStudio, we import HAR data for weight-lifting exercises provided from a Groupware@LES study, to see if we can predict how well exercises were performed by users of HAR devices. After some initial data wrangling, we analyze the data, visualize the results, and develop a machine learning model. This deep-dive analysis shows how to perform the typical steps for this type of machine learning activity.

Conclusion

RStudio is an excellent platform for this type of analytical project and helps analysts tell a story with data. In combination with RMarkdown, RStudio readily facilitates analysis, sharing, and reproducibility. Analysts may weave together narrative text and code to seamlessly produce elegantly formatted in various output formats (i.e. html, pdf, MS Word, etc.).

Implementation

With devices like the Jawbone Up, Nike FuelBand and Fitbit, it is feasible to inexpensively collect and analyze a large quantity of HAR data. With the Groupware@LES HAR data, we analyze six study participants that were asked to perform dumbbell exercises with correct form (Class A) and with the most common mistakes: throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Using this HAR class data, we leverage machine learning algorithms to develop a model to distinguish between participants that have exercised correctly versus those that haven’t, and what their mistakes may have been. We will then use this model to assess how accurately we can classify the exercises performed by other individuals solely based on their HAR data.

Technical Notes: This analysis is statically generated with RStudio 3.3.2 using RMarkdown 1.3 for reproducibility. The complete codebase is available on my github page. Many code blocks are cached for performant iterative analysis. When updating code blocks, pay special attention to subsequent cached code blocks that should be updated as a result. To learn more, have a peek at the RStudio/RMarkdown documentation and cheatsheets.

Data Processing

Getting the Data

Download the training and test data, if not already in the project data folder.

setwd("/home/rstudio/code")

trnFile = "data/pml-har-trn.csv"
tstFile = "data/pml-har-tst.csv"
if (!file.exists(trnFile)) {
 trnfileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
 download.file(trnfileUrl, destfile = trnFile, method = "curl")
}
if (!file.exists(tstFile)) {
 tstfileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
 download.file(tstfileUrl, destfile = tstFile, method = "curl")
}

Data Wrangling and Preparation

The goal of data preparation for machine learning, is to identify the variables that explain as much of the class variance possible. In other words, we need to find the fewest quantity of variables that can explain everything that is going on. To improve model performance and simplify our analysis we discard those variables not related to exercise performance.

# Load libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(knitr)
library(corrplot)
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(ggplot2)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
## cov, smooth, var

# Define number of parallel processes
registerDoMC(cores = 8)

# Load the training file
trn.dat = read.csv(trnFile, na.strings=c("NA","NaN", "", " "), stringsAsFactors=FALSE); rm("trnFile")

# Quick exploratory analysis
dim.init <- dim(trn.dat); dim.init
## [1] 19622 160
str(trn.dat)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : chr NA NA NA NA ...
## $ kurtosis_picth_belt : chr NA NA NA NA ...
## $ kurtosis_yaw_belt : chr NA NA NA NA ...
## $ skewness_roll_belt : chr NA NA NA NA ...
## $ skewness_roll_belt.1 : chr NA NA NA NA ...
## $ skewness_yaw_belt : chr NA NA NA NA ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : chr NA NA NA NA ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : chr NA NA NA NA ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : chr NA NA NA NA ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : chr NA NA NA NA ...
## $ kurtosis_picth_arm : chr NA NA NA NA ...
## $ kurtosis_yaw_arm : chr NA NA NA NA ...
## $ skewness_roll_arm : chr NA NA NA NA ...
## $ skewness_pitch_arm : chr NA NA NA NA ...
## $ skewness_yaw_arm : chr NA NA NA NA ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : chr NA NA NA NA ...
## $ kurtosis_picth_dumbbell : chr NA NA NA NA ...
## $ kurtosis_yaw_dumbbell : chr NA NA NA NA ...
## $ skewness_roll_dumbbell : chr NA NA NA NA ...
## $ skewness_pitch_dumbbell : chr NA NA NA NA ...
## $ skewness_yaw_dumbbell : chr NA NA NA NA ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : chr NA NA NA NA ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : chr NA NA NA NA ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]

## Removing unnecessary covariates
# Columns not related to exercise performance.
col.nrel <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", 
 "cvtd_timestamp", "new_window", "num_window")
# Columns with near zero variance
nsv <- nearZeroVar(trn.dat, saveMetrics = TRUE)
col.nzvs <- rownames(nsv[nsv$nzv == TRUE, ])
# Update the training set with the remaining fields
col.drops <- c(col.nrel, col.nzvs)
trn.dat.all <- trn.dat[,!(names(trn.dat) %in% col.drops)]

# Identify fields with large quantity of missing values.
col.names = colnames(trn.dat.all)
cnt.min = sum(complete.cases(trn.dat.all))
trn.cnt.df <- data.frame(Field=character(), FieldCnt=integer(), stringsAsFactors=FALSE)
# Create data-frame with field names and counts.
for (fld in 1:length(col.names)) {
 trn.cnt.df[fld,] <- c(col.names[fld], sum(!is.na(trn.dat.all[[col.names[fld]]])))
}
# Filter out the fields with high quantity of missing values.
col.keep <- as.vector(subset(trn.cnt.df, FieldCnt < cnt.min)$Field)
# Update the training set with the remaining fields
trn.dat.all <- trn.dat.all[col.keep]

## Cleanup and default model settings
# Set the classe variable as a factor.
trn.dat.all$classe <- as.factor(trn.dat.all$classe)
dim.finl <- dim(trn.dat.all); dim.finl
## [1] 19622 53

# house-cleaning
rm("cnt.min", "col.drops", "col.keep", "col.names", "col.nrel", "col.nzvs", "fld", "trn.cnt.df", "nsv","trn.dat")

# Cutoff parameters
corrRt = 0.80 # Correlation Cutoff 
dSplit = 0.60 # Training Cutoff

## Set model parameters for model tuning and training
cntFld = 10 # Number of cross-validation folds
cntRpt = 8 # Increase the Repeat count for improved cross-validation accuracy
cntTun = 5 # Parameter for tuning accuracy vs resource consumption
Class Variable Frequency:


A
B
C
D
E




5580
3797
3422
3216
3607

Class Variable Frequency:
A	B	C	D	E
5580	3797	3422	3216	3607

Our exploratory analysis shows that the Groupware@LES file is just shy of twenty thousand records and that some of the variables have near-zero-variance values or have missing values. After discarding them, our variable count drops from 160 down to 53. The class variables in the provided data is relatively balanced, which simplifies this analysis somewhat.

kable(t(table(trn.dat.all$classe)), caption = "Class Variable Frequency:")

Handling Multicollinearity

Multicollinearity occurs when model variables are correlated to your response variable as well as each other. This overinflates the standard errors which makes some variables seem statistically insignificant when they should be significant. By identifying the Variable Inflation Factor (VIF), we can quantify the severity of this impact. We implemented a stepwise VIF function to identify multicollinear variables within a specified threshold. A value greater than 10 is considered to be highly collinear. The threshold may be set to the specific tolerance required for your analysis.

We can reduce the number of predictors to a smaller set of uncorrelated components using Partial Least Squares Regression (PLS) or Principal Components Analysis (PCA) or by simply removing them from the model. Removing predictors may introduce some bias (difference between expected prediction and the truth), but may also reduce the prediction variance (increase accuracy). The goal is to optimize the trade-off between bias and variance. These options reduce our variable count and speeds our model calculation.

The goal of PCA is to explain the maximum amount of variance with the fewest number of principal components. PCA creates these components by transforming the variables into a smaller sub-space of variables (dimensionality reduction) that are uncorrelated with each other. A challenge with the components is that it can be difficult to explain what is driving model performance. Using the Caret R package, we apply a Box-Cox Transformation to correct for skewness, center and scale each variable and then apply PCA in one call to pre-process the variables.

We are going to approach this three different ways to get a sense of effort and impact:

remove the multicollinear predictors,
apply PCA to the multicollinear predictors, and
apply PCA to all the predictors.

# compile stepwise VIF selection function for reducing collinearity among explanatory variables
vif_func<-function(in_frame, thresh=10, trace=T, ...){
 require(fmsb)
 
 if(class(in_frame) != 'data.frame') in_frame<-data.frame(in_frame)
 
 #get initial vif value for all comparisons of variables
 vif_init<-NULL
 var_names <- names(in_frame)
 for(val in var_names){
 regressors <- var_names[-which(var_names == val)]
 form <- paste(regressors, collapse = '+')
 form_in <- formula(paste(val, '~', form))
 vif_init<-rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, ...))))
 }
 vif_max<-max(as.numeric(vif_init[,2]))

 if(vif_max < thresh){
 if(trace==T){ #print output of each iteration
 prmatrix(vif_init,collab=c('var','vif'), rowlab=rep('', nrow(vif_init)), quote=F)
 cat('\n')
 cat(paste('All variables have VIF < ', thresh,', max VIF ',round(vif_max,2), sep=''),'\n\n')
 }
 return(var_names)
 }
 else {
 in_dat<-in_frame

 #backwards selection of explanatory variables, stops when all VIF values are below 'thresh'
 while(vif_max >= thresh) {
 
 vif_vals<-NULL
 var_names <- names(in_dat)
 
 for(val in var_names){
 regressors <- var_names[-which(var_names == val)]
 form <- paste(regressors, collapse = '+')
 form_in <- formula(paste(val, '~', form))
 vif_add<-VIF(lm(form_in, data = in_dat, ...))
 vif_vals<-rbind(vif_vals,c(val,vif_add))
 }
 max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]

 vif_max<-as.numeric(vif_vals[max_row,2])

 if(vif_max<thresh) break
 
 if(trace==T){ #print output of each iteration
 prmatrix(vif_vals,collab=c('var','vif'),rowlab=rep('',nrow(vif_vals)),quote=F)
 cat('\n')
 cat('removed: ',vif_vals[max_row,1],vif_max,'\n\n')
 flush.console()
 }

 in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]
 }
 return(names(in_dat))
 }
}

Select the tabs below to switch between the plots, which highlight multicollinearity across variables in each approach. Review the summary below each plot to see the change in variable count and understand the impact on multicollinearity overall.

Baseline Correlation Matrix

This plot is difficult to read, but the key take-away is that the baseline variables are unusable for modelling due to significant multicollinearity. This is emphasized by the darker colored cross-sections.

# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.all[, -dim(trn.dat.all)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names') # c("Minimum", "1st Quartile", "Median", "Mean", "3rd Quartile", "Maximum") #
rownames(corrTabl) <- "Baseline Summary"

# plot the correlation matrix
corrplot(corrMatx, 
 order = "FPC", method = "color", type = "lower", 
 tl.cex = 0.75, tl.col = "black")



Variables






53







Min.
1st Qu.
Median
Mean
3rd Qu.
Max.




Baseline Summary
-0.992
-0.1101
0.0021
0.0018
0.0925
0.9809


# print the correlation summary
kable(dim(trn.dat.all)[2], col.names = c("Variables"), format='html')

kable(corrTabl, digits = 4, align = 'r', format='markdown')

Drop Correlation Matrix

Using the stepwise VIF function, we dropped variables with collinearity threshold greater than two. Notice the fainter colors compared to the initial matrix. Although we significantly improved our collinearity, we may loose some of the predictive variance in the dropped variables.

# use stepwise VIF remove the highly correlated values from the training variable list
var.keep <- vif_func(in_frame=trn.dat.all[, -dim(trn.dat.all)[2]], thresh=2, trace=F)

trn.dat.drp <- trn.dat.all[,c(var.keep, "classe")]

# calculate the final VIF value for the training set with the dropped variables
var.keep <- vif_func(in_frame=trn.dat.drp[, -dim(trn.dat.drp)[2]], thresh=2, trace=T)
## var vif 
## gyros_belt_x 1.77619828389346
## gyros_belt_z 1.34523435553331
## magnet_belt_y 1.42090571996023
## roll_arm 1.57141620051462
## pitch_arm 1.58544980918429
## yaw_arm 1.30081927466778
## total_accel_arm 1.19690560612217
## gyros_arm_x 1.22344553137577
## gyros_arm_z 1.68628616597968
## magnet_arm_x 1.75525855461027
## roll_dumbbell 1.77193973254286
## pitch_dumbbell 1.86197087067641
## total_accel_dumbbell 1.95658711484206
## gyros_dumbbell_y 1.44480121316314
## roll_forearm 1.31411910609692
## yaw_forearm 1.64142799639081
## total_accel_forearm 1.31980243865119
## gyros_forearm_x 1.70178568584006
## gyros_forearm_y 1.73220667730309
## accel_forearm_z 1.60105464239125
## magnet_forearm_x 1.32156481583226
## magnet_forearm_y 1.49600102813836
## 
## All variables have VIF < 2, max VIF 1.96



Variables






23







Min.
1st Qu.
Median
Mean
3rd Qu.
Max.




Drop Summary
-0.5319
-0.0574
0.01
0.0091
0.0778
0.5011

# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.drp[, -dim(trn.dat.drp)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names')
rownames(corrTabl) <- "Drop Summary"

# plot the correlation matrix
corrplot(corrMatx, 
 order = "FPC", method = "color", type = "lower", 
 tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.drp)[2], col.names = c("Variables"), format='html')

kable(corrTabl, digits = 4, align = 'r', format='markdown')

Hybrid Correlation Matrix

Rather than dropping all of the variables that exceed our threshold, this step extracts their principle components using PCA. The surviving variables and priciple components are then combined to create our training set. Most of the significant collinearity is removed while maintaining the predictive variance. Depending on your needs, the remaining collinearity may still be too high.

# Reassign variables to the PCA list that push our VIF threshold above 6
var.pck <- setdiff(var.keep, 
 c("magnet_arm_x", "magnet_forearm_x", "gyros_arm_x", 
 "magnet_forearm_y", "gyros_forearm_y", "total_accel_dumbbell"))

# Identify the collinear variables
var.pca <- setdiff(colnames(trn.dat.all), c(var.pck, "classe"))

# Preprocess the collinear variables
trn.ppv = preProcess(trn.dat.all[,c(var.pca)], method=c("BoxCox", "center", "scale", "pca"))

# Model the PCA components
trn.pca = (predict(trn.ppv, trn.dat.all[,c(var.pca)]))
## Warning in is.na(lam): is.na() applied to non-(list or vector) of type
## 'NULL'

# Create hybrid dataframe with noncollinear and pca variables.
trn.dat.hyb <- cbind(trn.dat.all[,c(var.pck)], trn.pca, classe=trn.dat.all$classe)

# check for any remaining multicollinearity
var.pck <- vif_func(in_frame=trn.dat.hyb[, -dim(trn.dat.hyb)[2]], thresh=9, trace=T)
## var vif 
## gyros_belt_x 3.31789201675371
## gyros_belt_z 2.42849164325859
## magnet_belt_y 3.50222035177036
## roll_arm 2.00971443486715
## pitch_arm 1.79503632379843
## yaw_arm 1.34711160187722
## total_accel_arm 2.28352087026064
## gyros_arm_z 2.41099423077938
## roll_dumbbell 2.52451436108404
## pitch_dumbbell 3.90244415241768
## gyros_dumbbell_y 1.99746736714151
## roll_forearm 1.53016707390642
## yaw_forearm 1.70219092561984
## total_accel_forearm 1.49518955648635
## gyros_forearm_x 2.99888704127112
## accel_forearm_z 3.64910457658203
## PC1 3.89238900830164
## PC2 5.33765975875524
## PC3 2.3092285807121 
## PC4 2.07114739930967
## PC5 4.05071781388632
## PC6 3.12727507721498
## PC7 1.66528105065898
## PC8 1.49857627121578
## PC9 1.41945553681272
## PC10 2.43447020550641
## PC11 2.03570673916657
## PC12 1.62852301913156
## PC13 1.45250217630237
## PC14 1.57873278234641
## PC15 1.65837047683967
## 
## All variables have VIF < 9, max VIF 5.34



Variables






32







Min.
1st Qu.
Median
Mean
3rd Qu.
Max.




Hybrid Summary
-0.6064
-0.0724
0
-0.0027
0.0576
0.6142

# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.hyb[, -dim(trn.dat.hyb)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names')
rownames(corrTabl) <- "Hybrid Summary"

# plot the correlation matrix
corrplot(corrMatx, 
 order = "FPC", method = "color", type = "lower", 
 tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.hyb)[2], col.names = c("Variables"), format='html')

kable(corrTabl, digits = 4, align = 'r', format='markdown')

PCA Correlation Matrix

For maximum reduction of multicollinearity, this approach transforms all the variables into principle components. Notice how we have eliminated all multicollinearity. The only real challenge with this approach, is that it is less intuitive to understand and describe what drives the model performance.

# Identify all the variables
var.pca <- setdiff(colnames(trn.dat.all), c("classe"))

# Preprocess all the variables
trn.ppv = preProcess(trn.dat.all[,c(var.pca)], method=c("BoxCox", "center", "scale", "pca"))

# Model the PCA components
trn.pca = (predict(trn.ppv, trn.dat.all[,c(var.pca)]))

# Create pca dataframe.
trn.dat.pca <- cbind(trn.pca, classe=trn.dat.all$classe)

# check for any remaining multicollinearity
var.pca <- vif_func(in_frame=trn.dat.pca[, -dim(trn.dat.pca)[2]], thresh=9, trace=T)
## var vif
## PC1 1 
## PC2 1 
## PC3 1 
## PC4 1 
## PC5 1 
## PC6 1 
## PC7 1 
## PC8 1 
## PC9 1 
## PC10 1 
## PC11 1 
## PC12 1 
## PC13 1 
## PC14 1 
## PC15 1 
## PC16 1 
## PC17 1 
## PC18 1 
## PC19 1 
## PC20 1 
## PC21 1 
## PC22 1 
## PC23 1 
## PC24 1 
## PC25 1 
## 
## All variables have VIF < 9, max VIF 1



Variables






26







Min.
1st Qu.
Median
Mean
3rd Qu.
Max.




PCA Summary
0
0
0
0
0
0

# Build a correlation matrix on the initial variables, excluding the "classe" variable
corrMatx <- cor(trn.dat.pca[, -dim(trn.dat.pca)[2]])
corrSumm <- summary(corrMatx[upper.tri(corrMatx)])[1:6]
corrTabl <- t(as.data.frame(as.vector(corrSumm)))

colnames(corrTabl) <- attr(corrSumm, 'names')
rownames(corrTabl) <- "PCA Summary"

# plot the correlation matrix
corrplot(corrMatx, 
 order = "FPC", method = "color", type = "lower", 
 tl.cex = 0.75, tl.col = "black")


# print the correlation summary
kable(dim(trn.dat.pca)[2], col.names = c("Variables"), format='html')

kable(corrTabl, digits = 4, align = 'r', format='markdown')

Machine Learning

Tuning Preparation

We are using a 60% data split to define our training set. To provide a meaningful estimation of performance, the remaining 40% will be divided equally to define our validation and test sets. The individual models are tuned with the training set, and assessed against the validation set. The test set is for estimating the ensemble model performance.

With the Caret R package, we use repeated k-fold cross-validation with 10 folds across 8 repetitions to improve our accuracy estimates. We preprocess the data with “centered and scaled” to better expose the underlying structure and relationships to the predictors. We pre-calculate a vector of seeds for reproducibility across multiple parallel model runs.

# Select the PCA dataset for training.
trn.dat <- trn.dat.pca

# Partition the data with training, validation, and test sets.
set.seed(1732)
inTrain <- createDataPartition(y=trn.dat$classe, p=dSplit, list=FALSE)
trn.trn <- trn.dat[ inTrain,]
trn.vld <- trn.dat[-inTrain,]
inValid <- createDataPartition(y=trn.vld$classe, p=0.50, list=FALSE)
trn.tst <- trn.vld[-inValid,]
trn.vld <- trn.vld[ inValid,]

# Setup seeds for running fully reproducible model in parallel mode
ivect = cntRpt*cntFld # quantity of integer vectors
seeds <- vector(mode = "list", length = ivect+1) # length equals the integer vectors plus one
for(i in 1:ivect) seeds[[i]] <- sample.int(n=1000, cntTun*3) # seeds for tuning attempts (*3 for Radial SVM)
seeds[[length(seeds)]] <- sample.int(1000, 1) # seed for the final model

# Create cross-validation folds to use for model tuning
myMetr <- "Kappa" # ROC, Kappa, Accuracy
myFlds <- createMultiFolds(y=trn.trn$classe, k = cntFld, times = cntRpt)
myCtrl <- trainControl(method = "repeatedCV", seeds = seeds, allowParallel = TRUE, 
 #savePredictions=TRUE, classProbs=TRUE, summaryFunction=multiClassSummary,
 number = cntFld, repeats = cntRpt)

# Standardize the training subset.
myPrep = c("center", "scale") # pre-process with center and scaling

# house-cleaning
rm("i", "cntFld", "cntRpt", "inTrain", "corrMatx", "vif_func")

Grid Searching

There are two ways to tune an algorithm using the Caret R package. The first method is to allow the system to automatically do it, by setting the tuneLength parameter to indicate the number of different values to try for each algorithm parameter. This makes a crude guess on what values to try, but can often get you within a reasonable range. This is the approach we’ve taken on our first tuning pass. The second approach involves manually setting the tuning grid to search. Our second tuning pass uses this approach. We start with the best performing parameters from the first tuning pass and create a more focused tuning grid to further refine our parameters.

We have selected three models to tune:

Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel,
RandomForest, and
Stochastic Gradient Boost Machine.

To improve SVM performance we doubled its grid-search area during its first pass. Also, since the GBM algorithm is computationally expensive, we did significant manual tuning during its second pass to attain reasonable performance. For good measure, we created an Ensemble (Stack) model using the three separate algorithms to improve overall performance.

Technical Notes: Since this process can be very resource intensive, we use OpenBlas, which is an optimized library to replace the standard BLAS library used by R. In addition, we use the “doMC” library to specify the number cores to correspond with the maximum number of R jobs to run in parallel. Depending on your tuning parameters and training set size, this process can easily consume all of your computer memory and may take hours to run. Running machine learning on an H2O.ai cluster helps to alleviate these challenges. For more details, select the “Distributed Random Forest” tab below.

SVM Passes

First-pass Logic:

# First Model: Radial Support Vector Machine
# First Pass: Tunelength
system.time(
 # Fit model to the training set.
 tune.svr1 <- train(classe ~ ., data = trn.trn, method = "svmRadial",
 tuneLength = cntTun*2, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
## alpha
## user system elapsed 
## 24074.66 16153.02 18107.58
tune.svr1;
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 11776 samples
## 25 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10599, 10599, 10601, 10599, 10598, ... 
## Resampling results across tuning parameters:
## 
## C Accuracy Kappa 
## 0.25 0.8324342 0.7878624
## 0.50 0.8668969 0.8314050
## 1.00 0.8990850 0.8721231
## 2.00 0.9255224 0.9056554
## 4.00 0.9464138 0.9321350
## 8.00 0.9616367 0.9514390
## 16.00 0.9699997 0.9620335
## 32.00 0.9752714 0.9687119
## 64.00 0.9792541 0.9737544
## 128.00 0.9812661 0.9763018
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02546125
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.02546125 and C = 128.

Second-pass Logic:

# First Model: Radial Support Vector Machine
# Second Pass: Grid search based on the parameter values selected in the first pass
mySig = round(tune.svr1$bestTune$sigma, 3)
myCst = tune.svr1$bestTune$C
myGrd = expand.grid(sigma = c(mySig-.005, mySig, mySig+.005),
 C = c(myCst-5, myCst-2, myCst, myCst+2, myCst+5))

system.time(
 # Fit model to the training set.
 tune.svr2 <- train(classe ~ ., data = trn.trn, method = "svmRadial",
 tuneGrid = myGrd, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## user system elapsed 
## 6536.600 11759.313 2692.237
tune.svr2;
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 11776 samples
## 25 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10597, 10598, 10599, 10598, 10598, 10599, ... 
## Resampling results across tuning parameters:
## 
## sigma C Accuracy Kappa 
## 0.020 123 0.9786695 0.9730145
## 0.020 126 0.9784493 0.9727355
## 0.020 128 0.9787402 0.9731037
## 0.020 130 0.9782583 0.9724951
## 0.020 133 0.9782591 0.9724958
## 0.025 123 0.9804082 0.9752153
## 0.025 126 0.9797802 0.9744204
## 0.025 128 0.9806034 0.9754615
## 0.025 130 0.9807813 0.9756867
## 0.025 133 0.9806902 0.9755724
## 0.030 123 0.9823909 0.9777244
## 0.030 126 0.9823292 0.9776469
## 0.030 128 0.9821991 0.9774824
## 0.030 130 0.9822425 0.9775371
## 0.030 133 0.9822996 0.9776091
## 
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.03 and C = 123.

Random Forest

First-pass Logic:

# Second Model: Random Forest Model (Extension to Bagging where Trees represent 
# Bootstrap samples with Bootstrap nodes)
# First Pass: Tune Length
system.time(
 #Fit a random forest model to the training set. Include Importance and Proximity scores
 tune.rft1 <- train(classe ~ ., data = trn.trn, method="rf",
 #proximity=TRUE, importance=TRUE, # produce extra info 
 tuneLength = cntTun, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
## margin
## user system elapsed 
## 8286.293 1893.681 1071.296
tune.rft1
## Random Forest 
## 
## 11776 samples
## 25 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10597, 10598, 10599, 10598, 10598, 10599, ... 
## Resampling results across tuning parameters:
## 
## mtry Accuracy Kappa 
## 2 0.9675717 0.9589737
## 7 0.9650024 0.9557260
## 13 0.9610645 0.9507485
## 19 0.9572961 0.9459845
## 25 0.9500997 0.9368821
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Second-pass Logic:

# Second Model: Random Forest Model (Extension to Bagging where Trees represent 
# Bootstrap samples with Bootstrap nodes)
# Second Pass: Grid search based on the parameter values selected in the first pass
myMtry = max(2, tune.rft1$bestTune$mtry)
myGrd = expand.grid(mtry = c(myMtry-1, myMtry, myMtry+1))

system.time(
 # Fit model to the training set.
 tune.rft2 <- train(classe ~ ., data = trn.trn, method = "rf",
 tuneGrid = myGrd, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## user system elapsed 
## 4376.450 12.110 499.645
tune.rft2;
## Random Forest 
## 
## 11776 samples
## 25 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10599, 10597, 10598, 10598, 10598, ... 
## Resampling results across tuning parameters:
## 
## mtry Accuracy Kappa 
## 1 0.9670300 0.9582867
## 2 0.9675927 0.9590027
## 3 0.9674332 0.9588003
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Gradient Boost

First-pass Logic:

# Third Model: Stochastic Gradient Boost Machine
# First Pass: Tune Length
system.time(
 #Fit a random forest model to the training set. Include Importance and Proximity scores
 tune.gbm1 <- train(classe ~ ., data = trn.trn, method="gbm", verbose=FALSE,
 tuneLength = cntTun, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
## cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
## user system elapsed 
## 7251.765 12.320 1005.324
tune.gbm1
## Stochastic Gradient Boosting 
## 
## 11776 samples
## 25 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10599, 10597, 10598, 10598, 10598, ... 
## Resampling results across tuning parameters:
## 
## interaction.depth n.trees Accuracy Kappa 
## 1 50 0.5655798 0.4413891
## 1 100 0.6223259 0.5181829
## 1 150 0.6529919 0.5581476
## 1 200 0.6721623 0.5829542
## 1 250 0.6851750 0.5997820
## 2 50 0.6553595 0.5609974
## 2 100 0.7258409 0.6518923
## 2 150 0.7620582 0.6981533
## 2 200 0.7862183 0.7289950
## 2 250 0.8060144 0.7542075
## 3 50 0.7137819 0.6363846
## 3 100 0.7786710 0.7193591
## 3 150 0.8154823 0.7662459
## 3 200 0.8405764 0.7981354
## 3 250 0.8589720 0.8214626
## 4 50 0.7511771 0.6841993
## 4 100 0.8148239 0.7653929
## 4 150 0.8498854 0.8099350
## 4 200 0.8742566 0.8408339
## 4 250 0.8919943 0.8633073
## 5 50 0.7780858 0.7185874
## 5 100 0.8390576 0.7961497
## 5 150 0.8743628 0.8409646
## 5 200 0.8958578 0.8681977
## 5 250 0.9102515 0.8864271
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250,
## interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.

Second-pass Logic:

# Third Model: Stochastic Gradient Boost Machine
# Second Pass: Grid search based on the parameter values selected in the first pass
myTrees = tune.gbm1$bestTune$n.trees; myTrees
## [1] 250
myDepth = tune.gbm1$bestTune$interaction.depth; myDepth
## [1] 5
myMinob = tune.gbm1$bestTune$n.minobsinnode; myMinob
## [1] 10

# Lower learning rates are preferred, however require a higher number of trees to
# model all the relationships and would be computationally expensive.
# Note: higher values of the following parameters may lead to overfitting:
# * Depth (5-8) should be chosen based on the number of observations and predictors. 
# * Shrinkage (0.05-0.2) controls the learning rate.
myGrd = expand.grid(n.trees = c(myTrees+1199),
 interaction.depth = c(myDepth+1),
 shrinkage = c(0.08),
 n.minobsinnode = c(15:17))

system.time(
 # Fit model to the training set.
 tune.gbm2 <- train(classe ~ ., data = trn.trn, method="gbm", verbose=FALSE,
 tuneGrid = myGrd, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## user system elapsed 
## 39628.030 53.503 5821.597
tune.gbm2;
## Stochastic Gradient Boosting 
## 
## 11776 samples
## 25 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold, repeated 8 times) 
## Summary of sample sizes: 10599, 10598, 10599, 10599, 10598, 10599, ... 
## Resampling results across tuning parameters:
## 
## n.minobsinnode Accuracy Kappa 
## 15 0.9649826 0.9557060
## 16 0.9643474 0.9549029
## 17 0.9658317 0.9567802
## 
## Tuning parameter 'n.trees' was held constant at a value of 1449
## 
## Tuning parameter 'interaction.depth' was held constant at a value of
## 6
## Tuning parameter 'shrinkage' was held constant at a value of 0.08
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1449,
## interaction.depth = 6, shrinkage = 0.08 and n.minobsinnode = 17.

Distributed Random Forest

This machine learning algorithm is distributed across an H2O cluster. Notice how the h2o library is used to initialize the H2O “cloud” instance, with a maximum number of threads and memory specified. This example leverages H2O, “intelligent” defaults. To learn how to complete repeated k-fold cross-validation using H2O, check out this H2O blog post.

First-pass Logic: The first pass runs with standard parameters and stopping rounds. Note that stopping rounds will stop model training when accuracy doesn’t improve for a specified number of training rounds, based on a simple moving average.

# Load H2O R package
library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
## > h2o.init()
## 
## For H2O package documentation, ask for help:
## > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following object is masked from 'package:pROC':
## 
## var
## The following objects are masked from 'package:stats':
## 
## cor, sd, var
## The following objects are masked from 'package:base':
## 
## ||, &&, %*%, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, %in%, is.character, is.factor, is.numeric,
## log, log10, log1p, log2, round, signif, trunc

# Create an H2O cloud instance
h2o.conn = h2o.init(ip='localhost', ## IP address of the H2O node
 port=54321, ## H2O port at the node IP
 nthreads=-1, ## -1: use all available threads
 max_mem_size='56G') ## specify the memory size for the H2O cloud
## 
## H2O is not running yet, starting it now...
## 
## Note: In case of errors look at the following log files:
## /tmp/RtmpvtS3hM/h2o_roobyz_started_from_r.out
## /tmp/RtmpvtS3hM/h2o_roobyz_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ...... Connection successful!
## 
## R is connected to the H2O cluster: 
## H2O cluster uptime: 6 seconds 216 milliseconds 
## H2O cluster version: 3.10.2.2 
## H2O cluster version age: 19 days 
## H2O cluster name: H2O_started_from_R_roobyz_dkz899 
## H2O cluster total nodes: 1 
## H2O cluster total memory: 49.78 GB 
## H2O cluster total cores: 8 
## H2O cluster allowed cores: 8 
## H2O cluster healthy: TRUE 
## H2O Connection ip: localhost 
## H2O Connection port: 54321 
## H2O Connection proxy: NA 
## R Version: R version 3.3.2 (2016-10-31)

# Clean slate - just in case the cluster was already running
h2o.removeAll()
## [1] 0

## Load dataframe into H2O cloud instance
trn.hex <- as.h2o(trn.trn, destination_frame="trn.hex")
## 
 | 
 | | 0%
 | 
 |=================================================================| 100%
vld.hex <- as.h2o(trn.vld, destination_frame="vld.hex")
## 
 | 
 | | 0%
 | 
 |=================================================================| 100%
tst.hex <- as.h2o(trn.tst, destination_frame="tst.hex")
## 
 | 
 | | 0%
 | 
 |=================================================================| 100%

###############################################################################
## run our first predictive model
drf1 <- h2o.randomForest( ## h2o.randomForest function
 training_frame = trn.hex, ## the H2O frame for training
 validation_frame = vld.hex, ## the H2O frame for validation (not required)
 x=1:25, ## the predictor columns, by column index
 y=26, ## the target index (what we are predicting)
 model_id = "rf_covType_v1", ## name the model in H2O not required, but helps use Flow
 ntrees = 400, ## use a maximum of 400 trees to create the random forest model. The default is 50.
 ## I have increased it because I will let the early stopping criteria decide when
 ## the random forest is sufficiently accurate
 stopping_rounds = 2, ## Stop fitting new trees when the 2-tree average is within 0.001 (default) of 
 ## the prior two 2-tree averages. Can be thought of as a convergence setting
 score_each_iteration = T, ## Predict against training and validation for each tree. Default will skip several.
 seed = 1732) ## Set the random seed so that this can be reproduced.
## 
 | 
 | | 0%
 | 
 |= | 2%
 | 
 |== | 4%
 | 
 |==== | 6%
 | 
 |=================================================================| 100%

###############################################################################
#summary(drf1) ## View information about the model.
 ## Keys to look for are validation performance
 ## and variable importance

#drf1@model$validation_metrics ## A more direct way to access the validation 
 ## metrics. Performance metrics depend on 
 ## the type of model being built. With a
 ## multinomial classification, we will primarily
 ## look at the confusion matrix, and overall
 ## accuracy via hit_ratio @ k=1.

h2o.confusionMatrix(drf1)
## Confusion Matrix: vertical: actual; across: predicted
## A B C D E Error Rate
## A 3259 31 24 27 7 0.0266 = 89 / 3,348
## B 63 2114 60 14 28 0.0724 = 165 / 2,279
## C 14 43 1933 42 22 0.0589 = 121 / 2,054
## D 12 7 106 1789 16 0.0731 = 141 / 1,930
## E 9 35 42 26 2053 0.0517 = 112 / 2,165
## Totals 3357 2230 2165 1898 2126 0.0533 = 628 / 11,776
h2o.hit_ratio_table(drf1, valid = T)[1,2] ## Even more directly, the hit_ratio @ k=1
## [1] 0.9620188
###############################################################################

Second-pass Logic: The second pass parameters were updated to increase the parameters for maximum depth and stopping tolerance/rounds.

###############################################################################
## run our second predictive model
drf2 <- h2o.randomForest( ##
 training_frame = trn.hex, ## the H2O frame for training
 validation_frame = vld.hex, ## the H2O frame for validation (not required)
 x=1:25, ## the predictor columns, by column index
 y=26, ## the target index (what we are predicting)
 model_id = "rf_covType_v2", ## name the model in H2O not required, but helps use Flow
 ntrees = 400, ## use a maximum of 400 trees to create the random forest model. The default is 50.
 max_depth = 30, ## Increase depth, from 20
 stopping_rounds = 4, ##
 stopping_tolerance = 2e-4, ## switch from .001 to .0002
 score_each_iteration = T, ##
 seed=1732) ##
## 
 | 
 | | 0%
 | 
 |== | 3%
 | 
 |=== | 5%
 | 
 |==== | 6%
 | 
 |===== | 7%
 | 
 |===== | 8%
 | 
 |=================================================================| 100%

###############################################################################
#summary(drf2)
h2o.confusionMatrix(drf1)
## Confusion Matrix: vertical: actual; across: predicted
## A B C D E Error Rate
## A 3259 31 24 27 7 0.0266 = 89 / 3,348
## B 63 2114 60 14 28 0.0724 = 165 / 2,279
## C 14 43 1933 42 22 0.0589 = 121 / 2,054
## D 12 7 106 1789 16 0.0731 = 141 / 1,930
## E 9 35 42 26 2053 0.0517 = 112 / 2,165
## Totals 3357 2230 2165 1898 2126 0.0533 = 628 / 11,776
h2o.confusionMatrix(drf2)
## Confusion Matrix: vertical: actual; across: predicted
## A B C D E Error Rate
## A 3264 32 23 24 5 0.0251 = 84 / 3,348
## B 47 2140 61 15 16 0.0610 = 139 / 2,279
## C 9 41 1953 32 19 0.0492 = 101 / 2,054
## D 9 18 87 1807 9 0.0637 = 123 / 1,930
## E 4 25 20 22 2094 0.0328 = 71 / 2,165
## Totals 3333 2256 2144 1900 2143 0.0440 = 518 / 11,776

h2o.hit_ratio_table(drf1, valid=T)[1,2] ## original random forest accuracy
## [1] 0.9620188
h2o.hit_ratio_table(drf2, valid=T)[1,2] ## newest random forest accuracy
## [1] 0.9666072


###############################################################################
drf.pred <- h2o.predict(object=drf2, newdata=tst.hex)
## 
 | 
 | | 0%
 | 
 |=================================================================| 100%

## Glance at what that prediction set looks like
## We see a final prediction in the "predict" column,
## and then the predicted probabilities per class.
drf.pred
## predict A B C D E
## 1 A 1 0 0 0 0
## 2 A 1 0 0 0 0
## 3 A 1 0 0 0 0
## 4 A 1 0 0 0 0
## 5 A 1 0 0 0 0
## 6 A 1 0 0 0 0
## 
## [3923 rows x 6 columns]

## Compare these predictions to the accuracy we got from our experimentation
h2o.hit_ratio_table(drf2, valid=T)[1,2] ## newest random forest accuracy
## [1] 0.9666072
mean(drf.pred$predict==tst.hex$classe) ## test set accuracy
## [1] 0.9694112

print(h2o.auc(drf2, valid = TRUE))
## Warning in h2o.auc(drf2, valid = TRUE): No AUC for H2OMultinomialModel
## NULL

#h2o.download_pojo(drf2)

### All done, shutdown H2O 
h2o.shutdown(prompt=FALSE)
## [1] TRUE

Tuning Model Accuracy

After tuning these models we can identify the parameters that generate the best performance per model and compare how well the models perform against each other.

Summary

mcomp = rbind(getTrainPerf(tune.svr2),
 getTrainPerf(tune.rft2),
 getTrainPerf(tune.gbm2))

kable(mcomp)


TrainAccuracy
TrainKappa
method




0.9823909
0.9777244
svmRadial


0.9675927
0.9590027
rf


0.9658317
0.9567802
gbm

Support Vector

plot(tune.svr1, metric = "Kappa", main = "First Pass")

plot(tune.svr2, metric = "Kappa", main = "Second Pass")

Random Forest

plot(tune.rft1, metric = "Kappa", main = "First Pass")

plot(tune.rft2, metric = "Kappa", main = "Second Pass")

Gradient Boost

plot(tune.gbm1, metric = "Kappa", main = "First Pass")

plot(tune.gbm2, metric = "Kappa", main = "Second Pass")

Density Plot

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
## margin
# Compare Models
resamps <- resamples(list(
 '1.SVM.Radial' = tune.svr2,
 '2.Random.Forest' = tune.rft2,
 '3.Gradient.Boost' = tune.gbm2
 ))
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: 1.SVM.Radial, 2.Random.Forest, 3.Gradient.Boost 
## Number of resamples: 38 
## 
## Accuracy 
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.SVM.Radial 0.9686 0.9796 0.9834 0.9824 0.9862 0.9890 0
## 2.Random.Forest 0.9559 0.9643 0.9677 0.9675 0.9718 0.9762 0
## 3.Gradient.Boost 0.9525 0.9643 0.9673 0.9668 0.9703 0.9762 0
## 
## Kappa 
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.SVM.Radial 0.9602 0.9742 0.9791 0.9778 0.9825 0.9860 0
## 2.Random.Forest 0.9442 0.9549 0.9592 0.9589 0.9643 0.9699 0
## 3.Gradient.Boost 0.9399 0.9548 0.9586 0.9581 0.9624 0.9699 0

# Using Kappa as a more conservative measure for accuracy.
densityplot(resamps, metric='Kappa', 
 auto.key=list(space='right', 
 trellis.par.set(superpose.line = list(lwd = 2.5, lty = 1))
 )
 )


# Compare the lagged differences between the models.
mdiffs <- diff(resamps)
summary(mdiffs)
## 
## Call:
## summary.diff.resamples(object = mdiffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
## 1.SVM.Radial 2.Random.Forest 3.Gradient.Boost
## 1.SVM.Radial 0.0149072 0.0155976 
## 2.Random.Forest 2.489e-14 0.0006905 
## 3.Gradient.Boost < 2.2e-16 1 
## 
## Kappa 
## 1.SVM.Radial 2.Random.Forest 3.Gradient.Boost
## 1.SVM.Radial 0.0188602 0.0197278 
## 2.Random.Forest 2.501e-14 0.0008676 
## 3.Gradient.Boost < 2.2e-16 1

Fitting the Models

Now that we have the tuned parameters we select and fit the final models. This is really fast compared to the tuning steps.

# Retrive the training errors for the SVM model
err.svr <- tune.svr2$finalModel@error; err.svr
## [1] 0.003226902

# Retrieve the OOB error for the Random Forest model based on number of trees used
ntr <- tune.rft2$finalModel$ntree; ntr
## [1] 500
err.rft <- tune.rft2$finalModel$err.rate[ntr,1]; err.rft
## OOB 
## 0.0303159

# Identify the lowest error rate by number of trees
treerr.df <- as.data.frame(tune.rft2$finalModel$err.rate[,1])
treerr.df$trees <- row(treerr.df)
colnames(treerr.df) <- c("err","trees")

# Reduce the number of trees for the Random Forest based on error rate
ntr = as.vector(head(treerr.df[with(treerr.df, order(err)), ], 1)$trees); ntr
## [1] 374
err.rft <- tune.rft2$finalModel$err.rate[ntr, 1]; err.rft
## OOB 
## 0.03014606

Support Vector

# Calculate the final model with the tuning parameters from the above.
mySig = tune.svr2$bestTune$sigma
myCst = tune.svr2$bestTune$C
newGrid = expand.grid(sigma=c(mySig), C=c(myCst))

mySig; myCst
## [1] 0.03
## [1] 123

system.time(
 # Fit model to the training set.
 fit.svr <- train(classe ~ ., data = trn.trn, method = "svmRadial", 
 tuneGrid = newGrid, 
 preProcess = myPrep, trControl = myCtrl)
)
## user system elapsed 
## 451.660 758.820 181.388

Random Forest

# Calculate the final model with the tuning parameters from the above.
ntr
## [1] 374

newGrid = expand.grid(mtry=c(tune.rft2$bestTune$mtry))
system.time(
 fit.rft <- train(classe ~ ., data = trn.trn, method="rf", 
 tuneGrid = newGrid, ntree = ntr,
 preProcess = myPrep, trControl = myCtrl)
)
## user system elapsed 
## 917.180 130.964 132.719

Gradient Boost

# Calculate the final model with the tuning parameters from the above.
myTrees = tune.gbm2$bestTune$n.trees
myDepth = tune.gbm2$bestTune$interaction.depth
myShrnk = tune.gbm2$bestTune$shrinkage
myMNode = tune.gbm2$bestTune$n.minobsinnode

myTrees; myDepth; myShrnk; myMNode
## [1] 1449
## [1] 6
## [1] 0.08
## [1] 17

myGrd = expand.grid(n.trees = c(myTrees),
 interaction.depth = c(myDepth),
 shrinkage = c(myShrnk),
 n.minobsinnode = c(myMNode))

system.time(
 fit.gbm <- train(classe ~ ., data = trn.trn, method="gbm", verbose=FALSE,
 tuneGrid = myGrd, preProcess = myPrep, 
 metric = myMetr, trControl = myCtrl)
)
## user system elapsed 
## 13436.916 15.847 2039.345

Separate Model Analysis & Validation

During this process we run the tuned models against our validation set to assess performance. The model performance statistics indicate reasonable performance and that we have not overfit the model.

# Assess the separate model accuracy on the validation set.
yhat.svr <- predict(fit.svr, trn.vld)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
## alpha
yhat.rft <- predict(fit.rft, trn.vld)
yhat.gbm <- predict(fit.gbm, trn.vld)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
## cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr

cm.svr <- confusionMatrix(yhat.svr, trn.vld$classe)
cm.rft <- confusionMatrix(yhat.rft, trn.vld$classe)
cm.gbm <- confusionMatrix(yhat.gbm, trn.vld$classe)

Support Vector

# Confusion Matrix
kable(cm.svr$table, caption = "Confusion Matrix:")
Confusion Matrix:



A
B
C
D
E




A
1110
7
2
4
0


B
6
750
3
0
3


C
0
2
673
16
3


D
0
0
6
622
0


E
0
0
0
1
715



Statistics by Class:



Class: A
Class: B
Class: C
Class: D
Class: E




Sensitivity
0.9946237
0.9881423
0.9839181
0.9673406
0.9916782


Specificity
0.9953687
0.9962073
0.9935165
0.9981707
0.9996877


Pos Pred Value
0.9884239
0.9842520
0.9697406
0.9904459
0.9986034


Neg Pred Value
0.9978571
0.9971528
0.9965934
0.9936267
0.9981291


Precision
0.9884239
0.9842520
0.9697406
0.9904459
0.9986034


Recall
0.9946237
0.9881423
0.9839181
0.9673406
0.9916782


F1
0.9915141
0.9861933
0.9767779
0.9787569
0.9951287


Prevalence
0.2844762
0.1934744
0.1743564
0.1639052
0.1837879


Detection Rate
0.2829467
0.1911802
0.1715524
0.1585521
0.1822585


Detection Prevalence
0.2862605
0.1942391
0.1769054
0.1600816
0.1825134


Balanced Accuracy
0.9949962
0.9921748
0.9887173
0.9827557
0.9956830

# Class Summary
kable(t(cm.svr$byClass), caption = "Statistics by Class:")

Random Forest

# Confusion Matrix
kable(cm.rft$table, caption = "Confusion Matrix:")
Confusion Matrix:



A
B
C
D
E




A
1103
17
3
4
1


B
3
731
6
0
6


C
7
11
667
30
6


D
2
0
7
608
4


E
1
0
1
1
704



Statistics by Class:



Class: A
Class: B
Class: C
Class: D
Class: E




Sensitivity
0.9883513
0.9631094
0.9751462
0.9455677
0.9764216


Specificity
0.9910937
0.9952592
0.9833282
0.9960366
0.9990631


Pos Pred Value
0.9778369
0.9798928
0.9251040
0.9790660
0.9957567


Neg Pred Value
0.9953488
0.9911867
0.9946908
0.9894004
0.9947139


Precision
0.9778369
0.9798928
0.9251040
0.9790660
0.9957567


Recall
0.9883513
0.9631094
0.9751462
0.9455677
0.9764216


F1
0.9830660
0.9714286
0.9494662
0.9620253
0.9859944


Prevalence
0.2844762
0.1934744
0.1743564
0.1639052
0.1837879


Detection Rate
0.2811624
0.1863370
0.1700229
0.1549834
0.1794545


Detection Prevalence
0.2875350
0.1901606
0.1837879
0.1582972
0.1802192


Balanced Accuracy
0.9897225
0.9791843
0.9792372
0.9708021
0.9877424

# Class Summary
kable(t(cm.rft$byClass), caption = "Statistics by Class:")

Gradient Boost

# Confusion Matrix
kable(cm.gbm$table, caption = "Confusion Matrix:")
Confusion Matrix:



A
B
C
D
E




A
1103
16
3
5
1


B
8
719
6
2
8


C
3
16
664
29
9


D
1
0
10
606
5


E
1
8
1
1
698



Statistics by Class:



Class: A
Class: B
Class: C
Class: D
Class: E




Sensitivity
0.9883513
0.9472991
0.9707602
0.9424572
0.9680999


Specificity
0.9910937
0.9924147
0.9824020
0.9951220
0.9965646


Pos Pred Value
0.9778369
0.9676985
0.9209431
0.9742765
0.9844852


Neg Pred Value
0.9953488
0.9874214
0.9937539
0.9887913
0.9928438


Precision
0.9778369
0.9676985
0.9209431
0.9742765
0.9844852


Recall
0.9883513
0.9472991
0.9707602
0.9424572
0.9680999


F1
0.9830660
0.9573901
0.9451957
0.9581028
0.9762238


Prevalence
0.2844762
0.1934744
0.1743564
0.1639052
0.1837879


Detection Rate
0.2811624
0.1832781
0.1692582
0.1544736
0.1779251


Detection Prevalence
0.2875350
0.1893959
0.1837879
0.1585521
0.1807290


Balanced Accuracy
0.9897225
0.9698569
0.9765811
0.9687896
0.9823323

# Class Summary
kable(t(cm.gbm$byClass), caption = "Statistics by Class:")

Ensemble Model Analysis & Comparison

Although the separate models perform well individually, we built this ensemble model to highlight the process and the potential benefit to predictive performance. Different algorithms will find some variables more predictive than others, and will rank them differently. This shows up in the variables of importance (see below). To get additional lift, it is important that the models being stacked have varying variables of importance, because the ensemble model will capitalize on those differences to boost predictive performance.

After fitting the separate models to our validation set, we create “modelled predictors” so that we can train the ensemble model. Then we assess the model performance against our test set. This “stacked” model should have improved predictive performance over the separate models.

set.seed(1732)
# Dataframe with predictions of the separate models, combined with the classe variable (from the validation set)
# in preparation for training the ensemble model against the "validation" data.
pred.vld <- data.frame(yhat.svr, yhat.rft, yhat.gbm, classe=trn.vld$classe)

# Train the ensemble model on the "stacked" predictors
# Note: since it was the fastest of the trio, the SVM algorithm used to build the ensemble
fit.stack <- train(classe ~ ., data=pred.vld, method="svmRadial",
 metric = myMetr, trControl = myCtrl)

# Compare the ensemble model training performance to the separate models.
rbind(getTrainPerf(fit.stack),
 getTrainPerf(fit.svr),
 getTrainPerf(fit.rft),
 getTrainPerf(fit.gbm))
## TrainAccuracy TrainKappa method
## 1 0.9862357 0.9825861 svmRadial
## 2 0.9824536 0.9778050 svmRadial
## 3 0.9674126 0.9587721 rf
## 4 0.9650669 0.9558121 gbm

# To test the ensemble performance we create "stacked" predictors based
# on the test set for ensemble prediction assessment.
yhat.svr <- predict(fit.svr, trn.tst)
yhat.rft <- predict(fit.rft, trn.tst)
yhat.gbm <- predict(fit.gbm, trn.tst)

# Create "stacked" dataframe of modelled predicters and the classe variable (from the test set)
# in preparation for training the ensemble model against the "test" data.
pred.tst <- data.frame(yhat.svr, yhat.rft, yhat.gbm, classe=trn.tst$classe)

# Predict the model classe variable using the "stacked" predictors
yhat.stk <- predict(fit.stack, pred.tst)

# Compare the predicted class against the actual class variable.
cm.stk <- confusionMatrix(yhat.stk, trn.tst$classe)
str(cm.stk$byClass)
## num [1:5, 1:11] 0.988 0.989 0.972 0.972 0.993 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:5] "Class: A" "Class: B" "Class: C" "Class: D" ...
## ..$ : chr [1:11] "Sensitivity" "Specificity" "Pos Pred Value" "Neg Pred Value" ...

Variables of Importance

Below are the top 10 GBM variables in comparison to the other models (100=most important; 0=least important). Notice how the variables of importance vary across models.

# Compare the variables of importance.
viv = cbind(varImp(tune.gbm2, scale = TRUE)$importance,
 varImp(tune.rft2, scale = TRUE)$importance,
 varImp(tune.svr2, scale = TRUE)$importance)
colnames(viv)[1] = "Gradient_Boost"
colnames(viv)[2] = "Random_Forest"
colnames(viv)[3:7] = c("SVM-A","SVM-B","SVM-C","SVM-D","SVM-E")

kable(head(viv[with(viv, order(-Gradient_Boost)), ], 10),
 digits = 2)



Gradient_Boost
Random_Forest
SVM-A
SVM-B
SVM-C
SVM-D
SVM-E




PC8
100.00
100.00
72.94
100.00
72.94
72.94
100.00


PC5
96.62
73.61
23.18
14.06
26.35
38.01
23.18


PC14
92.25
91.56
20.36
20.12
2.27
48.45
20.36


PC12
81.61
83.33
40.61
28.45
28.45
41.35
40.61


PC1
64.87
81.25
17.10
8.31
8.31
39.42
17.10


PC2
58.37
55.95
21.86
14.14
18.26
41.84
21.86


PC3
55.68
69.51
71.44
78.40
76.92
71.44
78.40


PC9
51.42
63.22
43.27
43.27
43.27
43.27
2.16


PC15
46.77
58.91
33.03
32.01
32.01
32.01
33.03


PC25
43.56
41.62
51.54
51.54
60.59
51.54
47.07

Model Correlation

Similar to searching for multicolinearity in the predictors of the separate models, we need to check if modelled predictors are fairly uncorrelated. Ideally, the modelled predictors should have less than 75% correlation.

results <- resamples(list(
 SVM.Radial = fit.svr,
 Random.Forest = fit.rft,
 Gradient.Boost = fit.gbm
 ))

# check if model predicotrs are fairly un-correlated (< 0.75)
kable(modelCor(results), caption = "Model Correlation Matrix:")
Model Correlation Matrix:



SVM.Radial
Random.Forest
Gradient.Boost




SVM.Radial
1.0000000
0.0139791
0.0701732


Random.Forest
0.0139791
1.0000000
0.2294552


Gradient.Boost
0.0701732
0.2294552
1.0000000

Density Plots

Although the Gradient Boost and Random Forest models do not perform as well as the Support Vector Machine model, the ensemble algorithm is able to learn from their variance to perform better than all three of the separate models.

results <- resamples(list(
 '1.SVM.Radial' = fit.svr,
 '2.Random.Forest' = fit.rft,
 '3.Gradient.Boost' = fit.gbm,
 '4.Ensemble.Stack' = fit.stack
 ))

# Summarize the ensemble results
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: 1.SVM.Radial, 2.Random.Forest, 3.Gradient.Boost, 4.Ensemble.Stack 
## Number of resamples: 80 
## 
## Accuracy 
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.SVM.Radial 0.9745 0.9796 0.9830 0.9825 0.9849 0.9890 0
## 2.Random.Forest 0.9550 0.9635 0.9677 0.9674 0.9711 0.9788 0
## 3.Gradient.Boost 0.9474 0.9609 0.9652 0.9651 0.9686 0.9771 0
## 4.Ensemble.Stack 0.9720 0.9822 0.9872 0.9862 0.9898 0.9974 0
## 
## Kappa 
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.SVM.Radial 0.9678 0.9742 0.9785 0.9778 0.9809 0.9860 0
## 2.Random.Forest 0.9431 0.9538 0.9591 0.9588 0.9635 0.9731 0
## 3.Gradient.Boost 0.9334 0.9506 0.9560 0.9558 0.9603 0.9710 0
## 4.Ensemble.Stack 0.9646 0.9775 0.9839 0.9826 0.9871 0.9968 0
lapply(fit.stack, summary)
## $method
## Length Class Mode 
## 1 character character 
## 
## $modelInfo
## Length Class Mode 
## label 1 -none- character
## library 1 -none- character
## type 2 -none- character
## parameters 3 data.frame list 
## grid 1 -none- function 
## loop 0 -none- NULL 
## fit 1 -none- function 
## predict 1 -none- function 
## prob 1 -none- function 
## predictors 1 -none- function 
## tags 4 -none- character
## levels 1 -none- function 
## sort 1 -none- function 
## 
## $modelType
## Length Class Mode 
## 1 character character 
## 
## $results
## sigma C Accuracy Kappa 
## Min. :0.03751 Min. :0.2500 Min. :0.9859 Min. :0.9821 
## 1st Qu.:0.03751 1st Qu.:0.3750 1st Qu.:0.9859 1st Qu.:0.9822 
## Median :0.03751 Median :0.5000 Median :0.9859 Median :0.9822 
## Mean :0.03751 Mean :0.5833 Mean :0.9860 Mean :0.9823 
## 3rd Qu.:0.03751 3rd Qu.:0.7500 3rd Qu.:0.9861 3rd Qu.:0.9824 
## Max. :0.03751 Max. :1.0000 Max. :0.9862 Max. :0.9826 
## AccuracySD KappaSD 
## Min. :0.005437 Min. :0.006878 
## 1st Qu.:0.005449 1st Qu.:0.006895 
## Median :0.005462 Median :0.006911 
## Mean :0.005461 Mean :0.006909 
## 3rd Qu.:0.005473 3rd Qu.:0.006925 
## Max. :0.005483 Max. :0.006938 
## 
## $pred
## Length Class Mode 
## 0 NULL NULL 
## 
## $bestTune
## sigma C 
## Min. :0.03751 Min. :0.5 
## 1st Qu.:0.03751 1st Qu.:0.5 
## Median :0.03751 Median :0.5 
## Mean :0.03751 Mean :0.5 
## 3rd Qu.:0.03751 3rd Qu.:0.5 
## Max. :0.03751 Max. :0.5 
## 
## $call
## Length Class Mode 
## 6 call call 
## 
## $dots
## Length Class Mode 
## 0 list list 
## 
## $metric
## Length Class Mode 
## 1 character character 
## 
## $control
## Length Class Mode 
## method 1 -none- character
## number 1 -none- numeric 
## repeats 1 -none- numeric 
## search 1 -none- character
## p 1 -none- numeric 
## initialWindow 0 -none- NULL 
## horizon 1 -none- numeric 
## fixedWindow 1 -none- logical 
## skip 1 -none- numeric 
## verboseIter 1 -none- logical 
## returnData 1 -none- logical 
## returnResamp 1 -none- character
## savePredictions 1 -none- character
## classProbs 1 -none- logical 
## summaryFunction 1 -none- function 
## selectionFunction 1 -none- character
## preProcOptions 6 -none- list 
## sampling 0 -none- NULL 
## index 80 -none- list 
## indexOut 80 -none- list 
## indexFinal 0 -none- NULL 
## timingSamps 1 -none- numeric 
## predictionBounds 2 -none- logical 
## seeds 81 -none- list 
## adaptive 4 -none- list 
## trim 1 -none- logical 
## allowParallel 1 -none- logical 
## 
## $finalModel
## Length Class Mode 
## 1 ksvm S4 
## 
## $preProcess
## Length Class Mode 
## 0 NULL NULL 
## 
## $trainingData
## .outcome yhat.svr yhat.rft yhat.gbm
## A:1116 A:1123 A:1128 A:1128 
## B: 759 B: 762 B: 746 B: 743 
## C: 684 C: 694 C: 721 C: 721 
## D: 643 D: 628 D: 621 D: 622 
## E: 721 E: 716 E: 707 E: 709 
## 
## $resample
## Accuracy Kappa Resample 
## Min. :0.9720 Min. :0.9646 Length:80 
## 1st Qu.:0.9822 1st Qu.:0.9775 Class :character 
## Median :0.9872 Median :0.9839 Mode :character 
## Mean :0.9862 Mean :0.9826 
## 3rd Qu.:0.9898 3rd Qu.:0.9871 
## Max. :0.9974 Max. :0.9968 
## 
## $resampledCM
## cell1 cell2 cell3 cell4 
## Min. :108.0 Min. :0.0000 Min. :0.00000 Min. :0.0000 
## 1st Qu.:110.0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 
## Median :111.0 Median :0.0000 Median :0.00000 Median :0.0000 
## Mean :110.7 Mean :0.6167 Mean :0.05833 Mean :0.1375 
## 3rd Qu.:112.0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 
## Max. :112.0 Max. :3.0000 Max. :1.00000 Max. :2.0000 
## cell5 cell6 cell7 cell8 
## Min. :0.00000 Min. :0.0000 Min. :73.00 Min. :0.0000 
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:74.75 1st Qu.:0.0000 
## Median :0.00000 Median :1.0000 Median :75.00 Median :0.0000 
## Mean :0.04583 Mean :0.6833 Mean :75.00 Mean :0.2167 
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:76.00 3rd Qu.:0.0000 
## Max. :1.00000 Max. :3.0000 Max. :76.00 Max. :2.0000 
## cell9 cell10 cell11 cell12 cell13 
## Min. :0 Min. :0 Min. :0.0 Min. :0.0 Min. :65.0 
## 1st Qu.:0 1st Qu.:0 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:66.0 
## Median :0 Median :0 Median :0.0 Median :0.0 Median :67.0 
## Mean :0 Mean :0 Mean :0.2 Mean :0.3 Mean :67.3 
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0 3rd Qu.:1.0 3rd Qu.:68.0 
## Max. :0 Max. :0 Max. :1.0 Max. :2.0 Max. :69.0 
## cell14 cell15 cell16 cell17 cell18 
## Min. :0.0 Min. :0 Min. :0.0000 Min. :0 Min. :0.000 
## 1st Qu.:0.0 1st Qu.:0 1st Qu.:0.0000 1st Qu.:0 1st Qu.:1.000 
## Median :0.0 Median :0 Median :0.0000 Median :0 Median :2.000 
## Mean :0.6 Mean :0 Mean :0.3167 Mean :0 Mean :1.608 
## 3rd Qu.:1.0 3rd Qu.:0 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:2.000 
## Max. :3.0 Max. :0 Max. :2.0000 Max. :0 Max. :5.000 
## cell19 cell20 cell21 cell22 cell23 
## Min. :58.00 Min. :0.0 Min. :0 Min. :0.0 Min. :0.0 
## 1st Qu.:62.00 1st Qu.:0.0 1st Qu.:0 1st Qu.:0.0 1st Qu.:0.0 
## Median :62.00 Median :0.0 Median :0 Median :0.0 Median :0.0 
## Mean :62.27 Mean :0.1 Mean :0 Mean :0.3 Mean :0.3 
## 3rd Qu.:63.00 3rd Qu.:0.0 3rd Qu.:0 3rd Qu.:1.0 3rd Qu.:1.0 
## Max. :64.00 Max. :1.0 Max. :0 Max. :2.0 Max. :2.0 
## cell24 cell25 sigma C 
## Min. :0 Min. :69.0 Min. :0.03751 Min. :0.2500 
## 1st Qu.:0 1st Qu.:71.0 1st Qu.:0.03751 1st Qu.:0.2500 
## Median :0 Median :72.0 Median :0.03751 Median :0.5000 
## Mean :0 Mean :71.5 Mean :0.03751 Mean :0.5833 
## 3rd Qu.:0 3rd Qu.:72.0 3rd Qu.:0.03751 3rd Qu.:1.0000 
## Max. :0 Max. :73.0 Max. :0.03751 Max. :1.0000 
## Resample 
## Length:240 
## Class :character 
## Mode :character 
## 
## 
## 
## 
## $perfNames
## Length Class Mode 
## 2 character character 
## 
## $maximize
## Mode TRUE NA's 
## logical 1 0 
## 
## $yLimits
## Length Class Mode 
## 0 NULL NULL 
## 
## $times
## Length Class Mode 
## everything 5 proc_time numeric
## final 5 proc_time numeric
## prediction 3 -none- logical
## 
## $levels
## Length Class Mode 
## 5 character character 
## 
## $terms
## Length Class1 Class2 Mode 
## 3 terms formula call 
## 
## $coefnames
## Length Class Mode 
## 12 character character 
## 
## $contrasts
## Length Class Mode 
## yhat.svr 1 -none- character
## yhat.rft 1 -none- character
## yhat.gbm 1 -none- character
## 
## $xlevels
## Length Class Mode 
## yhat.svr 5 -none- character
## yhat.rft 5 -none- character
## yhat.gbm 5 -none- character

# Using Kappa as a more conservative measure for accuracy.
densityplot(results, metric='Kappa', 
 auto.key=list(space='right', 
 trellis.par.set(superpose.line = list(lwd = 2.5, lty = 1))
 )
 )

Box Plots

The box plots indicate that the ensemble model has reduced bias and increased variance, however the variance trends higher in accuracy. In other words, the ensemble model has expected predictions that are closer to the truth when compared to any of the separate models.

# box and whisker plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales)

#bwplot(results, metric="ROC")

Per-Class Statistics

# Confusion Matrix
kable(cm.stk$table, caption = "Confusion Matrix:")
Confusion Matrix:



A
B
C
D
E




A
1103
5
2
0
0


B
11
751
9
0
1


C
0
1
665
17
4


D
1
1
7
625
0


E
1
1
1
1
716



Statistics by Class:



Class: A
Class: B
Class: C
Class: D
Class: E




Sensitivity
0.9883513
0.9894598
0.9722222
0.9720062
0.9930652


Specificity
0.9975062
0.9933628
0.9932078
0.9972561
0.9987508


Pos Pred Value
0.9936937
0.9727979
0.9679767
0.9858044
0.9944444


Neg Pred Value
0.9953786
0.9974611
0.9941286
0.9945272
0.9984390


Precision
0.9936937
0.9727979
0.9679767
0.9858044
0.9944444


Recall
0.9883513
0.9894598
0.9722222
0.9720062
0.9930652


F1
0.9910153
0.9810581
0.9700948
0.9788567
0.9937543


Prevalence
0.2844762
0.1934744
0.1743564
0.1639052
0.1837879


Detection Rate
0.2811624
0.1914351
0.1695131
0.1593168
0.1825134


Detection Prevalence
0.2829467
0.1967882
0.1751211
0.1616110
0.1835330


Balanced Accuracy
0.9929287
0.9914113
0.9827150
0.9846312
0.9959080

# Class Summary
kable(t(cm.stk$byClass), caption = "Statistics by Class:")

Conclusion

Our statistics, including accuracy, precision and recall, indicate that the ensemble model is well generalized and not overfit to the training data. The ensemble model’s estimated training accuracy (98.62%) shows a 0.38% predictive improvement over the SVM model (98.25%).

Let’s see how the ensemble performs compared to the separate models against the test data:

Ensemble: 98.39%
SVM-Radial: 98.34%
Random Forest: 97.43%
Gradient Boost: 97.12%

In addition to the improved estimated mean performance, the ensemble model has estimated per-class performance that is equal-to or better than the separate models:

# Assess how the ensemble gets higher accuracy
# Compare combined predictor accuracy to the producer’s accuracy (aka sensitivity or recall) for the classes
pred.perf <- rbind(confusionMatrix(yhat.svr, trn.tst$classe)$byClass[, 7],
 confusionMatrix(yhat.rft, trn.tst$classe)$byClass[, 7],
 confusionMatrix(yhat.gbm, trn.tst$classe)$byClass[, 7],
 confusionMatrix(yhat.stk, trn.tst$classe)$byClass[, 7])
row.names(pred.perf) <- c("SVM.Radial", "Random.Forest", "Gradient.Boost", "Ensemble.Stack")
# Class Summary
kable(pred.perf, caption = "Per-class Weigthed Average of Precision and Recall (F-1):")
Per-class Weigthed Average of Precision and Recall (F-1):



Class: A
Class: B
Class: C
Class: D
Class: E




SVM.Radial
0.9901257
0.9810581
0.9700948
0.9772549
0.9937543


Random.Forest
0.9839429
0.9648308
0.9561467
0.9693637
0.9909281


Gradient.Boost
0.9829443
0.9606815
0.9486623
0.9702194
0.9867411


Ensemble.Stack
0.9910153
0.9810581
0.9700948
0.9788567
0.9937543

Per-class Weigthed Average of Precision and Recall (F-1):
	Class: A	Class: B	Class: C	Class: D	Class: E
SVM.Radial	0.9901257	0.9810581	0.9700948	0.9772549	0.9937543
Random.Forest	0.9839429	0.9648308	0.9561467	0.9693637	0.9909281
Gradient.Boost	0.9829443	0.9606815	0.9486623	0.9702194	0.9867411
Ensemble.Stack	0.9910153	0.9810581	0.9700948	0.9788567	0.9937543

Technical Notes: Receiver Operating Characteristic (ROC) curve analysis was not designed for multi-classification problems, however using the pROC library, we calculate the mean area under the curve (AUC) on the test data of 99.37%, which cannot be plotted. To learn more about ROC curve analysis, have a look at this BioMed Central (BMC) research paper. “Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77”.

The code and content in this post is licensed by Roberto Rivera to the public
under a Creative Commons Attribution 4.0 License