For this course project, we aim to predict the manner in exercise given the dataset consisting of a large amount of personal data obtained by accelerometers on the belt, forearm, arm and dumbell of six participants.
We load the caret package for the data preparation and analysis, the tidyverse package for data cleansing, and the readr package for the data loading.
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v purrr 0.2.5 v forcats 0.3.0
## Warning: package 'purrr' was built under R version 3.4.4
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x BiocGenerics::Position() masks ggplot2::Position(), base::Position()
First of all, we load training and testing dataset.
training <- read_csv("pml-training.csv")
testing <- read_csv("pml-testing.csv")
as.tibble(training)
## # A tibble: 19,622 x 160
## X1 user_name raw_timestamp_part~ raw_timestamp_part~ cvtd_timestamp
## <int> <chr> <int> <int> <chr>
## 1 1 carlitos 1323084231 788290 05/12/2011 11:~
## 2 2 carlitos 1323084231 808298 05/12/2011 11:~
## 3 3 carlitos 1323084231 820366 05/12/2011 11:~
## 4 4 carlitos 1323084232 120339 05/12/2011 11:~
## 5 5 carlitos 1323084232 196328 05/12/2011 11:~
## 6 6 carlitos 1323084232 304277 05/12/2011 11:~
## 7 7 carlitos 1323084232 368296 05/12/2011 11:~
## 8 8 carlitos 1323084232 440390 05/12/2011 11:~
## 9 9 carlitos 1323084232 484323 05/12/2011 11:~
## 10 10 carlitos 1323084232 484434 05/12/2011 11:~
## # ... with 19,612 more rows, and 155 more variables: new_window <chr>,
## # num_window <int>, roll_belt <dbl>, pitch_belt <dbl>, yaw_belt <dbl>,
## # total_accel_belt <int>, kurtosis_roll_belt <chr>,
## # kurtosis_picth_belt <chr>, kurtosis_yaw_belt <chr>,
## # skewness_roll_belt <chr>, skewness_roll_belt.1 <chr>,
## # skewness_yaw_belt <chr>, max_roll_belt <dbl>, max_picth_belt <int>,
## # max_yaw_belt <chr>, min_roll_belt <dbl>, min_pitch_belt <int>,
## # min_yaw_belt <chr>, amplitude_roll_belt <dbl>,
## # amplitude_pitch_belt <int>, amplitude_yaw_belt <chr>,
## # var_total_accel_belt <dbl>, avg_roll_belt <dbl>,
## # stddev_roll_belt <dbl>, var_roll_belt <dbl>, avg_pitch_belt <dbl>,
## # stddev_pitch_belt <dbl>, var_pitch_belt <dbl>, avg_yaw_belt <dbl>,
## # stddev_yaw_belt <dbl>, var_yaw_belt <dbl>, gyros_belt_x <dbl>,
## # gyros_belt_y <dbl>, gyros_belt_z <dbl>, accel_belt_x <int>,
## # accel_belt_y <int>, accel_belt_z <int>, magnet_belt_x <int>,
## # magnet_belt_y <int>, magnet_belt_z <int>, roll_arm <dbl>,
## # pitch_arm <dbl>, yaw_arm <dbl>, total_accel_arm <int>,
## # var_accel_arm <dbl>, avg_roll_arm <dbl>, stddev_roll_arm <dbl>,
## # var_roll_arm <dbl>, avg_pitch_arm <dbl>, stddev_pitch_arm <dbl>,
## # var_pitch_arm <dbl>, avg_yaw_arm <dbl>, stddev_yaw_arm <dbl>,
## # var_yaw_arm <dbl>, gyros_arm_x <dbl>, gyros_arm_y <dbl>,
## # gyros_arm_z <dbl>, accel_arm_x <int>, accel_arm_y <int>,
## # accel_arm_z <int>, magnet_arm_x <int>, magnet_arm_y <int>,
## # magnet_arm_z <int>, kurtosis_roll_arm <dbl>, kurtosis_picth_arm <chr>,
## # kurtosis_yaw_arm <chr>, skewness_roll_arm <dbl>,
## # skewness_pitch_arm <chr>, skewness_yaw_arm <chr>, max_roll_arm <dbl>,
## # max_picth_arm <dbl>, max_yaw_arm <int>, min_roll_arm <dbl>,
## # min_pitch_arm <dbl>, min_yaw_arm <int>, amplitude_roll_arm <dbl>,
## # amplitude_pitch_arm <dbl>, amplitude_yaw_arm <int>,
## # roll_dumbbell <dbl>, pitch_dumbbell <dbl>, yaw_dumbbell <dbl>,
## # kurtosis_roll_dumbbell <dbl>, kurtosis_picth_dumbbell <dbl>,
## # kurtosis_yaw_dumbbell <chr>, skewness_roll_dumbbell <dbl>,
## # skewness_pitch_dumbbell <dbl>, skewness_yaw_dumbbell <chr>,
## # max_roll_dumbbell <dbl>, max_picth_dumbbell <dbl>,
## # max_yaw_dumbbell <dbl>, min_roll_dumbbell <dbl>,
## # min_pitch_dumbbell <dbl>, min_yaw_dumbbell <dbl>,
## # amplitude_roll_dumbbell <dbl>, amplitude_pitch_dumbbell <dbl>,
## # amplitude_yaw_dumbbell <dbl>, total_accel_dumbbell <int>,
## # var_accel_dumbbell <dbl>, avg_roll_dumbbell <dbl>,
## # stddev_roll_dumbbell <dbl>, ...
Our target variable to predict is “classe”, which is the performance of exercise and classified as 4 levels.
training <- training %>% mutate(classe = factor(classe))
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
After looking the dataset structure, we found that some of columns is not useful for our prediction.
training %>% select(X1, user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp,
new_window, num_window)
## # A tibble: 19,622 x 7
## X1 user_name raw_timestamp_part~ raw_timestamp_part~ cvtd_timestamp
## <int> <chr> <int> <int> <chr>
## 1 1 carlitos 1323084231 788290 05/12/2011 11:~
## 2 2 carlitos 1323084231 808298 05/12/2011 11:~
## 3 3 carlitos 1323084231 820366 05/12/2011 11:~
## 4 4 carlitos 1323084232 120339 05/12/2011 11:~
## 5 5 carlitos 1323084232 196328 05/12/2011 11:~
## 6 6 carlitos 1323084232 304277 05/12/2011 11:~
## 7 7 carlitos 1323084232 368296 05/12/2011 11:~
## 8 8 carlitos 1323084232 440390 05/12/2011 11:~
## 9 9 carlitos 1323084232 484323 05/12/2011 11:~
## 10 10 carlitos 1323084232 484434 05/12/2011 11:~
## # ... with 19,612 more rows, and 2 more variables: new_window <chr>,
## # num_window <int>
We then remove these columns for the accurate prediction.
training_2 <- training %>% select(-X1, -user_name, -raw_timestamp_part_1,
-raw_timestamp_part_2, -cvtd_timestamp,
-new_window, -num_window)
As well as useless columns, we need to determinate which column as almost same features by cheking near-zero-variance.
chk_variance <- nearZeroVar(training_2, saveMetrics = TRUE)
head(chk_variance)
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.101904 6.7781062 FALSE FALSE
## pitch_belt 1.036082 9.3772296 FALSE FALSE
## yaw_belt 1.058480 9.9734991 FALSE FALSE
## total_accel_belt 1.063160 0.1477933 FALSE FALSE
## kurtosis_roll_belt 5.000000 2.0181429 FALSE FALSE
## kurtosis_picth_belt 8.000000 1.6104373 FALSE FALSE
We then remove these TRUE columns by:
training_3 <- training_2[, !chk_variance$nzv]
Again, we remove the useless column consisting NAs and blanks.
NAcol <- training_3 %>% sapply(is.na) %>% apply(2, sum)
training_df <- training_3[, NAcol == 0]
training_df
## # A tibble: 19,622 x 50
## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 1.41 8.07 -94.4 3 0.
## 2 1.41 8.07 -94.4 3 0.0200
## 3 1.42 8.07 -94.4 3 0.
## 4 1.48 8.05 -94.4 3 0.0200
## 5 1.48 8.07 -94.4 3 0.0200
## 6 1.45 8.06 -94.4 3 0.0200
## 7 1.42 8.09 -94.4 3 0.0200
## 8 1.42 8.13 -94.4 3 0.0200
## 9 1.43 8.16 -94.4 3 0.0200
## 10 1.45 8.17 -94.4 3 0.0300
## # ... with 19,612 more rows, and 45 more variables: gyros_belt_y <dbl>,
## # gyros_belt_z <dbl>, accel_belt_x <int>, accel_belt_y <int>,
## # accel_belt_z <int>, magnet_belt_x <int>, magnet_belt_y <int>,
## # magnet_belt_z <int>, roll_arm <dbl>, pitch_arm <dbl>, yaw_arm <dbl>,
## # total_accel_arm <int>, gyros_arm_x <dbl>, gyros_arm_y <dbl>,
## # gyros_arm_z <dbl>, accel_arm_x <int>, accel_arm_y <int>,
## # accel_arm_z <int>, magnet_arm_x <int>, magnet_arm_y <int>,
## # magnet_arm_z <int>, roll_dumbbell <dbl>, pitch_dumbbell <dbl>,
## # yaw_dumbbell <dbl>, total_accel_dumbbell <int>,
## # gyros_dumbbell_x <dbl>, gyros_dumbbell_y <dbl>,
## # gyros_dumbbell_z <dbl>, accel_dumbbell_x <int>,
## # accel_dumbbell_y <int>, accel_dumbbell_z <int>,
## # magnet_dumbbell_x <int>, magnet_dumbbell_y <int>, roll_forearm <dbl>,
## # pitch_forearm <dbl>, yaw_forearm <dbl>, total_accel_forearm <int>,
## # gyros_forearm_x <dbl>, gyros_forearm_y <dbl>, gyros_forearm_z <dbl>,
## # accel_forearm_x <int>, accel_forearm_y <int>, accel_forearm_z <int>,
## # magnet_forearm_x <int>, classe <fct>
After data cleansing, we have 49 features to predict classe.
To measure the out-of-sample error without possibilities to be overfitting, we create cross-validation set (20%) by:
set.seed(145)
training_split <- createDataPartition(training_df$classe, p = 0.8, list = FALSE)
training_f <- training_df[training_split, ]
cv_f <- training_df[-training_split, ]
We do the same cleansing as training set for test set.
testing_2 <- testing %>% select(-X1, -user_name, -raw_timestamp_part_1,
-raw_timestamp_part_2, -cvtd_timestamp,
-new_window, -num_window)
testing_3 <- testing_2[, !chk_variance$nzv]
testing_f <- testing_3[, NAcol == 0]
Again, prediction target variable is “classe”, which is the performance of exercise and classified as 4 levels. We then decide to use random forest, because of its easiness and power of multiple classification. To begin with, the model includes all of predictors in final dataset.
rf_full <- train(classe ~., data = training_f, method = "rf")
rf_full$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 25
##
## OOB estimate of error rate: 0.59%
## Confusion matrix:
## A B C D E class.error
## A 4458 3 3 0 0 0.001344086
## B 21 3009 7 0 1 0.009545754
## C 0 16 2713 9 0 0.009130752
## D 1 1 21 2549 1 0.009327633
## E 0 1 1 7 2877 0.003118503
Second, we will estimate simpler model omitting location information with _x, _y, _z.
training_s_f <- training_f %>% select(roll_belt, roll_arm, roll_dumbbell, roll_forearm,
pitch_belt, pitch_arm, pitch_dumbbell, pitch_forearm,
yaw_belt, yaw_arm, yaw_dumbbell, yaw_forearm,
total_accel_belt, total_accel_arm, total_accel_dumbbell,
total_accel_forearm, classe)
rf_simple <- train(classe ~., data = training_s_f, method = "rf")
rf_simple$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 0.85%
## Confusion matrix:
## A B C D E class.error
## A 4454 8 1 1 0 0.002240143
## B 23 2988 26 1 0 0.016458196
## C 1 15 2709 13 0 0.010591673
## D 2 3 13 2553 2 0.007773028
## E 1 6 11 7 2861 0.008662509
Using cross validation set we made, we will check the prediction accuracy without taking look at test set.
predict_full <- predict(rf_full, cv_f)
predict_simple <- predict(rf_simple, cv_f)
confusionMatrix(predict_full, cv_f$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1115 5 0 0 0
## B 1 753 1 0 0
## C 0 1 679 3 0
## D 0 0 4 639 1
## E 0 0 0 1 720
##
## Overall Statistics
##
## Accuracy : 0.9957
## 95% CI : (0.9931, 0.9975)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9945
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9921 0.9927 0.9938 0.9986
## Specificity 0.9982 0.9994 0.9988 0.9985 0.9997
## Pos Pred Value 0.9955 0.9974 0.9941 0.9922 0.9986
## Neg Pred Value 0.9996 0.9981 0.9985 0.9988 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1919 0.1731 0.1629 0.1835
## Detection Prevalence 0.2855 0.1925 0.1741 0.1642 0.1838
## Balanced Accuracy 0.9987 0.9957 0.9957 0.9961 0.9992
confusionMatrix(predict_simple, cv_f$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1115 2 0 0 0
## B 1 745 2 0 3
## C 0 10 676 4 0
## D 0 1 6 637 0
## E 0 1 0 2 718
##
## Overall Statistics
##
## Accuracy : 0.9918
## 95% CI : (0.9885, 0.9944)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9897
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9816 0.9883 0.9907 0.9958
## Specificity 0.9993 0.9981 0.9957 0.9979 0.9991
## Pos Pred Value 0.9982 0.9920 0.9797 0.9891 0.9958
## Neg Pred Value 0.9996 0.9956 0.9975 0.9982 0.9991
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1899 0.1723 0.1624 0.1830
## Detection Prevalence 0.2847 0.1914 0.1759 0.1642 0.1838
## Balanced Accuracy 0.9992 0.9898 0.9920 0.9943 0.9975
Judging from above confusion matrix, we chose simpler specification to predict the test classe since it has reasonably high accuracy even though it contains less predictors.
predict(rf_simple, testing_f)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E