Now it is time to train our machine learning models. Because of the difficulty in spotting a connection between any one attribute and the success (or failure) of a project in the previous section, we will train a logistic regression model to hypothesis test for relationships. We will then train a random forest. We will perform cross validation to assess our forest’s performance and tune hyperparameters. Finally, we visualize what sort of improvement we can achieve (if any) with our tuned hyperparameters using an AUROC plot. Finally we will test to see if one model is better than another.
We will start by randomly sampling out 0.1% of our data because of the memory limitations of R’s randomForest (randomForest can handle a significantly larger dataset, which we will address shortly) and logistic regression libraries. Both our logistic regression model and randomForest will be built using this subset.
Logistic Regression
We want to see if we can spot a relationship between state and the other attributes that we could not spot in our exploratory data analysis section.
Below we train our logistic regression model on our subsetted data.
Lets see if we can spot a relationship between state and the three previously explored attributes…
lr <- glm(state ~ usd_goal_real + duration + main_category, data = subset, family = binomial)
lr %>%
tidy() %>%
kable(digits = 5)
(Intercept) |
1.23749 |
0.54008 |
2.29132 |
0.02195 |
usd_goal_real |
-0.00003 |
0.00001 |
-3.77297 |
0.00016 |
duration |
0.00000 |
0.00000 |
-3.52843 |
0.00042 |
main_categoryComics |
0.70342 |
0.83264 |
0.84481 |
0.39822 |
main_categoryCrafts |
-0.22895 |
0.90046 |
-0.25426 |
0.79929 |
main_categoryDance |
-0.11701 |
1.01229 |
-0.11558 |
0.90798 |
main_categoryDesign |
-0.34825 |
0.59805 |
-0.58231 |
0.56036 |
main_categoryFashion |
-1.49444 |
0.79293 |
-1.88470 |
0.05947 |
main_categoryFilm & Video |
0.36330 |
0.51301 |
0.70818 |
0.47884 |
main_categoryFood |
-0.09686 |
0.62230 |
-0.15565 |
0.87631 |
main_categoryGames |
-0.23463 |
0.55546 |
-0.42240 |
0.67273 |
main_categoryJournalism |
-0.73984 |
0.83600 |
-0.88498 |
0.37617 |
main_categoryMusic |
0.19660 |
0.50250 |
0.39123 |
0.69562 |
main_categoryPhotography |
-0.19711 |
0.67078 |
-0.29385 |
0.76887 |
main_categoryPublishing |
-0.71183 |
0.57241 |
-1.24357 |
0.21366 |
main_categoryTechnology |
-0.37860 |
0.61640 |
-0.61420 |
0.53908 |
main_categoryTheater |
2.74847 |
1.33146 |
2.06426 |
0.03899 |
Contrary to our exploratory analysis, goal and duration are related to state while main_category is not.
Random Forest
Adapted from example here since that is the only way I know how to do random forests in R.
Split Training/Test Data
We will then build a random forest using an 80/20 train/test data split on this subset. This split is pretty standard, but can be different if you have good reason to split differently.
set.seed(1234)
test_random_forest_df <- subset %>%
group_by(state) %>%
sample_frac(.2) %>%
ungroup()
train_random_forest_df <- subset %>%
anti_join(test_random_forest_df, by = "ID")
Training
Now that we have our training set we can train the model…
set.seed(1234)
rf <- randomForest(state ~ .,
data = train_random_forest_df %>%
select(-ID, -name, -category, -deadline, -launched))
rf
##
## Call:
## randomForest(formula = state ~ ., data = train_random_forest_df %>% select(-ID, -name, -category, -deadline, -launched))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.94%
## Confusion matrix:
## 0 1 class.error
## 0 168 14 0.07692308
## 1 4 117 0.03305785
Testing
Time to make predictions!
test_predictions <- predict(rf,
newdata = test_random_forest_df %>%
select(-ID, -name, -category, -deadline, -launched))
And lets check our performance on the test set…
table(pred = test_predictions, observed = test_random_forest_df$state)
## observed
## pred 0 1
## 0 42 2
## 1 4 28
Our error rate is less than 8%. Not Bad.
Cross-Validation
We will now perform 5-fold cross-validation (subset contains only 379 observations) to compare our large randomForest (default 500 trees), with a moderately sized random forest (100 trees). This process can be automated to test different fold values (e.g. 10 instead of 5), and test differently sized forests (Ideally trained on a larger dataset). One could then plot the different hyperparameter values against test accuracy to choose the best values (That would be how one should really tune hyperparameters. But for the purposes of this tutorial, this test will give us an idea of how performance may or may not be plateauing after reaching a certain hyperparameter value.)
set.seed(1234)
results_df <- createFolds(subset$state, k = 5) %>%
imap(function(test_indices, fold_number) {
train_df <- subset %>%
select(-ID, -name, -category, -deadline, -launched) %>%
slice(-test_indices)
test_df <- subset %>%
select(-ID, -name, -category, -deadline, -launched) %>%
slice(test_indices)
rf1 <- randomForest(state ~ ., data = train_df, ntree = 500)
rf2 <- randomForest(state ~ ., data = train_df, ntree = 100)
test_df %>%
select(observed_label = state) %>%
mutate(fold = fold_number) %>%
mutate(prob_positive_rf1 = predict(rf1, newdata = test_df, type = "prob")[, "1"]) %>%
mutate(predicted_label_rf1 = ifelse(prob_positive_rf1 > 0.5, 1, 0)) %>%
mutate(prob_positive_rf2 = predict(rf2, newdata = test_df, type = "prob")[, "1"]) %>%
mutate(predicted_label_rf2 = ifelse(prob_positive_rf2 > 0.5, 1, 0))
}) %>%
reduce(bind_rows)
kable(head(results_df))
0 |
Fold1 |
0.290 |
0 |
0.27 |
0 |
1 |
Fold1 |
0.954 |
1 |
0.95 |
1 |
1 |
Fold1 |
0.762 |
1 |
0.74 |
1 |
1 |
Fold1 |
0.958 |
1 |
0.98 |
1 |
1 |
Fold1 |
0.908 |
1 |
0.91 |
1 |
0 |
Fold1 |
0.120 |
0 |
0.12 |
0 |
Now to test for a difference in error rate…
results_df %>%
mutate(error_rf1 = observed_label != predicted_label_rf1,
error_rf2 = observed_label != predicted_label_rf2) %>%
group_by(fold) %>%
summarize(big_rf = mean(error_rf1), small_rf = mean(error_rf2)) %>%
gather(model, error, -fold) %>%
lm(error ~ model, data = .) %>%
tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 0.068463431 0.01441413 4.7497456 0.001445722
## 2 modelsmall_rf -0.005333333 0.02038465 -0.2616347 0.800213126
There is not a statistically cignificant difference in error rate between the two random forests. This means that our performance starts to plateau on or before the size becomes 100.
Cross-Validation cont. (AUROC)
Now lets visualize the difference…
labels <- split(results_df$observed_label, results_df$fold)
predictions_rf1 <- split(results_df$prob_positive_rf1, results_df$fold) %>% prediction(labels)
predictions_rf2 <- split(results_df$prob_positive_rf2, results_df$fold) %>% prediction(labels)
mean_auc_rf1 <- predictions_rf1 %>%
performance(measure = "auc") %>%
slot("y.values") %>% unlist() %>%
mean()
mean_auc_rf2 <- predictions_rf2 %>%
performance(measure = "auc") %>%
slot("y.values") %>% unlist() %>%
mean()
predictions_rf1 %>%
performance(measure = "tpr", x.measure = "fpr") %>%
plot(avg = "threshold", col = "orange", lwd = 2)
predictions_rf2 %>%
performance(measure = "tpr", x.measure = "fpr") %>%
plot(avg = "threshold", col = "blue", lwd = 2, add = TRUE)
legend("bottomright",
legend = paste(c("big", "small"), "rf, AUC:", round(c(mean_auc_rf1, mean_auc_rf2), digits = 3)),
col = c("orange", "blue"))

Both classifiers perform astonishingly well.
It turns out we don’t have to consider the fact that successes may be rare (in retrospect they probably are not). But if they were, we may have wanted to adjust our classifier to increase the false positive rate at the expense of our true positive rate.
Model Selection
It should be noted that the results of this section are for purely illustrational purposes. One can train a randomForest on approx. 30% of our complete dataset and observe an error rate well below 1%. See extra section following the conclusion.
We want to see if there is a statistically significant difference between the error rates of our linear regression model and our random forest model.
We start by applying 5-fold cross validation to both models and generating a table of error rates…
fold_indices <- cvFolds(n = nrow(subset), K = 5)
error_rates <- sapply(1:5, function(fold_index) {
test_indices <- which(fold_indices$which == fold_index)
ss <- subset %>%
select(-ID, -name, -category, -deadline, -launched)
test_set <- ss[test_indices,]
train_set <- ss[-test_indices,]
lr_fit <- glm(state ~ ., data = train_set %>% select(-currency, -country), family = "binomial")
lr_pred <- ifelse(predict(lr_fit, newdata = test_set, type = "response") > 0.5, 1, 0)
lr_error <- mean(test_set$state != lr_pred)
rf_fit <- randomForest(state ~ ., data = train_set)
rf_pred <- predict(rf, newdata = test_set)
rf_error <- mean(test_set$state != rf_pred)
c(lr_error, rf_error)
})
rownames(error_rates) <- c("lr", "rf")
error_rates <- as.data.frame(t(error_rates))
error_rates <- error_rates %>%
mutate(fold = 1:n()) %>%
gather(model, error, -fold)
error_rates %>%
kable("html")
fold
|
model
|
error
|
1
|
lr
|
0.0657895
|
2
|
lr
|
0.0131579
|
3
|
lr
|
0.0526316
|
4
|
lr
|
0.0263158
|
5
|
lr
|
0.0266667
|
1
|
rf
|
0.0131579
|
2
|
rf
|
0.0131579
|
3
|
rf
|
0.0000000
|
4
|
rf
|
0.0131579
|
5
|
rf
|
0.0400000
|
We can plot the error rates for each fold ofr each model as follows
dotplot(error ~ model, data = error_rates, ylab = "Mean Prediction Error")

Now to hypothesis test…
lm(error ~ model, data = error_rates) %>%
tidy() %>%
kable()
(Intercept) |
0.0369123 |
0.0082442 |
4.477346 |
0.0020630 |
modelrf |
-0.0210175 |
0.0116591 |
-1.802672 |
0.1091065 |
There is not a statistically significant difference between the 2 models.
But again, this is contrived. randomForest on 30% of the projects dataset has an error rate of ~0.4%