1 Introduction

We will be using machine learning in an effort to predict the success or failure of Kickstarter projects. The general question we hope to answer is: Given certain factors, such as goal (how much money the Kickstarter project wants to raise) and duration (for how long can people back the project), can we predict if the project will be successful or not? We will define success later. We will also be doing some exploratory data analysis and visualization throughout. We will train and/or analyze multiple machine learning models, tune hyperparameters, and compare their results.

Our tasks are as follows:

  1. Load our data
  2. Tidy our data
  3. Perform some EDA and visualization
  4. Train and analyze our machine learning models

1.1 Load Libraries

Below you can see the libraries we used to perform our data analysis…

library(lubridate)
library(dplyr)
library(knitr)
library(ggplot2)
library(randomForest)
library(caret)
library(purrr)
library(tidyr)
library(broom)
library(ROCR)
library(cvTools)

2 Data

Below we load our primary table obtained from https://www.kaggle.com/kemical/kickstarter-projects. The data originates from https://www.kickstarter.com. Kickstarter is one of many ‘crowdfunding’ sites where entrepreneurs can fund projects with the help of other people.

Also, please excuse any profanity in the names of projects as I did not create them.

2.1 Loading

projects <- read.csv("ks-projects-201801.csv", stringsAsFactors = FALSE)

str(projects)
## 'data.frame':    378661 obs. of  15 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : chr  "Poetry" "Narrative Film" "Narrative Film" "Music" ...
##  $ main_category   : chr  "Publishing" "Film & Video" "Film & Video" "Music" ...
##  $ currency        : chr  "GBP" "USD" "USD" "USD" ...
##  $ deadline        : chr  "2015-10-09" "2017-11-01" "2013-02-26" "2012-04-16" ...
##  $ goal            : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched        : chr  "2015-08-11 12:12:28" "2017-09-02 04:43:57" "2013-01-12 00:20:50" "2012-03-17 03:24:11" ...
##  $ pledged         : num  0 2421 220 1 1283 ...
##  $ state           : chr  "failed" "failed" "failed" "failed" ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : chr  "GB" "US" "US" "US" ...
##  $ usd.pledged     : num  0 100 220 1 1283 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 19500 ...

Most of the attributes do not need an explaination. It should be noted that the ‘goal’, ‘pledged’ and ‘usd.pledged’ variables will be dropped in tidying because of differences in currency and currency conversion. Attributes usd_pledged_real and usd_goal_real are conversions of pledged and goal by the authors of the dataset using Fixer.io API in order to have accurate and uniform conversions.

2.2 Tidying

Our dataset is a mostly “tidy” dataset. However, we must still perform some transformations for our purposes.

As menioned in the loading section, we must drop attributes ‘goal’, ‘pledged’ and ‘usd.pledged’. We must also convert ‘launched’ and ‘deadline’ into datetime, as well as add a variable ‘duration’ that equals ‘deadline’ - ‘launched’. Additionally, we will convert categorical attributes into factors. For simplicity we will also turn this into a binary classification problem (i.e. ‘state’ equals ‘0’ or ‘1’). It may be interesting to explore muticlass prediction later on if time permits.

2.2.1 Drop Unused Attributes

We start our transformations by dropping ‘goal’, ‘pledged’ and ‘usd.pledged’.

projects <- projects %>%
  select(-goal, -pledged, -usd.pledged)

2.2.2 Conversions

Next we will make our conversions for time and categorical attributes.

projects$launched <- ymd_hms(projects$launched)
projects$deadline <- ymd(projects$deadline)

projects$main_category <- as.factor(projects$main_category)
projects$category <- as.factor(projects$category)
projects$currency <- as.factor(projects$currency)
projects$country <- as.factor(projects$country)
#projects$state should also be a factor, but we address this in the next subsection.

2.2.3 Duration & Label Translation

Now lets add an attribute ‘duration’ that equals ‘deadline’ - ‘launched’

projects <- projects %>%
  mutate(duration = difftime(deadline, launched, units = "secs")) %>%
  transform(duration = as.numeric(duration))

Finally we take a look at the state attribute…

levels(as.factor(projects$state))
## [1] "canceled"   "failed"     "live"       "successful" "suspended" 
## [6] "undefined"

This means we must come up with a measure of success (‘1’) and failure (‘0’) ourselves. We will define success as ‘usd_pledged_real’ - ‘usd_goal_real’ >= 0. Translation below…

projects <- projects %>%
  mutate(state = ifelse(usd_pledged_real - usd_goal_real >= 0, 1, 0))

projects$state <- as.factor(projects$state)

str(projects)
## 'data.frame':    378661 obs. of  13 variables:
##  $ ID              : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name            : chr  "The Songs of Adelaide & Abullah" "Greeting From Earth: ZGAC Arts Capsule For ET" "Where is Hank?" "ToshiCapital Rekordz Needs Help to Complete Album" ...
##  $ category        : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 56 124 59 42 114 40 ...
##  $ main_category   : Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 7 8 8 8 5 7 ...
##  $ currency        : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 14 14 14 ...
##  $ deadline        : Date, format: "2015-10-09" "2017-11-01" ...
##  $ launched        : POSIXct, format: "2015-08-11 12:12:28" "2017-09-02 04:43:57" ...
##  $ state           : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 1 1 ...
##  $ backers         : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country         : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 23 23 23 ...
##  $ usd_pledged_real: num  0 2421 220 1 1283 ...
##  $ usd_goal_real   : num  1534 30000 45000 5000 19500 ...
##  $ duration        : num  5053652 5166963 3886750 2579749 4807497 ...

This completes our tidying.

Pragmatically speaking, a project that reaches its goal but does not deliver their product, or is otherwise cancelled / delayed et cetra could be considered a failure to the consumer. This is where one could deploy a model amenable to a multiclass setting. I have deemed our data insufficent to make accurate predictions in such a setting. One can hope for more information on the project teams, such as prior successes. This consideration is the reason for simplifying to a binary setting. Therefore, our prediction ability would be mainly useful to other project teams who are looking to reach a certain goal for their Kickstarter projects.

3 Exploratory Data Analysis and Visualization

Since our ultimate goal is prediction using machine learning models, it would be helpful to try and get a sense of what attributes may have a large impact on predicting ‘1’ (success) or ‘0’ (failure).

3.1 Setup

We will start by partitioning our main table into two intermediate tables, one for successes and one for failures. This way we can perform EDA separately and easily.

successes <- projects %>%
  filter(state == 1)

failures <- projects %>%
  filter(state == 0)
kable(head(successes))
ID name category main_category currency deadline launched state backers country usd_pledged_real usd_goal_real duration
1000014025 Monarch Espresso Bar Restaurants Food USD 2016-04-01 2016-02-26 13:38:27 1 224 US 52375.0 50000.00 2974893
1000023410 Support Solar Roasted Coffee & Green Energy! SolarCoffee.co Food Food USD 2014-12-21 2014-12-01 18:30:44 1 16 US 1205.0 1000.00 1661356
100005484 Lisa Lim New CD! Indie Rock Music USD 2013-04-08 2013-03-09 06:42:58 1 100 US 12700.0 12500.00 2567822
1000057089 Tombstone: Old West tabletop game and miniatures in 32mm. Tabletop Games Games GBP 2017-05-03 2017-04-05 19:44:18 1 761 GB 121857.3 6469.73 2348142
1000070642 Mike Corey’s Darkness & Light Album Music Music USD 2012-08-17 2012-08-02 14:11:32 1 7 US 250.0 250.00 1244908
1000072011 CMUK. Shoes: Take on Life Feet First. Fashion Fashion USD 2013-12-30 2013-11-25 07:06:11 1 624 US 34268.0 20000.00 2998429
kable(head(failures))
ID name category main_category currency deadline launched state backers country usd_pledged_real usd_goal_real duration
1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 2015-08-11 12:12:28 0 0 GB 0 1533.95 5053652
1000003930 Greeting From Earth: ZGAC Arts Capsule For ET Narrative Film Film & Video USD 2017-11-01 2017-09-02 04:43:57 0 15 US 2421 30000.00 5166963
1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 2013-01-12 00:20:50 0 3 US 220 45000.00 3886750
1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 2012-03-17 03:24:11 0 1 US 1 5000.00 2579749
1000011046 Community Film Project: The Art of Neighborhood Filmmaking Film & Video Film & Video USD 2015-08-29 2015-07-04 08:35:03 0 14 US 1283 19500.00 4807497
1000030581 Chaser Strips. Our Strips make Shots their B*tch! Drinks Food USD 2016-03-17 2016-02-01 20:05:12 0 40 US 453 25000.00 3815688

3.1.1 One Quick Note

nrow(failures) / nrow(successes)
## [1] 1.763102

This is the ratio of failures to successes. The failures well outnumber the successes, which could be an indicator that being successful is a rare event. One should keep this in mind when viewing the graphs in this section. It will become particularly important in the machine learning section when we must decide how to evaluate our models.

3.2 Comparing Attributes

3.2.1 Main Category

Lets first take a look at what category of project has the most successes and what category has the most failures. We look at main category here because R’s randomForest cannot handle categorical variables with more than 53 categories (i.e. the ‘category’ attribute).

successes %>%
  ggplot(aes(x = main_category)) +
  geom_histogram(stat = "count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

failures %>%
  ggplot(aes(x = main_category)) +
  geom_histogram(stat = "count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Keeping in mind that failures outnumber successes by a factor of ~1.7, music stands out as a particularly successful category. Lets check music’s ratio…

nrow(filter(failures, main_category == "Music")) /
  nrow(filter(successes, main_category == "Music"))
## [1] 1.030982

As opposed to many other categories, music has almost a 1 to 1 ratio of failures to successes.

3.2.2 Goal

One might assume that projects with a prohibitivly high goal would be unlikely to be successful. Lets start by seeing if we can spot a difference…

successes %>%
  ggplot(aes(x = usd_goal_real)) +
  geom_histogram()

It is not easy to see the distribution of goals for successes here. A simple transformation can fix this…

successes %>%
  mutate(min_goal = min(usd_goal_real)) %>%
  mutate(log_goal = log(usd_goal_real - min_goal)) %>%
  ggplot(aes(x = log_goal)) +
  geom_histogram()

Much better.

Now lets take a look at the failures (after performing the same transformation as we did for successes)…

failures %>%
  mutate(min_goal = min(usd_goal_real)) %>%
  mutate(log_goal = log(usd_goal_real - min_goal)) %>%
  ggplot(aes(x = log_goal)) +
  geom_histogram()

(The min_goal happens to be the same for both successes and failures.)

The distributions seem fairly similar. For good measure lets take a look at them side by side…

projects %>%
  mutate(min_goal = min(usd_goal_real)) %>%
  mutate(log_goal = log(usd_goal_real - min_goal)) %>%
  ggplot(aes(x = state, y = log_goal)) +
  geom_boxplot()

While failures definitly have higher goals on average, it seems premature to conclude that the goal will have a significant affect on the success of a project.

3.2.3 Duration

In the same vain as goal, it seems reasonable to assume that projects with a longer duration will be more likely to be successful. We will apply the same log transformation as we did for goal for the same reason…

successes %>%
  mutate(min_duration = min(duration)) %>%
  mutate(log_duration = log(duration - min_duration)) %>%
  ggplot(aes(x = log_duration)) +
  geom_histogram()

Now failures…

failures %>%
  mutate(min_duration = min(duration)) %>%
  mutate(log_duration = log(duration - min_duration)) %>%
  ggplot(aes(x = log_duration)) +
  geom_histogram()

(The min_durations are not equal this time, but we will use the projects data frame next.)

Our assumption that a longer duration will make a project more likely to be successful may also be false. Lets view the boxplots as before…

projects %>%
  mutate(min_goal = min(usd_goal_real)) %>%
  mutate(log_goal = log(usd_goal_real - min_goal)) %>%
  ggplot(aes(x = state, y = log_goal)) +
  geom_boxplot()

3.3 Intermediate Results

So at this point we have 2 possibly related things to consider:

  1. Successes may be a rare event.
  2. Because there is little discernable difference in the distribution of attribute values for successes and failures, prediction may be difficult.

For #2, we hope our machine learning model can combine attributes in such a way as to make better inferences than we can.

4 Machine Learning and Prediction

Now it is time to train our machine learning models. Because of the difficulty in spotting a connection between any one attribute and the success (or failure) of a project in the previous section, we will train a logistic regression model to hypothesis test for relationships. We will then train a random forest. We will perform cross validation to assess our forest’s performance and tune hyperparameters. Finally, we visualize what sort of improvement we can achieve (if any) with our tuned hyperparameters using an AUROC plot. Finally we will test to see if one model is better than another.

We will start by randomly sampling out 0.1% of our data because of the memory limitations of R’s randomForest (randomForest can handle a significantly larger dataset, which we will address shortly) and logistic regression libraries. Both our logistic regression model and randomForest will be built using this subset.

set.seed(1234)

subset <- projects %>%
  sample_frac(.001)

4.1 Logistic Regression

We want to see if we can spot a relationship between state and the other attributes that we could not spot in our exploratory data analysis section.

Below we train our logistic regression model on our subsetted data.

Lets see if we can spot a relationship between state and the three previously explored attributes…

lr <- glm(state ~ usd_goal_real + duration + main_category, data = subset, family = binomial)

lr %>%
  tidy() %>%
  kable(digits = 5)
term estimate std.error statistic p.value
(Intercept) 1.23749 0.54008 2.29132 0.02195
usd_goal_real -0.00003 0.00001 -3.77297 0.00016
duration 0.00000 0.00000 -3.52843 0.00042
main_categoryComics 0.70342 0.83264 0.84481 0.39822
main_categoryCrafts -0.22895 0.90046 -0.25426 0.79929
main_categoryDance -0.11701 1.01229 -0.11558 0.90798
main_categoryDesign -0.34825 0.59805 -0.58231 0.56036
main_categoryFashion -1.49444 0.79293 -1.88470 0.05947
main_categoryFilm & Video 0.36330 0.51301 0.70818 0.47884
main_categoryFood -0.09686 0.62230 -0.15565 0.87631
main_categoryGames -0.23463 0.55546 -0.42240 0.67273
main_categoryJournalism -0.73984 0.83600 -0.88498 0.37617
main_categoryMusic 0.19660 0.50250 0.39123 0.69562
main_categoryPhotography -0.19711 0.67078 -0.29385 0.76887
main_categoryPublishing -0.71183 0.57241 -1.24357 0.21366
main_categoryTechnology -0.37860 0.61640 -0.61420 0.53908
main_categoryTheater 2.74847 1.33146 2.06426 0.03899

Contrary to our exploratory analysis, goal and duration are related to state while main_category is not.

4.2 Random Forest

Adapted from example here since that is the only way I know how to do random forests in R.

4.2.1 Split Training/Test Data

We will then build a random forest using an 80/20 train/test data split on this subset. This split is pretty standard, but can be different if you have good reason to split differently.

set.seed(1234)

test_random_forest_df <- subset %>%
  group_by(state) %>%
  sample_frac(.2) %>%
  ungroup()

train_random_forest_df <- subset %>%
  anti_join(test_random_forest_df, by = "ID")

4.2.2 Training

Now that we have our training set we can train the model…

set.seed(1234)

rf <- randomForest(state ~ ., 
                   data = train_random_forest_df %>% 
                     select(-ID, -name, -category, -deadline, -launched))

rf
## 
## Call:
##  randomForest(formula = state ~ ., data = train_random_forest_df %>%      select(-ID, -name, -category, -deadline, -launched)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.94%
## Confusion matrix:
##     0   1 class.error
## 0 168  14  0.07692308
## 1   4 117  0.03305785

4.2.3 Testing

Time to make predictions!

test_predictions <- predict(rf, 
                            newdata = test_random_forest_df %>% 
                              select(-ID, -name, -category, -deadline, -launched))

And lets check our performance on the test set…

table(pred = test_predictions, observed = test_random_forest_df$state)
##     observed
## pred  0  1
##    0 42  2
##    1  4 28

Our error rate is less than 8%. Not Bad.

4.2.4 Cross-Validation

We will now perform 5-fold cross-validation (subset contains only 379 observations) to compare our large randomForest (default 500 trees), with a moderately sized random forest (100 trees). This process can be automated to test different fold values (e.g. 10 instead of 5), and test differently sized forests (Ideally trained on a larger dataset). One could then plot the different hyperparameter values against test accuracy to choose the best values (That would be how one should really tune hyperparameters. But for the purposes of this tutorial, this test will give us an idea of how performance may or may not be plateauing after reaching a certain hyperparameter value.)

set.seed(1234)

results_df <- createFolds(subset$state, k = 5) %>%
  imap(function(test_indices, fold_number) {
    
    train_df <- subset %>%
      select(-ID, -name, -category, -deadline, -launched) %>%
      slice(-test_indices)
    
    test_df <- subset %>%
      select(-ID, -name, -category, -deadline, -launched) %>%
      slice(test_indices)
    
    rf1 <- randomForest(state ~ ., data = train_df, ntree = 500)
    rf2 <- randomForest(state ~ ., data = train_df, ntree = 100)
    
    test_df %>%
      select(observed_label = state) %>%
      mutate(fold = fold_number) %>%
      mutate(prob_positive_rf1 = predict(rf1, newdata = test_df, type = "prob")[, "1"]) %>%
      mutate(predicted_label_rf1 = ifelse(prob_positive_rf1 > 0.5, 1, 0)) %>%
      mutate(prob_positive_rf2 = predict(rf2, newdata = test_df, type = "prob")[, "1"]) %>%
      mutate(predicted_label_rf2 = ifelse(prob_positive_rf2 > 0.5, 1, 0))
    
  }) %>%
  reduce(bind_rows)

kable(head(results_df))
observed_label fold prob_positive_rf1 predicted_label_rf1 prob_positive_rf2 predicted_label_rf2
0 Fold1 0.290 0 0.27 0
1 Fold1 0.954 1 0.95 1
1 Fold1 0.762 1 0.74 1
1 Fold1 0.958 1 0.98 1
1 Fold1 0.908 1 0.91 1
0 Fold1 0.120 0 0.12 0

Now to test for a difference in error rate…

results_df %>%
  mutate(error_rf1 = observed_label != predicted_label_rf1,
         error_rf2 = observed_label != predicted_label_rf2) %>%
  group_by(fold) %>%
  summarize(big_rf = mean(error_rf1), small_rf = mean(error_rf2)) %>%
  gather(model, error, -fold) %>%
  lm(error ~ model, data = .) %>%
  tidy()
##            term     estimate  std.error  statistic     p.value
## 1   (Intercept)  0.068463431 0.01441413  4.7497456 0.001445722
## 2 modelsmall_rf -0.005333333 0.02038465 -0.2616347 0.800213126

There is not a statistically cignificant difference in error rate between the two random forests. This means that our performance starts to plateau on or before the size becomes 100.

4.2.5 Cross-Validation cont. (AUROC)

Now lets visualize the difference…

labels <- split(results_df$observed_label, results_df$fold)

predictions_rf1 <- split(results_df$prob_positive_rf1, results_df$fold) %>% prediction(labels)

predictions_rf2 <- split(results_df$prob_positive_rf2, results_df$fold) %>% prediction(labels)

mean_auc_rf1 <- predictions_rf1 %>%
  performance(measure = "auc") %>%
  slot("y.values") %>% unlist() %>% 
  mean()

mean_auc_rf2 <- predictions_rf2 %>%
  performance(measure = "auc") %>%
  slot("y.values") %>% unlist() %>% 
  mean()

predictions_rf1 %>%
  performance(measure = "tpr", x.measure = "fpr") %>%
  plot(avg = "threshold", col = "orange", lwd = 2)

predictions_rf2 %>%
  performance(measure = "tpr", x.measure = "fpr") %>%
  plot(avg = "threshold", col = "blue", lwd = 2, add = TRUE)

legend("bottomright",
       legend = paste(c("big", "small"), "rf, AUC:", round(c(mean_auc_rf1, mean_auc_rf2), digits = 3)),
       col = c("orange", "blue"))

Both classifiers perform astonishingly well.

It turns out we don’t have to consider the fact that successes may be rare (in retrospect they probably are not). But if they were, we may have wanted to adjust our classifier to increase the false positive rate at the expense of our true positive rate.

4.3 Model Selection

It should be noted that the results of this section are for purely illustrational purposes. One can train a randomForest on approx. 30% of our complete dataset and observe an error rate well below 1%. See extra section following the conclusion.

We want to see if there is a statistically significant difference between the error rates of our linear regression model and our random forest model.

We start by applying 5-fold cross validation to both models and generating a table of error rates…

fold_indices <- cvFolds(n = nrow(subset), K = 5)

error_rates <- sapply(1:5, function(fold_index) {
  test_indices <- which(fold_indices$which == fold_index)
  ss <- subset %>%
    select(-ID, -name, -category, -deadline, -launched)
  test_set <- ss[test_indices,]
  train_set <- ss[-test_indices,]
  
  lr_fit <- glm(state ~ ., data = train_set %>% select(-currency, -country), family = "binomial")
  lr_pred <- ifelse(predict(lr_fit, newdata = test_set, type = "response") > 0.5, 1, 0)
  lr_error <- mean(test_set$state != lr_pred)
  
  rf_fit <- randomForest(state ~ ., data = train_set)

  rf_pred <- predict(rf, newdata = test_set)
  rf_error <- mean(test_set$state != rf_pred)
  c(lr_error, rf_error)
  })

rownames(error_rates) <- c("lr", "rf")
error_rates <- as.data.frame(t(error_rates))

error_rates <- error_rates %>%
  mutate(fold = 1:n()) %>%
  gather(model, error, -fold)

error_rates %>%
  kable("html")
fold model error
1 lr 0.0657895
2 lr 0.0131579
3 lr 0.0526316
4 lr 0.0263158
5 lr 0.0266667
1 rf 0.0131579
2 rf 0.0131579
3 rf 0.0000000
4 rf 0.0131579
5 rf 0.0400000

We can plot the error rates for each fold ofr each model as follows

dotplot(error ~ model, data = error_rates, ylab = "Mean Prediction Error")

Now to hypothesis test…

lm(error ~ model, data = error_rates) %>% 
  tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 0.0369123 0.0082442 4.477346 0.0020630
modelrf -0.0210175 0.0116591 -1.802672 0.1091065

There is not a statistically significant difference between the 2 models.

But again, this is contrived. randomForest on 30% of the projects dataset has an error rate of ~0.4%

5 Conclusion

Sometimes it is difficult to spot connections using EDA that machine learning models can learn easily. In our case, a random forest was able to classify the training set fairly accurately. There are other considerations of machine learning that are not addressed in this tutorial (such as bias). For more information on machine learning see this text. The concepts illustrated in this tutorial can be adapted to almost any dataset. To answer our initial question: Yes, we can (fairly accurately) predict the success or failure of a Kickstarter project given certain factors such as goal and duration. Now you can go start your own, just make sure you will reach your goal!

5.1 randomForest30

Below we illustrate just how well a random forest can perform for us…

set.seed(1234)

subset2 <- projects %>%
  sample_frac(.3)

test_random_forest_df2 <- subset2 %>%
  group_by(state) %>%
  sample_frac(.2) %>%
  ungroup()

train_random_forest_df2 <- subset2 %>%
  anti_join(test_random_forest_df2, by = "ID")

rf2 <- randomForest(state ~ ., 
                    data = train_random_forest_df2 %>% 
                      select(-ID, -name, -category, -deadline, -launched))

rf2
## 
## Call:
##  randomForest(formula = state ~ ., data = train_random_forest_df2 %>%      select(-ID, -name, -category, -deadline, -launched)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.44%
## Confusion matrix:
##       0     1  class.error
## 0 57361   377 0.0065294953
## 1    20 33121 0.0006034821
test_predictions2 <- predict(rf2, 
                             newdata = test_random_forest_df2 %>% 
                               select(-ID, -name, -category, -deadline, -launched))

table(pred = test_predictions2, observed = test_random_forest_df2$state)
##     observed
## pred     0     1
##    0 14350     1
##    1    84  8284