ST 558 Project 3: World Analysis Report

Suyog Dharmadhikari; Bennett McAuley 2022-11-02

Introduction

The goal of this report is to create and compare predictive models using Online News Popularity data from the UCI Machine Learning Repository.

This data contains information about articles published by Mashable over a period of two years–specifically the statistics associated with them.

The purpose of this analysis is to use linear regression and ensemble tree-based methods to predict the number of shares–our target variable–subset by data channel. For more details about these methods, see the Modeling section.

According to the data description, there are 39,797 observations and 61 variables present. 58 of them are predictors, but using all of them would be inefficient and show glaring redundancies. Instead, the ones that will be used were chosen intuitively–what might encourage a user to share an article, what underlying forces may influence them, and what would make for good exploratory analysis:

average_token_length - Average length of the words in the content
avg_negative_polarity - Avg. polarity of negative words
global_rate_positive_words - Rate of positive words in the content
global_sentiment_polarity - Text sentiment polarity
global_subjectivity - Text subjectivity
is_weekend - Was the article published on the weekend?
kw_avg_max - Best keyword (avg. shares)
kw_avg_min - Worst Keyword (avg. shares)
kw_max_max - Best keyword (max shares)
kw_min_min - Worst keyword (min shares)
n_non_stop_words - Rate of non-stop words in the content
n_tokens_content - Number of words in the content
n_tokens_title - Number of words in the title
n_unique_tokens - Rate of unique words in the content
num_hrefs - Number of links
num_imgs - Number of images
num_self_hrefs - Number of links to other articles published by Mashable
num_videos - Number of videos
self_reference_avg_shares - Avg. shares of referenced articles in Mashable
title_sentiment_polarity - Title polarity
weekday_is_monday - Was the article published on a Monday?
weekday_is_tuesday - Was the article published on a Tuesday?
weekday_is_wednesday - Was the article published on a Wednesday?
weekday_is_thursday - Was the article published on a Thursday?
weekday_is_friday - Was the article published on a Friday?
weekday_is_saturday - Was the article published on a Saturday?
weekday_is_sunday - Was the article published on a Sunday?

Which we store in a vector for fast retrieval:

news.vars <- c('n_tokens_title', 'n_tokens_content', 'n_non_stop_words', 'num_hrefs',
               'num_self_hrefs', 'num_imgs', 'num_videos', 'average_token_length',
               'kw_max_max', 'kw_min_min', 'kw_avg_max', 'kw_avg_min',
               'self_reference_avg_sharess','is_weekend', 'global_subjectivity', 'global_sentiment_polarity',
               'global_rate_positive_words', 'avg_negative_polarity', 'n_unique_tokens', 'title_sentiment_polarity',
               'weekday_is_sunday', 'weekday_is_monday', 'weekday_is_tuesday', 'weekday_is_wednesday',
               'weekday_is_thursday', 'weekday_is_friday', 'weekday_is_saturday'
               )

The Data

For this analysis, we will consider a single source for the data channel. In other words, we subset the data to the World channel.

So, we read in the data, perform the filter, and convert is_weekend into a binary factor:

news <- read_csv("OnlineNewsPopularity.csv")

## Rows: 39644 Columns: 61
## ── Column specification ───────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): url
## dbl (60): timedelta, n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fltr <- paste0("data_channel_is_", params$channel)

news.tbl <- news %>% filter(get(fltr) == 1) %>% 
  select(all_of(news.vars), shares) %>%
  mutate(is_weekend = as.factor(is_weekend))

head(news.tbl)

## # A tibble: 6 × 28
##   n_token…¹ n_tok…² n_non…³ num_h…⁴ num_s…⁵ num_i…⁶ num_v…⁷ avera…⁸ kw_ma…⁹ kw_mi…˟ kw_av…˟ kw_av…˟
##       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1        10     231    1.00       4       1       1       1    5.09       0       0       0       0
## 2         9    1248    1.00      11       0       1       0    4.62       0       0       0       0
## 3        12     682    1.00      10       0       1       0    4.62       0       0       0       0
## 4         9     391    1.00       9       2       1       1    5.30       0       0       0       0
## 5        11     125    1.00       1       1       1       0    4.82       0       0       0       0
## 6        11     799    1.00       8       6       1       0    4.70       0       0       0       0
## # … with 16 more variables: self_reference_avg_sharess <dbl>, is_weekend <fct>,
## #   global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>,
## #   avg_negative_polarity <dbl>, n_unique_tokens <dbl>, title_sentiment_polarity <dbl>,
## #   weekday_is_sunday <dbl>, weekday_is_monday <dbl>, weekday_is_tuesday <dbl>,
## #   weekday_is_wednesday <dbl>, weekday_is_thursday <dbl>, weekday_is_friday <dbl>,
## #   weekday_is_saturday <dbl>, shares <dbl>, and abbreviated variable names ¹​n_tokens_title,
## #   ²​n_tokens_content, ³​n_non_stop_words, ⁴​num_hrefs, ⁵​num_self_hrefs, ⁶​num_imgs, ⁷​num_videos, …

Summarizations

Now that our data has been manipulated and filtered, we can perform a simple EDA.

Summary Statistics

First and foremost, we want to know what range of values we might expect when predicting for shares. The following table shows basic summary statistics about the variable:

s <- summary(news.tbl$shares)

tibble(var = 'shares', min = s[1], max = s[6], mean = s[3], stddev = sd(news.tbl$shares))

## # A tibble: 1 × 5
##   var    min     max     mean    stddev
##   <chr>  <table> <table> <table>  <dbl>
## 1 shares 35      284700  1100     6090.

If the maximum is tremendously larger in scale than the mean, then it is very likely an outlier. If the standard deviation is larger than the mean, then the number of shares varies greatly. Furthermore, the higher it is, the more disperse they are from the mean. The inverse also applies for a smaller standard deviation (i.e. the smaller, the less disperse).

It would also be good to know relationships between some of the variables present–and not just to shares. Observe the following correlation matrix of variables related to content sentiment. A value of 0 indicates no correlation and 1 indicates perfect correlation. The closer a value is to 1(+/-), the stronger the relationship is. The closer to 0, the weaker.

c <- news.tbl %>%
  select(avg_negative_polarity,
         global_rate_positive_words,
         global_sentiment_polarity,
         global_subjectivity,
         shares)

round(cor(c),2)

##                            avg_negative_polarity global_rate_positive_words
## avg_negative_polarity                       1.00                      -0.10
## global_rate_positive_words                 -0.10                       1.00
## global_sentiment_polarity                   0.23                       0.63
## global_subjectivity                        -0.45                       0.53
## shares                                     -0.01                       0.02
##                            global_sentiment_polarity global_subjectivity shares
## avg_negative_polarity                           0.23               -0.45  -0.01
## global_rate_positive_words                      0.63                0.53   0.02
## global_sentiment_polarity                       1.00                0.33   0.03
## global_subjectivity                             0.33                1.00   0.02
## shares                                          0.03                0.02   1.00

The new variable popularity is set up so that news is considered popular if it receives more shares than the median shares. Otherwise, it is considered unpopular.table() gives details about count of popular and unpopular news.

news.tbl$popularity <- ifelse(news.tbl$shares < mean(news.tbl$shares),
                              "Unpopular", "Popular")

table(news.tbl$popularity)

## 
##   Popular Unpopular 
##      1646      6781

The contingency table below depicts the relationship between the popularity of a news channel and whether or not it was published on the weekend. It informs us whether or not popular news was published over the weekend. Whether or not unpopular news was published over the weekend.

table(news.tbl$popularity,news.tbl$is_weekend)

##            
##                0    1
##   Popular   1345  301
##   Unpopular 5996  785

Plots

For visual analysis, we first construct a density plot to compare the densities of shares between the weekend (is_weekend = 1) and weekdays (is_weekend = 0). Observe the patterns generated for similarities and/or differences.

g <- ggplot(data = news.tbl, aes(x = shares))

g + geom_density(adjust = 0.5, alpha = 0.5, aes(fill = is_weekend)) +
  xlim(0, 10000) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Density Plot for Shares",
       subtitle = paste("channel = ", params$channel),
       x = "# of Shares",
       y = "Density")

## Warning: Removed 254 rows containing non-finite values (stat_density).

The second visual is a scatterplot with a trend line. The number of words in the content is on the x-axis and the rate of unique words on the y-axis. If the data points and trend line show an upward trend, then articles with more words tend to have a larger rate of unique ones. If there is a downward trend, then articles with more words tend to have a smaller rate of unique ones.

p <- ggplot(news.tbl, aes(x = n_tokens_content, y = n_unique_tokens))

p + geom_point() +
  geom_smooth(method = glm, col = "Blue") +
  labs(title = "Scatterplot for n_tokens_content vs n_unique_tokens",
       subtitle = paste("channel = ", params$channel),
       x = "# of Words in Article",
       y = "Rate of Unique Words in Article")

## `geom_smooth()` using formula 'y ~ x'

This subsequent bar plot shows a comparison between the total number of articles published on each day of the week and the total shares of those articles. Observe any interestingly sized columns that one might think would be higher or lower than others. Articles published on which day tend to get shared the most? Which one gets the least?

plt <- news.tbl %>% pivot_longer(cols = starts_with("weekday_is_"),
               names_to = "day",
               names_prefix = "weekday_is_",
               names_transform = as.factor,
               values_to = "count") %>%
  filter(count == 1) %>%
  select(-count, -starts_with("weekday_is_"))

#Reordering days
plt$day <- factor(plt$day,
    levels = c("monday", "tuesday", "wednesday", "thursday",
               "friday", "saturday", "sunday"))


plt <- plt %>%
  select(day, shares) %>%
  group_by(day) %>%
  summarise(articles = n(), shares = sum(shares))

plt <- reshape2::melt(plt, id.vars = c("day"))

p <- ggplot(plt, aes(x = day, y = value, fill = day))

p + geom_col() +
  facet_wrap(vars(variable), scales = "free") +
  scale_y_continuous(labels = scales::comma) +
  guides(x = guide_axis(angle = 45), fill = "none") +
  labs(title = "Total # of Articles vs Total # of Shares per Publishing Day",
       subtitle = paste("channel = ", params$channel),
       x = "Day",
       y = '')

The code chunk below, tells us about the number of shares of a particular published news with number of images in the published news. It also tells us about the popularity of the news published with images.

ggplot(news.tbl, aes(fill= popularity, y=shares, x=num_imgs)) +
  geom_bar(position='dodge', stat='identity',  width = 0.98) +
  ggtitle('Relation between number of images and shares') +
  xlab('Number of images in the news published') +
  ylab('Shares') +
  ylim(0, 100000)

## Warning: Removed 6 rows containing missing values (geom_bar).

Correlation plots are used to find out the correlation between two variables in the given data set. In the code chunk below, we have used ggcorrplt() to plot the correlation between all the variables in the data set and identify the most correlated variables. A violet color circle shows there is negative correlation between the two respective variables and a dark orange color is positive correlation between two variables.

corr <- cor(select_if(news.tbl, is.numeric))

ggcorrplot(corr,
           method = "circle",
           type = "lower",
           outline.color = "black",
           lab_size = 6)

In the code chunk a scatter plot with a trend line. The number of words in the content is on the x-axis and the number of shares on the y-axis. If the data points and trend line show an upward trend, then articles with more words tend to have more number of shares. If there is a downward trend, then articles with more words tend to have a smaller number of shares.

ggplot(news.tbl, aes(y=shares, x=n_tokens_content)) + 
    geom_point(aes(color=popularity)) +
    geom_smooth() + 
    ylim(0, 100000)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 6 rows containing non-finite values (stat_smooth).

## Warning: Removed 6 rows containing missing values (geom_point).

Modeling

The goal now is to create models for predicting the number of shares using linear regression and ensemble trees. Before that, the data is split into a training (70%) and test (30%) set.

i <- createDataPartition(news.tbl[[1]], p = 0.7, list = FALSE)

train <- news.tbl[i,]
test <- news.tbl[-i,]

Linear Regression

Linear regression is a statistical technique that aims to demonstrate a relationship between variables. A dependent variable’s nature and degree of connection with a group of independent variables are assessed using linear regression. Symbolically: $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ where $Y_i$ is the dependent variable/response value, $\beta_0$ is the intercept, and $\beta_1$ is the slope coefficient for observation value $x_i$

We can make predictions of shares using linear regression models.

Model 1

The linear regression model is fit using variables chosen from forward stepwise selection. The leaps library’s regsubsets() function performs best subset selection by identifying the best model with a given number of predictors, where best is quantified using $RSS$. The syntax is identical to that of lm(). The summary() command returns the best set of variables for each model size. In the code chunk below, we find the best predictors subset, then $R^2$ value is used to determine the number of predictors that can give best model results. Then we create a formula that can be used to train the model by using as.formula() and paste0().

fwdVars <- regsubsets(shares ~ ., train, method = "forward")

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in = force.in, : 2 linear
## dependencies found

## Reordering variables and trying again:

fwdSumm <- summary(fwdVars)

adjRNum <- which.max(fwdSumm$adjr2)

bestModels <- fwdSumm$which[adjRNum,-1]

varNames <- names(which(bestModels == TRUE))

predictorFormulaPt2 <- paste(varNames, collapse = "+")

modelFormula <- as.formula(paste0("shares", "~", predictorFormulaPt2))

modelFormula

## shares ~ n_tokens_title + n_tokens_content + num_hrefs + num_imgs + 
##     average_token_length + global_subjectivity + global_sentiment_polarity + 
##     weekday_is_wednesday + popularityUnpopular
## <environment: 0x000002049755d2f8>

In the code chunk below, firstly we have converted categorical variables into numerical variables using dummyVars(). gsub() is used to rename the column names in order to match with the column names generated by regsubsets() in the function above. The caret package was used to train a linear regression model by using the best predictors obtained from regsubsets(). We have standardized the model using center and scale. Cross validation is used to avoid over fitting and add generalization in our model.

# convert categorical variables into numeric variables i.e one hot encoding
dummies <- dummyVars(shares ~ is_weekend+popularity, data = train)

trainDF <- as_tibble(predict(dummies, newdata = train)) %>%
    bind_cols(select(train,-is_weekend,-popularity))

# Replace . with spaces in column names to match the output names from regsubsets
names(trainDF) <- gsub(x = names(trainDF), pattern = "\\.", replacement = "") 

forwardFit <- train(modelFormula,
                    data = trainDF,
                    method = "lm",
                    metric = "Rsquared",
                    preProcess = c("center","scale"),
                    trControl = trainControl(method = "cv", number = 10))

summary(forwardFit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6180   -589   -102    354 120008 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2219.99      56.90  39.018  < 2e-16 ***
## n_tokens_title              165.14      57.13   2.891 0.003858 ** 
## n_tokens_content           -238.09      66.28  -3.592 0.000331 ***
## num_hrefs                   237.34      63.96   3.711 0.000208 ***
## num_imgs                    260.11      62.45   4.165 3.16e-05 ***
## average_token_length       -214.76      75.81  -2.833 0.004631 ** 
## global_subjectivity         287.80      77.05   3.735 0.000189 ***
## global_sentiment_polarity  -135.63      60.95  -2.225 0.026116 *  
## weekday_is_wednesday       -120.30      56.96  -2.112 0.034721 *  
## popularityUnpopular       -2284.30      57.94 -39.427  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4371 on 5891 degrees of freedom
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2301 
## F-statistic: 196.9 on 9 and 5891 DF,  p-value: < 2.2e-16

Now we use the model to make predictions:

# convert categorical variables into numeric variables i.e one hot encoding
dummies <- dummyVars(shares ~ is_weekend+popularity, data = test)

testDF <- as_tibble(predict(dummies, newdata = test)) %>%
    bind_cols(select(test,-is_weekend,-popularity))

  predLR <- predict(forwardFit,newdata = testDF)
  
  predResultsFwd <- postResample(predLR,obs = testDF$shares)
  
  predResultsFwd

##        RMSE    Rsquared         MAE 
## 7630.443458    0.116278 1521.932315

Model 2

This second linear regression model is fit using variables chosen from backwards stepwise selection.

The regsubsets function is called, and the summary is returned to grab the models of $n$ terms that have the optimal values for $R^2$, Mallows’ Cp, and BIC. As performance will vary, the mode value between the three is chosen as the ‘best’.

backVars <- regsubsets(shares ~ ., train, method = "backward")

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in = force.in, : 2 linear
## dependencies found

## Reordering variables and trying again:

backSumm <- summary(backVars, all.best = FALSE)

met <- c(R2 = which.max(backSumm$rsq), 
     Cp = which.min(backSumm$cp),
     BIC = which.min(backSumm$bic)
)

M <- which.max(tabulate(met))

The most frequent number is 9. The variables present in this size model are:

terms <- coef(backVars, M)
terms

##               (Intercept)            n_tokens_title                 num_hrefs 
##              1.567613e+03              1.003889e+02              4.420527e+01 
##                  num_imgs      average_token_length                kw_max_max 
##              1.056280e+02             -1.182415e+03             -6.236860e-05 
##       global_subjectivity global_sentiment_polarity           n_unique_tokens 
##              4.584708e+03             -1.183615e+03              5.153387e+03 
##       weekday_is_saturday 
##              5.244028e+02

Now, the model can be trained. The coefficient names are grabbed from the backwards selection, and input as a subset of the training data. The variables are standardized to account for the vastly different magnitudes between them, and to prevent any extreme coefficient values because of this.

prd <- names(terms)

#Assuming these terms appear in the model, but just a warning otherwise
prd <- replace(prd, prd %in% c('(Intercept)', 'is_weekend1'), c('shares', 'is_weekend'))

## Warning in x[list] <- values: number of items to replace is not a multiple of replacement length

BackwardFit <- train(shares ~ .,
                    data = train[,prd],
                    method = "lm",
                    metric = "Rsquared",
                    preProcess = c("scale", "center"),
                    trControl = trainControl(method = "cv", number = 10))

summary(BackwardFit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8133  -1423   -786    -55 123109 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2219.99      63.88  34.752  < 2e-16 ***
## n_tokens_title              208.99      64.66   3.232  0.00123 ** 
## num_hrefs                   417.82      70.20   5.951 2.81e-09 ***
## num_imgs                    543.35      69.65   7.801 7.20e-15 ***
## average_token_length      -1036.14     109.66  -9.449  < 2e-16 ***
## kw_max_max                  -11.11      65.27  -0.170  0.86486    
## global_subjectivity         495.93      86.35   5.743 9.75e-09 ***
## global_sentiment_polarity   -96.09      68.64  -1.400  0.16157    
## n_unique_tokens             636.14     104.67   6.077 1.30e-09 ***
## weekday_is_saturday         128.89      64.01   2.014  0.04410 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4907 on 5891 degrees of freedom
## Multiple R-squared:  0.03095,    Adjusted R-squared:  0.02947 
## F-statistic: 20.91 on 9 and 5891 DF,  p-value: < 2.2e-16

The model is then tested against the testing data:

predLRBck <- predict(BackwardFit, newdata = test)
  
predResultsBck <- postResample(predLRBck, obs = test$shares)
predResultsBck

##         RMSE     Rsquared          MAE 
## 8.059496e+03 1.220383e-02 2.030549e+03

Random Forest

Random forest uses the same idea as bagging tree models. It is a bootstrap aggregation method, but the difference is that random forests randomly select a subset of predictors instead of using all the predictors. This is because if a really strong predictor exists, every bootstrap tree will probably use it for the first split. This will result in making bagged trees predictions more correlated, and variance will be large in aggregation. Hence, in random forest we avoid this limitation by using a random subset of predictors for each bootstrap sample/tree fit.

In the code chunk below, caret package was used to train the rf forest model and the evaluation metric used for the fit is Rsquared. We have tuned the model by using tuning parameter mtry ad used 5 folds cross validation for finding best fit.

#5 folds cross validation
control <- trainControl(method='cv', 
                        number=5)

tunegrid <- data.frame(mtry = 1:9)
rfFit <- train(modelFormula, 
                      data=trainDF, 
                      method='rf', 
                      metric='Rsquared', 
                      tuneGrid=tunegrid, 
                      trControl=control)

rfFit$bestTune

##   mtry
## 1    1

varImp(rfFit$finalModel) %>% arrange(desc(Overall))

##                               Overall
## popularityUnpopular       21203754843
## global_sentiment_polarity  8104988144
## n_tokens_content           7678074397
## average_token_length       7607489489
## num_hrefs                  7391079175
## global_subjectivity        7384481872
## num_imgs                   6667080752
## n_tokens_title             6580559526
## weekday_is_wednesday        580957049

# convert categorical variables into numeric variables i.e one hot encoding
dummies <- dummyVars(shares ~ is_weekend+popularity, data = test)

testDF <- as_tibble(predict(dummies, newdata = test)) %>%
    bind_cols(select(test,-is_weekend,-popularity))

predRF <- predict(rfFit,newdata = testDF)
  
predResultsRF <- postResample(predRF,obs = testDF$shares)
predResultsRF

##        RMSE    Rsquared         MAE 
## 7690.182248    0.118991 1454.685575

Boosted Tree Model

Boosting centers around the idea of sequential learning–taking information from preceding iterations as a basis for calculating the current iteration. Boosted trees are ensemble tree models consisting of a group of trees that are built in sequence, including errors from previous trees for predictions.

Boosted trees also have a unique characteristic of slow training. This is controlled by $\lambda$, a shrinkage parameter. The two other notable parameters for training a boosted tree are $B$ (the number of times the algorithm performs its procedure; n.trees in R) and $d$ (the number of splits; interaction.depth in R), both of which can be selected with cross-validation.

For regression, the boosting algorithm proceeds as follows:

The predictions are initialized to 0–$\hat{y}(x)=0$
The residuals between observed and predicted values are calculated
A tree with $d$ splits and $d+1$ terminal nodes is fit using the residuals as the response–denoted by $\hat{y}^b(x)$
The overall prediction is updated by recursively adding these pseudo-responses–$\hat{y}(x) +=\lambda\hat{y}^b(x)$
Residuals for new predictions are updated
Repeat $B$ times

This model is fit using cv = 10, and the train function will determine the optimal values of the parameters for us.

#Output for this chunk is hidden as it is rather long and does not provide information needed for reporting
boostFit <- train(shares ~ .,
              method = "gbm",
              data = train,
              trControl = trainControl(method = "cv", number = 10))

During training, the optimal values for the tuning parameters were found to be:

boostFit$bestTune

##   n.trees interaction.depth shrinkage n.minobsinnode
## 1      50                 1       0.1             10

Since coefficients aren’t interpretable for boosted trees, the calculation of variable importance for the final model is shown instead, in descending order:

varImp(boostFit$finalModel) %>% arrange(desc(Overall))

##                                Overall
## popularityUnpopular        83760208728
## num_hrefs                   5609929032
## n_tokens_title              5026157752
## num_imgs                    4748795375
## self_reference_avg_sharess  1221365040
## kw_avg_min                   691504040
## average_token_length         617837128
## global_sentiment_polarity    525061001
## avg_negative_polarity        478081500
## n_tokens_content                     0
## n_non_stop_words                     0
## num_self_hrefs                       0
## num_videos                           0
## kw_max_max                           0
## kw_min_min                           0
## kw_avg_max                           0
## is_weekend1                          0
## global_subjectivity                  0
## global_rate_positive_words           0
## n_unique_tokens                      0
## title_sentiment_polarity             0
## weekday_is_sunday                    0
## weekday_is_monday                    0
## weekday_is_tuesday                   0
## weekday_is_wednesday                 0
## weekday_is_thursday                  0
## weekday_is_friday                    0
## weekday_is_saturday                  0

Now, the results of applying the boosted tree to the test data:

predBoost <- predict(boostFit,newdata = test)
  
predResultsBoost <- postResample(predBoost,obs = test$shares)
predResultsBoost

##         RMSE     Rsquared          MAE 
## 7671.6774624    0.1066828 1493.2944002

Model Comparison

The metric results of each model are returned in the table below. The metric that they are “judged” against is RMSE, and the model with the minimum value is the “winner”.

rmseModels <- c(predResultsFwd[[1]],predResultsBck[[1]],
                predResultsRF[[1]],predResultsBoost[[1]])

modelName <- c("Linear Regression (Forward Var. Selection)", "Linear Regression (Backward Var. Selection)",
               "Random foresF", "Boosted Tree")

resultsDF <- tibble(modelName, rmseModels)
resultsDF

## # A tibble: 4 × 2
##   modelName                                   rmseModels
##   <chr>                                            <dbl>
## 1 Linear Regression (Forward Var. Selection)       7630.
## 2 Linear Regression (Backward Var. Selection)      8059.
## 3 Random foresF                                    7690.
## 4 Boosted Tree                                     7672.

bestModelName <- filter(resultsDF,rmseModels == min(rmseModels))[[1]]

paste0("The best model for the ", params$channel, " channel is: ", bestModelName)

## [1] "The best model for the world channel is: Linear Regression (Forward Var. Selection)"