Skip to content

Commit b51badc

Browse files
committed
Exercise 5B now with random forest
1 parent a24d904 commit b51badc

File tree

7 files changed

+447
-284
lines changed

7 files changed

+447
-284
lines changed

data/Obt_Perio_ML.Rdata

-8.59 KB
Binary file not shown.

data/make_data.R

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -371,8 +371,11 @@ opt <- opt %>%
371371
mutate(N.PAL.sites = as.factor(ifelse(N.PAL.sites >= 2 , "3-33", as.character(N.PAL.sites))))
372372

373373

374+
374375
outvars <- c("PID", "Apgar1", "Apgar5", "Birthweight", "GA.at.outcome", "Any.SAE.", "Preg.ended...37.wk")
375376

377+
378+
376379
outcomes <- opt %>%
377380
dplyr::select(outvars)
378381

@@ -431,34 +434,34 @@ write_csv(optImp, file = 'Obstetrics_Periodontal_Therapy.csv')
431434

432435

433436

437+
# ------------------------------------------------------------------------
438+
# Balanced version for LASSO and RF
434439

435440

436-
# Check balance of factor variables for ML
437-
factor_counts <- optImp %>%
438-
dplyr::select(where(is.factor)) %>%
439-
map(~ as.data.frame(table(.))) %>%
440-
imap(~ setNames(.x, c("Level", "Count")) %>% mutate(Variable = .y)) %>%
441-
bind_rows() %>%
442-
relocate(Variable, .before = Level)
443441

442+
# Check balance of factor variables for ML
443+
factor_counts <- optImp %>%
444+
dplyr::select(where(is.character)) %>%
445+
pivot_longer(everything(), names_to = "Variable", values_to = "Level") %>%
446+
count(Variable, Level, name = "Count")
444447

445448
factor_counts
446449

447450

448451

449452

450-
451453
# Smaller more balanced version for LASSO and R
452454
optML <- optImp %>%
453455
dplyr::select(-c(X..Vis.Elig,
456+
GA...1st.SAE,
454457
Diabetes,
455458
Fetal.congenital.anomaly,
456459
Hypertension,
457460
Traumatic.Inj,
458461
BL.Bac.vag,
459462
ETXU_CAT1)) %>%
460-
mutate(Any.SAE.= as.factor(ifelse(Any.SAE. == 'Yes', 1, 0)),
461-
Preg.ended...37.wk = as.factor(ifelse(Preg.ended...37.wk == 'Yes', 1, 0)))
463+
mutate(Preg.ended...37.wk = as.factor(ifelse(Preg.ended...37.wk == 'Yes', 1, 0)))
464+
462465

463466

464467
# Upsample to get more examples of rare class output
@@ -470,26 +473,27 @@ optML <- upSample(x = optML[, -which(names(optML) == "Preg.ended...37.wk")], y =
470473

471474
optML <- optML %>%
472475
mutate(PID= paste0('P', 1:nrow(optML))) %>%
473-
relocate(PID, .before = Clinic)
476+
relocate(PID, .before = Apgar1)
474477

475478

476479

477-
# Sample rows to remove 20% from the upsampled class
480+
# Sample rows to remove 50% from the upsampled class
478481
set.seed(123)
479482

480483
nclass1 <- optML %>%
481484
filter(Preg.ended...37.wk == "1")
482485

483-
down1 <- round(0.35 * nrow(nclass1))
486+
down1 <- round(0.5 * nrow(nclass1))
484487

485488
PID1 <- nclass1 %>%
486489
slice_sample(n = down1) %>%
487490
pull(PID)
488491

489492

490-
# Combine with other class
493+
# Filter class 1
491494
optML <- optML %>%
492-
dplyr::filter(!PID %in% PID1)
495+
dplyr::filter(!PID %in% PID1) %>%
496+
sample_frac(1) # shuffle_rows
493497

494498

495499
# Up-sample sparse class

exercises/ExtraExercise5.qmd

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
title: "Extra Exercise 5 - Models and Model Evaluation in R"
3+
format: html
4+
project:
5+
type: website
6+
output-dir: ../docs
7+
---
8+
9+
## Extra exercises
10+
11+
e1. Find the best single predictor in the Diabetes dataset. This is done by comparing the null model (no predictors) to all possible models with one predictor, i.e. `outcome ~ predictor`, `outcome ~ predictor2`, ect. The null model can be formulated like so: `outcome ~ 1` (only the intercept). Fit all possible one predictor models and compare their fit to the null model with a likelihood ratio test. Find the predictor with the lowest p-value in the likelihood ratio test. This can be done in a loop in order to avoid writing out all models.
12+
13+
::: {.callout-tip collapse="true"}
14+
## Hint
15+
16+
To use a formula with a variable you will need to combine the literal part and the variable with paste, e.g. `paste("Outcome ~", my_pred)`.
17+
:::
18+
19+
```{r}
20+
21+
# Define the null model (intercept-only model)
22+
null_model <- glm(Diabetes ~ 1, data = train, family = binomial)
23+
24+
# Get predictor names (excluding the outcome variable)
25+
predictors <- setdiff(names(diabetes_nona), c("Diabetes", "ID"))
26+
27+
# Initialize an empty data frame to store results
28+
results <- data.frame(Predictor = character(), ChiSq = numeric(), P_Value = numeric(), stringsAsFactors = FALSE)
29+
30+
# Loop through each predictor and fit a logistic regression model
31+
for (pred in predictors) {
32+
33+
# Fit model with single predictor
34+
model_pred <- glm(paste("Diabetes ~", pred), data = train, family = binomial)
35+
36+
# Perform Likelihood Ratio Test
37+
test_result <- anova(null_model, model_pred, test = "Chisq")
38+
39+
# Extract Chi-square statistic and p-value
40+
chi_sq <- test_result$Deviance[2] # The second row corresponds to the predictor model
41+
p_value <- test_result$`Pr(>Chi)`[2]
42+
43+
# Store results
44+
results <- rbind(results, data.frame(Predictor = pred, ChiSq = chi_sq, P_Value = p_value))
45+
}
46+
47+
# Print the results sorted by p-value
48+
results <- results %>% arrange(P_Value)
49+
print(results)
50+
51+
```
52+
53+
e2. Write a function that handles visualization of k-means clustering results. Think about which information you need to pass and what it should return.
54+
55+
---
56+
57+
## Quarto
58+
59+
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.
60+
61+
## Running Code
62+
63+
When you click the **Render** button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
64+
65+
```{r}
66+
1 + 1
67+
```
68+
69+
You can add options to executable code like this
70+
71+
```{r}
72+
#| echo: false
73+
2 * 2
74+
```
75+
76+
The `echo: false` option disables the printing of code (only output is displayed).

exercises/exercise5A.qmd

Lines changed: 19 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ In this exercise you will fit and interpret simple models.
1515
```{r warning=FALSE, message=FALSE}
1616
library(tidyverse)
1717
library(readxl)
18+
library(ggfortify)
19+
library(factoextra)
1820
```
1921

2022
## Part 1: Linear regression
@@ -66,7 +68,7 @@ plot(model)
6668

6769
9. Now, use our test set to predict the response `medv` (`median value per house in 1000s`).
6870

69-
10. Evaluate how well our model performs. There are different ways of doing this but lets use the classic measure of RMSE (Root Mean Square Error). The psedocode below shows how to calculate the RMSE. A small RMSE (close to zero), indicates a good model.
71+
10. Evaluate how well our model performs. There are different ways of doing this but lets use the classic measure of RMSE (Root Mean Square Error). The psedo-code below shows how to calculate the RMSE. A small RMSE (close to zero), indicates a good model.
7072

7173
```{r, eval=FALSE}
7274
#RMSE
@@ -83,62 +85,50 @@ Plot `y_test` against `y_pred`.
8385

8486
## Part 2: Logistic regression
8587

86-
For this part we will use the joined diabetes, so lets load the joined dataset we created in exercise 1, e.g. 'diabetes_join.xlsx' or what you have named it.
88+
For this part we will use the joined diabetes, so lets load the joined dataset we created in exercise 1, e.g. `diabetes_join.xlsx` or what you have named it.
8789

8890
As the outcome we are studying, `Diabetes`, is categorical variable we will perform logistic regression. We select serum calcium levels (`Serum_ca2`), `BMI` and smoking habits (`Smoker`) as predictive variables.
8991

90-
12. Logistic regression does not allow for any missing values so first ensure you do not have NAs in your dataframe. Ensure that your outcome variable `Diabetes` is a factor.
92+
12. Read in the Diabetes dataset.
9193

92-
13. Split your data into training and test data. Take care that the two classes of the outcome variable are represented in both training and test data, and at similar ratios.
94+
13. Logistic regression does not allow for any missing values so first ensure you do not have NAs in your dataframe. Ensure that your outcome variable `Diabetes` is a factor.
9395

94-
14. Fit a logistic regression model with `Serum_ca2`, `BMI` and `Smoker` as predictors and `Diabetes` as outcome, using your training data.
96+
14. Split your data into training and test data. Take care that the two classes of the outcome variable are represented in both training and test data, and at similar ratios.
97+
98+
15. Fit a logistic regression model with `Serum_ca2`, `BMI` and `Smoker` as predictors and `Diabetes` as outcome, using your training data.
9599

96100
::: {.callout-tip collapse="true"}
97101
## Hint
98102

99103
glm(..., family = 'binomial')
100104
:::
101105

102-
15. Check the model summary and try to determine whether you could potentially drop one of your variables? If so, remake your model and check the coefficients, and error terms again.
106+
16. Check the model summary and try to determine whether you could potentially drop one or more of your variables? If so, make this alternative model (model2) and compare it to the original model. Is there a significant loss/gain, i.e. better fit when including the serum calcium levels as predictor?
103107

104-
16. Now, use your model to predict Diabetes class based on your test set. What does the output of the prediction mean?
108+
17. Now, use your model to predict Diabetes class based on your test set. What does the output of the prediction mean?
105109

106110
::: {.callout-tip collapse="true"}
107111
## Hint
108112

109113
`predict(... , type ='response')`
110114
:::
111115

112-
17. Lets evaluate the performance of our model. As we are performing classification, measures such as mse/rmse will not work, instead we will calculate the Accuracy. In order to get the Accuracy you must first convert our predictions into Diabetes class (e.g. 0 or 1).
116+
18. Lets evaluate the performance of our model. As we are performing classification, measures such as mse/rmse will not work, instead we will calculate the accuracy. In order to get the accuracy you must first convert our predictions into Diabetes class labels (e.g. 0 or 1).
113117

114118
```{r, eval=FALSE}
115-
confusionMatrix(y_pred, y_test)
119+
caret::confusionMatrix(y_pred, y_test)
116120
```
117121

118122
## Part 3: Clustering
119123

120-
In this part we will run clustering on the joined diabetes dataset from exercise 1. Load it here if you don't have it already from Part 2.
121-
122-
14. Run the k-means clustering algorithm with 4 centers on the data. Consider which columns you can use and if you have to manipulate them before. If you get an error, check whether you have values that might not be admissible, such as NA.
123-
124-
15. Check whether the data you have run k-means on has the same number of rows as the dataframe with meta information, e.g. whether the person had diabetes. If they are not aligned, create a dataframe with Diabetes info that matches the dataframe you ran clustering on.
125-
126-
16. Visualize the results of your clustering.
127-
128-
17. Investigate the best number of clusters.
124+
In this part we will run clustering on the joined diabetes dataset (`diabetes_join.xlsx`) from exercise 1. Load it here if you don't have it already from Part 2 above.
129125

130-
18. Re-do the clustering (plus visualization) with that number.
126+
19. Before running K-means clustering. Remove any missing values across all variables in your dataset.
131127

132-
------------------------------------------------------------------------
128+
20. Run the k-means clustering algorithm with 4 centers on the data. Consider which columns you can use and if you have to do anything to them before clustering?
133129

134-
## Extra exercises
130+
21. Visualize the results of your clustering.
135131

136-
e1. Find the best single predictor in the Diabetes dataset. This is done by comparing the null model (no predictors) to all possible models with one predictor, i.e. `outcome ~ predictor`, `outcome ~ predictor2`, ect. The null model can be formulated like so: `outcome ~ 1` (only the intercept). Fit all possible one predictor models and compare their fit to the null model with a likelihood ratio test. Find the predictor with the lowest p-value in the likelihood ratio test. This can be done in a loop in order to avoid writing out all models.
137-
138-
::: {.callout-tip collapse="true"}
139-
## Hint
140-
141-
To use a formula with a variable you will need to combine the literal part and the variable with paste, e.g. `paste("Outcome ~", my_pred)`.
142-
:::
132+
22. Investigate the best number of clusters.
143133

144-
e2. Write a function that handles visualization of k-means clustering results. Think about which information you need to pass and what it should return.
134+
23. Re-do the clustering (plus visualization) with that number.

exercises/exercise5B.qmd

Lines changed: 82 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The remaining 28 variables we will consider as potential explanatory variables f
5252

5353
Elastic Net regression is part of the family of penalized regressions, which also includes Ridge regression and LASSO regression. Penalized regressions are especially useful when dealing with many predictors, as they help eliminate less informative ones while retaining the important predictors, making them ideal for high-dimensional datasets. One of the key advantages of Elastic Net over other types of penalized regression is its ability to handle multicollinearity and situations where the number of predictors exceeds the number of observations.
5454

55-
As described above we have five variables which could be considered outcomes as these where all measured at the end of pregnancy. We can only work with one outcome at a time and we will pick `Preg.ended...37.wk`. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).
55+
As described above we have five variables which could be considered outcomes as these where all measured at the end of pregnancy. We can only work with one outcome at a time and we will pick `Preg.ended...37.wk` for now. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).
5656

5757
5. As you will use the response `Preg.ended...37.wk`, you should remove the other five outcome measures from your dataset.
5858

@@ -99,7 +99,12 @@ Now, lets see how well your model performed.
9999

100100
14. Predict if a individual is likely to give birth before the 37th week using your model and your test set. See pseudo-code below.
101101

102-
15. Just like for the logistic regression model you can calculate the accuracy of the prediction by first converting the predicted probabilities back into class labels (0, 1) and then comparing these to `y_test` with `confusionMatrix()`. Do you have a good accuracy? N.B look at the 2x2 contingency table, what does it tell you?
102+
```{r, eval = FALSE}
103+
y_pred <- predict(model, test, type = 'class')
104+
105+
```
106+
107+
15. Just like for the logistic regression model you can calculate the accuracy of the prediction by comparing it to `y_test` with `confusionMatrix()`. Do you have a good accuracy? N.B look at the 2x2 contingency table, what does it tell you?
103108

104109
16. Lastly, lets extract the variables which were retained in the model (e.g. not penalized out). We do this by calling the coefficient with `coef()` on our model. See pseudo-code below.
105110

@@ -113,4 +118,78 @@ coeffsDat <- as.data.frame(as.matrix(coeffs))
113118

114119
16. Make a plot that shows the absolute importance of the variables retained in your model. This could be barplot with variable names on the x-axis and the height of the bars denoting absolute size of coefficient).
115120

116-
17. Make a logistic regression using this same dataset (you already have your train data, test data, y_train and y_test). Do you get similar results?
121+
## Part 2: Random Forest
122+
123+
Now, we will make a Random Forest.
124+
125+
We will continue using the `Obt_Perio_ML.Rdata` with `Preg.ended...37.wk` as outcome.
126+
127+
18. Just like in the section on EN above:
128+
129+
- Load the dataset (if you have not already)
130+
131+
- Remove the outcome variables you will not be using.
132+
133+
- Split the dataset into test and train set - this time keep the outcome variable `Preg.ended...37.wk` in the dataset.
134+
135+
- Remember to remove the `PID` column before training!
136+
137+
19. Set up a Random Forest model with cross-validation. See pseudo-code below. Remember to set a seed.
138+
139+
First the cross-validation parameters:
140+
141+
```{r, , eval=FALSE}
142+
set.seed(123)
143+
144+
# Set up cross-validation: 5-fold CV
145+
RFcv <- trainControl(
146+
method = "cv",
147+
number = 5,
148+
classProbs = TRUE,
149+
summaryFunction = twoClassSummary,
150+
savePredictions = "final"
151+
)
152+
```
153+
154+
Next we train the model:
155+
156+
```{r, eval=FALSE}
157+
# Train Random Forest
158+
set.seed(123)
159+
rf_model <- train(
160+
Outcome ~ .,
161+
data = Trainingdata,
162+
method = "rf",
163+
trControl = RFcv,
164+
metric = "ROC",
165+
tuneLength = 5
166+
)
167+
168+
169+
# Model summary
170+
print(rf_model)
171+
```
172+
173+
20. Plot your model fit. How does your model improve when you add 10, 20, 30, etc. predictors?
174+
175+
```{r, eval=FALSE}
176+
# Best parameters
177+
rf_model$bestTune
178+
179+
# Plot performance
180+
plot(rf_model)
181+
182+
183+
```
184+
185+
21. Use your test set to evaluate your model performance. How does the random forest compare to the elastic net regression?
186+
187+
22. Extract the predictive variables with the greatest importance from your fit.
188+
189+
```{r, eval=FALSE}
190+
varImpOut <- varImp(rf_model)
191+
192+
varImpOut$importance
193+
```
194+
195+
23. Make a logistic regression using the same dataset (you already have your train data, test data, y_train and y_test). How do the results of Elastic Net regression and Random Forest compare to the output of your glm.

0 commit comments

Comments
 (0)