You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: "Extra Exercise 5 - Models and Model Evaluation in R"
3
+
format: html
4
+
project:
5
+
type: website
6
+
output-dir: ../docs
7
+
---
8
+
9
+
## Extra exercises
10
+
11
+
e1. Find the best single predictor in the Diabetes dataset. This is done by comparing the null model (no predictors) to all possible models with one predictor, i.e. `outcome ~ predictor`, `outcome ~ predictor2`, ect. The null model can be formulated like so: `outcome ~ 1` (only the intercept). Fit all possible one predictor models and compare their fit to the null model with a likelihood ratio test. Find the predictor with the lowest p-value in the likelihood ratio test. This can be done in a loop in order to avoid writing out all models.
12
+
13
+
::: {.callout-tip collapse="true"}
14
+
## Hint
15
+
16
+
To use a formula with a variable you will need to combine the literal part and the variable with paste, e.g. `paste("Outcome ~", my_pred)`.
17
+
:::
18
+
19
+
```{r}
20
+
21
+
# Define the null model (intercept-only model)
22
+
null_model <- glm(Diabetes ~ 1, data = train, family = binomial)
23
+
24
+
# Get predictor names (excluding the outcome variable)
e2. Write a function that handles visualization of k-means clustering results. Think about which information you need to pass and what it should return.
54
+
55
+
---
56
+
57
+
## Quarto
58
+
59
+
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.
60
+
61
+
## Running Code
62
+
63
+
When you click the **Render** button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
64
+
65
+
```{r}
66
+
1 + 1
67
+
```
68
+
69
+
You can add options to executable code like this
70
+
71
+
```{r}
72
+
#| echo: false
73
+
2 * 2
74
+
```
75
+
76
+
The `echo: false` option disables the printing of code (only output is displayed).
Copy file name to clipboardExpand all lines: exercises/exercise5A.qmd
+19-29Lines changed: 19 additions & 29 deletions
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,8 @@ In this exercise you will fit and interpret simple models.
15
15
```{r warning=FALSE, message=FALSE}
16
16
library(tidyverse)
17
17
library(readxl)
18
+
library(ggfortify)
19
+
library(factoextra)
18
20
```
19
21
20
22
## Part 1: Linear regression
@@ -66,7 +68,7 @@ plot(model)
66
68
67
69
9. Now, use our test set to predict the response `medv` (`median value per house in 1000s`).
68
70
69
-
10. Evaluate how well our model performs. There are different ways of doing this but lets use the classic measure of RMSE (Root Mean Square Error). The psedocode below shows how to calculate the RMSE. A small RMSE (close to zero), indicates a good model.
71
+
10. Evaluate how well our model performs. There are different ways of doing this but lets use the classic measure of RMSE (Root Mean Square Error). The psedo-code below shows how to calculate the RMSE. A small RMSE (close to zero), indicates a good model.
70
72
71
73
```{r, eval=FALSE}
72
74
#RMSE
@@ -83,62 +85,50 @@ Plot `y_test` against `y_pred`.
83
85
84
86
## Part 2: Logistic regression
85
87
86
-
For this part we will use the joined diabetes, so lets load the joined dataset we created in exercise 1, e.g. 'diabetes_join.xlsx' or what you have named it.
88
+
For this part we will use the joined diabetes, so lets load the joined dataset we created in exercise 1, e.g. `diabetes_join.xlsx` or what you have named it.
87
89
88
90
As the outcome we are studying, `Diabetes`, is categorical variable we will perform logistic regression. We select serum calcium levels (`Serum_ca2`), `BMI` and smoking habits (`Smoker`) as predictive variables.
89
91
90
-
12.Logistic regression does not allow for any missing values so first ensure you do not have NAs in your dataframe. Ensure that your outcome variable `Diabetes` is a factor.
92
+
12.Read in the Diabetes dataset.
91
93
92
-
13.Split your data into training and test data. Take care that the two classes of the outcome variable are represented in both training and test data, and at similar ratios.
94
+
13.Logistic regression does not allow for any missing values so first ensure you do not have NAs in your dataframe. Ensure that your outcome variable `Diabetes` is a factor.
93
95
94
-
14. Fit a logistic regression model with `Serum_ca2`, `BMI` and `Smoker` as predictors and `Diabetes` as outcome, using your training data.
96
+
14. Split your data into training and test data. Take care that the two classes of the outcome variable are represented in both training and test data, and at similar ratios.
97
+
98
+
15. Fit a logistic regression model with `Serum_ca2`, `BMI` and `Smoker` as predictors and `Diabetes` as outcome, using your training data.
95
99
96
100
::: {.callout-tip collapse="true"}
97
101
## Hint
98
102
99
103
glm(..., family = 'binomial')
100
104
:::
101
105
102
-
15. Check the model summary and try to determine whether you could potentially drop one of your variables? If so, remake your model and check the coefficients, and error terms again.
106
+
16. Check the model summary and try to determine whether you could potentially drop one or more of your variables? If so, make this alternative model (model2) and compare it to the original model. Is there a significant loss/gain, i.e. better fit when including the serum calcium levels as predictor?
103
107
104
-
16. Now, use your model to predict Diabetes class based on your test set. What does the output of the prediction mean?
108
+
17. Now, use your model to predict Diabetes class based on your test set. What does the output of the prediction mean?
105
109
106
110
::: {.callout-tip collapse="true"}
107
111
## Hint
108
112
109
113
`predict(... , type ='response')`
110
114
:::
111
115
112
-
17. Lets evaluate the performance of our model. As we are performing classification, measures such as mse/rmse will not work, instead we will calculate the Accuracy. In order to get the Accuracy you must first convert our predictions into Diabetes class (e.g. 0 or 1).
116
+
18. Lets evaluate the performance of our model. As we are performing classification, measures such as mse/rmse will not work, instead we will calculate the accuracy. In order to get the accuracy you must first convert our predictions into Diabetes class labels (e.g. 0 or 1).
113
117
114
118
```{r, eval=FALSE}
115
-
confusionMatrix(y_pred, y_test)
119
+
caret::confusionMatrix(y_pred, y_test)
116
120
```
117
121
118
122
## Part 3: Clustering
119
123
120
-
In this part we will run clustering on the joined diabetes dataset from exercise 1. Load it here if you don't have it already from Part 2.
121
-
122
-
14. Run the k-means clustering algorithm with 4 centers on the data. Consider which columns you can use and if you have to manipulate them before. If you get an error, check whether you have values that might not be admissible, such as NA.
123
-
124
-
15. Check whether the data you have run k-means on has the same number of rows as the dataframe with meta information, e.g. whether the person had diabetes. If they are not aligned, create a dataframe with Diabetes info that matches the dataframe you ran clustering on.
125
-
126
-
16. Visualize the results of your clustering.
127
-
128
-
17. Investigate the best number of clusters.
124
+
In this part we will run clustering on the joined diabetes dataset (`diabetes_join.xlsx`) from exercise 1. Load it here if you don't have it already from Part 2 above.
129
125
130
-
18. Re-do the clustering (plus visualization) with that number.
126
+
19. Before running K-means clustering. Remove any missing values across all variables in your dataset.
20. Run the k-means clustering algorithm with 4 centers on the data. Consider which columns you can use and if you have to do anything to them before clustering?
133
129
134
-
## Extra exercises
130
+
21. Visualize the results of your clustering.
135
131
136
-
e1. Find the best single predictor in the Diabetes dataset. This is done by comparing the null model (no predictors) to all possible models with one predictor, i.e. `outcome ~ predictor`, `outcome ~ predictor2`, ect. The null model can be formulated like so: `outcome ~ 1` (only the intercept). Fit all possible one predictor models and compare their fit to the null model with a likelihood ratio test. Find the predictor with the lowest p-value in the likelihood ratio test. This can be done in a loop in order to avoid writing out all models.
137
-
138
-
::: {.callout-tip collapse="true"}
139
-
## Hint
140
-
141
-
To use a formula with a variable you will need to combine the literal part and the variable with paste, e.g. `paste("Outcome ~", my_pred)`.
142
-
:::
132
+
22. Investigate the best number of clusters.
143
133
144
-
e2. Write a function that handles visualization of k-means clustering results. Think about which information you need to pass and what it should return.
134
+
23. Re-do the clustering (plus visualization) with that number.
Copy file name to clipboardExpand all lines: exercises/exercise5B.qmd
+82-3Lines changed: 82 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ The remaining 28 variables we will consider as potential explanatory variables f
52
52
53
53
Elastic Net regression is part of the family of penalized regressions, which also includes Ridge regression and LASSO regression. Penalized regressions are especially useful when dealing with many predictors, as they help eliminate less informative ones while retaining the important predictors, making them ideal for high-dimensional datasets. One of the key advantages of Elastic Net over other types of penalized regression is its ability to handle multicollinearity and situations where the number of predictors exceeds the number of observations.
54
54
55
-
As described above we have five variables which could be considered outcomes as these where all measured at the end of pregnancy. We can only work with one outcome at a time and we will pick `Preg.ended...37.wk`. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).
55
+
As described above we have five variables which could be considered outcomes as these where all measured at the end of pregnancy. We can only work with one outcome at a time and we will pick `Preg.ended...37.wk` for now. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).
56
56
57
57
5. As you will use the response `Preg.ended...37.wk`, you should remove the other five outcome measures from your dataset.
58
58
@@ -99,7 +99,12 @@ Now, lets see how well your model performed.
99
99
100
100
14. Predict if a individual is likely to give birth before the 37th week using your model and your test set. See pseudo-code below.
101
101
102
-
15. Just like for the logistic regression model you can calculate the accuracy of the prediction by first converting the predicted probabilities back into class labels (0, 1) and then comparing these to `y_test` with `confusionMatrix()`. Do you have a good accuracy? N.B look at the 2x2 contingency table, what does it tell you?
102
+
```{r, eval = FALSE}
103
+
y_pred <- predict(model, test, type = 'class')
104
+
105
+
```
106
+
107
+
15. Just like for the logistic regression model you can calculate the accuracy of the prediction by comparing it to `y_test` with `confusionMatrix()`. Do you have a good accuracy? N.B look at the 2x2 contingency table, what does it tell you?
103
108
104
109
16. Lastly, lets extract the variables which were retained in the model (e.g. not penalized out). We do this by calling the coefficient with `coef()` on our model. See pseudo-code below.
16. Make a plot that shows the absolute importance of the variables retained in your model. This could be barplot with variable names on the x-axis and the height of the bars denoting absolute size of coefficient).
115
120
116
-
17. Make a logistic regression using this same dataset (you already have your train data, test data, y_train and y_test). Do you get similar results?
121
+
## Part 2: Random Forest
122
+
123
+
Now, we will make a Random Forest.
124
+
125
+
We will continue using the `Obt_Perio_ML.Rdata` with `Preg.ended...37.wk` as outcome.
126
+
127
+
18. Just like in the section on EN above:
128
+
129
+
- Load the dataset (if you have not already)
130
+
131
+
- Remove the outcome variables you will not be using.
132
+
133
+
- Split the dataset into test and train set - this time keep the outcome variable `Preg.ended...37.wk` in the dataset.
134
+
135
+
- Remember to remove the `PID` column before training!
136
+
137
+
19. Set up a Random Forest model with cross-validation. See pseudo-code below. Remember to set a seed.
138
+
139
+
First the cross-validation parameters:
140
+
141
+
```{r, , eval=FALSE}
142
+
set.seed(123)
143
+
144
+
# Set up cross-validation: 5-fold CV
145
+
RFcv <- trainControl(
146
+
method = "cv",
147
+
number = 5,
148
+
classProbs = TRUE,
149
+
summaryFunction = twoClassSummary,
150
+
savePredictions = "final"
151
+
)
152
+
```
153
+
154
+
Next we train the model:
155
+
156
+
```{r, eval=FALSE}
157
+
# Train Random Forest
158
+
set.seed(123)
159
+
rf_model <- train(
160
+
Outcome ~ .,
161
+
data = Trainingdata,
162
+
method = "rf",
163
+
trControl = RFcv,
164
+
metric = "ROC",
165
+
tuneLength = 5
166
+
)
167
+
168
+
169
+
# Model summary
170
+
print(rf_model)
171
+
```
172
+
173
+
20. Plot your model fit. How does your model improve when you add 10, 20, 30, etc. predictors?
174
+
175
+
```{r, eval=FALSE}
176
+
# Best parameters
177
+
rf_model$bestTune
178
+
179
+
# Plot performance
180
+
plot(rf_model)
181
+
182
+
183
+
```
184
+
185
+
21. Use your test set to evaluate your model performance. How does the random forest compare to the elastic net regression?
186
+
187
+
22. Extract the predictive variables with the greatest importance from your fit.
188
+
189
+
```{r, eval=FALSE}
190
+
varImpOut <- varImp(rf_model)
191
+
192
+
varImpOut$importance
193
+
```
194
+
195
+
23. Make a logistic regression using the same dataset (you already have your train data, test data, y_train and y_test). How do the results of Elastic Net regression and Random Forest compare to the output of your glm.
0 commit comments