Skip to content

Commit 7d2d737

Browse files
authored
Merge pull request #5 from UCSB-Library-Research-Data-Services/main
Merging back script with the data cleaning
2 parents 8ccb44e + a1161ca commit 7d2d737

File tree

4 files changed

+1453
-51
lines changed

4 files changed

+1453
-51
lines changed

.Rprofile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
source("renv/activate.R")

data-cleaning.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -284,7 +284,7 @@ Decisions:
284284

285285
```{r}
286286
snowsurvey_fixed <- snowsurvey_fixed %>%
287-
filter((Total_cover_computed >= 80 & Total_cover_computed <= 120) | (Water_cover + Land_cover == 0 & Snow_cover > 0))
287+
filter((Total_cover_computed >= 80 & Total_cover_computed <= 120) | (Water_cover + Land_cover == 0 & Snow_cover >= 0))
288288
```
289289

290290
### Dates

data-cleaning_empty.qmd

Lines changed: 17 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,6 @@ Let's focus on the non-numeric values as a starting point:
5050

5151
```{r}
5252
snowsurvey_csv %>%
53-
count(Snow_cover) %>%
54-
filter(is.na(as.numeric(Snow_cover)))
5553
5654
```
5755

@@ -70,29 +68,22 @@ snowsurvey_csv %>%
7068
Interestingly, when there is a "dot" for snow cover, it is also the case for all the other covers. Let's replace those with NA since there is no supplemental information in the provided metadata about the use of dots
7169

7270
```{r}
73-
snowsurvey_fixed <- snowsurvey_csv %>%
74-
# filter(Snow_cover ==".") %>%
75-
mutate(Snow_cover = ifelse(Snow_cover==".", NA, Snow_cover))
76-
71+
snowsurvey_fixed <-
7772
```
7873

7974
#### `-` values
8075

8176
Is he problem is similar with "-"?
8277

8378
```{r}
84-
snowsurvey_csv %>%
85-
filter(Snow_cover == "-") %>%
86-
View()
79+
snowsurvey_csv
8780
```
8881

8982

9083
let's set it to NA:
9184

9285
```{r}
93-
snowsurvey_fixed <- snowsurvey_fixed %>%
94-
# filter(Snow_cover == "-") %>%
95-
mutate(Snow_cover = ifelse(Snow_cover=="-", NA, Snow_cover))
86+
snowsurvey_fixed <-
9687
```
9788

9889
#### `n/a` values
@@ -101,17 +92,15 @@ snowsurvey_fixed <- snowsurvey_fixed %>%
10192

10293
```{r}
10394
snowsurvey_csv %>%
104-
filter(Snow_cover == "n/a") %>%
105-
View()
95+
10696
```
10797

10898

10999
Same pattern, let's substitute with NA:
110100

111101
```{r}
112-
snowsurvey_fixed <- snowsurvey_fixed %>%
113-
# filter(Snow_cover == "n/a") %>%
114-
mutate(Snow_cover = ifelse(Snow_cover=="n/a", NA, Snow_cover))
102+
snowsurvey_fixed <-
103+
115104
```
116105

117106
#### `unk` values
@@ -120,31 +109,24 @@ What about "unk"? It is probably an abbreviation for unknown:
120109

121110
```{r}
122111
snowsurvey_csv %>%
123-
filter(Snow_cover == "unk") %>%
124-
View()
112+
125113
```
126114

127115

128116
```{r}
129-
snowsurvey_fixed <- snowsurvey_fixed %>%
130-
# filter(Snow_cover == "unk") %>%
131-
mutate(Snow_cover = ifelse(Snow_cover=="unk", NA, Snow_cover))
117+
snowsurvey_fixed <-
132118
```
133119

134120
#### `<1` values
135121

136122
Finally What should we replace"<1" with?
137123

138124
```{r}
139-
snowsurvey_csv %>%
140-
filter(Snow_cover == "<1") %>%
141-
View()
125+
142126
```
143127

144128
```{r}
145-
snowsurvey_fixed <- snowsurvey_fixed %>%
146-
# filter(Snow_cover == "<1") %>%
147-
mutate(Snow_cover = ifelse(Snow_cover=="<1", "0", Snow_cover))
129+
148130
```
149131

150132

@@ -161,10 +143,7 @@ snowsurvey_fixed %>%
161143
Ok, we can do the transformation:
162144

163145
```{r}
164-
snowsurvey_fixed <- snowsurvey_fixed %>%
165-
mutate(Snow_cover = as.numeric(Snow_cover))
166-
167-
glimpse(snowsurvey_fixed)
146+
snowsurvey_fixed <-
168147
```
169148

170149
Yeah we have finally a numeric column 🎉.
@@ -175,22 +154,20 @@ Yeah we have finally a numeric column 🎉.
175154
We are dealing with percentages, so we should verify that all the values are between 0 and 100:
176155

177156
```{r}
178-
snowsurvey_fixed %>%
179-
filter(Snow_cover > 100)
157+
snowsurvey_fixed
180158
```
181159

182160
We have two values above 100, with an interesting 470%! ☃️ We should probably set those values to NAs:
183161

184162
```{r}
185-
snowsurvey_fixed <- snowsurvey_fixed %>%
186-
mutate(Snow_cover = ifelse(Snow_cover > 100, NA, Snow_cover))
163+
snowsurvey_fixed <-
187164
```
188165

189166
Let's check for negative values:
190167

191168
```{r}
192169
snowsurvey_fixed %>%
193-
filter(Snow_cover < 0)
170+
194171
```
195172

196173
No negative value detected ✅
@@ -236,32 +213,22 @@ This data model is not convenient for a database, we will have to switch to a lo
236213
### Data cleaning
237214

238215
```{r}
239-
species_long <- species_csv %>%
240-
pivot_longer(
241-
cols = !c(Year, Site, Date, Jdate, Num_observers, All_obs_reported, Observer_hours),
242-
names_to = "species",
243-
values_to = "species_count",
244-
values_transform = list(species_count = as.character)
245-
)
216+
species_long <-
246217
```
247218

248219

249-
```{r}
250-
251-
```
252-
253220
We want to focus on the presence and absence of species and not the count. Let's create a new column for presence where anything else than 0 is considered present
254221

255222
```{r}
256223
species_presence <- species_long %>%
257-
mutate(species_presence = ifelse(species_count == "0", 0, 1))
224+
258225
```
259226

260227
We can remove some columns: "Num_observers", "All_obs_reported", "Observer_hours" are here to help to compute the effort of observation but since we just want presence and absence, we do not need it. We can also remove all the zeros values to reduce the size of our data set:
261228

262229
```{r}
263230
species_presence <- species_presence %>%
264-
select(-c(Num_observers, All_obs_reported, Observer_hours))
231+
265232
266233
```
267234

0 commit comments

Comments
 (0)