1
- # <div align =" center " >Audio Tagging & Sound Event Detection in PyTorch</div >
1
+ # <div align =" center " >Audio Classification, Tagging & Sound Event Detection in PyTorch</div >
2
2
3
3
Progress:
4
4
5
- - [ ] Mixup Augmentation
6
- - [ ] Random Noise Augmentation
7
- - [ ] Spec Augmentation
8
- - [ ] SED fine tuning
5
+ - [x] Fine-tune on audio classification
6
+ - [ ] Fine-tune on audio tagging
7
+ - [ ] Fine-tune on sound event detection
8
+ - [x] Add tagging metrics
9
+ - [ ] Add Tutorial
10
+ - [x] Add Augmentation Notebook
11
+ - [ ] Add more schedulers
12
+ - [ ] Add FSDKaggle2019 dataset
13
+ - [ ] Add MTT dataset
14
+ - [ ] Add DESED
9
15
10
16
11
17
## <div align =" center " >Model Zoo</div >
@@ -26,19 +32,38 @@ CNN14_DecisionLevelMax | SED | 38.5 | 32 | 1024 | 64 | 14k | [download][cnn14max
26
32
27
33
</details >
28
34
29
- > Note: These are the pretrained models from [ audioset-tagging-cnn] ( https://github.com/qiuqiangkong/audioset_tagging_cnn ) . Check out this official repo if you want to train on audioset. Training on audioset will not be supported in this repo due to resource constraints.
35
+ > Note: These models will be used as a pretrained model in the fine-tuning tasks below. Check out [ audioset-tagging-cnn] ( https://github.com/qiuqiangkong/audioset_tagging_cnn ) , if you want to train on AudioSet dataset.
30
36
31
- [ esc50cnn14 ] : https://drive.google.com/file/d/1oYFws7hvGtothbnzf1vDtK4dQ5sjbgR2/view?usp=sharing
37
+ [ esc50cnn14 ] : https://drive.google.com/file/d/1itN-WyEL6Wp_jVBlld6vLaj47UWL2JaP/view?usp=sharing
38
+ [ fsd2018 ] : https://drive.google.com/file/d/1KzKd4icIV2xF7BdW9EZpU9BAZyfCatrD/view?usp=sharing
39
+ [ scv1 ] : https://drive.google.com/file/d/1Mc4UxHOEvaeJXKcuP4RiTggqZZ0CCmOB/view?usp=sharing
32
40
33
41
<details open >
34
- <summary ><strong >Fine-tuned Models</strong ></summary >
42
+ <summary ><strong >Fine-tuned Classification Models</strong ></summary >
35
43
36
- Model | Task | Dataset | Accuracy<br ><sup >(%) | Sample Rate <br ><sup >(kHz) | Window Length | Num Mels | Fmax | Weights
37
- --- | --- | --- | --- | --- | --- | --- | --- | ---
38
- CNN14 | Tagging | ESC50<br >(Fold-5) | 94.75<br >(no aug) | 32 | 1024 | 64 | 14k | [ download] [ esc50cnn14 ]
39
- CNN14 | Tagging | FSDKaggle2018<br >(val) | ? | 32 | 1024 | 64 | 14k | -
40
- CNN14 | Tagging | UrbandSound8k<br >(Fold-10) | ? | 32 | 1024 | 64 | 14k | -
41
- CNN14 | Tagging | SpeechCommandsv1<br >(val/test) | ? | 32 | 1024 | 64 | 14k | -
44
+ Model | Dataset | Accuracy<br ><sup >(%) | Sample Rate <br ><sup >(kHz) | Weights
45
+ --- | --- | --- | --- | ---
46
+ CNN14 | ESC50 (Fold-5)| 95.75 | 32 | [ download] [ esc50cnn14 ]
47
+ CNN14 | FSDKaggle2018 (test) | 93.56 | 32 | [ download] [ fsd2018 ]
48
+ CNN14 | SpeechCommandsv1 (val/test) | 96.60/96.77 | 32 | [ download] [ scv1 ]
49
+
50
+ </details >
51
+
52
+ <details >
53
+ <summary ><strong >Fine-tuned Tagging Models</strong ></summary >
54
+
55
+ Model | Dataset | mAP(%) | AUC | d-prime | Sample Rate <br ><sup >(kHz) | Config | Weights
56
+ --- | --- | --- | --- | --- | --- | --- | ---
57
+ CNN14 | FSDKaggle2019 | - | - | - | 32 | - | -
58
+
59
+ </details >
60
+
61
+ <details >
62
+ <summary ><strong >Fine-tuned SED Models</strong ></summary >
63
+
64
+ Model | Dataset | F1 | Sample Rate <br ><sup >(kHz) | Config | Weights
65
+ --- | --- | --- | --- | --- | ---
66
+ CNN14_DecisionLevelMax | DESED | - | 32 | - | -
42
67
43
68
</details >
44
69
@@ -47,20 +72,29 @@ CNN14 | Tagging | SpeechCommandsv1<br>(val/test) | ? | 32 | 1024 | 64 | 14k | -
47
72
## <div align =" center " >Supported Datasets</div >
48
73
49
74
[ esc50 ] : https://github.com/karolpiczak/ESC-50
50
- [ fsdkaggle ] : https://zenodo.org/record/2552860
75
+ [ fsdkaggle2018 ] : https://zenodo.org/record/2552860
76
+ [ fsdkaggle2019 ] : https://zenodo.org/record/3612637
51
77
[ audioset ] : https://research.google.com/audioset/
52
78
[ urbansound8k ] : https://urbansounddataset.weebly.com/urbansound8k.html
53
79
[ speechcommandsv1 ] : https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
54
80
[ speechcommandsv2 ] : http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
81
+ [ mtt ] : https://github.com/keunwoochoi/magnatagatune-list
82
+ [ desed ] : https://project.inria.fr/desed/
55
83
56
- Dataset | Type | Classes | Train | Val | Test | Audio Length | Audio Spec | Size
84
+ Dataset | Task | Classes | Train | Val | Test | Audio Length | Audio Spec | Size
57
85
--- | --- | --- | --- | --- | --- | --- | --- | ---
58
- [ ESC-50] [ esc50 ] | Environmental | 50 | 2,000 | 5 folds | - | 5s | 44.1kHz, mono | 600MB
59
- [ UrbanSound8k] [ urbansound8k ] | Urban | 10 | 8,732 | 10 folds | - | <=4s | Vary | 5.6GB
60
- [ FSDKaggle2018] [ fsdkaggle ] | - | 41 | 9,473 | 1,600 | - | 300ms~ 30s | 44.1kHz, mono | 4.6GB
61
- [ SpeechCommandsv1] [ speechcommandsv1 ] | Words | 30 | 51,088 | 6,798 | 6,835 | <=1s | 16kHz, mono | 1.4GB
62
- [ SpeechCommandsv2] [ speechcommandsv2 ] | Words | 35 | 84,843 | 9,981 | 11,005 | <=1s | 16kHz, mono | 2.3GB
86
+ [ ESC-50] [ esc50 ] | Classification | 50 | 2,000 | 5 folds | - | 5s | 44.1kHz, mono | 600MB
87
+ [ UrbanSound8k] [ urbansound8k ] | Classification | 10 | 8,732 | 10 folds | - | <=4s | Vary | 5.6GB
88
+ [ FSDKaggle2018] [ fsdkaggle2018 ] | Classification | 41 | 9,473 | - | 1,600 | 300ms~ 30s | 44.1kHz, mono | 4.6GB
89
+ [ SpeechCommandsv1] [ speechcommandsv1 ] | Classification | 30 | 51,088 | 6,798 | 6,835 | <=1s | 16kHz, mono | 1.4GB
90
+ [ SpeechCommandsv2] [ speechcommandsv2 ] | Classification | 35 | 84,843 | 9,981 | 11,005 | <=1s | 16kHz, mono | 2.3GB
91
+ ||
92
+ [ FSDKaggle2019] [ fsdkaggle2019 ] * | Tagging | 80 | 4,970+19,815 | - | 4,481 | 300ms~ 30s | 44.1kHz, mono | 24GB
93
+ [ MTT] [ mtt ] * | Tagging | 50 | 19,000 | - | - | - | - | 3GB
94
+ ||
95
+ [ DESED] [ desed ] * | SED | 10 | - | - | - | 10 | - | -
63
96
97
+ > Notes: ` * ` datasets are not available yet. Classification dataset are treated as multi-class/single-label classification and tagging and sed datasets are treated as multi-label classification.
64
98
65
99
<details >
66
100
<summary ><strong >Dataset Structure</strong > (click to expand)</summary >
@@ -93,27 +127,64 @@ datasets
93
127
94
128
</details >
95
129
130
+ <br >
131
+ <details >
132
+ <summary ><strong >Augmentations</strong > (click to expand)</summary >
133
+
134
+ Currently, the following augmentations are supported. More will be added in the future. You can test the effects of augmentations with this [ notebook] ( ./datasets/aug_test.ipynb )
135
+
136
+ WaveForm Augmentations:
137
+
138
+ - [x] MixUp
139
+ - [x] Background Noise
140
+ - [x] Gaussian Noise
141
+ - [x] Fade In/Out
142
+ - [x] Volume
143
+ - [ ] CutMix
144
+
145
+ Spectrogram Augmentations:
146
+
147
+ - [x] Time Masking
148
+ - [x] Frequency Masking
149
+ - [x] Filter Augmentation
150
+
151
+ </details >
152
+
96
153
---
97
154
98
155
## <div align =" center " >Usage</div >
99
156
157
+ <details >
158
+ <summary ><strong >Requirements</strong > (click to expand)</summary >
159
+
160
+ * python >= 3.6
161
+ * pytorch >= 1.8.1
162
+ * torchaudio >= 0.8.1
163
+
164
+ Other requirements can be installed with ` pip install -r requirements.txt ` .
165
+
166
+ </details >
167
+
168
+ <br >
100
169
<details >
101
170
<summary ><strong >Configuration</strong > (click to expand)</summary >
102
171
103
- Create a configuration file in ` configs ` . Sample configuration for ImageNet dataset can be found [ here] ( configs/tagging.yaml ) . Then edit the fields you think if it is needed. This configuration file is needed for all of training, evaluation and prediction scripts.
172
+ * Create a configuration file in [ configs] ( ./configs/ ) . Sample configuration for ESC50 dataset can be found [ here] ( configs/esc50.yaml ) .
173
+ * Copy the contents of this and then edit the fields you think if it is needed.
174
+ * This configuration file is needed for all of training, evaluation and prediction scripts.
104
175
105
176
</details >
106
177
<br >
107
178
<details >
108
179
<summary ><strong >Training</strong > (click to expand)</summary >
109
180
110
- Train with 1 GPU:
181
+ To train with a single GPU:
111
182
112
183
``` bash
113
184
$ python tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
114
185
```
115
186
116
- Train with 2 GPUs :
187
+ To train with multiple gpus, set ` DDP ` field in config file to ` true ` and run as follows :
117
188
118
189
``` bash
119
190
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
@@ -128,33 +199,36 @@ $ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py
128
199
Make sure to set ` MODEL_PATH ` of the configuration file to your trained model directory.
129
200
130
201
``` bash
131
- $ python tools/val.py --cfg configs/CONFIG_FILE_NAME .yaml
202
+ $ python tools/val.py --cfg configs/CONFIG_FILE .yaml
132
203
```
133
204
134
205
</details >
135
206
136
207
<br >
137
208
<details open >
138
- <summary ><strong >Audio Tagging Inference</strong ></summary >
209
+ <summary ><strong >Audio Classification/ Tagging Inference</strong ></summary >
139
210
140
211
* Set ` MODEL_PATH ` of the configuration file to your model's trained weights.
141
212
* Change the dataset name in ` DATASET ` >> ` NAME ` as your trained model's dataset.
142
213
* Set the testing audio file path in ` TEST ` >> ` FILE ` .
143
214
* Run the following command.
144
215
145
216
``` bash
146
- $ python tools/tagging_infer.py --cfg configs/TAGGING_CONFIG_FILE.yaml
217
+ $ python tools/infer.py --cfg configs/CONFIG_FILE.yaml
218
+
219
+ # # for example
220
+ $ python tools/infer.py --cfg configs/audioset.yaml
147
221
```
148
222
You will get an output similar to this:
149
223
150
224
``` bash
151
225
Class Confidence
152
226
---------------------- ------------
153
- Speech 0.893114
154
- Telephone bell ringing 0.754014
155
- Inside, small room 0.235118
156
- Telephone 0.182611
157
- Music 0.0922332
227
+ Speech 0.897762
228
+ Telephone bell ringing 0.752206
229
+ Telephone 0.219329
230
+ Inside, small room 0.20761
231
+ Music 0.0770325
158
232
```
159
233
160
234
</details >
@@ -169,8 +243,12 @@ Music 0.0922332
169
243
* Run the following command.
170
244
171
245
``` bash
172
- $ python tools/sed_infer.py --cfg configs/SED_CONFIG_FILE.yaml
246
+ $ python tools/sed_infer.py --cfg configs/CONFIG_FILE.yaml
247
+
248
+ # # for example
249
+ $ python tools/sed_infer.py --cfg configs/audioset_sed.yaml
173
250
```
251
+
174
252
You will get an output similar to this:
175
253
176
254
``` bash
@@ -180,7 +258,7 @@ Speech 2.2 7
180
258
Telephone bell ringing 0 2.5
181
259
```
182
260
183
- If you set ` TEST ` >> ` PLOT ` to ` true ` , the following plot will also show :
261
+ The following plot will also be shown, if you set ` PLOT ` to ` true ` :
184
262
185
263
![ sed_result] ( ./assests/sed_result.png )
186
264
@@ -190,8 +268,44 @@ If you set `TEST` >> `PLOT` to `true`, the following plot will also show:
190
268
<details >
191
269
<summary ><strong >References</strong > (click to expand)</summary >
192
270
271
+ * https://github.com/qiuqiangkong/audioset_tagging_cnn
272
+ * https://github.com/YuanGongND/ast
273
+ * https://github.com/frednam93/FilterAugSED
274
+ * https://github.com/lRomul/argus-freesound
275
+
276
+ </details >
277
+
278
+ <br >
279
+ <details >
280
+ <summary ><strong >Citations</strong > (click to expand)</summary >
281
+
193
282
```
194
- [1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. "Panns: Large-scale pretrained audio neural networks for audio pattern recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894
283
+ @misc{kong2020panns,
284
+ title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
285
+ author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
286
+ year={2020},
287
+ eprint={1912.10211},
288
+ archivePrefix={arXiv},
289
+ primaryClass={cs.SD}
290
+ }
291
+
292
+ @misc{gong2021ast,
293
+ title={AST: Audio Spectrogram Transformer},
294
+ author={Yuan Gong and Yu-An Chung and James Glass},
295
+ year={2021},
296
+ eprint={2104.01778},
297
+ archivePrefix={arXiv},
298
+ primaryClass={cs.SD}
299
+ }
300
+
301
+ @misc{nam2021heavily,
302
+ title={Heavily Augmented Sound Event Detection utilizing Weak Predictions},
303
+ author={Hyeonuk Nam and Byeong-Yun Ko and Gyeong-Tae Lee and Seong-Hu Kim and Won-Ho Jung and Sang-Min Choi and Yong-Hwa Park},
304
+ year={2021},
305
+ eprint={2107.03649},
306
+ archivePrefix={arXiv},
307
+ primaryClass={eess.AS}
308
+ }
195
309
```
196
310
197
311
</details >
0 commit comments