Skip to content

Commit a52ec4a

Browse files
committed
release v0.2.0
1 parent 7b8d1b5 commit a52ec4a

28 files changed

+1405
-430
lines changed

README.md

Lines changed: 149 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,17 @@
1-
# <div align="center">Audio Tagging & Sound Event Detection in PyTorch</div>
1+
# <div align="center">Audio Classification, Tagging & Sound Event Detection in PyTorch</div>
22

33
Progress:
44

5-
- [ ] Mixup Augmentation
6-
- [ ] Random Noise Augmentation
7-
- [ ] Spec Augmentation
8-
- [ ] SED fine tuning
5+
- [x] Fine-tune on audio classification
6+
- [ ] Fine-tune on audio tagging
7+
- [ ] Fine-tune on sound event detection
8+
- [x] Add tagging metrics
9+
- [ ] Add Tutorial
10+
- [x] Add Augmentation Notebook
11+
- [ ] Add more schedulers
12+
- [ ] Add FSDKaggle2019 dataset
13+
- [ ] Add MTT dataset
14+
- [ ] Add DESED
915

1016

1117
## <div align="center">Model Zoo</div>
@@ -26,19 +32,38 @@ CNN14_DecisionLevelMax | SED | 38.5 | 32 | 1024 | 64 | 14k | [download][cnn14max
2632

2733
</details>
2834

29-
> Note: These are the pretrained models from [audioset-tagging-cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn). Check out this official repo if you want to train on audioset. Training on audioset will not be supported in this repo due to resource constraints.
35+
> Note: These models will be used as a pretrained model in the fine-tuning tasks below. Check out [audioset-tagging-cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn), if you want to train on AudioSet dataset.
3036
31-
[esc50cnn14]: https://drive.google.com/file/d/1oYFws7hvGtothbnzf1vDtK4dQ5sjbgR2/view?usp=sharing
37+
[esc50cnn14]: https://drive.google.com/file/d/1itN-WyEL6Wp_jVBlld6vLaj47UWL2JaP/view?usp=sharing
38+
[fsd2018]: https://drive.google.com/file/d/1KzKd4icIV2xF7BdW9EZpU9BAZyfCatrD/view?usp=sharing
39+
[scv1]: https://drive.google.com/file/d/1Mc4UxHOEvaeJXKcuP4RiTggqZZ0CCmOB/view?usp=sharing
3240

3341
<details open>
34-
<summary><strong>Fine-tuned Models</strong></summary>
42+
<summary><strong>Fine-tuned Classification Models</strong></summary>
3543

36-
Model | Task | Dataset | Accuracy<br><sup>(%) | Sample Rate <br><sup>(kHz) | Window Length | Num Mels | Fmax | Weights
37-
--- | --- | --- | --- | --- | --- | --- | --- | ---
38-
CNN14 | Tagging | ESC50<br>(Fold-5) | 94.75<br>(no aug) | 32 | 1024 | 64 | 14k | [download][esc50cnn14]
39-
CNN14 | Tagging | FSDKaggle2018<br>(val) | ? | 32 | 1024 | 64 | 14k | -
40-
CNN14 | Tagging | UrbandSound8k<br>(Fold-10) | ? | 32 | 1024 | 64 | 14k | -
41-
CNN14 | Tagging | SpeechCommandsv1<br>(val/test) | ? | 32 | 1024 | 64 | 14k | -
44+
Model | Dataset | Accuracy<br><sup>(%) | Sample Rate <br><sup>(kHz) | Weights
45+
--- | --- | --- | --- | ---
46+
CNN14 | ESC50 (Fold-5)| 95.75 | 32 | [download][esc50cnn14]
47+
CNN14 | FSDKaggle2018 (test) | 93.56 | 32 | [download][fsd2018]
48+
CNN14 | SpeechCommandsv1 (val/test) | 96.60/96.77 | 32 | [download][scv1]
49+
50+
</details>
51+
52+
<details>
53+
<summary><strong>Fine-tuned Tagging Models</strong></summary>
54+
55+
Model | Dataset | mAP(%) | AUC | d-prime | Sample Rate <br><sup>(kHz) | Config | Weights
56+
--- | --- | --- | --- | --- | --- | --- | ---
57+
CNN14 | FSDKaggle2019 | - | - | - | 32 | - | -
58+
59+
</details>
60+
61+
<details>
62+
<summary><strong>Fine-tuned SED Models</strong></summary>
63+
64+
Model | Dataset | F1 | Sample Rate <br><sup>(kHz) | Config | Weights
65+
--- | --- | --- | --- | --- | ---
66+
CNN14_DecisionLevelMax | DESED | - | 32 | - | -
4267

4368
</details>
4469

@@ -47,20 +72,29 @@ CNN14 | Tagging | SpeechCommandsv1<br>(val/test) | ? | 32 | 1024 | 64 | 14k | -
4772
## <div align="center">Supported Datasets</div>
4873

4974
[esc50]: https://github.com/karolpiczak/ESC-50
50-
[fsdkaggle]: https://zenodo.org/record/2552860
75+
[fsdkaggle2018]: https://zenodo.org/record/2552860
76+
[fsdkaggle2019]: https://zenodo.org/record/3612637
5177
[audioset]: https://research.google.com/audioset/
5278
[urbansound8k]: https://urbansounddataset.weebly.com/urbansound8k.html
5379
[speechcommandsv1]: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
5480
[speechcommandsv2]: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
81+
[mtt]: https://github.com/keunwoochoi/magnatagatune-list
82+
[desed]: https://project.inria.fr/desed/
5583

56-
Dataset | Type | Classes | Train | Val | Test | Audio Length | Audio Spec | Size
84+
Dataset | Task | Classes | Train | Val | Test | Audio Length | Audio Spec | Size
5785
--- | --- | --- | --- | --- | --- | --- | --- | ---
58-
[ESC-50][esc50] | Environmental | 50 | 2,000 | 5 folds | - | 5s | 44.1kHz, mono | 600MB
59-
[UrbanSound8k][urbansound8k] | Urban | 10 | 8,732 | 10 folds | - | <=4s | Vary | 5.6GB
60-
[FSDKaggle2018][fsdkaggle] | - | 41 | 9,473 | 1,600 | - | 300ms~30s | 44.1kHz, mono | 4.6GB
61-
[SpeechCommandsv1][speechcommandsv1] | Words | 30 | 51,088 | 6,798 | 6,835 | <=1s | 16kHz, mono | 1.4GB
62-
[SpeechCommandsv2][speechcommandsv2] | Words | 35 | 84,843 | 9,981 | 11,005 | <=1s | 16kHz, mono | 2.3GB
86+
[ESC-50][esc50] | Classification | 50 | 2,000 | 5 folds | - | 5s | 44.1kHz, mono | 600MB
87+
[UrbanSound8k][urbansound8k] | Classification | 10 | 8,732 | 10 folds | - | <=4s | Vary | 5.6GB
88+
[FSDKaggle2018][fsdkaggle2018] | Classification | 41 | 9,473 | - | 1,600 | 300ms~30s | 44.1kHz, mono | 4.6GB
89+
[SpeechCommandsv1][speechcommandsv1] | Classification | 30 | 51,088 | 6,798 | 6,835 | <=1s | 16kHz, mono | 1.4GB
90+
[SpeechCommandsv2][speechcommandsv2] | Classification | 35 | 84,843 | 9,981 | 11,005 | <=1s | 16kHz, mono | 2.3GB
91+
||
92+
[FSDKaggle2019][fsdkaggle2019]* | Tagging | 80 | 4,970+19,815 | - | 4,481 | 300ms~30s | 44.1kHz, mono | 24GB
93+
[MTT][mtt]* | Tagging | 50 | 19,000 | - | - | - | - | 3GB
94+
||
95+
[DESED][desed]* | SED | 10 | - | - | - | 10 | - | -
6396

97+
> Notes: `*` datasets are not available yet. Classification dataset are treated as multi-class/single-label classification and tagging and sed datasets are treated as multi-label classification.
6498
6599
<details>
66100
<summary><strong>Dataset Structure</strong> (click to expand)</summary>
@@ -93,27 +127,64 @@ datasets
93127

94128
</details>
95129

130+
<br>
131+
<details>
132+
<summary><strong>Augmentations</strong> (click to expand)</summary>
133+
134+
Currently, the following augmentations are supported. More will be added in the future. You can test the effects of augmentations with this [notebook](./datasets/aug_test.ipynb)
135+
136+
WaveForm Augmentations:
137+
138+
- [x] MixUp
139+
- [x] Background Noise
140+
- [x] Gaussian Noise
141+
- [x] Fade In/Out
142+
- [x] Volume
143+
- [ ] CutMix
144+
145+
Spectrogram Augmentations:
146+
147+
- [x] Time Masking
148+
- [x] Frequency Masking
149+
- [x] Filter Augmentation
150+
151+
</details>
152+
96153
---
97154

98155
## <div align="center">Usage</div>
99156

157+
<details>
158+
<summary><strong>Requirements</strong> (click to expand)</summary>
159+
160+
* python >= 3.6
161+
* pytorch >= 1.8.1
162+
* torchaudio >= 0.8.1
163+
164+
Other requirements can be installed with `pip install -r requirements.txt`.
165+
166+
</details>
167+
168+
<br>
100169
<details>
101170
<summary><strong>Configuration</strong> (click to expand)</summary>
102171

103-
Create a configuration file in `configs`. Sample configuration for ImageNet dataset can be found [here](configs/tagging.yaml). Then edit the fields you think if it is needed. This configuration file is needed for all of training, evaluation and prediction scripts.
172+
* Create a configuration file in [configs](./configs/). Sample configuration for ESC50 dataset can be found [here](configs/esc50.yaml).
173+
* Copy the contents of this and then edit the fields you think if it is needed.
174+
* This configuration file is needed for all of training, evaluation and prediction scripts.
104175

105176
</details>
106177
<br>
107178
<details>
108179
<summary><strong>Training</strong> (click to expand)</summary>
109180

110-
Train with 1 GPU:
181+
To train with a single GPU:
111182

112183
```bash
113184
$ python tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
114185
```
115186

116-
Train with 2 GPUs:
187+
To train with multiple gpus, set `DDP` field in config file to `true` and run as follows:
117188

118189
```bash
119190
$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml
@@ -128,33 +199,36 @@ $ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py
128199
Make sure to set `MODEL_PATH` of the configuration file to your trained model directory.
129200

130201
```bash
131-
$ python tools/val.py --cfg configs/CONFIG_FILE_NAME.yaml
202+
$ python tools/val.py --cfg configs/CONFIG_FILE.yaml
132203
```
133204

134205
</details>
135206

136207
<br>
137208
<details open>
138-
<summary><strong>Audio Tagging Inference</strong></summary>
209+
<summary><strong>Audio Classification/Tagging Inference</strong></summary>
139210

140211
* Set `MODEL_PATH` of the configuration file to your model's trained weights.
141212
* Change the dataset name in `DATASET` >> `NAME` as your trained model's dataset.
142213
* Set the testing audio file path in `TEST` >> `FILE`.
143214
* Run the following command.
144215

145216
```bash
146-
$ python tools/tagging_infer.py --cfg configs/TAGGING_CONFIG_FILE.yaml
217+
$ python tools/infer.py --cfg configs/CONFIG_FILE.yaml
218+
219+
## for example
220+
$ python tools/infer.py --cfg configs/audioset.yaml
147221
```
148222
You will get an output similar to this:
149223

150224
```bash
151225
Class Confidence
152226
---------------------- ------------
153-
Speech 0.893114
154-
Telephone bell ringing 0.754014
155-
Inside, small room 0.235118
156-
Telephone 0.182611
157-
Music 0.0922332
227+
Speech 0.897762
228+
Telephone bell ringing 0.752206
229+
Telephone 0.219329
230+
Inside, small room 0.20761
231+
Music 0.0770325
158232
```
159233

160234
</details>
@@ -169,8 +243,12 @@ Music 0.0922332
169243
* Run the following command.
170244

171245
```bash
172-
$ python tools/sed_infer.py --cfg configs/SED_CONFIG_FILE.yaml
246+
$ python tools/sed_infer.py --cfg configs/CONFIG_FILE.yaml
247+
248+
## for example
249+
$ python tools/sed_infer.py --cfg configs/audioset_sed.yaml
173250
```
251+
174252
You will get an output similar to this:
175253

176254
```bash
@@ -180,7 +258,7 @@ Speech 2.2 7
180258
Telephone bell ringing 0 2.5
181259
```
182260

183-
If you set `TEST` >> `PLOT` to `true`, the following plot will also show:
261+
The following plot will also be shown, if you set `PLOT` to `true`:
184262

185263
![sed_result](./assests/sed_result.png)
186264

@@ -190,8 +268,44 @@ If you set `TEST` >> `PLOT` to `true`, the following plot will also show:
190268
<details>
191269
<summary><strong>References</strong> (click to expand)</summary>
192270

271+
* https://github.com/qiuqiangkong/audioset_tagging_cnn
272+
* https://github.com/YuanGongND/ast
273+
* https://github.com/frednam93/FilterAugSED
274+
* https://github.com/lRomul/argus-freesound
275+
276+
</details>
277+
278+
<br>
279+
<details>
280+
<summary><strong>Citations</strong> (click to expand)</summary>
281+
193282
```
194-
[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. "Panns: Large-scale pretrained audio neural networks for audio pattern recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894
283+
@misc{kong2020panns,
284+
title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
285+
author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
286+
year={2020},
287+
eprint={1912.10211},
288+
archivePrefix={arXiv},
289+
primaryClass={cs.SD}
290+
}
291+
292+
@misc{gong2021ast,
293+
title={AST: Audio Spectrogram Transformer},
294+
author={Yuan Gong and Yu-An Chung and James Glass},
295+
year={2021},
296+
eprint={2104.01778},
297+
archivePrefix={arXiv},
298+
primaryClass={cs.SD}
299+
}
300+
301+
@misc{nam2021heavily,
302+
title={Heavily Augmented Sound Event Detection utilizing Weak Predictions},
303+
author={Hyeonuk Nam and Byeong-Yun Ko and Gyeong-Tae Lee and Seong-Hu Kim and Won-Ho Jung and Sang-Min Choi and Yong-Hwa Park},
304+
year={2021},
305+
eprint={2107.03649},
306+
archivePrefix={arXiv},
307+
primaryClass={eess.AS}
308+
}
195309
```
196310

197311
</details>

assests/noises/voices.wav

156 KB
Binary file not shown.

configs/audioset.yaml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
DEVICE: cpu # device used for training
2+
3+
MODEL:
4+
NAME: cnn14 # name of the model you are using
5+
PRETRAINED: ''
6+
7+
DATASET:
8+
NAME: audioset # dataset name
9+
ROOT: '' # dataset root path
10+
METRIC: mAP
11+
SAMPLE_RATE: 32000
12+
AUDIO_LENGTH: 5
13+
WIN_LENGTH: 1024
14+
HOP_LENGTH: 320
15+
N_MELS: 64
16+
FMIN: 50
17+
FMAX: 14000
18+
19+
AUG:
20+
MIXUP: 0.0
21+
MIXUP_ALPHA: 10
22+
SMOOTHING: 0.1
23+
TIME_MASK: 96
24+
FREQ_MASK: 24
25+
26+
TRAIN:
27+
EPOCHS: 100 # number of epochs to train
28+
EVAL_INTERVAL: 10 # interval to evaluate the model during training
29+
BATCH_SIZE: 16 # batch size used to train
30+
LOSS: bcelogits # loss function name (ce, bce, bcelogits, label_smooth, soft_target)
31+
AMP: true # use Automatic Mixed Precision training or not
32+
DDP: false
33+
SAVE_DIR: 'output' # output folder name used for saving the trained model and logs
34+
35+
OPTIMIZER:
36+
NAME: adamw
37+
LR: 0.0001 # initial learning rate used in optimizer
38+
WEIGHT_DECAY: 0.001 # decay rate use in optimizer
39+
40+
SCHEDULER:
41+
NAME: steplr
42+
PARAMS: [30, 0.1]
43+
44+
45+
TEST:
46+
MODE: file # inference mode (file, mic)
47+
FILE: 'assests/test.wav' # audio file name (not use if you choose MODE=mic)
48+
MODEL_PATH: 'checkpoints/cnn14.pth' # trained model path
49+
TOPK: 5

configs/audioset_sed.yaml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
DEVICE: cpu # device used for training
2+
3+
MODEL:
4+
NAME: cnn14decisionlevelmax # name of the model you are using
5+
PRETRAINED: ''
6+
7+
DATASET:
8+
NAME: audioset # dataset name
9+
ROOT: '' # dataset root path
10+
METRIC: mAP
11+
SAMPLE_RATE: 32000
12+
AUDIO_LENGTH: 5
13+
WIN_LENGTH: 1024
14+
HOP_LENGTH: 320
15+
N_MELS: 64
16+
FMIN: 50
17+
FMAX: 14000
18+
19+
AUG:
20+
MIXUP: 0.0
21+
MIXUP_ALPHA: 10
22+
SMOOTHING: 0.1
23+
TIME_MASK: 96
24+
FREQ_MASK: 24
25+
26+
TRAIN:
27+
EPOCHS: 100 # number of epochs to train
28+
EVAL_INTERVAL: 10 # interval to evaluate the model during training
29+
BATCH_SIZE: 16 # batch size used to train
30+
LOSS: bcelogits # loss function name (ce, bce, bcelogits, label_smooth, soft_target)
31+
AMP: true # use Automatic Mixed Precision training or not
32+
DDP: false
33+
SAVE_DIR: 'output' # output folder name used for saving the trained model and logs
34+
35+
OPTIMIZER:
36+
NAME: adamw
37+
LR: 0.0001 # initial learning rate used in optimizer
38+
WEIGHT_DECAY: 0.001 # decay rate use in optimizer
39+
40+
SCHEDULER:
41+
NAME: steplr
42+
PARAMS: [30, 0.1]
43+
44+
45+
TEST:
46+
MODE: file # inference mode (file, mic)
47+
FILE: 'assests/test.wav' # audio file name (not use if you choose MODE=mic)
48+
MODEL_PATH: 'checkpoints/cnn14_decisionlevelmax.pth' # trained model path
49+
THRESHOLD: 0.2
50+
PLOT: false

0 commit comments

Comments
 (0)