Background
The huggingface company (http://www.huggingface.co) announced a challenge to build the best ASR (Automatic Speech recognition) model for any language. The reason for this competition is to promote the wav2vec2 model for other languages than english, showing the quality ASR you can achieve with a limited amount of transcribed text. Norwegian is such a language. There is limited amount of training data for ASR in Norwegian.
Luckily, språkbanken (https://www.nb.no/sprakbanken/) at the national library of Norway (NLN) has released Norwegian Parliament Speech Collection (NPSC) under CC0. Meaning that a lot of effort had to be put in to make it into a dataset on hugginface. The dataset contains over 140 hours of high quality transcribed talks from the national parliament.
NBailab is participating in the challengeand trying to build models for Norwegian. Since Norwegian is a bit scarce on acoustic language data, our first step was to build a good training set for Norwegian. Thanks to sprakbanken (https://www.nb.no/sprakbanken/), we found the Norwegian parliament speech corpus (NPSC). From this source we made an openly available training set for ASR models on hugginface.
The Experiment
The ASR challenge is about finetuning the wav2vec2 model (ref) for speech recognition on spesific languages. Our experiment was to finetune the model on the two Norwegian languages (Bokmål and Nynorsk). our goal is to create a model for Norwegian that performs on the same level as models for
Models
We have buildt separate models for Bokmål and Nynorsk, and trying to prefer smaller models over larger. We used the following baseline models for finetuning:
- XLSR_1B, XLSR model with 1 billion parameters
- XLSR_300M, XLSR model with only 300 miliion parameters
- VoxRex_300M, The voxrex model for KB,Sweden (ref)
The acoustic models have been extended with the kenLM language model, a 5 gram model used in the process to select word fro the output. No language have been applied after the word have been produced.
Results
These results are reported from the testing of our models (the best results) from the best models:
XLSR_1B
- Trained on Bokmål and Nynorsk,separate models
- Includes KenLM language model based on ADMIN corpus from NCC.
- Bokmål models recieved a Word Error Rate of 6.65 and a Character Error rate of 2.53
- Nynorsk models recieved a Word Error Rate of 13.35 and a Character Error rate of 4.54
XLSR_300M
- Trained on Bokmål and Nynorsk,separate models
- Includes KenLM language model based on ADMIN corpus from NCC.
- Bokmål models recieved a Word Error Rate of 7.12 and a Character Error rate of 2.82
- Nynorsk models recieved a Word Error Rate of 12.22 and a Character Error rate of 4.19
VoxRex_300M
- Trained on Bokmål and Nynorsk,separate models
- Includes KenLM language model based on ADMIN corpus from NCC.
- Bokmål models recieved a Word Error Rate of 7.55 and a Character Error rate of 2.66
- Nynorsk models recieved a Word Error Rate of 7.55 and a Character Error rate of 4.1