Use case:NB ai lab Huggingface speech to text challenge

Background

The huggingface company (http://www.huggingface.co) announced a challenge to build the best ASR (Automatic Speech recognition) model for any language. The reason for this competition is to promote the wav2vec2 model for other languages than english, showing the quality ASR you can achieve with a limited amount of transcribed text. Norwegian is such a language. There is limited amount of training data for ASR in Norwegian.

Luckily, språkbanken (https://www.nb.no/sprakbanken/) at the national library of Norway (NLN) has released Norwegian Parliament Speech Collection (NPSC) under CC0. Meaning that a lot of effort had to be put in to make it into a dataset on hugginface. The dataset contains over 140 hours of high quality transcribed talks from the national parliament.

NBailab is participating in the challengeand trying to build models for Norwegian. Since Norwegian is a bit scarce on acoustic language data, our first step was to build a good training set for Norwegian. Thanks to sprakbanken (https://www.nb.no/sprakbanken/), we found the Norwegian parliament speech corpus (NPSC). From this source we made an openly available training set for ASR models on hugginface.

The Experiment

The ASR challenge is about finetuning the wav2vec2 model (ref) for speech recognition on spesific languages. Our experiment was to finetune the model on the two Norwegian languages (Bokmål and Nynorsk). our goal is to create a model for Norwegian that performs on the same level as models for

Models

We have buildt separate models for Bokmål and Nynorsk, and trying to prefer smaller models over larger. We used the following baseline models for finetuning:

XLSR_1B, XLSR model with 1 billion parameters
XLSR_300M, XLSR model with only 300 miliion parameters
VoxRex_300M, The voxrex model for KB,Sweden (ref)

The acoustic models have been extended with the kenLM language model, a 5 gram model used in the process to select word fro the output. No language have been applied after the word have been produced.

Results

These results are reported from the testing of our models (the best results) from the best models:

XLSR_1B

Trained on Bokmål and Nynorsk,separate models
Includes KenLM language model based on ADMIN corpus from NCC.
Bokmål models recieved a Word Error Rate of 6.65 and a Character Error rate of 2.53
Nynorsk models recieved a Word Error Rate of 13.35 and a Character Error rate of 4.54

XLSR_300M

Trained on Bokmål and Nynorsk,separate models
Includes KenLM language model based on ADMIN corpus from NCC.
Bokmål models recieved a Word Error Rate of 7.12 and a Character Error rate of 2.82
Nynorsk models recieved a Word Error Rate of 12.22 and a Character Error rate of 4.19

VoxRex_300M

Trained on Bokmål and Nynorsk,separate models
Includes KenLM language model based on ADMIN corpus from NCC.
Bokmål models recieved a Word Error Rate of 7.55 and a Character Error rate of 2.66
Nynorsk models recieved a Word Error Rate of 7.55 and a Character Error rate of 4.1