Use case:NB ai lab Huggingface speech to text challenge

Background

The huggingface company (http://www.huggingface.co) announced a challenge to build the best ASR (Automatic Speech recognition) model for any language. The reason for this competition is to promote the wav2vec2 model for other languages than english, showing the quality ASR you can achieve with a limited amount of transcribed text. Norwegian is such a language. There is limited amount of training data for ASR in Norwegian.

Luckily, språkbanken (https://www.nb.no/sprakbanken/) at the national library of Norway (NLN) has released Norwegian Parliament Speech Collection (NPSC) under CC0. Meaning that a lot of effort had to be put in to make it into a dataset on hugginface. The dataset contains over 140 hours of high quality transcribed talks from the national parliament.

NBailab is participating in the challengeand trying to build models for Norwegian. Since Norwegian is a bit scarce on acoustic language data, our first step was to build a good training set for Norwegian. Thanks to sprakbanken (https://www.nb.no/sprakbanken/), we found the Norwegian parliament speech corpus (NPSC). From this source we made an openly available training set for ASR models on hugginface.

The Experiment

The ASR challenge is about finetuning the wav2vec2 model (ref) for speech recognition on spesific languages. Our experiment was to finetune the model on the two Norwegian languages (Bokmål and Nynorsk). our goal is to create a model for Norwegian that performs on the same level as models for

Models

We have buildt separate models for Bokmål and Nynorsk, and trying to prefer smaller models over larger. We used the following baseline models for finetuning:

The acoustic models have been extended with the kenLM language model, a 5 gram model used in the process to select word fro the output. No language have been applied after the word have been produced.

Results

These results are reported from the testing of our models (the best results) from the best models:

XLSR_1B

XLSR_300M

VoxRex_300M