The NB AI-lab trains models for various purposes. They are most often based on the combination of NB’s digital collection.
Pre-trained semi-supervisely on enormous datasets, modern language models offer the possibility of adjusting their weights for specific supervised downstream tasks at a fraction of the cost with astonishing results. The NB AI-lab has released one of the best performing text models for Norwegian and other Scandinavian languages yet.
- NB-BERT-base. NB-BERT-base is a general BERT-base (12 layers) model built on the large digital collection at the National Library of Norway.
- NB-BERT-large. NB-BERT-large is a general BERT-large (24 layers) model built on the same material as the bast version.
These encoder-only models are based on the same structure of BERT cased multilingual models, and are trained on a wide variety of Norwegian text (both Bokmål and Nynorsk) from the last 200 years.
NB-GPT-J-6B is a Norwegian fine-tuned version of GPT-J 6B, a decoder-only transformer model trained using Mesh Transformer JAX. “GPT-J” refers to the class of model, while “6B” represents the number of trainable parameters (6 billion parameters). It has been trained on a mixture of library data and Internet data. It can generate text from a prompt, and even solve some tasks zero- and few-shot.
Text-to-Text Transfer Transformers (T5) are a type of sequence to sequence models that enable tasks that involve transformation from a sequence of text into another, such as translation or text normalization. The framework is versatile enough to also allow classification and even regression.
We are currently evaluating T5 models trained on Norwegian text data and have plans to release them before the end of the year.
NB AI-lab has now released our NB-Whisper Beta for Norwegian Bokmål and Nynorsk. The model is based on Whisper from OpenAI, and trained on data from Språkbanken and the digital collection at the National Library of Norway. Training data includes
- Transcribed speeches from the Norwegian Parliament produced by Språkbanken
- TV broadcast (NRK) subtitles (NLN digital collection)
- Audiobooks (NLN digital collection)
The NB-Whisper Beta may be found on Huggingface. The NB AI-lab will be glad for feedback on your experience on using this model in various contexts.
There is a demo of the model which you can play with on this page.
Access and use
You may access the model and the model card on this page on Huggingface
Our plan is to release more Beta models in different sizes. The first is a Small model, next to come is Medium. When we reach a stable functional level, the official version will be released.
Similar to the way BERT models are trained, Wav2vec 2.0 models are trained in a self-supervised fashion by predicting speech units for masked parts of the audio. We have experimented with several of these models and released a series of fine-tuned models for Norwegian:
- nb-wav2vec2-1b-bokmaal. This model is fine-tuned on top of the feature extractor from XLS-R using the Bokmål subset of the NPSC dataset. A 5-gram KenLM is attached to improve accuracy in automatic speech recognition for Bokmål.
- nb-wav2vec2-1b-nynorsk. Similarly, this model is also a 1B parameters model fine-tuned on the Nynorsk subset of NPSC for automatic speech recognition.
- nb-wav2vec2-300m-bokmaal. The 300M parameter model versions are fine-tuned using the Swedish VoxRex model on the NPSC dataset.
- nb-wav2vec2-300m-nynorsk. This model also uses the 300M VoxRex, but on the Nynorsk subset of NPSC.
Front page detector
The front-page-detector is a fine-tuned version of ViT specifically designed to detect front-pages from newspapers. Given a single page image, the model is capable of deciding whether the image corresponds to a front-, a middle-, or a back-page in a given newspaper. The model is now in production internally in the library.