Datasets – AI-lab

The National Library of Norway has a very large digital collection, spanning all kinds of media. The collection is partly based on a massive digitization program, and partly on born digital information. Parts of this collection are out of copyright and may be used without restrictions. A significant number of content objects also have metadata records related to the content.

Original datasets

The Norwegian Colossal Corpus (NCC)

After a year’s effort, we produced the first public version of a general large-scale text corpus for machine learning on the Norwegian language. This corpus, which we modestly call the “Norwegian Colossal Corpus”, represents a major shift for training modern language models for Norwegian (both Bokmål and Nynorsk).

The Norwegian Colossal Corpus is a collection of multiple smaller Norwegian corpora suitable for training large language models. We have done extensive cleaning on the datasets, and have made them available in a common format. The total size of the NCC is currently 45GB.

The corpus and finer detail about its sources and entries, is available as a Hugginface Dataset. To find more about the corpus, please go to the NoTram Project in GitHub. This dataset will also be distributed as part of the Norwegian Language Bank. If you need text data from in-copyright sources, let’s have a chat.

Political Affiliation

The Norwegian Parliament Speeches dataset is a collection of text passages from 1998 to 2016 and pronounced at the Norwegian Parliament (Storting) by members of the two major parties: Fremskrittspartiet and Sosialistisk Venstreparti. The dataset is annotated with the party the speaker was associated with at the time (dates of speeches are also included).

Newspaper Front Pages (in-progress)

This dataset contains pages from Norwegian newspapers where the front, back, and middle pages are all marked. It will be released before the end of the year. It has also been used in the implementation of a front-page detector for the internal library workflows.

Translated datasets

Norwegian PAWS-X

PAWS-X is a cross-lingual adversarial dataset for paraphrase identification. The original contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We machine translated all pairs into Norwegian Bokmål and Nynorsk using FAIR’s No Language Left Behind 3.3B parameters model, resulting in our own Norwegian PAWS-X.

Norwegian MNLI

The Multi-Genre Natural Language Inference (MNLI) dataset is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. The authors of the GLUE benchmark, which MNLI belongs to, use the standard test set, for which they obtained private labels from the RTE authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section.

Using the Google Translate API, we machine translated the entire MNLI dataset (including the matched and mismatched sets).

Derivative datasets

LIA

The Language Infrastructure made Accessible (LIA) dataset is based on the homonymous corpus of historical dialect recordings. LIA Norwegian comprises 3.5 million words elicited from 1382 informants from 227 local areas in Norway. The material is transcribed both (quasi) phonetically and orthographically (Nynorsk), as well as being morphologically tagged with the LIA tagger.

We took the dataset and converted it into a format more suitable for machine learning purposes.

NPSC

The Norwegian Parliament Speech Corpus (NPSC) was developed by the Norwegian Language Bank at the National Library of Norway from 2019-2021. The NPSC consists of audio recordings of meetings in Stortinget (the Norwegian parliament), and corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers.

We re-worked the corpus to make it available for machine learning.

NST

The Nordic Language Technology (NST) database was created for the development of automatic speech recognition and dictation in Norwegian. The Language Bank re-organized the data to improve the usefulness of the database. We later transformed it into an actionable dataset for machine learning.

Coscan Speech (in-progress)

The Continental Scandinavian Speech dataset is a subset of NST that meets a set of conditions:

Utterances last between 5 and 30 seconds.
Speakers were born and raised in the same region.
The gender of the speaker is known
The age of the speaker is known

There are somewhat balance splits containing information about area, region, age group and gender.

NRK (in-progress)

The NRK dataset contains pointers to audio clips with speech from the Norwegian Broadcasting Corporation.

NB Tale (in-progress)

NB Tale is a basic acoustic-phonetic speech database for Norwegian. The database contains recordings of 380 speakers from 24 different dialect areas. The database is produced for the National Library of Norway by Lingit AS.

We are planning to release a dataset built upon NB Tale by early next year.