Datasets

The National Library of Norway has a very large digital collection, spanning all kinds of media. The collection is partly based on a massive digitization program, and partly on born digital information. Parts of this collection are out of copyrights and may be used without restrictions. A significant number of content objects also have metadata records related to the content.

The Norwegian Colossal Corpus (NCC)

After a year’s effort, we can finally announce the first public version of a general large-scale text corpus for machine learning on the Norwegian language. This corpus, which we modestly call the “Norwegian Colossal Corpus”, represents a major shift for training modern language models for Norwegian (both Bokmål and Nynorsk).

The Norwegian Colossal Corpus is a collection of multiple smaller Norwegian corpura suitable for training large language models. We have done extensive cleaning on the datasets, and have made them available in a common format. The total size of the NCC is currently 45GB.

The corpus and finer detail about its sources and entries, is available as a ? Hugginface Dataset To find more about the corpus, please go to the NoTram Project in GitHub. This dataset will also be distributed as part of the Norwegian Language Bank.