The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP

Stian Rødven Eide; Nina Tahmasebi; Lars Borin

The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP
Paper i proceeding, 2016

In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset consists of a wide range of sources, all annotated using a state-of-the-art corpus annotation pipeline, and is intended to be a static and clearly versioned dataset. This will facilitate reproducibility of experiments across institutions and make it easier to compare NLP algorithms on contemporary Swedish. The dataset contains sentences from 1950 to 2015 and has been carefully designed to feature a good mix of genres balanced over each included decade. The sources include literary, journalistic, academic and legal texts, as well as blogs and web forum entries.

A One Billion Word Swedish Reference Dataset for NLP

Författare

Stian Rødven Eide

Göteborgs universitet

Nina Tahmasebi

Göteborgs universitet

Forskning Andra publikationer

Lars Borin

Göteborgs universitet

Linköping Electronic Conference Proceedings. Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, July 11, 2016, Krakow, Poland

1650-3740 (eISSN)

Vol. 126 002 8-12
978-91-7685-733-5 (ISBN)

Ämneskategorier (SSIF 2011)

Språkteknologi (språkvetenskaplig databehandling)

ISBN

978-91-7685-733-5

Mer information

Skapat

2017-10-10

The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP Paper i proceeding, 2016

Författare

Stian Rødven Eide

Nina Tahmasebi

Lars Borin

Linköping Electronic Conference Proceedings. Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, July 11, 2016, Krakow, Poland

Ämneskategorier (SSIF 2011)

ISBN

Mer information

Skapat

The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP
Paper i proceeding, 2016