Automatic Annotation of Bibliographical References with Target Language
Paper i proceeding, 2008

In a large-scale project to list bibliographical references to all of the ca 7 000 languages of the world, the need arises to automatically annotated the bibliographical entries with ISO-639-3 language identifiers. The task can be seen as a special case of a more general Information Extraction problem: to classify short text snippets in various languages into a large number of classes. We will explore supervised and unsupervised approaches motivated by distributional characterists of the specific domain and availability of data sets. In all cases, we make use of a database with language names and identifiers. The suggested methods are rigorously evaluated on a fresh representative data set.

Författare

Harald Hammarström

Göteborgs universitet

Coling 2008: Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization; August 2008, Manchester

57-64

Ämneskategorier

Datavetenskap (datalogi)