Automatic Annotation of Bibliographical References with Target Language
Paper in proceeding, 2008

In a large-scale project to list bibliographical references to all of the ca 7 000 languages of the world, the need arises to automatically annotated the bibliographical entries with ISO-639-3 language identifiers. The task can be seen as a special case of a more general Information Extraction problem: to classify short text snippets in various languages into a large number of classes. We will explore supervised and unsupervised approaches motivated by distributional characterists of the specific domain and availability of data sets. In all cases, we make use of a database with language names and identifiers. The suggested methods are rigorously evaluated on a fresh representative data set.


Harald Hammarström

University of Gothenburg

Coling 2008: Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization; August 2008, Manchester


Subject Categories

Computer Science

More information